DeepSeek-V3

ReleasedFeatured

Revolutionary 671B MoE model rivaling GPT-4o at fraction of cost

Released on 2024.12.26

Overview

DeepSeek-V3 is a groundbreaking 671B parameter MoE model that achieves performance comparable to leading closed-source models while being trained at a fraction of the cost. It introduces auxiliary-loss-free load balancing and multi-token prediction.

Key Features

671B total parameters (37B activated)
Trained for only $5.58M
Auxiliary-loss-free load balancing
Multi-token prediction (MTP)
FP8 mixed precision training
Outperforms Claude 3.5 Sonnet on many benchmarks

Specifications

Parameters: 671B (37B activated)
Architecture: MoE + MLA + MTP
Context Length: 128K tokens
Training Tokens: 14.8T tokens
Benchmark: MMLU 88.5%, HumanEval 82.6%

Resources

Research Paper GitHub Hugging Face API