DeepSeek-V2

Released

Revolutionary MoE model with 236B parameters and MLA architecture

Released on 2024.05.06

Overview

DeepSeek-V2 is a strong Mixture-of-Experts (MoE) language model with 236B total parameters, of which 21B are activated for each token. It introduces innovative Multi-head Latent Attention (MLA) architecture for efficient inference.

Key Features

236B total parameters (21B activated)
Multi-head Latent Attention (MLA)
DeepSeekMoE architecture
128K context length
Significantly reduced inference cost

Specifications

Parameters: 236B (21B activated)
Architecture: MoE + MLA
Context Length: 128K tokens
Training Tokens: 8.1T tokens
License: DeepSeek License

Resources

Research Paper GitHub Hugging Face API