DeepSeek-V4 Brings 1M-Token Context at a Fraction of Prior Cost

DeepSeek has released a preview of its V4 model series, and the headline number is not the parameter count - it is the efficiency.

The series includes two models: DeepSeek-V4-Pro, a 1.6 trillion-parameter Mixture-of-Experts model with 49 billion parameters active at inference time, and DeepSeek-V4-Flash, a smaller 284 billion-parameter variant with 13 billion active. Both handle contexts up to one million tokens. The efficiency story is the more striking one: at the one-million-token context length, V4-Pro needs only 27% of the per-token compute and 10% of the key-value cache that DeepSeek-V3.2 required for the same task. Both models were pre-trained on more than 32 trillion tokens. Three architectural changes drive the gains - a hybrid attention scheme pairing Compressed Sparse Attention with Heavily Compressed Attention, a modified residual connection design called Manifold-Constrained Hyper-Connections, and a training optimizer called Muon that the team credits with faster convergence. The top reasoning mode, DeepSeek-V4-Pro-Max, is claimed to beat prior open models on core benchmarks.

A one-million-token context window has been a spec-sheet fixture for a while, but actually serving it at scale is expensive enough that most providers throttle or price it punitively. If DeepSeek's efficiency numbers hold up under real workloads, routinely offering that window stops being a loss-leader. That matters for long-document analysis, code-over-entire-repos, and the multi-step agent tasks that need sustained context.

Model checkpoints are public on Hugging Face, so independent benchmark results should surface quickly - which is the only way to know whether the FLOP and cache claims translate outside the lab.

← Back to the front page