Released April 24, 2026
1.6 trillion parameters. 1 million token context. Frontier-class intelligence at a fraction of the cost — open source under MIT License.
Pro for frontier reasoning and agentic work. Flash for cost-sensitive, high-throughput production.
The flagship. 1.6 trillion total parameters with 49B active per token. Near-frontier performance on coding, math, and agentic tasks — the best open-weights model available today at any price.
Lean and fast. 284B parameters with only 13B active. Within 1–2 points of Pro on most benchmarks at 10× lower API cost. The practical default for production workloads.
Six novel engineering advances combine to make V4 the most efficient large model ever built.
Hybrid Compressed Sparse Attention and Heavily Compressed Attention. At 1M tokens, V4-Pro uses only 27% of V3.2's inference FLOPs and 10% of the KV cache.
Core InnovationConditional memory architecture that selectively stores and retrieves relevant information across 1M tokens — solving the "needle in a haystack" failure mode of standard attention.
Long ContextReplaces AdamW with Momentum + Orthogonalization. Removes redundancy between gradient updates, achieving faster convergence at 32T+ token pre-training scale.
TrainingModified Hyperbolic Connections replace standard residuals. Constraining weight updates to a Riemannian manifold enables stable training across hundreds of transformer layers.
StabilityMixture of Experts with 1.6T total parameters but only 49B active per token. Full frontier intelligence without paying for all 1.6T params on every inference call.
EfficiencyFirst frontier-class model trained entirely on domestic Chinese hardware — Huawei Ascend 950 chips and Cambricon accelerators. Proves frontier AI without Nvidia Blackwell.
HardwareGranular control over latency vs intelligence — from instant responses to maximum reasoning effort.
Fast, intuitive responses for routine tasks. No internal chain-of-thought — answers immediately. Ideal for chat, summarization, and real-time applications.
Conscious logical analysis. Slower but significantly more accurate for complex problem-solving, coding challenges, and structured reasoning tasks.
Pushes to absolute capability limits. Uses extended chain-of-thought for the most difficult problems. Recommended: 384K+ token context window for full effect.
Verified performance data across coding, reasoning, and mathematical domains.
No monthly fees. Pay only for tokens you use. V4-Flash is 10× cheaper than Claude Opus for comparable coding results.
| Model | Input /1M | Output /1M | Context | Open? |
|---|---|---|---|---|
| V4-Flash | $0.14 | $0.28 | 1M | ✓ MIT |
| V4-Pro | $1.74 | $3.48 | 1M | ✓ MIT |
| GPT-5.4 | $10.00 | $30.00 | 128K | ✗ |
| Claude Opus | $15.00 | $75.00 | 200K | ✗ |
| Gemini 3 Pro | $1.25 | $5.00 | 1M | ✗ |
Multiple ways to access V4 — from the free chat UI to a production API that replaces OpenAI with 2 lines of code.
Free users: go to chat.deepseek.com — no account needed. Developers: sign up at platform.deepseek.com for an API key with $5 free credits.
In chat: Instant Mode = Flash, Expert Mode = Pro. Via API: set model="deepseek-v4-flash" or "deepseek-v4-pro".
Toggle Think High for complex coding or math problems. Use Think Max for competition-level reasoning. Leave off for everyday chat.
DeepSeek's API is fully OpenAI-compatible. Change base_url and api_key. All function calling, streaming, and JSON mode works unchanged.
Download V4-Flash (160GB) from Hugging Face. Requires ~4×A100 80GB for Flash, or a full cluster for Pro. Run under MIT license with no usage fees ever.
Web, mobile, API, or self-host — V4 is available everywhere, right now.
No signup. Chat with V4-Flash (Instant) or V4-Pro (Expert) directly in your browser.
OpenAI-compatible REST API. Use models deepseek-v4-pro or deepseek-v4-flash.
Download and self-host. V4-Flash (160GB) or V4-Pro (865GB). Free commercial use.
V4-Pro (1.6T params, 49B active) is the flagship — best reasoning, agentic coding, and knowledge-intensive tasks. V4-Flash (284B total, 13B active) is 10× cheaper and within 1–2 benchmark points on most real coding tasks. Start with Flash, upgrade to Pro only if you see a quality gap on your specific workload.
Yes. The Engram conditional memory system and CSA+HCA hybrid attention were engineered specifically to make 1M-token inference economically viable. V4-Pro requires only 27% of V3.2's FLOPs and 10% of the KV cache at 1M tokens. Standard transformers become prohibitively expensive at this scale — V4's architecture solves that.
Think Max extends the reasoning chain as far as the model can go — essentially unlimited budget for internal chain-of-thought. DeepSeek recommends setting a context window of at least 384K tokens when using Think Max. Think High is a lighter, faster reasoning mode suitable for most complex tasks without the latency overhead of maximum effort.
Nearly. Change base_url to https://api.deepseek.com/v1 and swap your API key. Update the model name to deepseek-v4-flash or deepseek-v4-pro. All existing code for streaming, function calling, structured JSON output, and the ChatCompletions schema works without any other changes. The Anthropic Messages API format is also supported.
Both legacy model names will be fully retired and inaccessible after July 24, 2026 at 15:59 UTC. They currently route to V4-Flash (non-thinking and thinking modes respectively). Migrate to deepseek-v4-flash or deepseek-v4-pro now to avoid disruption.
Three key gaps: (1) HLE score (37.7%) trails Claude (40%) and Gemini (44.4%) on expert cross-domain reasoning. (2) Frontend UI code — V4-Pro produces functionally correct but visually less polished UI compared to GPT-5.5. (3) No native multimodal at launch — vision support is in development. For most coding, math, and agentic tasks, V4-Pro is competitive with the best proprietary models.
Frontier-class AI. Open source. No subscription. Start chatting now or grab an API key and build.