1.6 trillion parameters. 49 billion active per token. 1 million token context. The most capable open-source model ever built - rivaling GPT-5.5 and Claude Opus 4.7 at one-seventh the cost.
V4-Pro isn't V3 scaled up. Four new architectural innovations make 1M context economically viable for the first time.
Replaces standard full attention with two complementary mechanisms. Compressed Sparse Attention (CSA) selects the top 1,024 most relevant KV pairs per query. Heavily Compressed Attention (HCA) provides cheap global context from distant tokens. Together they make 1M-token inference practical — not just a benchmark number.
The CSA+HCA hybrid architecture reduces KV cache memory to just 10% of what V3.2 required at the same 1M-token context length. This makes long-context production deployments — processing entire codebases, legal contracts, or books — economically viable at scale.
Replaces standard residual connections with mixing matrices constrained to the Birkhoff Polytope (a doubly-stochastic manifold). Prevents signal explosion in deep networks and enables stable training at 1.6T parameter scale. Makes the extreme depth of V4-Pro trainable without gradient instability.
Replaces AdamW for most parameters with the Muon optimizer (Momentum + Orthogonalization). Removes redundancy between gradient updates, achieving faster convergence and greater training stability at 32T+ token pre-training scale. AdamW retained for embeddings, prediction head, and normalization weights.
1.6T total parameters but only 49B activate per token. Specialized expert networks handle different types of knowledge while a learned router selects the most relevant experts for each query. Full frontier intelligence without paying for 1.6T parameters on every inference call.
First frontier model with FP4 quantization-aware training applied to MoE expert weights and the indexer QK path during pre-training itself — not as post-training quantization. MoE expert parameters use FP4; most other parameters use FP8. Reduces memory and inference cost without the accuracy loss of post-hoc quantization.
Verified scores from public evaluations. V4-Pro leads all open-source models and competes with the best proprietary models at a fraction of the price.
V4-Pro supports three reasoning effort levels per request — dynamically control latency vs accuracy without switching models.
Instant, intuitive responses. No internal chain-of-thought — the model answers immediately from pattern. Best for chat, Q&A, summarization, translation, and real-time applications where sub-second latency matters.
Conscious analytical reasoning. The model applies structured logical analysis before answering. Significantly more accurate on complex coding, data analysis, and technical problem-solving. Recommended for most professional use cases.
Full reasoning budget — the model explores the problem space exhaustively before answering. Achieves the headline benchmark scores (80.6% SWE-bench, 93.5 LiveCodeBench). Token-intensive: generates ~190M output tokens per benchmark run. Use for the hardest agentic coding and scientific reasoning tasks.
Switching modes via API
No monthly fee. Pay only for tokens. A 75% promotional discount applies until May 31, 2026.
Promo price until May 31. Regular: $1.74/1M. Use model deepseek-v4-pro via the official API.
Standard pricing after May 31, 2026 promotion ends. Still 7× cheaper than Claude Opus 4.7 on output tokens.
Full access to V4-Pro (Expert Mode) at chat.deepseek.com. No subscription, no ads, no hidden limits for normal use.
Download full weights (865 GB, FP8) from Hugging Face. No API fees ever. Commercial use allowed without contacting DeepSeek.
Full benchmark and pricing comparison against the top proprietary and open-source frontier models — May 2026.
| Model | SWE-bench | LiveCodeBench | HLE | Input /1M | Output /1M | Context | Open? |
|---|---|---|---|---|---|---|---|
| DeepSeek V4-Pro | 80.6% | 93.5 | 37.7% | $1.74 | $3.48 | 1M | ✓ MIT |
| Claude Opus 4.7 | 80.8% | 88.8 | 40.0% | $5.00 | $25.00 | 200K | ✗ Closed |
| GPT-5.5 | 74%+ | ~86 | 39.8% | $5.00 | $20.00 | 128K | ✗ Closed |
| Gemini 3.1 Pro | 80.6% | ~87 | 44.4% | $1.25 | $5.00 | 1M | ✗ Closed |
| Qwen 3.6 Plus | ~76% | ~88 | ~35% | $0.50 | $2.00 | 128K | Partial |
| DeepSeek V3.2 | ~74% | ~85 | ~32% | $0.28 | $0.42 | 128K | ✓ MIT |
V4-Pro scores 37.7% on Humanity's Last Exam, trailing Claude (40.0%), GPT-5.4 (39.8%), and Gemini (44.4%). For cross-domain expert-level reasoning requiring broad real-world knowledge, closed models still lead.
SimpleQA-Verified: V4-Pro 57.9% vs Gemini 75.6%. For workloads requiring accurate real-world fact retrieval across diverse domains, Gemini holds a meaningful edge.
Think Max mode generates ~190M output tokens per benchmark run — far above the 47M median. Monitor output token usage carefully in production; Think Max costs scale with reasoning depth.
Hosted API data may be stored on servers in China. Not suitable for HIPAA/SOC2-regulated data without self-hosting or routing through AWS Bedrock / Azure AI with data residency guarantees.
DeepSeek V4-Pro is fully OpenAI API compatible. Change base_url and api_key — nothing else.
The API uses the standard /v1/chat/completions endpoint with the OpenAI-compatible request schema. All existing code for streaming, function calling, and structured outputs works unchanged.
Legacy model retirement: deepseek-chat and deepseek-reasoner retire July 24, 2026. Migrate to deepseek-v4-pro or deepseek-v4-flash now.
MIT licensed — download weights freely. Full-size V4-Pro requires serious GPU infrastructure. Distilled variants serve most use cases.
Quick local setup with Ollama
V4-Pro is DeepSeek's flagship 1.6T parameter MoE model, released April 24, 2026. It is not a scaled-up V3 — it introduces four genuinely new architectural innovations: (1) Hybrid attention (CSA + HCA) that cuts inference FLOPs to 27% and KV cache to 10% of V3.2 at 1M context. (2) Manifold-Constrained Hyper-Connections (mHC) for training stability at trillion-parameter scale. (3) The Muon optimizer replacing AdamW for faster convergence. (4) FP4 quantization-aware training on MoE expert weights. It was pre-trained on 33T tokens (vs 14.8T for V3) and scores 80.6% on SWE-bench Verified.
On coding tasks: V4-Pro matches Claude Opus 4.6 on SWE-bench (80.6% vs 80.8% — a 0.2% gap), beats Claude on LiveCodeBench (93.5 vs 88.8), and leads all models on Codeforces (rating 3206). It also beats Claude on Terminal-Bench 2.0 for agentic coding. However, Claude leads on HLE (40.0% vs 37.7%) and HMMT 2026 math (96.2% vs 95.2%), and Gemini leads on factual recall. For most coding and software engineering use cases, V4-Pro is a viable alternative to closed models at 7× lower cost.
1 million tokens is roughly 750,000 words — enough to fit the entire Harry Potter series, a large codebase, or months of conversation history in a single request. Most importantly, V4-Pro's CSA+HCA hybrid attention makes this practical: at 1M context, it requires only 10% of the KV cache memory that V3.2 needed. This means 1M context is economically viable in production, not just a benchmark number. DeepSeek recommends setting context to at least 384K tokens when using Think Max mode.
Think High applies structured analytical reasoning with a fixed budget — faster, suitable for most complex tasks, recommended for production coding agents. Think Max (Pro-Max mode) gives the model unlimited reasoning budget, exhaustively exploring the problem space. This achieves the headline benchmark scores but generates ~190M output tokens per benchmark run — far above the 47M median. Monitor output costs carefully in Think Max mode. Set context window to at least 384K tokens for best results.
Three ways: (1) Free web chat at chat.deepseek.com — enable Expert Mode. Full V4-Pro, free, no subscription. (2) API at platform.deepseek.com — model name deepseek-v4-pro, $1.74/1M input (promo: $0.435 until May 31). New accounts get 5M free tokens. (3) Self-host from Hugging Face — 865 GB weights under MIT license, requires 8×H100 80GB minimum for full V4-Pro.
The model weights are fully open under the MIT license at huggingface.co/deepseek-ai/DeepSeek-V4-Pro. This means you can download, run, fine-tune, and build commercial products without restrictions or fees. The training code and full dataset are not published (standard for large model releases). For practical purposes: open weights enable self-hosting, auditing, and fine-tuning — everything most developers and enterprises need.
When using Think High or Think Max mode, the response includes a reasoning_content field in addition to the standard content field. reasoning_content contains the model's internal chain-of-thought — the full reasoning process before the final answer. This is useful for debugging, educational applications, and verifying the model's logic. Note: a common gotcha reported by developers — many OpenAI-compatible client libraries don't expose reasoning_content by default and require accessing the raw response object.
80.6% SWE-bench. Codeforces #1 (3206). 1M context. MIT licensed. Start in Expert Mode for free — no account needed.