DeepSeek-V4-Pro | Advanced AI Model for Reasoning & Code

Architecture

Built Different from V3

V4-Pro isn't V3 scaled up. Four new architectural innovations make 1M context economically viable for the first time.

🧩

Core Innovation

Hybrid Attention (CSA + HCA)

Replaces standard full attention with two complementary mechanisms. Compressed Sparse Attention (CSA) selects the top 1,024 most relevant KV pairs per query. Heavily Compressed Attention (HCA) provides cheap global context from distant tokens. Together they make 1M-token inference practical — not just a benchmark number.

27% of V3.2 inference FLOPs at 1M context

💾

Memory Efficiency

10% KV Cache vs V3.2

The CSA+HCA hybrid architecture reduces KV cache memory to just 10% of what V3.2 required at the same 1M-token context length. This makes long-context production deployments — processing entire codebases, legal contracts, or books — economically viable at scale.

10% of KV cache memory vs V3.2 at 1M tokens

🌀

Training Stability

Manifold-Constrained Hyper-Connections (mHC)

Replaces standard residual connections with mixing matrices constrained to the Birkhoff Polytope (a doubly-stochastic manifold). Prevents signal explosion in deep networks and enables stable training at 1.6T parameter scale. Makes the extreme depth of V4-Pro trainable without gradient instability.

1.6T parameters trained stably via mHC

⚡

Optimizer

Muon Optimizer

Replaces AdamW for most parameters with the Muon optimizer (Momentum + Orthogonalization). Removes redundancy between gradient updates, achieving faster convergence and greater training stability at 32T+ token pre-training scale. AdamW retained for embeddings, prediction head, and normalization weights.

33T pre-training tokens (vs 14.8T for V3)

🔀

Model Design

Mixture of Experts (MoE)

1.6T total parameters but only 49B activate per token. Specialized expert networks handle different types of knowledge while a learned router selects the most relevant experts for each query. Full frontier intelligence without paying for 1.6T parameters on every inference call.

49B active parameters per token (of 1.6T)

🎯

Precision

FP4 + FP8 Mixed Precision

First frontier model with FP4 quantization-aware training applied to MoE expert weights and the indexer QK path during pre-training itself — not as post-training quantization. MoE expert parameters use FP4; most other parameters use FP8. Reduces memory and inference cost without the accuracy loss of post-hoc quantization.

FP4 expert weights · FP8 other params

Performance

Benchmark Results

Verified scores from public evaluations. V4-Pro leads all open-source models and competes with the best proprietary models at a fraction of the price.

Coding

Math & Science

Reasoning

Knowledge

SWE-bench Verified

Real-world GitHub issue resolution (Pro-Max mode)

Claude Opus 4.6 80.8% V4-Pro 80.6% 0.2pt gap

V4-Pro

80.6%

Claude Op.

80.8%

GPT-5.4

72.0%

Gemini Pro

80.6%

LiveCodeBench Pass@1

Live competitive programming (Pro-Max mode)

V4-Pro #1 · 93.5

V4-Pro

93.5

Claude Op.

88.8

GPT-5.5

~86

Codeforces Rating

Competitive programming Elo rating

V4-Pro #1 · 3206

V4-Pro

3206

GPT-5.4

3168

Gemini Pro

3052

Terminal-Bench 2.0

Agentic CLI and tool-use tasks

V4-Pro beats Claude

V4-Pro

67.9%

Claude Op.

65.4%

HMMT 2026 Math Competition

Harvard-MIT Math Tournament problems

GPT-5.4 leads 97.7%

GPT-5.4

97.7%

Claude Op.

96.2%

V4-Pro

95.2%

MMLU-Pro

Multi-discipline knowledge and reasoning

V4-Pro 73.5%

V4-Pro

73.5%

V3.2 (base)

65.5%

IMO 2025

International Mathematical Olympiad

Gold Medal 🥇

V4 Series

Gold

HLE — Humanity's Last Exam

Expert cross-domain reasoning (V4-Pro known gap)

Gemini leads 44.4%

Gemini Pro

44.4%

Claude Op.

40.0%

GPT-5.4

39.8%

V4-Pro

37.7%

GPQA Diamond

Expert-level science questions

V4-Pro 71.5%

V4-Pro

71.5%

V3.2

~63%

MMLU 5-shot

World knowledge across 57 academic subjects

V4-Pro 90.1%

V4-Pro

90.1%

V3.2

87.8%

SimpleQA-Verified

Factual recall and knowledge retrieval

Gemini leads 75.6%

Gemini Pro

75.6%

V4-Pro

57.9%

Three Reasoning Modes

Control Intelligence vs Speed

V4-Pro supports three reasoning effort levels per request — dynamically control latency vs accuracy without switching models.

NON-THINKING ⚡

Non-Think

Instant Mode in chat · No CoT

Instant, intuitive responses. No internal chain-of-thought — the model answers immediately from pattern. Best for chat, Q&A, summarization, translation, and real-time applications where sub-second latency matters.

Speed

Fastest

THINK HIGH 🔎

Think High

Expert Mode · Structured reasoning

Conscious analytical reasoning. The model applies structured logical analysis before answering. Significantly more accurate on complex coding, data analysis, and technical problem-solving. Recommended for most professional use cases.

Speed

Balanced

THINK MAX 🧠

Think Max (Pro-Max)

Maximum reasoning · DeepSeek recommends ≥384K ctx

Full reasoning budget — the model explores the problem space exhaustively before answering. Achieves the headline benchmark scores (80.6% SWE-bench, 93.5 LiveCodeBench). Token-intensive: generates ~190M output tokens per benchmark run. Use for the hardest agentic coding and scientific reasoning tasks.

Speed

Deep

Switching modes via API

        # Non-Think: fast chat

        "extra_body": {"thinking": {"type": "disabled"}}

        # Think High: balanced reasoning

        "extra_body": {"thinking": {"type": "enabled", "budget": "high"}}

        # Think Max: maximum reasoning (set ctx ≥384K)

        "extra_body": {"thinking": {"type": "enabled", "budget": "max"}}

Pricing

Frontier Quality, Fraction the Cost

No monthly fee. Pay only for tokens. A 75% promotional discount applies until May 31, 2026.

V4-Pro API

^$0.435/1M input

Promo price until May 31. Regular: $1.74/1M. Use model deepseek-v4-pro via the official API.

Input (cache miss)$0.435/1M

Input (cache hit)$0.044/1M

Output tokens$0.87/1M

Context window1M tokens

Max output384K tokens

Get API Key →

V4-Pro (Regular)

^$1.74/1M input

Standard pricing after May 31, 2026 promotion ends. Still 7× cheaper than Claude Opus 4.7 on output tokens.

Input (cache miss)$1.74/1M

Input (cache hit)$0.174/1M

Output tokens$3.48/1M

vs Claude Opus7× cheaper out

vs GPT-5.59× cheaper out

View API Docs

✓ Free Forever

Web Chat

^$0/month

Full access to V4-Pro (Expert Mode) at chat.deepseek.com. No subscription, no ads, no hidden limits for normal use.

Expert Mode (V4-Pro)✓ Free

DeepThink (Think Max)✓ Free

File & image uploads✓ Free

Web search✓ Free

Rate cap500 msg/hr

Open Chat Free →

Self-Host (MIT)

Open Weights

Download full weights (865 GB, FP8) from Hugging Face. No API fees ever. Commercial use allowed without contacting DeepSeek.

LicenseMIT · Commercial ✓

Weight size865 GB (FP8)

Min hardware8× H100 80GB

API fees$0 forever

Fine-tuning✓ Allowed

Hugging Face ↗

7×

Cheaper than Claude Opus (output)

9×

Cheaper than GPT-5.5 (output)

4×

Cheaper than Claude on benchmark runs

90%

Cache hit discount on repeated prompts

Model Comparison

How V4-Pro Stacks Up

Full benchmark and pricing comparison against the top proprietary and open-source frontier models — May 2026.

Model	SWE-bench	LiveCodeBench	HLE	Input /1M	Output /1M	Context	Open?
DeepSeek V4-Pro	80.6%	93.5	37.7%	$1.74	$3.48	1M	✓ MIT
Claude Opus 4.7	80.8%	88.8	40.0%	$5.00	$25.00	200K	✗ Closed
GPT-5.5	74%+	~86	39.8%	$5.00	$20.00	128K	✗ Closed
Gemini 3.1 Pro	80.6%	~87	44.4%	$1.25	$5.00	1M	✗ Closed
Qwen 3.6 Plus	~76%	~88	~35%	$0.50	$2.00	128K	Partial
DeepSeek V3.2	~74%	~85	~32%	$0.28	$0.42	128K	✓ MIT

Known Limitations

📉

HLE Expert Reasoning Gap

V4-Pro scores 37.7% on Humanity's Last Exam, trailing Claude (40.0%), GPT-5.4 (39.8%), and Gemini (44.4%). For cross-domain expert-level reasoning requiring broad real-world knowledge, closed models still lead.

📚

Factual Recall vs Gemini

SimpleQA-Verified: V4-Pro 57.9% vs Gemini 75.6%. For workloads requiring accurate real-world fact retrieval across diverse domains, Gemini holds a meaningful edge.

⏳

Think Max Token Intensity

Think Max mode generates ~190M output tokens per benchmark run — far above the 47M median. Monitor output token usage carefully in production; Think Max costs scale with reasoning depth.

🌏

Data Residency & Compliance

Hosted API data may be stored on servers in China. Not suitable for HIPAA/SOC2-regulated data without self-hosting or routing through AWS Bedrock / Azure AI with data residency guarantees.

API Integration

Migrate in 2 Lines

DeepSeek V4-Pro is fully OpenAI API compatible. Change base_url and api_key — nothing else.

The API uses the standard /v1/chat/completions endpoint with the OpenAI-compatible request schema. All existing code for streaming, function calling, and structured outputs works unchanged.

base_url = "https://api.deepseek.com/v1"
model = "deepseek-v4-pro"

✓OpenAI Python / Node SDK compatible

✓Anthropic Messages API format also supported

✓Streaming SSE responses

✓Function calling / tool use

✓JSON mode / structured outputs

✓Automatic context caching (90% discount)

✓Three thinking modes per request

✓5M free tokens on new accounts

⚠️

Legacy model retirement: deepseek-chat and deepseek-reasoner retire July 24, 2026. Migrate to deepseek-v4-pro or deepseek-v4-flash now.

Python

Node.js

Think Max

# pip install openai
from openai import OpenAI
import os

client = OpenAI(
  api_key=os.getenv("DEEPSEEK_API_KEY"),
  base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
  model="deepseek-v4-pro",
  messages=[
    {"role": "system",
     "content": "You are an expert assistant."},
    {"role": "user",
     "content": "Explain hybrid attention"}
  ],
  max_tokens=2048
)

print(response.choices[0].message.content)

// npm install openai
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.DEEPSEEK_API_KEY,
  baseURL: 'https://api.deepseek.com/v1',
});

const res = await client.chat.completions.create({
  model: 'deepseek-v4-pro',
  messages: [
    { role: 'user', content: 'Hello!' }
  ],
});

console.log(res.choices[0].message.content);
// Access reasoning content:
console.log(res.choices[0].message?.reasoning_content);

# Think Max mode — set ctx ≥ 384K
from openai import OpenAI

client = OpenAI(
  api_key="<your-key>",
  base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
  model="deepseek-v4-pro",
  messages=[{
    "role": "user",
    "content": "Solve this hard problem..."
  }],
  max_tokens=65536,
  extra_body={
    "thinking": {
      "type": "enabled",
      "budget": "max" # Think Max
    }
  }
)

# reasoning_content shows the thinking chain
chain = response.choices[0].message.reasoning_content
answer = response.choices[0].message.content

Self-Hosting

Hardware to Run V4-Pro

MIT licensed — download weights freely. Full-size V4-Pro requires serious GPU infrastructure. Distilled variants serve most use cases.

Consumer

R1 7B Distill

~4.5 GB weights · Via Ollama

GPU VRAM8 GB

RAM8 GB

HardwareRTX 3060+

Apple SiliconM2 8GB

Developer / Local

Developer Recommended

R1 14B Distill

~9 GB weights · Best quality/cost

GPU VRAM16 GB

RAM16 GB

HardwareRTX 4080+

Apple SiliconM2 Pro 16GB

★ Recommended Local

Small Server

R1 32B Distill

~20 GB weights · High quality

GPU VRAM32 GB

RAM32 GB

Hardware1× A100 40GB

Apple SiliconM3 Max 48GB

Production Server

Enterprise Only

Full V4-Pro

865 GB weights (FP8)

GPU VRAM8× H100 80GB

RAM1 TB+

HuggingFacedeepseek-ai/V4-Pro

FrameworkvLLM / TRT-LLM

~$50K+ infra

Quick local setup with Ollama

        # Install Ollama (macOS/Linux)

        curl -fsSL https://ollama.com/install.sh | sh

        # Run DeepSeek V4 distilled (choose by RAM)

        ollama run deepseek-r1:7b   # 8 GB RAM

        ollama run deepseek-r1:14b  # 16 GB RAM ← recommended

        ollama run deepseek-r1:32b  # 32 GB RAM

FAQ

Questions About V4-Pro

What is DeepSeek-V4-Pro and how is it different from V3?+

V4-Pro is DeepSeek's flagship 1.6T parameter MoE model, released April 24, 2026. It is not a scaled-up V3 — it introduces four genuinely new architectural innovations: (1) Hybrid attention (CSA + HCA) that cuts inference FLOPs to 27% and KV cache to 10% of V3.2 at 1M context. (2) Manifold-Constrained Hyper-Connections (mHC) for training stability at trillion-parameter scale. (3) The Muon optimizer replacing AdamW for faster convergence. (4) FP4 quantization-aware training on MoE expert weights. It was pre-trained on 33T tokens (vs 14.8T for V3) and scores 80.6% on SWE-bench Verified.

Is V4-Pro really better than Claude Opus or GPT-5.5?+

On coding tasks: V4-Pro matches Claude Opus 4.6 on SWE-bench (80.6% vs 80.8% — a 0.2% gap), beats Claude on LiveCodeBench (93.5 vs 88.8), and leads all models on Codeforces (rating 3206). It also beats Claude on Terminal-Bench 2.0 for agentic coding. However, Claude leads on HLE (40.0% vs 37.7%) and HMMT 2026 math (96.2% vs 95.2%), and Gemini leads on factual recall. For most coding and software engineering use cases, V4-Pro is a viable alternative to closed models at 7× lower cost.

What does the 1M token context window actually mean?+

1 million tokens is roughly 750,000 words — enough to fit the entire Harry Potter series, a large codebase, or months of conversation history in a single request. Most importantly, V4-Pro's CSA+HCA hybrid attention makes this practical: at 1M context, it requires only 10% of the KV cache memory that V3.2 needed. This means 1M context is economically viable in production, not just a benchmark number. DeepSeek recommends setting context to at least 384K tokens when using Think Max mode.

What's the difference between Think High and Think Max?+

Think High applies structured analytical reasoning with a fixed budget — faster, suitable for most complex tasks, recommended for production coding agents. Think Max (Pro-Max mode) gives the model unlimited reasoning budget, exhaustively exploring the problem space. This achieves the headline benchmark scores but generates ~190M output tokens per benchmark run — far above the 47M median. Monitor output costs carefully in Think Max mode. Set context window to at least 384K tokens for best results.

How do I access V4-Pro? Is it free?+

Three ways: (1) Free web chat at chat.deepseek.com — enable Expert Mode. Full V4-Pro, free, no subscription. (2) API at platform.deepseek.com — model name deepseek-v4-pro, $1.74/1M input (promo: $0.435 until May 31). New accounts get 5M free tokens. (3) Self-host from Hugging Face — 865 GB weights under MIT license, requires 8×H100 80GB minimum for full V4-Pro.

Is V4-Pro actually open source?+

The model weights are fully open under the MIT license at huggingface.co/deepseek-ai/DeepSeek-V4-Pro. This means you can download, run, fine-tune, and build commercial products without restrictions or fees. The training code and full dataset are not published (standard for large model releases). For practical purposes: open weights enable self-hosting, auditing, and fine-tuning — everything most developers and enterprises need.

How does the reasoning_content field work?+

When using Think High or Think Max mode, the response includes a reasoning_content field in addition to the standard content field. reasoning_content contains the model's internal chain-of-thought — the full reasoning process before the final answer. This is useful for debugging, educational applications, and verifying the model's logic. Note: a common gotcha reported by developers — many OpenAI-compatible client libraries don't expose reasoning_content by default and require accessing the raw response object.

DeepSeek V4-Pro

Built Different from V3

Benchmark Results

Control Intelligence vs Speed

Frontier Quality, Fraction the Cost

How V4-Pro Stacks Up

HLE Expert Reasoning Gap

Factual Recall vs Gemini

Think Max Token Intensity

Data Residency & Compliance

Migrate in 2 Lines

Hardware to Run V4-Pro

Questions About V4-Pro

The Best Open-SourceModel. Free to Try.

DeepSeek
V4-Pro

The Best Open-Source
Model. Free to Try.