83.3 tok/s · 1.11s TTFT · Available now

Fast.
V4-Flash.

284B parameter MoE. 13B active per token. 79% SWE-bench. 1M context. $0.14/1M input tokens. The default model for high-throughput production AI — not a compromise, a choice.

83.3
tokens / sec
1.11s
time to first token
💰
$0.14
input / 1M tokens
🎯
79.0%
SWE-bench verified
Try in Instant Mode Get API Key 🤗 Open Weights
Why V4-Flash

Not Lite. Just Right.

V4-Flash isn't a downgrade from V4-Pro — it's purpose-built for the 90% of production workloads where speed and cost matter more than the final 1.6 benchmark points.

Speed
83 Tokens Per Second

Above-average generation speed vs models of the same size class (median: 54 tok/s). Faster throughput means snappier chat UIs, quicker batch jobs, and lower latency for end users in production.

54% faster than the 284B class median
💰
Cost
12.4× Cheaper Than Pro

$0.28/1M output vs V4-Pro's $3.48. That's not a small discount — it's a fundamentally different cost structure. A workload costing $1,000/month on Pro runs for $80 on Flash. Same OpenAI-compatible API endpoint.

$0.28 per 1M output tokens
📏
Context
Same 1M Token Window

Flash shares the full 1M token context window with Pro. Both implement CSA+HCA hybrid attention — the architecture that makes long-context inference economically viable. Process entire codebases at $0.14/1M.

1M tokens — same as V4-Pro
🎯
Quality
Only 1.6pts Behind Pro

79.0% SWE-bench vs Pro's 80.6%. 91.6% LiveCodeBench vs 93.5%. For most developer coding tasks, these are functionally equivalent results. The gap widens on complex agentic workflows — not on everyday coding.

1.6pt gap vs Pro on SWE-bench
Benchmarks

Where Flash Stands

Honest numbers — including where Flash leads, where it matches Pro, and where the gap matters.

Coding
Reasoning
Flash vs Pro
SWE-bench Verified
Real-world GitHub issue resolution
Flash 79.0% Pro 80.6% Gap: 1.6pt
V4-Flash
79.0%
V4-Pro
80.6%
GPT-5.4
72.0%
Claude Op.
80.8%
LiveCodeBench
Live competitive programming Pass@1
Flash 91.6 Pro 93.5
V4-Flash
91.6
V4-Pro
93.5
Claude Op.
88.8
Terminal-Bench 2.0
Agentic CLI tool-use (Flash's main gap vs Pro)
Flash 56.9% Pro 67.9%
V4-Pro
67.9%
V4-Flash
56.9%
Claude Op.
65.4%
Artificial Analysis Intelligence Index
Multi-benchmark composite (max reasoning mode)
Flash 47 Median: 29
V4-Flash
47 / 75
Class median
29 / 75
SimpleQA-Verified
Factual recall (Flash's largest gap vs Pro)
Flash 34.1% Pro 57.9%
V4-Pro
57.9%
V4-Flash
34.1%
Flash-Max vs Pro-Max
Max reasoning mode — Flash approaches Pro quality
Flash-Max ≈ Pro
DeepSeek states that V4-Flash-Max achieves comparable reasoning performance to V4-Pro when given a larger thinking budget (Think Max mode). The smaller parameter scale places it slightly behind on pure knowledge tasks and the most complex agentic workflows — but on most reasoning benchmarks, the gap narrows significantly with increased thinking budget.
⚡ Where Flash is better or equal:
Speed (tok/s)
83.3
Cost efficiency
12.4×
Time to first token
1.11s
Simple agent tasks
≈ Pro
🔵 Where Pro is better:
Terminal-Bench 2.0
67.9%
SimpleQA recall
57.9%
Complex multi-step
Lead
Performance Metrics

Speed That Matters

83 tokens per second. 1.11s time to first token. Well above the median for open-weight models of the same class — and faster than many proprietary alternatives.

🏎
83.3 tok/s
Output Speed

Generation throughput on DeepSeek's own API (streaming). Well above the 284B-class median of 54.1 tok/s. Fast enough for smooth chat UIs without perceptible buffering.

↑ Above average for class
1.11s TTFT
Time to First Token

Very competitive vs the 284B-class median of 2.38s. First-chunk latency is critical for chat UIs — Flash feels responsive because the first token arrives quickly.

↑ Very competitive (median: 2.38s)
📊
46.5 / 75
Intelligence Index

Scores 46.5 on the Artificial Analysis Intelligence Index (non-reasoning mode). Ranked #16 of 525 published models — "near-flagship intelligence" at a fraction of flagship cost.

#16 of 525 models
🔁
1M ctx
Context Window

Full 1 million token context — identical to V4-Pro. Process complete codebases, books, or long conversation histories. Same CSA+HCA hybrid attention for efficient long-context inference.

= V4-Pro context
🏎 Output Speed Comparison (tokens/sec)
V4-Flash
83.3 t/s
V3.2
~60 t/s
Class median
54.1 t/s
V4-Pro
~40 t/s
⏱ Time to First Token (seconds — lower is better)
V4-Flash
1.11s
GPT-5.5
~1.4s
Class median
2.38s
V4-Pro
~1.9s
Pricing

Production Scale Without the Bill

$0.14 input. $0.28 output. Automatic 90% cache discount. No monthly fees.

V4-Pro (comparison)
$3.48/1M out

12.4× more expensive per output token. Use when benchmarks show a quality gap on your specific workload.

Input$1.74/1M
Cache hit$0.174/1M
Output$3.48/1M
vs Flash cost12.4× output
See Flash vs Pro
✓ Free Forever
Web Chat (Instant Mode)
$0/month

Instant Mode at chat.deepseek.com is V4-Flash. Free, no account required, no subscription.

ModelV4-Flash
DeepThink✓ Included
File uploads✓ Included
AdsNone
Open Chat Free
Self-Host (MIT)
Open Weights

160 GB FP8 weights. MIT licensed. Runs on a single H100 (tight) or comfortably on 4×H200 — practical for enterprise self-hosting.

LicenseMIT · Commercial ✓
Weight size160 GB (FP8)
GPU min1× H100 80GB
Comfortable4× H200
Hugging Face ↗
🧮 Cost Calculator
$0
Flash / month
$0
Other / month
$0
Monthly saving
Cheaper
Adjust the fields above to calculate your savings.
Pro vs Flash

Which One Should You Use?

The honest answer: start with Flash. Upgrade to Pro only if evaluations reveal a quality gap on your specific task.

Decision Guide
🆕
Starting a new project? Use Flash first. Run your own benchmark on 100 representative prompts. Only switch to Pro if the gap is meaningful for your users.
💬
Chat / Q&A / summarizationFlash wins. Speed and cost dominate. Quality difference is negligible for conversational tasks.
💻
Everyday coding & code reviewFlash wins. 79% vs 80.6% SWE-bench is not a meaningful difference for most real-world code.
🤖
Complex multi-step agentsPro recommended. Terminal-Bench gap is 11 points (67.9% vs 56.9%). Pro's deeper expert network handles multi-hop tool chains better.
📚
Factual knowledge recallPro recommended. SimpleQA gap: 57.9% vs 34.1%. If accuracy on specific facts matters, Pro has a clear edge.
🖥
Self-hosting on limited GPUFlash only option. Flash fits on 1×H100; Pro requires 8×H100 minimum cluster.
📊
High-volume production (1M+ calls/mo)Flash wins. 12.4× cost saving at scale creates meaningful infrastructure ROI.
Spec ⚡ Flash 🔵 Pro
Total params284B1.6T
Active params / token13B49B
Context window1M tokens1M tokens
Max output= 384K= 384K
Output speed83 tok/s~40 tok/s
TTFT1.11s~1.9s
Input price$0.14/1M$1.74/1M
Output price$0.28/1M$3.48/1M
SWE-bench79.0%80.6%
LiveCodeBench91.693.5
Terminal-Bench 2.056.9%67.9%
SimpleQA recall34.1%57.9%
Self-host GPU min1× H1008× H100
Self-host weight size160 GB865 GB
License= MIT= MIT
Thinking modes= 3 modes= 3 modes
API Integration

One Model Name Change

Already on OpenAI or V3.2? Migrate to V4-Flash by changing one string. Everything else stays identical.

base_url = "https://api.deepseek.com/v1"
model     = "deepseek-v4-flash"
OpenAI Python / Node SDK compatible
Anthropic Messages API format supported
Streaming SSE responses (83 tok/s)
Function calling / tool use
JSON mode / structured outputs
Auto context caching (90% discount on cache hits)
Three thinking modes (Non-Think, High, Max)
reasoning_content in Think modes
5M free tokens on new accounts
⚠️ Legacy aliases retire July 24, 2026: deepseek-chat currently routes to deepseek-v4-flash (non-thinking). Migrate now.
Python
Node.js
Streaming
Think Max
# pip install openai — no new package needed
from openai import OpenAI
import os

client = OpenAI(
  api_key=os.getenv("DEEPSEEK_API_KEY"),
  base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
  model="deepseek-v4-flash", # ← only change
  messages=[
    {"role": "system",
     "content": "You are a helpful assistant."},
    {"role": "user",
     "content": "Explain context caching"}
  ],
  max_tokens=1024
)
print(response.choices[0].message.content)
# Cache hits: response.usage.prompt_cache_hit_tokens
// npm install openai
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.DEEPSEEK_API_KEY,
  baseURL: 'https://api.deepseek.com/v1',
});

const res = await client.chat.completions.create({
  model: 'deepseek-v4-flash',
  messages: [{
    role: 'user',
    content: 'Hello from Node.js!'
  }],
});

console.log(res.choices[0].message.content);
# Real-time streaming: 83 tok/s
from openai import OpenAI

client = OpenAI(
  api_key="<your-key>",
  base_url="https://api.deepseek.com/v1"
)

stream = client.chat.completions.create(
  model="deepseek-v4-flash",
  messages=[{
    "role":"user",
    "content":"Write a haiku about speed"
  }],
  stream=True  # ← 83 tok/s streaming
)

for chunk in stream:
  delta = chunk.choices[0].delta.content
  if delta:
    print(delta, end="", flush=True)
# Think Max on Flash — approaches Pro quality
from openai import OpenAI

client = OpenAI(
  api_key="<your-key>",
  base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
  model="deepseek-v4-flash",
  messages=[{
    "role": "user",
    "content": "Hard reasoning problem..."
  }],
  max_tokens=32768,
  extra_body={
    "thinking": {
      "type": "enabled",
      "budget": "max" # Think Max
    }
  }
)
# Flash-Max ≈ Pro-Max on most reasoning
chain = response.choices[0].message.reasoning_content
Self-Hosting

Run Flash on Your Hardware

160 GB FP8 weights under MIT license. The practical self-hosting target — V4-Pro requires 8× more GPU memory.

Consumer / Dev
Distilled 7B
~4.5 GB · Via Ollama
VRAM8 GB
RAM8 GB
HardwareRTX 3060 / M1
Commandollama run r1:7b
Local laptop
Recommended Workstation
Distilled 14B
~9 GB · Best local quality
VRAM16 GB
RAM16 GB
HardwareRTX 4080 / M2 Pro
Commandollama run r1:14b
★ Best local choice
Production Server
Full V4-Flash (FP8)
160 GB · Single H100 (tight)
Min GPU1× H100 80GB
Comfortable4× H200
vs Pro GPU8× cheaper
FrameworkvLLM / SGLang
★ Full Flash quality
No GPU / Cloud
Cloud Providers
AWS Bedrock · Azure · Together
Data residencyYour region
ComplianceSOC2 / HIPAA
Cost vs API~30% higher
SLAEnterprise 99.9%
Compliance use

Quick start with Ollama

# macOS / Linux
curl -fsSL https://ollama.com/install.sh | sh

# Run DeepSeek R1 (V4-distilled variant) locally
ollama run deepseek-r1:14b # 9 GB · 16 GB RAM
FAQ

Questions About V4-Flash

What is DeepSeek-V4-Flash and why should I use it over V4-Pro?+

V4-Flash is a 284B parameter MoE model (13B active per token) released April 24, 2026. It's the efficiency-optimized model in the V4 family — 12.4× cheaper per output token than V4-Pro, 2× faster (83 tok/s vs ~40), and scores 79.0% on SWE-bench vs Pro's 80.6%. For the vast majority of production workloads — chat, Q&A, code review, summarization, RAG pipelines — the 1.6 benchmark point gap is not detectable by users. Start with Flash; only switch to Pro if evaluations on your specific task reveal a meaningful quality gap.

How fast is V4-Flash? Is it suitable for real-time chat UIs?+

Yes. V4-Flash generates 83.3 tokens per second with a 1.11s time to first token on DeepSeek's API. Both metrics are well above the median for open-weight models of the same class (median: 54 tok/s output, 2.38s TTFT). At 83 tok/s, a 200-word response arrives in about 1.5 seconds after the first token — smooth enough for streaming chat UIs without perceptible buffering.

What's the exact pricing for V4-Flash?+

V4-Flash costs $0.14/1M input tokens (cache miss), $0.014/1M (cache hit — 90% discount on repeated prefixes), and $0.28/1M output tokens. No monthly fees. No promo discounts apply to Flash (unlike V4-Pro's 75% promo through May 31). New accounts receive 5M free tokens. Context caching is automatic — consistent system prompts cost 90% less after the first call. Always verify current prices at api-docs.deepseek.com/quick_start/pricing.

Does V4-Flash support Think Max (V4-Flash-Max)?+

Yes. V4-Flash supports all three reasoning modes: Non-Think (instant), Think High (analytical), and Think Max. DeepSeek states that V4-Flash-Max achieves comparable reasoning performance to V4-Pro-Max when given a larger thinking budget. The main caveat: Think Max is very token-intensive — Flash-Max generated 240M output tokens on the Intelligence Index benchmark (average: 42M). Monitor output costs carefully in Think Max mode. For Think Max, set context window to at least 384K tokens.

What are the real limitations of V4-Flash vs V4-Pro?+

Three main gaps: (1) Terminal-Bench 2.0 (agentic CLI): 56.9% vs 67.9% — an 11-point gap on complex multi-step tool use. (2) SimpleQA-Verified (factual recall): 34.1% vs 57.9% — Flash is significantly weaker at precise real-world fact retrieval. (3) Complex multi-step agents: Flash performs on par with Pro on simple agent tasks but falls further behind on complex workflows with many tool calls. For everyday coding, chat, and RAG, Flash matches Pro closely enough that the cost difference almost always wins.

Is V4-Flash the same as the deepseek-chat alias?+

Currently yes — deepseek-chat routes to V4-Flash in non-thinking mode, and deepseek-reasoner routes to V4-Flash in thinking mode. However, both aliases retire permanently on July 24, 2026 at 15:59 UTC. After that date they will return errors with no fallback. Migrate to deepseek-v4-flash (or deepseek-v4-pro) now. Only the model name needs to change — base URL, authentication, and request format are identical.

Can I self-host V4-Flash? What hardware do I need?+

Yes — full weights are available at huggingface.co/deepseek-ai/DeepSeek-V4-Flash under MIT license (commercial use free). The FP8 weight file is 160 GB. Minimum: 1× H100 80GB (tight fit). Comfortable: 4× H200. This compares very favorably to V4-Pro which needs 8× H100. For local development use, run the distilled R1-14B variant via Ollama on a 16 GB RAM machine — functionally similar quality for most tasks at zero hardware cost.

Get Started

Fast. Cheap.
Frontier Quality.

79% SWE-bench. $0.14/1M tokens. 83 tok/s. MIT open-source. The default production AI for teams that ship fast and scale wide.

Try Instant Mode Free Get API Key 🤗 Download Weights

model: deepseek-v4-flash · $0.14 input · $0.28 output · MIT License