284B parameter MoE. 13B active per token. 79% SWE-bench. 1M context. $0.14/1M input tokens. The default model for high-throughput production AI — not a compromise, a choice.
V4-Flash isn't a downgrade from V4-Pro — it's purpose-built for the 90% of production workloads where speed and cost matter more than the final 1.6 benchmark points.
Above-average generation speed vs models of the same size class (median: 54 tok/s). Faster throughput means snappier chat UIs, quicker batch jobs, and lower latency for end users in production.
$0.28/1M output vs V4-Pro's $3.48. That's not a small discount — it's a fundamentally different cost structure. A workload costing $1,000/month on Pro runs for $80 on Flash. Same OpenAI-compatible API endpoint.
Flash shares the full 1M token context window with Pro. Both implement CSA+HCA hybrid attention — the architecture that makes long-context inference economically viable. Process entire codebases at $0.14/1M.
79.0% SWE-bench vs Pro's 80.6%. 91.6% LiveCodeBench vs 93.5%. For most developer coding tasks, these are functionally equivalent results. The gap widens on complex agentic workflows — not on everyday coding.
Honest numbers — including where Flash leads, where it matches Pro, and where the gap matters.
83 tokens per second. 1.11s time to first token. Well above the median for open-weight models of the same class — and faster than many proprietary alternatives.
Generation throughput on DeepSeek's own API (streaming). Well above the 284B-class median of 54.1 tok/s. Fast enough for smooth chat UIs without perceptible buffering.
↑ Above average for classVery competitive vs the 284B-class median of 2.38s. First-chunk latency is critical for chat UIs — Flash feels responsive because the first token arrives quickly.
↑ Very competitive (median: 2.38s)Scores 46.5 on the Artificial Analysis Intelligence Index (non-reasoning mode). Ranked #16 of 525 published models — "near-flagship intelligence" at a fraction of flagship cost.
#16 of 525 modelsFull 1 million token context — identical to V4-Pro. Process complete codebases, books, or long conversation histories. Same CSA+HCA hybrid attention for efficient long-context inference.
= V4-Pro context$0.14 input. $0.28 output. Automatic 90% cache discount. No monthly fees.
The best price-performance model in production. Model: deepseek-v4-flash
12.4× more expensive per output token. Use when benchmarks show a quality gap on your specific workload.
Instant Mode at chat.deepseek.com is V4-Flash. Free, no account required, no subscription.
160 GB FP8 weights. MIT licensed. Runs on a single H100 (tight) or comfortably on 4×H200 — practical for enterprise self-hosting.
The honest answer: start with Flash. Upgrade to Pro only if evaluations reveal a quality gap on your specific task.
| Spec | ⚡ Flash | 🔵 Pro |
|---|---|---|
| Total params | 284B | 1.6T |
| Active params / token | 13B | 49B |
| Context window | 1M tokens | 1M tokens |
| Max output | = 384K | = 384K |
| Output speed | 83 tok/s | ~40 tok/s |
| TTFT | 1.11s | ~1.9s |
| Input price | $0.14/1M | $1.74/1M |
| Output price | $0.28/1M | $3.48/1M |
| SWE-bench | 79.0% | 80.6% |
| LiveCodeBench | 91.6 | 93.5 |
| Terminal-Bench 2.0 | 56.9% | 67.9% |
| SimpleQA recall | 34.1% | 57.9% |
| Self-host GPU min | 1× H100 | 8× H100 |
| Self-host weight size | 160 GB | 865 GB |
| License | = MIT | = MIT |
| Thinking modes | = 3 modes | = 3 modes |
Already on OpenAI or V3.2? Migrate to V4-Flash by changing one string. Everything else stays identical.
deepseek-chat currently routes to deepseek-v4-flash (non-thinking). Migrate now.
160 GB FP8 weights under MIT license. The practical self-hosting target — V4-Pro requires 8× more GPU memory.
Quick start with Ollama
V4-Flash is a 284B parameter MoE model (13B active per token) released April 24, 2026. It's the efficiency-optimized model in the V4 family — 12.4× cheaper per output token than V4-Pro, 2× faster (83 tok/s vs ~40), and scores 79.0% on SWE-bench vs Pro's 80.6%. For the vast majority of production workloads — chat, Q&A, code review, summarization, RAG pipelines — the 1.6 benchmark point gap is not detectable by users. Start with Flash; only switch to Pro if evaluations on your specific task reveal a meaningful quality gap.
Yes. V4-Flash generates 83.3 tokens per second with a 1.11s time to first token on DeepSeek's API. Both metrics are well above the median for open-weight models of the same class (median: 54 tok/s output, 2.38s TTFT). At 83 tok/s, a 200-word response arrives in about 1.5 seconds after the first token — smooth enough for streaming chat UIs without perceptible buffering.
V4-Flash costs $0.14/1M input tokens (cache miss), $0.014/1M (cache hit — 90% discount on repeated prefixes), and $0.28/1M output tokens. No monthly fees. No promo discounts apply to Flash (unlike V4-Pro's 75% promo through May 31). New accounts receive 5M free tokens. Context caching is automatic — consistent system prompts cost 90% less after the first call. Always verify current prices at api-docs.deepseek.com/quick_start/pricing.
Yes. V4-Flash supports all three reasoning modes: Non-Think (instant), Think High (analytical), and Think Max. DeepSeek states that V4-Flash-Max achieves comparable reasoning performance to V4-Pro-Max when given a larger thinking budget. The main caveat: Think Max is very token-intensive — Flash-Max generated 240M output tokens on the Intelligence Index benchmark (average: 42M). Monitor output costs carefully in Think Max mode. For Think Max, set context window to at least 384K tokens.
Three main gaps: (1) Terminal-Bench 2.0 (agentic CLI): 56.9% vs 67.9% — an 11-point gap on complex multi-step tool use. (2) SimpleQA-Verified (factual recall): 34.1% vs 57.9% — Flash is significantly weaker at precise real-world fact retrieval. (3) Complex multi-step agents: Flash performs on par with Pro on simple agent tasks but falls further behind on complex workflows with many tool calls. For everyday coding, chat, and RAG, Flash matches Pro closely enough that the cost difference almost always wins.
Currently yes — deepseek-chat routes to V4-Flash in non-thinking mode, and deepseek-reasoner routes to V4-Flash in thinking mode. However, both aliases retire permanently on July 24, 2026 at 15:59 UTC. After that date they will return errors with no fallback. Migrate to deepseek-v4-flash (or deepseek-v4-pro) now. Only the model name needs to change — base URL, authentication, and request format are identical.
Yes — full weights are available at huggingface.co/deepseek-ai/DeepSeek-V4-Flash under MIT license (commercial use free). The FP8 weight file is 160 GB. Minimum: 1× H100 80GB (tight fit). Comfortable: 4× H200. This compares very favorably to V4-Pro which needs 8× H100. For local development use, run the distilled R1-14B variant via Ollama on a 16 GB RAM machine — functionally similar quality for most tasks at zero hardware cost.
79% SWE-bench. $0.14/1M tokens. 83 tok/s. MIT open-source. The default production AI for teams that ship fast and scale wide.
model: deepseek-v4-flash · $0.14 input · $0.28 output · MIT License