DeepSeek-V4-Flash | Fast, Low-Cost AI for Real-Time Apps

Why V4-Flash

Not Lite. Just Right.

V4-Flash isn't a downgrade from V4-Pro — it's purpose-built for the 90% of production workloads where speed and cost matter more than the final 1.6 benchmark points.

⚡

Speed

83 Tokens Per Second

Above-average generation speed vs models of the same size class (median: 54 tok/s). Faster throughput means snappier chat UIs, quicker batch jobs, and lower latency for end users in production.

54% faster than the 284B class median

💰

Cost

12.4× Cheaper Than Pro

$0.28/1M output vs V4-Pro's $3.48. That's not a small discount — it's a fundamentally different cost structure. A workload costing $1,000/month on Pro runs for $80 on Flash. Same OpenAI-compatible API endpoint.

$0.28 per 1M output tokens

📏

Context

Same 1M Token Window

Flash shares the full 1M token context window with Pro. Both implement CSA+HCA hybrid attention — the architecture that makes long-context inference economically viable. Process entire codebases at $0.14/1M.

1M tokens — same as V4-Pro

🎯

Quality

Only 1.6pts Behind Pro

79.0% SWE-bench vs Pro's 80.6%. 91.6% LiveCodeBench vs 93.5%. For most developer coding tasks, these are functionally equivalent results. The gap widens on complex agentic workflows — not on everyday coding.

1.6pt gap vs Pro on SWE-bench

Benchmarks

Where Flash Stands

Honest numbers — including where Flash leads, where it matches Pro, and where the gap matters.

Coding

Reasoning

Flash vs Pro

SWE-bench Verified

Real-world GitHub issue resolution

Flash 79.0% Pro 80.6% Gap: 1.6pt

V4-Flash

79.0%

V4-Pro

80.6%

GPT-5.4

72.0%

Claude Op.

80.8%

LiveCodeBench

Live competitive programming Pass@1

Flash 91.6 Pro 93.5

V4-Flash

91.6

V4-Pro

93.5

Claude Op.

88.8

Terminal-Bench 2.0

Agentic CLI tool-use (Flash's main gap vs Pro)

Flash 56.9% Pro 67.9%

V4-Pro

67.9%

V4-Flash

56.9%

Claude Op.

65.4%

Artificial Analysis Intelligence Index

Multi-benchmark composite (max reasoning mode)

Flash 47 Median: 29

V4-Flash

47 / 75

Class median

29 / 75

SimpleQA-Verified

Factual recall (Flash's largest gap vs Pro)

Flash 34.1% Pro 57.9%

V4-Pro

57.9%

V4-Flash

34.1%

Flash-Max vs Pro-Max

Max reasoning mode — Flash approaches Pro quality

Flash-Max ≈ Pro

DeepSeek states that V4-Flash-Max achieves comparable reasoning performance to V4-Pro when given a larger thinking budget (Think Max mode). The smaller parameter scale places it slightly behind on pure knowledge tasks and the most complex agentic workflows — but on most reasoning benchmarks, the gap narrows significantly with increased thinking budget.

⚡ Where Flash is better or equal:

Speed (tok/s)

83.3

Cost efficiency

12.4×

Time to first token

1.11s

Simple agent tasks

≈ Pro

🔵 Where Pro is better:

Terminal-Bench 2.0

67.9%

SimpleQA recall

57.9%

Complex multi-step

Lead

Performance Metrics

Speed That Matters

83 tokens per second. 1.11s time to first token. Well above the median for open-weight models of the same class — and faster than many proprietary alternatives.

🏎

83.3 tok/s

Output Speed

Generation throughput on DeepSeek's own API (streaming). Well above the 284B-class median of 54.1 tok/s. Fast enough for smooth chat UIs without perceptible buffering.

↑ Above average for class

⏱

1.11s TTFT

Time to First Token

Very competitive vs the 284B-class median of 2.38s. First-chunk latency is critical for chat UIs — Flash feels responsive because the first token arrives quickly.

↑ Very competitive (median: 2.38s)

📊

46.5 / 75

Intelligence Index

Scores 46.5 on the Artificial Analysis Intelligence Index (non-reasoning mode). Ranked #16 of 525 published models — "near-flagship intelligence" at a fraction of flagship cost.

#16 of 525 models

🔁

1M ctx

Context Window

Full 1 million token context — identical to V4-Pro. Process complete codebases, books, or long conversation histories. Same CSA+HCA hybrid attention for efficient long-context inference.

= V4-Pro context

🏎 Output Speed Comparison (tokens/sec)

V4-Flash

83.3 t/s

V3.2

~60 t/s

Class median

54.1 t/s

V4-Pro

~40 t/s

⏱ Time to First Token (seconds — lower is better)

V4-Flash

1.11s

GPT-5.5

~1.4s

Class median

2.38s

V4-Pro

~1.9s

Pricing

Production Scale Without the Bill

$0.14 input. $0.28 output. Automatic 90% cache discount. No monthly fees.

⚡ Recommended Default

V4-Flash API

^$0.14/1M input

The best price-performance model in production. Model: deepseek-v4-flash

Input — cache miss$0.14/1M

Input — cache hit$0.014/1M

Output tokens$0.28/1M

Context window1M tokens

Max output384K tokens

Get API Key →

V4-Pro (comparison)

^$3.48/1M out

12.4× more expensive per output token. Use when benchmarks show a quality gap on your specific workload.

Input$1.74/1M

Cache hit$0.174/1M

Output$3.48/1M

vs Flash cost12.4× output

See Flash vs Pro

Web Chat (Instant Mode)

^$0/month

Instant Mode at chat.deepseek.com is V4-Flash. Free, no account required, no subscription.

ModelV4-Flash

DeepThink✓ Included

File uploads✓ Included

AdsNone

Open Chat Free

Self-Host (MIT)

Open Weights

160 GB FP8 weights. MIT licensed. Runs on a single H100 (tight) or comfortably on 4×H200 — practical for enterprise self-hosting.

LicenseMIT · Commercial ✓

Weight size160 GB (FP8)

GPU min1× H100 80GB

Comfortable4× H200

Hugging Face ↗

🧮 Cost Calculator

Compare against

Monthly calls

Avg input tokens / call

Avg output tokens / call

Cache hit rate: 70%

$0

Flash / month

$0

Other / month

$0

Monthly saving

0×

Cheaper

Adjust the fields above to calculate your savings.

Pro vs Flash

Which One Should You Use?

The honest answer: start with Flash. Upgrade to Pro only if evaluations reveal a quality gap on your specific task.

Decision Guide

🆕

Starting a new project? Use Flash first. Run your own benchmark on 100 representative prompts. Only switch to Pro if the gap is meaningful for your users.

💬

Chat / Q&A / summarization — Flash wins. Speed and cost dominate. Quality difference is negligible for conversational tasks.

💻

Everyday coding & code review — Flash wins. 79% vs 80.6% SWE-bench is not a meaningful difference for most real-world code.

🤖

Complex multi-step agents — Pro recommended. Terminal-Bench gap is 11 points (67.9% vs 56.9%). Pro's deeper expert network handles multi-hop tool chains better.

📚

Factual knowledge recall — Pro recommended. SimpleQA gap: 57.9% vs 34.1%. If accuracy on specific facts matters, Pro has a clear edge.

🖥

Self-hosting on limited GPU — Flash only option. Flash fits on 1×H100; Pro requires 8×H100 minimum cluster.

📊

High-volume production (1M+ calls/mo) — Flash wins. 12.4× cost saving at scale creates meaningful infrastructure ROI.

Spec	⚡ Flash	🔵 Pro
Total params	284B	1.6T
Active params / token	13B	49B
Context window	1M tokens	1M tokens
Max output	= 384K	= 384K
Output speed	83 tok/s	~40 tok/s
TTFT	1.11s	~1.9s
Input price	$0.14/1M	$1.74/1M
Output price	$0.28/1M	$3.48/1M
SWE-bench	79.0%	80.6%
LiveCodeBench	91.6	93.5
Terminal-Bench 2.0	56.9%	67.9%
SimpleQA recall	34.1%	57.9%
Self-host GPU min	1× H100	8× H100
Self-host weight size	160 GB	865 GB
License	= MIT	= MIT
Thinking modes	= 3 modes	= 3 modes

API Integration

One Model Name Change

Already on OpenAI or V3.2? Migrate to V4-Flash by changing one string. Everything else stays identical.

base_url = "https://api.deepseek.com/v1"
model = "deepseek-v4-flash"

✓OpenAI Python / Node SDK compatible

✓Anthropic Messages API format supported

✓Streaming SSE responses (83 tok/s)

✓Function calling / tool use

✓JSON mode / structured outputs

✓Auto context caching (90% discount on cache hits)

✓Three thinking modes (Non-Think, High, Max)

✓reasoning_content in Think modes

✓5M free tokens on new accounts

⚠️ Legacy aliases retire July 24, 2026: deepseek-chat currently routes to deepseek-v4-flash (non-thinking). Migrate now.

Python

Node.js

Streaming

Think Max

# pip install openai — no new package needed
from openai import OpenAI
import os

client = OpenAI(
  api_key=os.getenv("DEEPSEEK_API_KEY"),
  base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
  model="deepseek-v4-flash", # ← only change
  messages=[
    {"role": "system",
     "content": "You are a helpful assistant."},
    {"role": "user",
     "content": "Explain context caching"}
  ],
  max_tokens=1024
)
print(response.choices[0].message.content)
# Cache hits: response.usage.prompt_cache_hit_tokens

// npm install openai
import OpenAI from 'openai';

const client = new OpenAI({
  apiKey: process.env.DEEPSEEK_API_KEY,
  baseURL: 'https://api.deepseek.com/v1',
});

const res = await client.chat.completions.create({
  model: 'deepseek-v4-flash',
  messages: [{
    role: 'user',
    content: 'Hello from Node.js!'
  }],
});

console.log(res.choices[0].message.content);

# Real-time streaming: 83 tok/s
from openai import OpenAI

client = OpenAI(
  api_key="<your-key>",
  base_url="https://api.deepseek.com/v1"
)

stream = client.chat.completions.create(
  model="deepseek-v4-flash",
  messages=[{
    "role":"user",
    "content":"Write a haiku about speed"
  }],
  stream=True  # ← 83 tok/s streaming
)

for chunk in stream:
  delta = chunk.choices[0].delta.content
  if delta:
    print(delta, end="", flush=True)

# Think Max on Flash — approaches Pro quality
from openai import OpenAI

client = OpenAI(
  api_key="<your-key>",
  base_url="https://api.deepseek.com/v1"
)

response = client.chat.completions.create(
  model="deepseek-v4-flash",
  messages=[{
    "role": "user",
    "content": "Hard reasoning problem..."
  }],
  max_tokens=32768,
  extra_body={
    "thinking": {
      "type": "enabled",
      "budget": "max" # Think Max
    }
  }
)
# Flash-Max ≈ Pro-Max on most reasoning
chain = response.choices[0].message.reasoning_content

Self-Hosting

Run Flash on Your Hardware

160 GB FP8 weights under MIT license. The practical self-hosting target — V4-Pro requires 8× more GPU memory.

Consumer / Dev

Distilled 7B

~4.5 GB · Via Ollama

VRAM8 GB

RAM8 GB

HardwareRTX 3060 / M1

Commandollama run r1:7b

Local laptop

Recommended Workstation

Distilled 14B

~9 GB · Best local quality

VRAM16 GB

RAM16 GB

HardwareRTX 4080 / M2 Pro

Commandollama run r1:14b

★ Best local choice

Production Server

Full V4-Flash (FP8)

160 GB · Single H100 (tight)

Min GPU1× H100 80GB

Comfortable4× H200

vs Pro GPU8× cheaper

FrameworkvLLM / SGLang

★ Full Flash quality

No GPU / Cloud

Cloud Providers

AWS Bedrock · Azure · Together

Data residencyYour region

ComplianceSOC2 / HIPAA

Cost vs API~30% higher

SLAEnterprise 99.9%

Compliance use

Quick start with Ollama

        # macOS / Linux

        curl -fsSL https://ollama.com/install.sh | sh

        # Run DeepSeek R1 (V4-distilled variant) locally

        ollama run deepseek-r1:14b  # 9 GB · 16 GB RAM

FAQ

Questions About V4-Flash

What is DeepSeek-V4-Flash and why should I use it over V4-Pro?+

V4-Flash is a 284B parameter MoE model (13B active per token) released April 24, 2026. It's the efficiency-optimized model in the V4 family — 12.4× cheaper per output token than V4-Pro, 2× faster (83 tok/s vs ~40), and scores 79.0% on SWE-bench vs Pro's 80.6%. For the vast majority of production workloads — chat, Q&A, code review, summarization, RAG pipelines — the 1.6 benchmark point gap is not detectable by users. Start with Flash; only switch to Pro if evaluations on your specific task reveal a meaningful quality gap.

How fast is V4-Flash? Is it suitable for real-time chat UIs?+

Yes. V4-Flash generates 83.3 tokens per second with a 1.11s time to first token on DeepSeek's API. Both metrics are well above the median for open-weight models of the same class (median: 54 tok/s output, 2.38s TTFT). At 83 tok/s, a 200-word response arrives in about 1.5 seconds after the first token — smooth enough for streaming chat UIs without perceptible buffering.

What's the exact pricing for V4-Flash?+

V4-Flash costs $0.14/1M input tokens (cache miss), $0.014/1M (cache hit — 90% discount on repeated prefixes), and $0.28/1M output tokens. No monthly fees. No promo discounts apply to Flash (unlike V4-Pro's 75% promo through May 31). New accounts receive 5M free tokens. Context caching is automatic — consistent system prompts cost 90% less after the first call. Always verify current prices at api-docs.deepseek.com/quick_start/pricing.

Does V4-Flash support Think Max (V4-Flash-Max)?+

Yes. V4-Flash supports all three reasoning modes: Non-Think (instant), Think High (analytical), and Think Max. DeepSeek states that V4-Flash-Max achieves comparable reasoning performance to V4-Pro-Max when given a larger thinking budget. The main caveat: Think Max is very token-intensive — Flash-Max generated 240M output tokens on the Intelligence Index benchmark (average: 42M). Monitor output costs carefully in Think Max mode. For Think Max, set context window to at least 384K tokens.

What are the real limitations of V4-Flash vs V4-Pro?+

Three main gaps: (1) Terminal-Bench 2.0 (agentic CLI): 56.9% vs 67.9% — an 11-point gap on complex multi-step tool use. (2) SimpleQA-Verified (factual recall): 34.1% vs 57.9% — Flash is significantly weaker at precise real-world fact retrieval. (3) Complex multi-step agents: Flash performs on par with Pro on simple agent tasks but falls further behind on complex workflows with many tool calls. For everyday coding, chat, and RAG, Flash matches Pro closely enough that the cost difference almost always wins.

Is V4-Flash the same as the deepseek-chat alias?+

Currently yes — deepseek-chat routes to V4-Flash in non-thinking mode, and deepseek-reasoner routes to V4-Flash in thinking mode. However, both aliases retire permanently on July 24, 2026 at 15:59 UTC. After that date they will return errors with no fallback. Migrate to deepseek-v4-flash (or deepseek-v4-pro) now. Only the model name needs to change — base URL, authentication, and request format are identical.

Can I self-host V4-Flash? What hardware do I need?+

Yes — full weights are available at huggingface.co/deepseek-ai/DeepSeek-V4-Flash under MIT license (commercial use free). The FP8 weight file is 160 GB. Minimum: 1× H100 80GB (tight fit). Comfortable: 4× H200. This compares very favorably to V4-Pro which needs 8× H100. For local development use, run the distilled R1-14B variant via Ollama on a 16 GB RAM machine — functionally similar quality for most tasks at zero hardware cost.

Fast.
V4-Flash.

Not Lite. Just Right.

Where Flash Stands

Speed That Matters

Production Scale Without the Bill

Which One Should You Use?

One Model Name Change

Run Flash on Your Hardware

Questions About V4-Flash

Fast. Cheap.
Frontier Quality.

Fast.V4-Flash.

Not Lite. Just Right.

Where Flash Stands

Speed That Matters

Production Scale Without the Bill

Which One Should You Use?

One Model Name Change

Run Flash on Your Hardware

Questions About V4-Flash

Fast. Cheap.Frontier Quality.

Fast.
V4-Flash.

Fast. Cheap.
Frontier Quality.