DeepSeek API Pricing in 2026: V4-Pro vs V4-Flash Costs
If you landed here to find out what a production workload on DeepSeek will actually cost, the short answer is: much less than the frontier-closed incumbents, but only if you understand the three token buckets DeepSeek bills. The current DeepSeek API pricing is split across two V4 tiers — V4-Flash for high-volume work, V4-Pro for frontier-tier coding and agents — with separate rates for cache-hit input, cache-miss input and output. This guide walks through the published per-million-token rates, runs worked examples for both tiers, shows how context caching changes the math, and flags where the legacy `deepseek-chat` and `deepseek-reasoner` IDs fit during the migration window that closes on July 24, 2026.
What DeepSeek API pricing covers in 2026
DeepSeek bills API usage on a per-token basis, separately for input and output, with a discounted rate when the provider detects a repeated prefix in your messages array. The current generation is DeepSeek-V4 Preview, shipping as two open-weight Mixture-of-Experts models: DeepSeek-V4-Pro with 1.6T total / 49B active parameters, and DeepSeek-V4-Flash with 284B total / 13B active parameters. Both tiers share the same API surface, the same 1M-token context window, and the same three reasoning-effort modes; they differ on price and, to a lesser degree, on benchmark performance.
Two operational points matter for cost modelling before the numbers:
- The API is stateless. Every request must carry the full conversation history, which means long chats grow the input bill linearly unless context caching is kicking in.
- Thinking mode is a request parameter, not a separate model. The rate you pay depends on which model ID you pick (
deepseek-v4-proordeepseek-v4-flash); enabling thinking just increases the output-token count you consume at that rate.
The V4 rate card
The official pricing page lists these per-million-token rates in USD, as verified on April 24, 2026. DeepSeek charges $0.14 per million input tokens and $0.28 per million output tokens for Flash, and $1.74 per million input and $3.48 per million output for Pro. Cache-hit input is discounted roughly 90% from cache-miss input on both tiers.
DeepSeek V4-Flash rates
| Token bucket | Rate (USD per 1M tokens) |
|---|---|
| Input, cache hit | $0.028 |
| Input, cache miss | $0.14 |
| Output | $0.28 |
DeepSeek V4-Pro rates
| Token bucket | Rate (USD per 1M tokens) |
|---|---|
| Input, cache hit | $0.145 |
| Input, cache miss | $1.74 |
| Output | $3.48 |
Verdict on the spread: V4-Pro is roughly 12× the input cost and 12× the output cost of V4-Flash. That premium buys stronger agentic and coding performance, but for chat, classification, summarisation, and most RAG workloads, V4-Flash is the sensible default. For the full DeepSeek V4-Flash and DeepSeek V4-Pro spec sheets, see each model’s page.
Quickstart: what a minimal billed request looks like
Chat requests hit POST /chat/completions, the OpenAI-compatible endpoint at https://api.deepseek.com. The OpenAI Python SDK works unchanged after swapping base_url and api_key; the following snippet sends a single non-thinking call to V4-Flash and is what the first billed token looks like in practice.
from openai import OpenAI
client = OpenAI(
base_url="https://api.deepseek.com",
api_key="sk-...",
)
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "You answer concisely."},
{"role": "user", "content": "Summarise the V4 pricing tiers."},
],
max_tokens=400,
temperature=1.3,
)
print(resp.choices[0].message.content)
To switch the same request to thinking mode on V4-Pro, change the model, add reasoning_effort="high", and pass extra_body={"thinking": {"type": "enabled"}}. The response then includes reasoning_content alongside the final content, and your output-token count will climb accordingly. For a deeper walkthrough, see our DeepSeek API getting started tutorial.
Worked example A: 1 million calls on V4-Flash
Imagine a classifier or lightweight chatbot making 1,000,000 calls per month, each with a 2,000-token cached system prompt, a 200-token user message (uncached by definition — it’s new on every call), and a 300-token model response. On V4-Flash:
- Cached input: 2,000 × 1,000,000 = 2,000,000,000 tokens × $0.028/M = $56.00
- Uncached input: 200 × 1,000,000 = 200,000,000 tokens × $0.14/M = $28.00
- Output: 300 × 1,000,000 = 300,000,000 tokens × $0.28/M = $84.00
- Monthly total: $168.00
Two points worth stressing. First, the user message on each call is a cache miss against the system prefix — you cannot pretend the whole input bucket is cached. Second, the cache kicks in automatically once the shared prefix reaches the provider’s minimum size. Cache-hit pricing is automatic; every request with a repeated prefix against the same account benefits with no opt-in, though prefixes must be at least 1,024 tokens long and must match byte-for-byte. Shorter or variable system prompts won’t cache. For a reusable model, try the DeepSeek cost estimator.
Worked example B: the same workload on V4-Pro
Same 1M calls, same token shape, priced on V4-Pro:
- Cached input: 2,000,000,000 tokens × $0.145/M = $290.00
- Uncached input: 200,000,000 tokens × $1.74/M = $348.00
- Output: 300,000,000 tokens × $3.48/M = $1,044.00
- Monthly total: $1,682.00
Roughly 10× the Flash bill for the same traffic. Reserve Pro for work where the benchmark lift pays for itself — agentic coding, long-horizon tool use, and the complex SWE-Bench-style tasks where V4-Pro posts 80.6 on SWE-Verified, within a fraction of Claude at 80.8 and matching Gemini at 80.6. For everyday chat, the gap does not usually justify the spend.
Where context caching actually helps
The 90% input discount only applies to the portion of your prompt the provider recognises as a repeated prefix. In practice, that means:
- A long, fixed system prompt above 1,024 tokens — good candidate.
- A few-shot block of worked examples reused across requests — good candidate.
- A retrieval-augmented prompt where the retrieved chunks change every call — not cached; you pay full miss rate on the retrieved portion.
- Any dynamic content (timestamps, user IDs) threaded through the prefix — breaks byte-for-byte matching and kills the hit.
If caching is core to your cost model, read the dedicated DeepSeek context caching write-up before optimising your prompt structure.
Off-peak discounts and free credits — what’s real in 2026
A few persistent myths are worth shutting down:
- There is no active off-peak API discount. DeepSeek ended the 50%/75% night-time discount on September 5, 2025 and has not reintroduced it with the V4 launch.
- “Free credits for new accounts” is not a documented rule. DeepSeek’s billing system includes a “granted balance” concept — a small promotional credit that can expire. Check the billing console for current offers; do not plan around a specific number.
- Format does not change pricing. The Anthropic-format endpoint at api.deepseek.com/anthropic uses the same rates as the OpenAI-format endpoint; format does not affect billing.
How V4 compares with the legacy IDs
If your integration still points at deepseek-chat or deepseek-reasoner, you are already being billed at V4-Flash rates. The model names deepseek-chat and deepseek-reasoner will be deprecated, and for compatibility they correspond to the non-thinking mode and thinking mode of deepseek-v4-flash respectively. The hard cutoff is 2026-07-24 15:59 UTC, after which those IDs stop working. Migration is a one-line model= swap; base_url does not change. If you are still running V3.x patterns, our DeepSeek OpenAI SDK compatibility notes cover the edge cases.
V3.2 → V4-Flash price comparison (historical context)
| Tier | Input (miss) $/M | Output $/M | Notes |
|---|---|---|---|
| V3.2 (retired 2026-04-24) | $0.28 | $0.42 | Previous generation rate card. |
| V4-Flash (current) | $0.14 | $0.28 | Undercuts V3.2 by 50% on input and 33% on output. |
| V4-Pro (current) | $1.74 | $3.48 | Frontier tier, priced well above V3.2. |
In other words, the default-tier price has dropped with V4, while a new premium tier has been introduced for work that genuinely needs it.
How DeepSeek compares with closed-source frontier models
Pricing-only comparisons are easy to overstate — benchmark lift, latency, regional availability and data-handling all factor in — but the headline gap is real. Simon Willison’s April 2026 comparison concluded that DeepSeek-V4-Flash is the cheapest of the small models and DeepSeek-V4-Pro is the cheapest of the larger frontier models. For a task-by-task breakdown against the major competitors, see DeepSeek vs ChatGPT and DeepSeek vs Claude. Always verify competitor rates on their own pricing pages before a procurement decision — closed-source pricing shifts frequently.
Simon Willison’s V4 write-up (April 24, 2026) and DeepSeek’s official pricing page are the two primary sources to re-check before you sign off on a budget.
Four rules that keep the bill predictable
- Pick the right tier first. Default to V4-Flash. Move individual endpoints to V4-Pro only when a benchmark or evaluation justifies the cost.
- Design for cache hits. Put stable instructions and few-shot examples at the top of
messages; avoid injecting timestamps, user IDs or retrieved content above the 1,024-token mark. - Cap
max_tokens. Output is the expensive bucket on both tiers. Setting a realistic cap protects against runaway responses, especially in thinking mode wherereasoning_contentcounts as output. - Monitor thinking-mode usage.
reasoning_effort="max"can generate tens of thousands of tokens on a hard prompt. Use it deliberately on problems that need it.
Rate limits and headroom
Rate limits are not priced directly, but they shape how much of the cheap rate you can actually consume in a day. DeepSeek publishes per-account limits that vary by tier and that tighten under traffic spikes. Before committing to a volume assumption in your cost model, skim the DeepSeek API rate limits reference and plan retry logic around the documented error codes.
Last verified: 2026-04-24. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.
Frequently asked questions
How much does the DeepSeek API cost per 1M tokens?
V4-Flash costs $0.028 for cache-hit input, $0.14 for cache-miss input, and $0.28 for output per 1M tokens. V4-Pro costs $0.145, $1.74 and $3.48 respectively. Both tiers share the same 1M-token context window and the same API surface; only the per-token rate and the underlying model change. See our DeepSeek API docs and guides for the live rate card.
What is the difference between V4-Flash and V4-Pro pricing?
V4-Pro is roughly 12× the price of V4-Flash across all three token buckets. The premium buys stronger agentic coding, long-horizon tool use, and frontier-tier reasoning; V4-Flash handles chat, classification, extraction and most RAG workloads at a fraction of the cost. Compare capabilities on the DeepSeek V4-Pro and DeepSeek V4-Flash pages.
Does DeepSeek still offer an off-peak discount?
No. DeepSeek ended the off-peak 50%/75% API discount on September 5, 2025 and did not reintroduce it with the V4 launch. The current rate card is flat across the day. Context caching remains the main way to reduce input costs, with cache-hit tokens priced roughly 90% below cache-miss tokens — see our notes on DeepSeek context caching.
Does thinking mode cost more on the DeepSeek API?
Thinking mode does not change the per-token rate — the rate is set by the model ID. It does, however, generate more output tokens because the response now includes reasoning_content alongside the final content. In practice, enabling reasoning_effort="high" or "max" can multiply output volume several times over. See our DeepSeek API best practices for guidance.
Can I still use deepseek-chat and deepseek-reasoner?
Yes, until 2026-07-24 at 15:59 UTC. Both legacy IDs currently route to V4-Flash — deepseek-chat in non-thinking mode and deepseek-reasoner in thinking mode — and are billed at V4-Flash rates. After the cutoff, requests using those IDs will fail; migration is a one-line model= change. For help getting a key, see get a DeepSeek API key.
