DeepSeek Pricing Calculator for V4-Flash and V4-Pro API Costs
You opened a DeepSeek pricing calculator because the rate card alone does not answer the question you actually have: what will my monthly bill look like at this volume? The published numbers give you dollars per million tokens, but a real estimate has to split input into cache hits and cache misses, add output separately, and pick the right model tier for the workload. Get any of those wrong and the forecast is off by an order of magnitude.
This guide walks through the V4 rate card as published on April 24, 2026, gives you a copy-paste calculation template for both `deepseek-v4-flash` and `deepseek-v4-pro`, and shows two worked examples — a chatbot with cached system prompts and an agentic coding loop — so you can sanity-check your own numbers before committing to production.
The V4 rate card you are calculating against
DeepSeek’s current generation is DeepSeek V4, released April 24, 2026 and shipped as two open-weight Mixture-of-Experts models that both support the OpenAI ChatCompletions format and the Anthropic API format. Any pricing calculator has to handle both tiers separately because the per-token rates differ by roughly 7x on output.
The two model IDs are DeepSeek V4-Pro (frontier tier) and DeepSeek V4-Flash (cost-efficient tier). Both are 1 million token context Mixture of Experts models — Pro is 1.6T total parameters with 49B active, Flash is 284B total with 13B active — and both ship under the standard MIT license.
Per-million-token rates as of April 2026
| Model | Input cache hit ($/M) | Input cache miss ($/M) | Output ($/M) |
|---|---|---|---|
deepseek-v4-flash |
$0.028 | $0.14 | $0.28 |
deepseek-v4-pro |
$0.145 | $1.74 | $3.48 |
The Flash numbers come from DeepSeek’s announcement: V4-Flash costs $0.14 per million input tokens and $0.28 per million output tokens, undercutting GPT-5.4 Nano, Gemini 3.1 Flash, GPT-5.4 Mini, and Claude Haiku 4.5. V4-Pro runs at $1.74 input and $3.48 output, and both carry a 1M-token context window with up to 384K output tokens. Verify the live rates on the official DeepSeek pricing page before you finalise a budget — V4 is still flagged as a Preview release.
Two important footnotes that calculators routinely get wrong:
- The off-peak discount is gone. DeepSeek ended the night-time API discount on September 5, 2025, and did not reintroduce it with V4. Any calculator that still applies a 50% or 75% night discount is producing fiction.
- Prices are the same whether you are in thinking mode or non-thinking mode — the model ID sets the rate; the reasoning mode just changes how many tokens you burn at that rate. Thinking mode emits a
reasoning_contentfield alongside the finalcontent, and those reasoning tokens are billed as output.
How a correct cost calculation is structured
Every accurate estimate enumerates three token buckets and names one model tier. Skipping any of these is a math error that invalidates the whole forecast.
Input, cache hit : X tokens × hit_rate = $A
Input, cache miss : Y tokens × miss_rate = $B
Output : Z tokens × out_rate = $C
------
Total per call : $(A + B + C)
The cache-hit bucket is what you save when a repeated prefix is detected. Cache-hit pricing is automatic — every request with a repeated prefix against the same account benefits, you do not need to opt in, but prefixes must be at least 1,024 tokens long and must match byte-for-byte. That detail matters: a system prompt that you paraphrase between calls will not hit the cache.
The cache-miss bucket is the part of each request that the provider has not seen before — typically the user’s new message plus any per-call context. Do not pretend a cached system prompt covers the user message. Each new user turn is a miss against that prefix until the model sees it again.
For deeper background on prefix matching, see our notes on DeepSeek context caching, and pair this with a DeepSeek token counter if you do not yet have token counts for your prompts.
Worked example A: chatbot on V4-Flash
Assume one million API calls a month, each with a 2,000-token system prompt that benefits from caching, a 200-token user message that does not, and a 300-token response. The math on deepseek-v4-flash:
| Bucket | Tokens | Rate ($/M) | Cost |
|---|---|---|---|
| Cached input | 2,000,000,000 | $0.028 | $56.00 |
| Uncached input | 200,000,000 | $0.14 | $28.00 |
| Output | 300,000,000 | $0.28 | $84.00 |
| Total | — | — | $168.00 |
One million chatbot turns for $168 is not a typo — it is what V4-Flash is for. Notice the cache hit alone saves about $224 versus paying the miss rate on the whole prefix.
Worked example B: agentic coding loop on V4-Pro
Same workload at Pro rates:
| Bucket | Tokens | Rate ($/M) | Cost |
|---|---|---|---|
| Cached input | 2,000,000,000 | $0.145 | $290.00 |
| Uncached input | 200,000,000 | $1.74 | $348.00 |
| Output | 300,000,000 | $3.48 | $1,044.00 |
| Total | — | — | $1,682.00 |
Pro costs roughly 10x what Flash does on identical traffic. That is the trade-off: DeepSeek-V4-Pro-Max significantly advances the knowledge capabilities of open-source models and achieves top-tier performance in coding benchmarks while bridging the gap with leading closed-source models on reasoning and agentic tasks. Reserve Pro for tasks where the benchmark lift justifies the spend; default to Flash for chat and standard workloads.
Plugging the calculator into real code
Chat requests hit POST /chat/completions, the OpenAI-compatible endpoint at https://api.deepseek.com. The API is stateless — clients must resend the full conversation history with every request, unlike the web chat which keeps session history server-side. A minimal Python call looks like this:
from openai import OpenAI
client = OpenAI(
base_url="https://api.deepseek.com",
api_key="sk-...",
)
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": "Summarise this PR."},
],
max_tokens=500,
temperature=0.0,
)
print(resp.usage) # prompt_tokens, completion_tokens, prompt_cache_hit_tokens
The usage object is what feeds your calculator at runtime. Multiply each bucket by the right rate from the table above and log the result per request. For temperature, DeepSeek’s official guidance is 0.0 for code and math, 1.0 for data analysis, 1.3 for general chat and translation, and 1.5 for creative writing. To enable thinking mode on either V4 model, add reasoning_effort="high" with extra_body={"thinking": {"type": "enabled"}}; for maximum reasoning effort use reasoning_effort="max" with at least 384K context.
If you are still on legacy IDs, DeepSeek’s public API documentation now lists deepseek-v4-pro and deepseek-v4-flash as the current model names, while the older deepseek-chat and deepseek-reasoner aliases are scheduled for deprecation on July 24, 2026 at 15:59 UTC. The retirement is hard: until then both legacy IDs route to deepseek-v4-flash; after, they fail. Migration is a one-line model= swap; the base_url does not change. See the DeepSeek API pricing reference for the live rate card.
Common calculator mistakes that produce wrong numbers
- Mixing tiers in one example. Quoting Flash’s $0.14 input alongside Pro’s $3.48 output is a math error. Pick one tier per calculation and label it.
- Skipping the uncached-input line. A 2,000-token cached system prompt does not absorb the user’s new message. The user turn is a miss every time.
- Counting reasoning tokens as input. When thinking mode is on, the model emits
reasoning_contentbefore the final answer. Those tokens bill as output. - Forgetting the 1,024-token cache floor. Short prompts never hit the cache. If your system prompt is under that threshold, drop the cache-hit row entirely and bill the whole prefix as a miss.
- Modelling an off-peak discount. It does not exist any more. Charges are flat 24/7.
JSON mode, output caps and the budget headroom you need
If you are calling the API with response_format={"type": "json_object"}, set max_tokens high enough that the response cannot truncate mid-object — a truncated response is invalid JSON and you have paid for unusable output. JSON mode is designed to return valid JSON, not guaranteed; the model may occasionally return empty content, and the prompt should include the word “json” plus a small example schema. Detailed patterns live in the DeepSeek API JSON mode guide.
For broader budgeting beyond a single endpoint, the DeepSeek cost estimator covers monthly and annual projections, and the wider set of DeepSeek tools and utilities includes hardware sizing for self-hosted V4 weights if your workload pushes you off the hosted API.
Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.
Frequently asked questions
How accurate is a DeepSeek pricing calculator for production budgets?
An accurate calculator splits input into cache-hit and cache-miss tokens, adds output separately, and uses the right rate column for the chosen model tier. As long as those three buckets are enumerated and the model ID is named, the estimate will be within a few percent of your real bill. The full rate card lives on the DeepSeek API pricing page; verify before you commit to volume.
What is the difference between V4-Flash and V4-Pro pricing?
V4-Flash costs $0.14 per million cache-miss input tokens and $0.28 per million output, while V4-Pro costs $1.74 and $3.48 respectively — roughly a 10x premium on output. Both share a 1M-token context window. Default to DeepSeek V4-Flash for chat workloads; reserve DeepSeek V4-Pro for frontier-tier reasoning and agentic coding where the quality lift earns the spend.
Does cache hit pricing apply automatically?
Yes. Cache-hit pricing is automatic — every request with a repeated prefix against the same account benefits, with no opt-in required, but prefixes must be at least 1,024 tokens and match byte-for-byte. Paraphrasing the system prompt between calls will break the match. The DeepSeek context caching guide covers prefix structure in detail.
Why are my legacy deepseek-chat costs identical to V4-Flash?
Because they are V4-Flash now. The legacy deepseek-chat and deepseek-reasoner IDs currently route to deepseek-v4-flash in non-thinking and thinking mode respectively, and will be fully retired on July 24, 2026 at 15:59 UTC. Migrating is a one-line model= swap. Walk through the migration in the DeepSeek API getting started tutorial.
Is there still a night-time discount on the DeepSeek API?
No. DeepSeek ended the off-peak discount on September 5, 2025, and did not reintroduce it with the V4 Preview. Any calculator or third-party guide still applying a 50% or 75% night discount is out of date. Rates are flat across all hours and apply equally to thinking and non-thinking modes. For other current cost-cutting tactics see the DeepSeek API best practices reference.
