The DeepSeek Advanced Guide for V4 Pro, V4 Flash and Thinking Mode

DeepSeek advanced guide for V4 Pro and Flash: thinking modes, 1M context, real cost math, and migration steps. Build production workloads — start now.

The DeepSeek Advanced Guide for V4 Pro, V4 Flash and Thinking Mode

Guides·April 25, 2026·By DS Guide Editorial

You already know how to send a chat request to DeepSeek. The question this deepseek advanced guide answers is the next one: how do you actually run V4 in production without burning budget, hitting truncation on long contexts, or shipping JSON that occasionally arrives empty? I have been running DeepSeek V4 Pro and V4 Flash since the Preview dropped on April 24, 2026, after migrating workloads off V3.2 the same week. The patterns below come from billable traffic, not benchmark theatre — model selection, thinking-mode plumbing, cache-aware cost math, and the migration window before legacy IDs retire on July 24, 2026. Expect concrete numbers, working code, and a few opinions about where V4 still falls short.

What changed with DeepSeek V4 — and why it matters for advanced users

DeepSeek V4 is not one model. It is a family of two open-weight Mixture-of-Experts tiers, both shipped on the same day with a shared feature set. V4-Pro is 1.6T total parameters with 49B active per token, and V4-Flash is 284B total with 13B active. Both support a context length of one million tokens, and both are open-weight under the MIT license.

Two architectural decisions matter for production workloads. First, the new Hybrid Attention combines Compressed Sparse Attention (CSA) and Heavily Compressed Attention (HCA) to dramatically improve long-context efficiency. Second, the efficiency lift is large enough to make 1M context affordable: in the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. V4-Flash drops further still.

If you are coming from V3.2 or earlier, see the DeepSeek V3.2 reference for the previous-generation baseline; the rest of this article assumes V4 as the default.

Model selection: when to pick V4 Pro vs V4 Flash

V4 Pro is roughly 12× the per-output-token cost of Flash. That ratio drives almost every selection decision. The honest answer for most teams is: start on Flash, escalate specific routes to Pro when a benchmark gap maps to a real revenue or quality lever.

Dimension V4-Flash V4-Pro
Total / active params 284B / 13B 1.6T / 49B
Default context 1,000,000 tokens 1,000,000 tokens
Max output tokens 384,000 384,000
SWE-Bench Verified 79.0% 80.6%
Terminal-Bench 2.0 56.9% 67.9%
Output price per 1M tokens (USD) $0.28 $3.48
Best fit Chat, RAG, high-volume agents Frontier coding, complex multi-step agents

The Terminal-Bench gap is the single most useful data point for agent builders. Terminal-Bench 2.0 involves real autonomous terminal execution with a 3-hour timeout, and that gap matters for agentic workflows more than a single-turn coding benchmark would. If your agent shells out to a sandbox and runs for tens of minutes, Pro pays for itself. If it answers a single tool call and returns, Flash is almost always the right choice.

For deeper head-to-heads, the DeepSeek V4-Pro and DeepSeek V4-Flash reference pages break the benchmarks down by task type.

Thinking mode: a parameter, not a model

The biggest API change from V3.x is that thinking mode is no longer a separate model ID. DeepSeek-V4-Pro and DeepSeek-V4-Flash both support three reasoning effort modes: non-thinking (the default), thinking with reasoning_effort="high", and a maximum-effort tier with reasoning_effort="max". The mode is set per request.

Chat requests hit POST /chat/completions, the OpenAI-compatible endpoint at https://api.deepseek.com. Here is a minimal Python example using the OpenAI SDK to call V4-Pro in thinking mode:

from openai import OpenAI

client = OpenAI(
    base_url="https://api.deepseek.com",
    api_key="YOUR_KEY",
)

resp = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "system", "content": "You are a senior SRE. Be terse."},
        {"role": "user", "content": "Plan a zero-downtime Postgres 14->16 upgrade."},
    ],
    reasoning_effort="high",
    extra_body={"thinking": {"type": "enabled"}},
    max_tokens=8000,
)

print(resp.choices[0].message.reasoning_content)  # the thinking trace
print(resp.choices[0].message.content)            # the final answer

When thinking is enabled the response returns reasoning_content alongside the final content. Two operational notes: max-effort thinking needs headroom — for the Think Max reasoning mode, the recommended context window is at least 384K tokens. And DeepSeek also exposes an Anthropic-compatible surface against the same base URL, so the Anthropic SDK works against DeepSeek by swapping base_url and api_key.

For sampling defaults, DeepSeek’s official guidance is worth memorising: temperature 0.0 for code and maths, 1.0 for data analysis, 1.3 for general conversation and translation, 1.5 for creative writing. The full reference lives on the DeepSeek API best practices page.

Stateless API: the rule that breaks naive ports

The /chat/completions endpoint is stateless. The server does not remember prior turns. Every multi-turn request must resend the full messages array — system, user, assistant, user, and so on — or the model will respond as if the conversation just started.

This contrasts with the web chat and mobile app, which maintain session history server-side for the user. If you are porting a workflow that “just worked” in DeepSeek chat, the most common bug at the API boundary is forgetting to persist and resend conversation state. Build a ring buffer of recent turns, or rely on context caching (next section) to make the resend cheap.

Cost math: how to price a workload without lying to yourself

Every cost example below names the tier explicitly. Mixing rates across V4-Flash and V4-Pro is the single most common mistake I see in slide decks.

V4-Flash worked example

Workload: 1,000,000 calls per month, 2,000-token cached system prompt, 200-token user message (uncached), 300-token response.

  • Cached input: 2,000,000,000 tokens × $0.028/M = $56.00
  • Uncached input: 200,000,000 tokens × $0.14/M = $28.00
  • Output: 300,000,000 tokens × $0.28/M = $84.00
  • Total: $168.00

V4-Pro worked example

Same workload, Pro rates:

  • Cached input: 2,000,000,000 × $0.145/M = $290.00
  • Uncached input: 200,000,000 × $1.74/M = $348.00
  • Output: 300,000,000 × $3.48/M = $1,044.00
  • Total: $1,682.00

Two things people forget. The user message on each call is an uncached miss against the cached prefix — you cannot pretend the system-prompt cache hit covers the whole input. And output dominates Pro’s bill: 62% of the total in the example above. If you cannot get the model to answer in fewer tokens, Pro will hurt. The DeepSeek context caching guide and the DeepSeek pricing calculator will let you plug in your own ratios.

Note that off-peak discounts ended on September 5, 2025 and have not returned with V4. Quoting them as active is a common error — do not.

JSON mode, tool calling, FIM and the rest of the surface

DeepSeek’s JSON mode is designed to return valid JSON, not guaranteed to. Three rules I enforce on every JSON-mode route:

  1. Include the word “json” in the prompt and a small example schema. The docs require it; the model is more reliable when you do.
  2. Set max_tokens high enough that the JSON cannot be truncated. Truncated JSON is invalid JSON.
  3. Handle the empty-content case explicitly. The API can occasionally return an empty string, and a try/except around json.loads is not enough.

Tool calling works in OpenAI-compatible format and is supported in both thinking and non-thinking modes. FIM (Fill-In-the-Middle) completion is in Beta and runs in non-thinking mode only; it is genuinely useful for code-completion plug-ins, paired with DeepSeek with VS Code or your own editor surface. Streaming is enabled with stream=true; when thinking is on, reasoning content streams alongside final content. Chat Prefix Completion is also Beta and useful for continuation-style prompts.

Migration checklist for V3.2 → V4

Three integration variables matter, and the work is small. The retirement window for legacy IDs is the only hard deadline:

  1. Swap the model field. Replace deepseek-chat with deepseek-v4-flash, and deepseek-reasoner with deepseek-v4-flash plus the thinking parameters above. deepseek-chat and deepseek-reasoner will be fully retired and inaccessible after Jul 24th, 2026, 15:59 (UTC Time), currently routing to deepseek-v4-flash non-thinking/thinking.
  2. Keep the base URL. No change to https://api.deepseek.com.
  3. Re-tune max_tokens for the new context budget. 1M input is the default; output can run to 384,000.

If you also use the web chat, V4 is now the default there. DeepSeek-V4 launched on Hugging Face, the DeepSeek API, and chat.deepseek.com (Expert Mode = V4-Pro, Instant Mode = V4-Flash).

Where V4 still falls short

Two honest gaps. HLE (Humanity’s Last Exam) at 37.7% puts V4-Pro below Claude (40.0%), GPT-5.4 (39.8%), and well below Gemini-3.1-Pro (44.4%) — HLE tests expert-level cross-domain reasoning. And SimpleQA-Verified at 57.9% versus Gemini’s 75.6% reveals a meaningful factual knowledge retrieval gap; if your use case requires accurate real-world knowledge recall, Gemini holds a clear edge. Build retrieval into your stack — see the DeepSeek RAG tutorial — rather than asking V4 to memorise the world.

For a side-by-side with the dominant closed alternatives, the DeepSeek vs Claude and DeepSeek vs ChatGPT comparisons go deeper than space allows here. Patterns and prompt templates for production routes live in the broader DeepSeek beginner guides hub.

Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.

How is DeepSeek V4 different from V3.2 in practice?

V4 ships as two MoE tiers — V4-Pro (1.6T/49B active) and V4-Flash (284B/13B active) — instead of V3.2’s single chat/reasoner pair. Both default to a 1,000,000-token context with up to 384,000-token output, and thinking mode is a request parameter rather than a separate model ID. Migration is a model-field swap; see the DeepSeek V4 overview for the full delta.

What does the DeepSeek API return when thinking mode is on?

The response carries two fields: reasoning_content with the model’s thinking trace, and content with the final answer. Both stream when stream=true. Set reasoning_effort="high" with extra_body={"thinking": {"type": "enabled"}} on either V4 model. The DeepSeek API streaming guide covers buffering and partial-token handling.

Can I keep using deepseek-chat and deepseek-reasoner?

Yes, but only until July 24, 2026 at 15:59 UTC. Both legacy IDs currently route to deepseek-v4-flash in non-thinking and thinking mode respectively. After retirement, requests using those IDs will fail. The migration is a one-line model= change; the base URL does not change. Step-by-step instructions sit on the DeepSeek API documentation page.

Does DeepSeek’s API remember conversation history like the web chat?

No. The /chat/completions endpoint is stateless — clients must resend the full message history with every request. The web chat and mobile app keep session history for the user, but the developer surface does not. Persist conversation state on your side, or use context caching to keep the resend cheap. The DeepSeek context caching page documents the cache-hit rates.

Is DeepSeek V4 actually open source?

Yes — V4-Pro and V4-Flash are both published as open weights under the MIT license, available on Hugging Face. That covers running them locally, fine-tuning, and commercial deployment, subject to the MIT terms. Older releases sometimes split code (MIT) from weights (separate DeepSeek Model License); the is DeepSeek open source guide tracks licensing per model.

Leave a Reply

Your email address will not be published. Required fields are marked *