A Practitioner’s Guide to DeepSeek Prompt Engineering on V4
Why does the same prompt produce a tight, accurate answer one day and a rambling guess the next? Usually it is not the model — it is the prompt. This guide to DeepSeek prompt engineering is written from inside a production stack that runs both `deepseek-v4-pro` and `deepseek-v4-flash` daily. It assumes you have used DeepSeek’s web chat or API at least once and now want repeatable results: shorter outputs, fewer retries, lower bills.
I will cover prompt structure, the temperature settings DeepSeek itself recommends, when to switch on thinking mode, how to write a prompt that does not break JSON output, and how prompt shape affects your token bill. By the end you will have a small library of patterns you can paste into a real project.
What you’ll build (and why DeepSeek V4 changes the playbook)
By the end of this tutorial you will have: a default system-prompt template that works on either V4 tier, a non-thinking prompt for fast structured tasks, a thinking-mode prompt for harder reasoning, a JSON-mode prompt that does not silently truncate, and a worked cost example so you know what each pattern costs at scale.
The current generation is DeepSeek V4, released on April 24, 2026, and it ships as two open-weight Mixture-of-Experts models under the MIT license. DeepSeek V4-Pro has 1.6T total parameters with 49B active per token; DeepSeek V4-Flash has 284B total with 13B active. According to DeepSeek’s own thinking-mode docs, both models share a feature set and a default 1,000,000-token context window, with output up to 384,000 tokens.
The big shift for prompt engineers: thinking mode is now a request parameter, not a separate model ID. The DeepSeek model supports the thinking mode: before outputting the final answer, the model will first output a chain-of-thought reasoning to improve the accuracy of the final response, and the thinking toggle defaults to enabled with effort set to high for regular requests. That means the same prompt can be sent in three flavours: non-thinking, thinking-high, or thinking-max — and your prompt should change with the mode.
Prerequisites
- A DeepSeek account and a working API key — the get a DeepSeek API key walkthrough covers signup and billing.
- Python 3.9+ with the
openaiSDK installed (pip install openai). DeepSeek is OpenAI-compatible, so the same SDK works againsthttps://api.deepseek.com. - Comfort with a terminal and a basic understanding of system, user and assistant message roles.
- If you only know the chat UI so far, run through how to use DeepSeek first; this article assumes you have moved on to the API.
Chat requests hit POST /chat/completions, the OpenAI-compatible endpoint. DeepSeek also exposes an Anthropic-compatible surface against the same base URL, so an Anthropic SDK works with a base_url swap.
The DeepSeek prompt anatomy
Every effective DeepSeek prompt has six parts. Skip any of them and quality drops.
- Role — who the model is acting as (“senior security reviewer”, “patient maths tutor”).
- Task — one verb-led instruction. Not three, not five.
- Context — the data, code, or background the model needs.
- Constraints — length, tone, banned phrases, must-include items.
- Output format — markdown table, numbered list, JSON schema, plain prose.
- Examples — one or two input/output pairs when the format is non-obvious.
That order matters. DeepSeek (like most LLMs) weighs early tokens more heavily, so the role and task should land before any long context block. If your context is several thousand tokens of code or document text, put a one-line restatement of the task after the context too — a “task sandwich” — so the instruction is not buried.
Pick a temperature on purpose
DeepSeek publishes its own temperature recommendations, and they are different from OpenAI’s defaults. Memorise this table — it is the single biggest single-knob improvement most people can make.
| Use case | Temperature | Notes |
|---|---|---|
| Code generation, mathematics | 0.0 | Deterministic; rerun = same answer. |
| Data analysis, data cleaning | 1.0 | Default. Some variation, mostly grounded. |
| General conversation, translation | 1.3 | Natural-sounding output without losing accuracy. |
| Creative writing, poetry | 1.5 | More surprising word choice; expect re-rolls. |
One critical caveat: thinking mode does not support the temperature, top_p, presence_penalty, or frequency_penalty parameters — for compatibility, setting these parameters will not trigger an error but will also have no effect. If you turn on thinking, drop those knobs from your prompt-tuning vocabulary and tune the prompt itself instead.
A minimal V4 prompt in Python
This Python snippet shows the simplest correct V4 call against POST /chat/completions, with thinking explicitly disabled for a fast, deterministic structured task:
from openai import OpenAI
client = OpenAI(
base_url="https://api.deepseek.com",
api_key="YOUR_KEY",
)
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content":
"You are a senior Python reviewer. "
"Reply with a markdown table: column 1 'Issue', column 2 'Fix'. "
"If the code is fine, reply with the single word: clean."},
{"role": "user", "content": "def add(a,b): return a+b"},
],
temperature=0.0,
max_tokens=400,
extra_body={"thinking": {"type": "disabled"}},
)
print(resp.choices[0].message.content)
Two things to notice. First, the system prompt does the heavy lifting — role, task, output format, edge case. The user message is just data. Second, the API is stateless. To carry on a conversation you must resend the full messages array on every call. The web chat keeps history for you; the API does not. DeepSeek API best practices covers history-trimming patterns once your conversations get long.
When (and how) to use thinking mode
Thinking mode swaps speed for quality. The model emits a chain-of-thought trace, then the final answer. In thinking mode, the chain-of-thought content is returned via the reasoning_content parameter, at the same level as content — preferred phrasing in your own code comments is “returns reasoning_content alongside the final content“.
Use thinking mode when:
- The task involves multi-step reasoning the user expects to be auditable (legal, finance, complex maths).
- You are running an agent loop where wrong intermediate decisions cascade.
- You are willing to pay for more output tokens because the trace counts as completion tokens.
Skip thinking mode for chat replies, classification, formatting, translation, and most retrieval-augmented generation answers. The latency and the bill are not worth it.
Here is the same call in thinking-high mode:
resp = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[
{"role": "system", "content":
"You are a senior architect. Plan the migration in 5 numbered steps. "
"Be explicit about rollback for each step."},
{"role": "user", "content": migration_brief},
],
reasoning_effort="high",
extra_body={"thinking": {"type": "enabled"}},
)
reasoning = resp.choices[0].message.reasoning_content # the trace
answer = resp.choices[0].message.content # the plan
For the hardest agentic and competition-maths workloads, set reasoning_effort="max". The official Hugging Face card recommends setting the context window to at least 384K tokens for Think Max so the trace does not get truncated.
Don’t repeat the trace back into context
A common bug: developers append the assistant’s full message — including reasoning_content — to messages for the next turn. Between two user messages, if the model did not perform a tool call, the intermediate assistant’s reasoning_content does not need to participate in the context concatenation, and if passed to the API in subsequent turns, it will be ignored. The exception is when the model called a tool mid-turn — then you must pass the trace back. Pay your token bill once, not twice.
Prompting JSON mode without breaking it
JSON mode is designed to return valid JSON, not guaranteed. The official spec is blunt about the failure modes: when using JSON Output, you must instruct the model to produce JSON yourself via a system or user message — without this, the model may generate an unending stream of whitespace until the generation reaches the token limit, resulting in a long-running and seemingly “stuck” request, and the message content may be partially cut off if finish_reason=”length”.
That gives you three rules:
- Include the literal word “json” in your system or user message.
- Show a small example schema in the prompt — not just describe it.
- Set
max_tokenshigh enough that the JSON cannot be truncated mid-string.
A working pattern:
system = """You extract contact details and return json.
Return exactly this shape, with no prose:
{"name": "string", "email": "string|null", "phone": "string|null"}
If a field is missing, use null."""
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "system", "content": system},
{"role": "user", "content": email_body}],
response_format={"type": "json_object"},
temperature=0.0,
max_tokens=512,
)
Always wrap the parse in a try/except and handle empty content. Even at temperature 0 the model occasionally returns nothing when it cannot satisfy the schema. DeepSeek API JSON mode documents the failure modes and retry patterns in more detail.
Six patterns I keep in a snippet file
1. The role+constraint sandwich
You are a [ROLE].
Task: [VERB-LED INSTRUCTION].
Constraints:
- [LENGTH or TONE]
- [MUST INCLUDE]
- [MUST AVOID]
Reply only with [FORMAT]. No preamble.
2. Few-shot for tone
Two or three labelled examples beat any number of adjectives. “Write in our brand voice” rarely works; pasting two real product emails always does.
3. Verifier prompt
For high-stakes outputs, send a second call: “Here is a draft answer and the original question. List any factual claims unsupported by the source. Reply with json: {issues: […]}.” This catches more errors than a single longer prompt.
4. The “explain why” suffix
Adding “Explain in one sentence why each item belongs in the list” forces tighter selection without enabling full thinking mode. Cheaper than reasoning mode for a meaningful quality bump on classification tasks.
5. Negative examples
If the model keeps making the same mistake, add an explicit “Wrong: … Right: …” pair. DeepSeek pays attention to negatives more reliably than to “do not” instructions alone.
6. Output anchors
End your prompt with the first characters of the expected output (e.g. “Begin your reply with the line: ## Summary”). This dramatically reduces preamble drift on V4-Flash.
If you want curated starter prompts, our DeepSeek prompt templates library has tested versions for coding, writing and research. Builders who want to iterate inside a chat can use the DeepSeek prompt generator.
Verify your prompts work
“Looks good” is not a test. Three checks I run before any prompt ships:
- Run it five times at temperature 0 and at 1.3. If outputs vary wildly at 0, your prompt is ambiguous, not the model.
- Adversarial inputs. Empty string, very long input, input in a different language, input with prompt-injection (“ignore your instructions and …”). The prompt should fail gracefully.
- Token budget. Use a DeepSeek token counter to confirm your prompt fits comfortably within the 1M context — and that your
max_tokensmatches the realistic output length.
What prompt shape costs you — a worked example on V4-Flash
Prompt engineering and cost engineering are the same job. Imagine a customer-support bot making 1,000,000 calls per month with a 2,000-token system prompt (cached across calls), a 200-token user message, and a 300-token reply, on deepseek-v4-flash:
| Bucket | Tokens | Rate per 1M | Cost |
|---|---|---|---|
| Input, cache hit (system prompt) | 2,000,000,000 | $0.028 | $56.00 |
| Input, cache miss (each new user msg) | 200,000,000 | $0.14 | $28.00 |
| Output | 300,000,000 | $0.28 | $84.00 |
| Total | $168.00 |
Same workload on deepseek-v4-pro at $0.145 cache-hit / $1.74 cache-miss / $3.48 output per 1M tokens lands at $1,682.00 — roughly ten times more. That price difference is why most chat workloads should default to V4-Flash, with V4-Pro reserved for frontier coding or agentic work where the benchmark lift earns the spend. Anchor every quoted price with a date — these rates are current as of April 2026; verify on the DeepSeek API pricing page before committing budget.
Two prompt-shape levers move that bill more than anything else. First, cache prefixes aggressively: put your stable system prompt and few-shot examples at the very start of messages so the cache-hit tier kicks in. Second, cap output with max_tokens and prompt-side instructions (“reply in at most 80 words”) — output tokens cost ten times what cached input does on Flash and twenty-four times on Pro.
Legacy IDs and the migration window
If your code still says model="deepseek-chat" or model="deepseek-reasoner", those legacy IDs currently route to deepseek-v4-flash (non-thinking and thinking respectively) but will be retired on 2026-07-24 at 15:59 UTC. After that, requests using those IDs will fail. Migration is a one-line change to the model= field — base_url stays the same. Prompt content rarely needs editing, but if you were prompt-tuning around quirks of the old V3.2-class model, retest at temperature 0 and recheck output structure.
Common errors and quick fixes
| Symptom | Likely cause | Fix |
|---|---|---|
Empty content in JSON mode |
Prompt did not include the word “json” or a schema | Add an example schema and the literal word “json” to the system prompt. |
| Truncated JSON ending with a comma | max_tokens too low |
Raise max_tokens; check finish_reason. |
| Model ignores temperature | Thinking mode is enabled | Sampling params are no-ops in thinking mode; tune the prompt instead. |
Output drifts across reruns at temperature=0 |
Ambiguous task or conflicting constraints | Rewrite for one clear task; remove conflicting “be brief but detailed” pairs. |
| Costs spiking after enabling thinking | Long traces in completion_tokens |
Lower reasoning_effort, cap max_tokens, or move the task to non-thinking with stronger few-shots. |
Next steps
Two natural follow-ons: build a small benchmark harness using these patterns with the DeepSeek API getting started tutorial, then plug your best prompts into a retrieval pipeline using the DeepSeek RAG tutorial. For the bigger picture across all DeepSeek how-tos, the DeepSeek tutorials hub indexes everything by skill level.
Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.
How is DeepSeek prompt engineering different on V4 compared to V3?
The biggest change is that thinking mode is now a request parameter on either deepseek-v4-pro or deepseek-v4-flash, not a separate model ID. You toggle it with reasoning_effort and extra_body={"thinking": {"type": "enabled"}}. Context is also 1M tokens by default. See the DeepSeek V4 overview for the full list of architecture changes.
What temperature should I use for DeepSeek?
DeepSeek’s own guidance is 0.0 for code and maths, 1.0 for data analysis, 1.3 for general chat and translation, and 1.5 for creative writing. Pick on purpose; do not leave temperature at whatever default your SDK uses. In thinking mode, temperature has no effect — tune the prompt itself. The DeepSeek API best practices guide has more parameter tips.
Does DeepSeek remember previous messages between API calls?
No. The API is stateless — every POST /chat/completions call must include the full messages array if you want continuity. The web chat and mobile app keep session state for you, but the API does not. This is one of the most common newcomer mistakes; for a clearer walkthrough see the DeepSeek API documentation overview.
Can I use the same prompt on V4-Flash and V4-Pro?
Usually yes — both models share a feature set, prompt format and 1M context. Pro tends to need fewer few-shot examples to hit the same quality, and Flash benefits from more explicit formatting anchors. Always test in both. Pricing differs roughly tenfold, so start on Flash and only escalate when a benchmark or eval shows Pro pays for itself. Compare specs on DeepSeek comparisons.
Why does my JSON-mode prompt sometimes return empty content?
JSON mode is designed to return valid JSON, not guaranteed. If the prompt does not include the word “json” plus an example schema, or if max_tokens is too low, the model can return empty content or truncated output. Always include both, set generous max_tokens, and wrap parsing in try/except. The DeepSeek API JSON mode reference details every failure mode.
