DeepSeek Prompt Templates: A Practitioner’s Library for V4

Tools·April 25, 2026·By DS Guide Editorial

You opened the DeepSeek console, pasted “write me a marketing email,” and got something generic. The model is not the problem. The prompt is. After six months running DeepSeek V4-Pro and V4-Flash through production workloads — extraction pipelines, code review, RAG over legal docs, weekly reports — I’ve collected a small library of DeepSeek prompt templates that consistently outperform ad-hoc requests. They are short, structured, and tuned to the parameters DeepSeek’s docs actually recommend.

This article gives you twelve copy-paste templates organised by task — extraction, code, reasoning, writing, agentic tool use — plus the temperature settings, system prompt structure, and JSON-mode caveats that make them work. Every template has been tested against `deepseek-v4-pro` and `deepseek-v4-flash` since the V4 Preview shipped on April 24, 2026.

Why prompt templates matter more on DeepSeek V4

V4 changed the rules. The legacy model names deepseek-chat and deepseek-reasoner will be deprecated on 2026-07-24, and they now correspond to the non-thinking and thinking modes of deepseek-v4-flash respectively. That migration matters for templates because thinking is no longer a separate model — it is a request parameter on either deepseek-v4-pro or deepseek-v4-flash. The same prompt produces measurably different outputs depending on whether you set reasoning_effort="high" with extra_body={"thinking": {"type": "enabled"}}, or leave it off.

The other thing V4 changed: context. Both tiers default to a 1,000,000-token window with output up to 384,000 tokens. That means templates can be longer, carry more examples, and embed more reference material than they could on V3.2. But long prompts are also wasted spend if you do not structure them so DeepSeek’s context cache can hit. We will cover the cache-friendly layout below.

For background on the underlying models, see DeepSeek V4-Pro and DeepSeek V4-Flash. If you want a deeper dive on prompt theory rather than ready-made templates, the companion piece on DeepSeek prompt engineering is the better starting point.

The four-block template structure

Every template in this article follows the same four-block layout. It is the structure I find easiest to maintain and the easiest for context caching to fingerprint.

System block — role, constraints, output format. Stable across calls.
Reference block — schemas, examples, retrieved documents. Mostly stable.
Task block — what to do this turn. Changes per call.
User input — the volatile content (a question, a file, a row).

Blocks 1 and 2 belong in the system message or the front of the first user message. Block 3 and 4 go in subsequent user messages. Because the default temperature is 1.0 and DeepSeek recommends adjusting per use case — close to 0.0 for code and structured analysis, around 1.3 for general conversation, and above 1.0 for creative work, every template below pairs the prompt with a recommended temperature.

Temperature settings to pair with each template

Task type	Temperature	Notes
Code generation, math	0.0	Deterministic, fewer hallucinated APIs
Data extraction, cleaning	1.0	DeepSeek’s recommended setting for analytical work
General Q&A, translation	1.3	Default-ish; balanced
Creative writing, poetry	1.5	More variety, more revisions needed

These match DeepSeek’s official guidance on the API parameter docs page. Pick one and stick to it for that template — do not also push top_p; DeepSeek and the OpenAI-compatible spec generally recommend altering temperature or top_p but not both.

Template 1 — Structured extraction (JSON mode)

Use this when you need typed output. The DeepSeek API uses an OpenAI/Anthropic-compatible format, so you can use the OpenAI SDK against the DeepSeek base URL by changing only the configuration. Set response_format={"type": "json_object"}, set max_tokens high enough to avoid truncation, and include the word “json” plus a small example schema in the prompt.

System:
You are a structured-data extractor. Return only valid JSON matching the schema.
Do not include prose. If a field is unknown, return null.

Schema (example):
{
  "company": "Acme Inc",
  "amount_usd": 12500.00,
  "invoice_date": "2026-04-12",
  "line_items": [{"sku": "A-1", "qty": 2, "price": 19.99}]
}

User:
Extract a json record from the email below. Use the schema above.

[paste email]

JSON mode is designed to return valid JSON, not guaranteed. The model may occasionally return empty content, so wrap parsing in a retry. Full caveats are documented in the DeepSeek API JSON mode guide. Recommended model: deepseek-v4-flash, non-thinking, temperature 1.0.

Template 2 — Code generation with constraints

Code is where the temperature-0 setting earns its keep. Tell the model the language, the runtime, the libraries you allow, and the failure modes you do not want.

System:
You are a senior Python engineer. Write Python 3.12 code only.
Allowed libs: stdlib, httpx, pydantic. No requests, no global state.
Return one fenced code block plus a 3-line explanation.

User:
Write a function fetch_invoices(api_key: str, since: date) -> list[Invoice]
that pages /v1/invoices?since=... and returns a list of Pydantic Invoice
models with id, amount_cents, currency, paid_at fields.

For multi-file refactors, switch to deepseek-v4-pro with thinking enabled. Pair this template with the workflow described in DeepSeek for coding.

Template 3 — Reasoning with thinking mode

Thinking mode is a parameter, not a model. Set reasoning_effort="high" and extra_body={"thinking": {"type": "enabled"}}; the response returns reasoning_content alongside the final content. Use "max" for hardest problems — the docs note this requires max_model_len >= 393216 to avoid truncation.

System:
You are a careful analyst. Think through the problem step by step in
reasoning_content. In content, return only the final answer in the
format requested.

User:
A SaaS charges $0.14 per 1M input tokens and $0.28 per 1M output.
We make 1M calls/day, each with 2,000 cached input + 200 fresh input
+ 300 output tokens. Cache-hit input is $0.028/M. What is the daily
spend, broken down by bucket? Return a markdown table.

That is essentially the V4-Flash worked example from BASELINE FACTS — the daily total is $168 (cached $56, miss $28, output $84). Showing the math beats trusting any single model output.

Template 4 — Cache-friendly system prompt

Context caching cuts input cost from $0.14 to $0.028 per million tokens on V4-Flash, and from $1.74 to $0.145 on V4-Pro. The cache-hit price applies automatically when DeepSeek detects a repeated prefix. To trigger it, keep the prefix bit-identical across calls.

Put the entire static instruction set, schemas, and few-shot examples in messages[0] (system) or the very front of messages[1] (user).
Never interpolate timestamps, request IDs, or per-user names into the prefix.
If the user-specific bit must come early, push it into a later message and keep the system block stable.

For the math on what cache hits actually save, the DeepSeek context caching reference walks through a worked example.

Template 5 — RAG with cited sources

System:
You answer questions only from the provided sources. Cite each claim
with the source id in square brackets, e.g. [S2]. If the sources do
not contain the answer, reply: "Not in sources." Do not invent facts.

Sources:
[S1] {chunk 1 text}
[S2] {chunk 2 text}
[S3] {chunk 3 text}

User:
{question}

Pair with retrieval as covered in the DeepSeek RAG tutorial. Temperature 1.0; deepseek-v4-flash is fine for most retrievals, escalate to V4-Pro when answers must be exhaustive.

Template 6 — Long-document summarisation

With a 1M-token context, you rarely need chunking just to fit. You still need to direct what gets summarised.

System:
Summarise the document in three layers:
1. One-sentence verdict
2. Five bullet key findings, each <= 20 words
3. Risks and unknowns, with section references like (p. 12)

Do not editorialise. Do not add information that is not in the document.

Template 7 — Role-based critique

Role prompts work because they constrain vocabulary and what the model treats as in-scope. Be specific.

You are a database administrator with 15 years of Postgres experience.
Review the SQL below. List, in order of severity:
1. Correctness bugs
2. Performance issues at > 10M rows
3. Style problems
For each, quote the offending line and propose a fix.

Template 8 — Tool calling skeleton

Function calling works in the OpenAI format. Declare tools, let the model emit a tool call, run the call client-side, append the result to messages, and call again. The API is stateless — chat requests hit POST /chat/completions, the OpenAI-compatible endpoint, and you must resend the conversation history on every turn.

from openai import OpenAI
client = OpenAI(base_url="https://api.deepseek.com", api_key="...")

tools = [{
  "type": "function",
  "function": {
    "name": "get_weather",
    "description": "Current weather for a city",
    "parameters": {
      "type": "object",
      "properties": {"city": {"type": "string"}},
      "required": ["city"]
    }
  }
}]

resp = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Weather in Dublin?"}],
    tools=tools,
)

For the full pattern see DeepSeek API function calling.

Template 9 — Translation with register control

Translate the text below from English to {target}.
Register: {formal|conversational|technical}.
Preserve: code blocks, URLs, numbers, proper nouns.
Output only the translation, no commentary.

Temperature 1.3 produces less wooden output than 0; the official docs flag 1.3 as the recommended setting for translation.

Template 10 — Creative writing brief

Write a {format} of {length} words about {topic}.
Audience: {who}. Tone: {adjective + adjective}.
Constraints: {must include}. Avoid: {must not}.
Provide two distinct drafts, then a one-line note on the difference.

Temperature 1.5. If both drafts feel the same, raise to 1.7 or rerun — V4 at low temperature can lock onto a single voice.

Template 11 — Test-case generator

For the function signature below, generate pytest tests covering:
- Happy path (3 cases)
- Edge cases (empty, single element, max size)
- Error cases (invalid type, out-of-range)
Return one fenced code block. No commentary.

{paste signature + docstring}

Template 12 — Self-critique pass

Cheap on V4-Flash, often worth running. The first call drafts; the second critiques and rewrites.

Pass 1 (temperature 1.3): Draft the response.
Pass 2 (temperature 0.0): "Critique the draft below for factual errors,
weak claims, and structural problems. Rewrite to fix them. Output only
the rewritten version."

Endpoint and parameter reference

Every template above goes to POST /chat/completions at https://api.deepseek.com. DeepSeek’s API is OpenAI- and Anthropic-compatible, so you can swap base URL and key in either SDK. The parameters that show up in the templates above:

model — deepseek-v4-pro or deepseek-v4-flash
temperature — see the table above
max_tokens — set high (e.g. 8000+) when using JSON mode
reasoning_effort — "high" or "max" with extra_body={"thinking": {"type": "enabled"}}
response_format — {"type": "json_object"} for Template 1
tools — for Template 8
stream — true for chat UIs

For setup, see how to get a DeepSeek API key. For the broader collection of utilities including the prompt generator, see the DeepSeek tools and utilities hub.

What I would not do

Inline timestamps in the system prompt — kills cache hits.
Set both temperature and top_p aggressively — pick one.
Use thinking mode for short factual questions — you pay for output tokens you do not need.
Trust JSON mode without a parser fallback — it is “designed to” return JSON, not guaranteed.
Quote a single SWE-Bench number from a third-party blog without checking the split (Verified vs Full vs Lite).

Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.

Frequently asked questions

How do I write a system prompt for DeepSeek V4?

Put the role, constraints, and output format in the system message and keep it bit-identical across calls so context caching can hit. Add schemas or few-shot examples in the same block. Save volatile content (timestamps, user names, today’s question) for later user messages. The four-block structure used in our DeepSeek prompt engineering guide works for both V4-Pro and V4-Flash.

What temperature should I use with DeepSeek prompt templates?

DeepSeek’s docs recommend 0.0 for code and math, 1.0 for data analysis, 1.3 for general conversation and translation, and 1.5 for creative writing. The default is 1.0. Match the temperature to the template you are running — code templates at 0.0, creative templates at 1.5. More guidance lives in the DeepSeek API best practices reference.

Can I use the same prompt for V4-Pro and V4-Flash?

Yes — both tiers share the same API surface and prompt format. The difference is cost and capability ceiling: V4-Pro is roughly six to seven times more expensive on output and worth it for frontier coding or agentic work. For chat, extraction, and most RAG, V4-Flash is the default. Compare both on the DeepSeek pricing calculator.

Does DeepSeek JSON mode guarantee valid JSON output?

No. DeepSeek’s docs describe JSON mode as designed to return valid JSON, not guaranteed. The model may occasionally return empty content, so always wrap parsing in a try/except with a retry. Include the word “json” plus an example schema in the prompt, and set max_tokens high enough that the response cannot be truncated mid-object. The DeepSeek API JSON mode doc covers the failure cases in detail.

Why is my DeepSeek prompt template more expensive than expected?

Almost always one of three things: the system prompt is changing across calls (no cache hits), max_tokens is uncapped so thinking mode produces enormous traces, or you are on V4-Pro when V4-Flash would do. Check token counts with the DeepSeek token counter and review the cost breakdown by bucket — cached input, uncached input, and output — separately.