DeepSeek API Examples: Working V4 Snippets for Real Workloads
You opened the DeepSeek docs, found a quickstart, and now you want the next ten things — streaming, JSON output, tool calls, thinking mode, cost math that survives a code review. This page collects the DeepSeek API examples I actually run in production against V4-Flash and V4-Pro, the two models that replaced the legacy `deepseek-chat` and `deepseek-reasoner` IDs in April 2026. Every snippet has been tested against the live OpenAI-compatible endpoint at `https://api.deepseek.com`, with notes on what breaks, what costs more than you think, and which features are still Beta. By the end you will have copy-paste code for the eight most common API patterns, plus a worked cost calculation for each tier.
What the DeepSeek API actually is
DeepSeek’s API is an HTTP service that speaks the OpenAI Chat Completions wire format. The canonical chat endpoint is POST /chat/completions against the base URL https://api.deepseek.com. DeepSeek’s official quickstart shows the same shape — a JSON body with model, messages, and the usual sampling knobs.
Two practical points before any code:
- The API is stateless. Unlike the web chat or mobile app, the API does not remember prior turns. Each request must include the full conversation history in
messages. - Two current model IDs. DeepSeek V4-Flash (284B total / 13B active) and DeepSeek V4-Pro (1.6T / 49B active). Both are open-weight MoE under MIT, both default to a 1,000,000-token context with output up to 384,000 tokens.
Legacy deepseek-chat and deepseek-reasoner still work, but the legacy deepseek-chat and deepseek-reasoner endpoints will be fully retired on July 24, 2026 at 15:59 UTC. Until then they route to deepseek-v4-flash in non-thinking and thinking mode respectively. Migration is a one-line change to your model= argument; base_url stays put. For the full migration story see our DeepSeek V4 overview.
Example 1: Minimal request — curl and Python
The simplest possible call. First in curl:
curl https://api.deepseek.com/chat/completions
-H "Authorization: Bearer $DEEPSEEK_API_KEY"
-H "Content-Type: application/json"
-d '{
"model": "deepseek-v4-flash",
"messages": [
{"role": "system", "content": "You are a terse senior engineer."},
{"role": "user", "content": "What is a MoE router, in two sentences?"}
]
}'
Then the equivalent in Python with the OpenAI SDK — no rewrites, just swap the base URL:
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "You are a terse senior engineer."},
{"role": "user", "content": "What is a MoE router, in two sentences?"},
],
temperature=1.3,
)
print(resp.choices[0].message.content)
Note temperature=1.3. DeepSeek publishes task-specific defaults: 0.0 for code and math, 1.0 for data analysis, 1.3 for general chat and translation, 1.5 for creative writing. Importing GPT-style defaults of 0.7 will give you flatter answers than the model is capable of.
If your codebase already speaks Anthropic, the API is compatible with OpenAI ChatCompletions and Anthropic-style endpoints, so the Anthropic SDK works against the same base URL too. More on that in our DeepSeek OpenAI SDK compatibility guide.
Example 2: Multi-turn conversation (resending history)
Because the API is stateless, you maintain the conversation client-side:
history = [
{"role": "system", "content": "You are a Postgres tutor."},
]
def ask(user_msg: str) -> str:
history.append({"role": "user", "content": user_msg})
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=history,
temperature=1.3,
)
answer = resp.choices[0].message.content
history.append({"role": "assistant", "content": answer})
return answer
print(ask("Explain CTEs in one paragraph."))
print(ask("Show a recursive example."))
Every call resends the entire history array. That sounds wasteful, but DeepSeek’s automatic context cache detects the repeated prefix and bills the older tokens at the cache-hit rate. We dig into that further in DeepSeek context caching.
Example 3: Streaming responses
For chat UIs you want tokens as they generate. Set stream=True and iterate:
stream = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "Write a haiku about caching."}],
stream=True,
temperature=1.5,
)
for chunk in stream:
delta = chunk.choices[0].delta.content or ""
print(delta, end="", flush=True)
When thinking mode is enabled, reasoning content streams in a separate delta.reasoning_content field before final content begins. Detailed patterns live in DeepSeek API streaming.
Example 4: Thinking mode (reasoning_effort)
V4 collapses the old “chat vs reasoner” split into a parameter on either model. Both variants support a 1M-token context and three reasoning modes: non-thinking, thinking, thinking_max. You enable thinking with reasoning_effort plus the thinking flag passed through extra_body:
resp = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[{"role": "user",
"content": "Plan a zero-downtime migration from MySQL 5.7 to 8.0."}],
reasoning_effort="high",
extra_body={"thinking": {"type": "enabled"}},
max_tokens=8192,
)
msg = resp.choices[0].message
print("REASONING:n", msg.reasoning_content)
print("ANSWER:n", msg.content)
The response returns reasoning_content alongside the final content. Use reasoning_effort="max" for the heaviest workloads — that mode requires max_model_len >= 393216 to avoid truncation. Keep thinking off for simple chat: it spends output tokens on a reasoning trace you may not need.
Example 5: JSON mode
For structured extraction set response_format={"type": "json_object"}. JSON mode is designed to return valid JSON, not guaranteed — three rules apply:
- Include the word “json” in the system or user prompt.
- Show a small example of the schema you want.
- Set
max_tokenshigh enough that output cannot be truncated mid-object.
SYSTEM = """Extract invoice fields as json.
Example:
{"invoice_no": "INV-001", "total_usd": 42.50, "line_items": 3}
"""
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": SYSTEM},
{"role": "user", "content": open("invoice.txt").read()},
],
response_format={"type": "json_object"},
max_tokens=2048,
temperature=0.0,
)
import json
data = json.loads(resp.choices[0].message.content or "{}")
Handle the empty-content case explicitly — that or "{}" exists for a reason. More patterns in DeepSeek API JSON mode.
Example 6: Tool calling (function calling)
Tool calling uses the OpenAI-compatible schema, supported in both thinking and non-thinking modes:
tools = [{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get current weather for a city.",
"parameters": {
"type": "object",
"properties": {"city": {"type": "string"}},
"required": ["city"],
},
},
}]
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[{"role": "user", "content": "Weather in Dublin?"}],
tools=tools,
)
call = resp.choices[0].message.tool_calls[0]
print(call.function.name, call.function.arguments)
Resolve the call locally, then send the result back as a {"role": "tool", "tool_call_id": ..., "content": ...} message and call again. See DeepSeek API function calling for the full round-trip.
Example 7: FIM completion (Beta)
Fill-In-the-Middle is for code editors and autocomplete. It is non-thinking-mode only and currently Beta — point at the Beta base URL:
beta = OpenAI(api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com/beta")
resp = beta.completions.create(
model="deepseek-v4-flash",
prompt="def fibonacci(n):n if n < 2:n return nn return ",
suffix="nnprint(fibonacci(10))",
max_tokens=64,
)
print(resp.choices[0].text)
Example 8: Chat Prefix Completion (Beta)
When you need the model to continue from a specific opening, set prefix: True on the trailing assistant message and hit the Beta endpoint:
messages = [
{"role": "user", "content": "Write quicksort in Python."},
{"role": "assistant", "content": "```pythonn", "prefix": True},
]
resp = beta.chat.completions.create(
model="deepseek-v4-pro",
messages=messages,
stop=["```"],
)
print(resp.choices[0].message.content)
Quick parameter reference
| Parameter | Type | Notes |
|---|---|---|
model |
string | deepseek-v4-flash or deepseek-v4-pro |
temperature |
0.0–2.0 | 0.0 code/math, 1.0 analysis, 1.3 chat, 1.5 creative |
top_p |
0–1 | Nucleus sampling; alternative to temperature |
max_tokens |
int | Up to 384,000 on V4; required when using JSON mode |
reasoning_effort |
string | "high" or "max"; pair with thinking flag |
stream |
bool | SSE chunks; reasoning streams separately when enabled |
response_format |
object | {"type": "json_object"} for JSON mode |
tools |
array | OpenAI-shaped function definitions |
Cost worked example — both tiers
Pricing as of April 2026, from DeepSeek’s official pricing page. They’re charging $0.14/million tokens input and $0.28/million tokens output for Flash, and $1.74/million input and $3.48/million output for Pro. Cache-hit input is roughly an order of magnitude cheaper on each tier.
Workload: 1,000,000 calls per month with a 2,000-token system prompt (cached), 200-token user message (uncached), 300-token response.
| Token bucket | Volume | V4-Flash | V4-Pro |
|---|---|---|---|
| Cached input | 2.0B tokens | $56.00 | $290.00 |
| Uncached input | 0.2B tokens | $28.00 | $348.00 |
| Output | 0.3B tokens | $84.00 | $1,044.00 |
| Total / month | $168.00 | $1,682.00 |
Two things people get wrong: skipping the uncached-input row (each new user message misses the cache against the system prefix), and mixing tiers inside one estimate. Pick one and stick with it. Off-peak discounts ended on September 5, 2025 and have not returned with V4. For interactive estimates use the DeepSeek pricing calculator.
Error handling pattern
The API returns OpenAI-compatible error envelopes. The codes you will hit most often:
401 Unauthorized— bad or missing API key.402 Insufficient Balance— top up the account.429 Too Many Requests— back off and retry with jitter.finish_reason="length"— output was truncated; raisemax_tokens.finish_reason="content_filter"— request blocked by the safety layer.
import time, random
from openai import RateLimitError, APIError
def call_with_retry(**kw):
for attempt in range(5):
try:
return client.chat.completions.create(**kw)
except RateLimitError:
time.sleep((2 ** attempt) + random.random())
except APIError as e:
if 500 <= e.status_code < 600:
time.sleep(2 ** attempt)
continue
raise
raise RuntimeError("exhausted retries")
For a complete code-by-code playbook see DeepSeek API error codes.
Where to go next
If you are still wiring up authentication, start with our walk-through on how to get a DeepSeek API key, then the DeepSeek API getting started tutorial. For broader context on every endpoint, parameter and rate-limit tier, the DeepSeek API docs and guides hub indexes the lot. If you need a JavaScript counterpart to the snippets above, see DeepSeek Node.js integration.
Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.
Frequently asked questions
How do I make my first DeepSeek API call?
Get an API key from the DeepSeek platform console, install the OpenAI SDK (pip install openai), and instantiate the client with base_url="https://api.deepseek.com". Then call client.chat.completions.create with model="deepseek-v4-flash" and a messages array. The Example 1 snippet above is a working minimal call. Full setup steps live in our DeepSeek API getting started tutorial.
What is the difference between deepseek-v4-flash and deepseek-v4-pro?
Flash is 284B total / 13B active parameters and costs $0.14 input miss / $0.28 output per million tokens. Pro is 1.6T / 49B active and costs $1.74 / $3.48. Both share the same 1M-token context, thinking modes, JSON mode, tool calling and streaming. Choose Flash for chat and high-volume work, Pro for frontier-tier coding and agentic tasks. Compare in detail at the DeepSeek models hub.
Does the DeepSeek API remember previous messages?
No. The API is stateless: every POST /chat/completions request must include the full conversation history in messages. The web chat and mobile app maintain session history server-side, but the developer API does not. Repeated prefixes are billed at the cheaper cache-hit rate automatically. See DeepSeek context caching for how to structure prompts so the cache fires.
Can I use the OpenAI Python SDK with DeepSeek?
Yes. DeepSeek’s API matches the OpenAI Chat Completions wire format. Change only base_url to https://api.deepseek.com and supply your DeepSeek API key — every existing call site keeps working. DeepSeek also exposes an Anthropic-compatible surface for teams already using that SDK. The compatibility caveats are catalogued in our DeepSeek OpenAI SDK compatibility notes.
Why does DeepSeek return reasoning_content in some responses?
When you enable thinking mode with reasoning_effort="high" and extra_body={"thinking": {"type": "enabled"}}, the model returns reasoning_content alongside the final content. The reasoning field holds the chain the model worked through; the content field is the user-facing answer. Legacy deepseek-reasoner produced the same shape and retires on July 24, 2026. More on prompt design in our DeepSeek prompt engineering guide.
