DeepSeek API Getting Started: A Hands-On V4 Quickstart
If you have an OpenAI or Anthropic client already wired up, the deepseek api getting started path is short enough to finish before your coffee cools: change a base URL, paste an API key, pick a model ID, and ship a request. That’s the good news. The rest of this guide covers what the quickstart pages usually gloss over — which of the two V4 tiers to pick, how thinking mode actually toggles, why the API is stateless even though the web chat isn’t, and how to cost a workload without surprise bills.
I run DeepSeek V4-Flash and V4-Pro in production alongside Claude and GPT-5, so every recommendation below is grounded in live usage as of April 2026. By the end you’ll have a working Python call, a correct cost estimate, and a migration plan for legacy model IDs.
What you’ll build in this guide
Three concrete outcomes: a working POST /chat/completions request in Python against deepseek-v4-flash, a second call that enables thinking mode on deepseek-v4-pro, and a cost estimate for a realistic workload with cache-hit, cache-miss and output tokens broken out. Everything uses the official OpenAI SDK — DeepSeek’s chat surface is wire-compatible, so you do not need a DeepSeek-specific client library.
A short orientation before the steps. The current generation is DeepSeek V4, released on April 24, 2026, and it ships as two open-weight MoE models under the MIT license: deepseek-v4-pro (1.6T total / 49B active, frontier tier) and deepseek-v4-flash (284B / 13B active, cost-efficient tier). Both share the same feature set, a 1,000,000-token context window, and up to 384,000 tokens of output. Thinking mode is a request parameter on either tier, not a separate model ID.
Prerequisites
- Python 3.9+ with
pipavailable, or Node.js 18+ if you prefer JavaScript. - A DeepSeek platform account and an API key. If you do not have one yet, walk through our companion piece on how to get a DeepSeek API key first.
- About $1 of credit on the billing console. Realistic Flash-tier testing costs pennies; keep Pro-tier experiments short until you’ve seen the cost math in Step 6.
- Familiarity with environment variables. Never hardcode the key in a file you might commit.
Step 1: Install the OpenAI SDK
Because DeepSeek exposes an OpenAI-compatible wire format, any OpenAI client works once you redirect it. Install the official SDK with pip:
pip install openai
Node.js users can run npm install openai instead. DeepSeek also ships an Anthropic-compatible surface at the same base URL, so the Anthropic SDK is a drop-in alternative if your codebase already uses it. See our notes on DeepSeek OpenAI SDK compatibility for the parameter-by-parameter mapping.
Step 2: Store your API key safely
Export the key to your shell so it lives outside your source tree:
export DEEPSEEK_API_KEY="sk-..."
On Windows PowerShell, use $env:DEEPSEEK_API_KEY = "sk-...". For longer-term setups, put it in a .env file that is in .gitignore. If the key ever leaks, rotate it from the billing console immediately — there’s more detail in our writeup on DeepSeek API authentication.
Step 3: Make your first request
The canonical chat endpoint is POST /chat/completions, reached via base_url="https://api.deepseek.com". Here’s a minimal Python script that sends a non-thinking request to V4-Flash:
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.deepseek.com",
api_key=os.environ["DEEPSEEK_API_KEY"],
)
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": "You are a terse assistant."},
{"role": "user", "content": "Summarise MoE architecture in two sentences."},
],
temperature=1.3,
max_tokens=400,
)
print(resp.choices[0].message.content)
If you prefer curl, the same call looks like this:
curl https://api.deepseek.com/chat/completions
-H "Content-Type: application/json"
-H "Authorization: Bearer $DEEPSEEK_API_KEY"
-d '{
"model": "deepseek-v4-flash",
"messages": [{"role":"user","content":"Hello"}],
"temperature": 1.3
}'
Picking the right temperature
DeepSeek publishes task-specific guidance for temperature, and matching it saves you a lot of prompt tuning:
- 0.0 — code generation and mathematics.
- 1.0 — data analysis and data cleaning.
- 1.3 — general conversation and translation.
- 1.5 — creative writing and poetry.
You can use top_p as an alternative or complement, and max_tokens to cap output length — especially important with JSON mode, where truncation produces invalid output.
Step 4: Enable thinking mode on V4-Pro
Thinking mode is a request parameter, not a separate model ID. Both V4 models accept three reasoning-effort settings: non-thinking (default), reasoning_effort="high" with extra_body={"thinking": {"type": "enabled"}}, or reasoning_effort="max" for maximum-effort thinking. When thinking is enabled the API returns reasoning_content alongside the final content.
resp = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[{"role": "user", "content": "Plan a zero-downtime Postgres 14-to-16 upgrade."}],
reasoning_effort="high",
extra_body={"thinking": {"type": "enabled"}},
)
print("Reasoning:", resp.choices[0].message.reasoning_content)
print("Answer:", resp.choices[0].message.content)
For reasoning_effort="max", ensure your runtime allows a max_model_len of at least 393,216 tokens (384K) so the reasoning trace is not truncated. Use non-thinking mode for latency-sensitive or high-volume chat; use thinking mode when the task has real planning, multi-step reasoning, or agentic tool calls.
Step 5: Know which model ID to send
V4 reshaped the model picker. Here’s the current state, including the legacy IDs that are still accepted during the migration window:
| Model ID | Tier | Params (total / active) | Status |
|---|---|---|---|
deepseek-v4-pro |
Frontier | 1.6T / 49B | Current |
deepseek-v4-flash |
Cost-efficient | 284B / 13B | Current (recommended default) |
deepseek-chat |
Legacy alias | Routes to V4-Flash (non-thinking) | Retires 2026-07-24 15:59 UTC |
deepseek-reasoner |
Legacy alias | Routes to V4-Flash (thinking) | Retires 2026-07-24 15:59 UTC |
If you have an integration built around the legacy IDs, migration is a one-line model= swap — base_url does not change. Make the change before the retirement deadline to avoid failed requests. For deep background on the architecture shift, see our DeepSeek V4 explainer.
Step 6: Cost your workload correctly
V4-Flash and V4-Pro have three token buckets each: cache-hit input, cache-miss input, and output. Omitting any of them — particularly the uncached user message on each call — produces an optimistic estimate that reality will not honour.
V4-Flash rates (per 1M tokens, as of April 2026)
| Bucket | Rate |
|---|---|
| Input, cache hit | $0.028 |
| Input, cache miss | $0.14 |
| Output | $0.28 |
V4-Pro rates (per 1M tokens, as of April 2026)
| Bucket | Rate |
|---|---|
| Input, cache hit | $0.145 |
| Input, cache miss | $1.74 |
| Output | $3.48 |
Worked example: 1,000,000 V4-Flash calls
Assume a 2,000-token cached system prompt, 200 uncached user-message tokens per call, and a 300-token response:
- Cached input: 2,000 × 1,000,000 = 2,000,000,000 tokens × $0.028/M = $56.00
- Uncached input: 200 × 1,000,000 = 200,000,000 tokens × $0.14/M = $28.00
- Output: 300 × 1,000,000 = 300,000,000 tokens × $0.28/M = $84.00
- Total: $168.00
The same workload on deepseek-v4-pro would cost $290 + $348 + $1,044 = $1,682.00 — roughly an order of magnitude more. Pick Pro when the frontier benchmark lift justifies the spend (agentic coding, deep reasoning) and stick with Flash for chat, summarisation, and standard tool-calling. Note that off-peak discounts are no longer active — DeepSeek ended them on September 5, 2025. Check the live DeepSeek API pricing page before committing numbers to a business case, and our DeepSeek pricing calculator for arbitrary workloads.
Step 7: Handle statelessness
The API is stateless — you must resend the conversation history with each request. That contrasts with the web chat and mobile app, which maintain session history for you. A multi-turn pattern looks like this:
history = [{"role": "system", "content": "You are a terse assistant."}]
def turn(user_text):
history.append({"role": "user", "content": user_text})
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=history,
)
reply = resp.choices[0].message.content
history.append({"role": "assistant", "content": reply})
return reply
Because DeepSeek bills cache hits at a fraction of the miss rate, repeated stable prefixes — a long system prompt, a retrieval preamble, a tool schema — get cheap on subsequent calls. Our guide to DeepSeek context caching covers the prefix rules.
Step 8: Add JSON mode, tools, and streaming when you need them
Three features cover most production needs beyond simple chat:
- JSON mode. Set
response_format={"type": "json_object"}. It is designed to return valid JSON, not guaranteed. Include the word “json” plus a small example schema in the prompt, setmax_tokenshigh enough to avoid truncation, and handle occasional empty content gracefully. - Tool calling. Declare tools in the OpenAI-compatible format; supported in both thinking and non-thinking modes. Useful for routing to search, calculators, or internal APIs.
- Streaming. Pass
stream=Trueto get server-sent-events chunks. When thinking is enabled, reasoning content streams alongside the final answer — render them into separate UI panes if you show them to end users.
Two Beta features are worth knowing about: FIM (Fill-In-the-Middle) completion for code use cases (non-thinking mode only, so call V4-Flash or V4-Pro without reasoning_effort), and Chat Prefix Completion for continuation-style prompts.
Verify it worked
A quick checklist after your first successful call:
- Response status was 200 and
choices[0].message.contentis non-empty. - The
usagefield reports token counts you can reconcile with the billing console. - For thinking-mode calls,
reasoning_contentis populated in addition tocontent. - You can reproduce the cost math in Step 6 against the
usagenumbers.
Common errors and fixes
| Symptom | Likely cause | Fix |
|---|---|---|
401 Unauthorized |
Missing or wrong API key | Re-export DEEPSEEK_API_KEY; confirm the key is active in the console. |
402 Insufficient Balance |
Zero granted balance and no top-up | Top up from the billing console; DeepSeek may offer a granted balance that can expire, so verify your current balance there. |
429 Rate Limited |
Burst above the per-minute cap | Add exponential backoff; see our DeepSeek API rate limits guide. |
| Empty JSON output | JSON mode edge case or truncation | Raise max_tokens, include “json” and a schema example, retry once on empty content. |
| Legacy ID failing after July 2026 | deepseek-chat or deepseek-reasoner past the 2026-07-24 15:59 UTC retirement |
Swap to deepseek-v4-flash or deepseek-v4-pro. |
Next steps
You now have a working integration, a correct cost model, and a migration plan. From here, two directions are worth your time: harden the developer ergonomics (streaming, tool calling, JSON mode) with the examples in our DeepSeek API code examples, or push into agentic and RAG workloads with the DeepSeek RAG tutorial. If you want the full landscape of developer resources, the DeepSeek tutorials hub has the rest of the sequence.
Last verified: 2026-04-24. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.
How do I get started with the DeepSeek API in the simplest way?
Install the OpenAI SDK, set base_url="https://api.deepseek.com" and your API key, then send a POST /chat/completions request with model="deepseek-v4-flash". That’s the whole path. Full walkthroughs are in our DeepSeek API documentation hub, and key creation is covered in get a DeepSeek API key.
What’s the difference between deepseek-v4-flash and deepseek-v4-pro?
V4-Flash (284B total / 13B active) is the cost-efficient tier and the recommended default for chat, summarisation, and standard tool calling. V4-Pro (1.6T / 49B) is the frontier tier — roughly 6× more expensive on output — for agentic coding and deep reasoning. Both share the 1M-token context and the same feature set. Our DeepSeek V4-Flash and DeepSeek V4-Pro pages go deeper.
Is the DeepSeek API stateless like OpenAI’s?
Yes. The API is stateless — you must resend the full conversation history with every request. The web chat and mobile app keep session history for you, but the API does not. This is why context caching matters: cached prefixes drop from $0.14 to $0.028 per 1M input tokens on V4-Flash. See DeepSeek context caching for the prefix rules.
Can I keep using deepseek-chat and deepseek-reasoner in my existing code?
Yes, but only until 2026-07-24 at 15:59 UTC. Both legacy IDs currently route to deepseek-v4-flash — deepseek-chat in non-thinking mode, deepseek-reasoner in thinking mode. After the deadline they stop working. Migration is a one-line model= change; the base URL does not move. Background in our DeepSeek latest updates.
How do I enable thinking mode and what does the response look like?
Set reasoning_effort="high" and extra_body={"thinking": {"type": "enabled"}}, or use reasoning_effort="max" for maximum effort (which requires a 384K max_model_len). The response returns reasoning_content alongside the final content. Works on both V4 tiers. For prompt patterns that take advantage of it, see DeepSeek prompt engineering.
