DeepSeek Python Integration with V4: A Practical Tutorial

Tutorials·April 25, 2026·By DS Guide Editorial

You have a DeepSeek API key, an existing Python project, and a question that gets in the way of every first call: do you really need a new SDK, or does the OpenAI client already cover it? This tutorial answers that, then walks through a complete DeepSeek Python integration against the V4 generation — the family that replaced `deepseek-chat` and `deepseek-reasoner` on April 24, 2026. By the end you will have a working client, a streaming example, a JSON-mode call that does not silently truncate, a thinking-mode pattern, and a cost calculation you can paste into a finance review. Every snippet is copy-paste runnable on Python 3.10 or later.

What you will build

You will build a small Python module that talks to the DeepSeek API using the official OpenAI SDK — no DeepSeek-specific package needed. The module will support both V4 model tiers, switch between non-thinking and thinking mode with a single argument, stream tokens, request structured JSON, and log token usage so you can keep an eye on spend. The whole thing is roughly 120 lines.

This is the same shape I run in production today, after migrating off the legacy deepseek-chat and deepseek-reasoner IDs. Those names will be fully retired and inaccessible after July 24, 2026, 15:59 UTC, so any new DeepSeek Python integration should target deepseek-v4-pro or deepseek-v4-flash from day one. If you have not picked up an API key yet, the get a DeepSeek API key walkthrough covers the platform.deepseek.com signup and the first top-up.

Prerequisites

Python 3.10+ (3.11 or 3.12 recommended for better async behaviour).
OpenAI Python SDK v1.40 or later: pip install "openai>=1.40,<2.0".
A DeepSeek API key with at least a small balance — calls return HTTP 402 if the account has zero balance.
An environment variable DEEPSEEK_API_KEY. Do not paste keys into source files.
Optional but recommended: python-dotenv for local development, httpx already comes with the OpenAI SDK.

Why the OpenAI SDK works

The DeepSeek API uses an API format compatible with OpenAI/Anthropic. By modifying the configuration, you can use the OpenAI/Anthropic SDK or softwares compatible with the OpenAI/Anthropic API to access the DeepSeek API. In practice that means swapping two values — base_url and api_key — and pointing the model field at a DeepSeek ID. Chat requests hit POST /chat/completions, the OpenAI-compatible endpoint, with the same JSON body shape you already know. The full reference lives in the DeepSeek OpenAI SDK compatibility guide.

If you prefer the Anthropic SDK, that works too: DeepSeek exposes an Anthropic-compatible surface against the same base URL. This article sticks with the OpenAI client because most existing Python codebases already depend on it.

Step 1 — Install the SDK and set environment variables

Create a virtual environment and install the OpenAI client. The shell commands below are bash-flavoured; PowerShell users should use $env:DEEPSEEK_API_KEY syntax.

python -m venv .venv
source .venv/bin/activate
pip install "openai>=1.40,<2.0" python-dotenv
export DEEPSEEK_API_KEY="sk-..."

For project-level config, put the key in a .env file at the repo root and add .env to .gitignore. The DeepSeek API authentication guide covers key rotation and revocation in more detail.

Step 2 — Your first DeepSeek Python integration call

Save the following Python script as hello_deepseek.py. It targets deepseek-v4-flash, which is the cost-efficient default I recommend for most workloads.

import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com",
)

response = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[
        {"role": "system", "content": "You are a precise technical assistant."},
        {"role": "user", "content": "In one sentence, what is a Mixture-of-Experts model?"},
    ],
    temperature=0.0,
    max_tokens=200,
)

print(response.choices[0].message.content)
print("usage:", response.usage)

Run it with python hello_deepseek.py. You should see a one-sentence answer plus a usage block listing prompt, completion and total tokens.

Picking V4-Flash vs V4-Pro

The two V4 model IDs share the same API surface but target different price points. Two models ship today. V4-Pro packs 1.6 trillion total parameters with 49 billion activated per token. V4-Flash is the efficient sibling at 284 billion total / 13 billion active. Both support native 1M context, and both are open weights.

Field	deepseek-v4-flash	deepseek-v4-pro
Total / active params	284B / 13B	1.6T / 49B
Default context	1,000,000 tokens	1,000,000 tokens
Max output	384,000 tokens	384,000 tokens
Input (cache hit) per 1M	$0.028	$0.145
Input (cache miss) per 1M	$0.14	$1.74
Output per 1M	$0.28	$3.48
Best for	Chat, classification, RAG	Frontier coding, agents

Pricing as of April 2026, per the official DeepSeek pricing page. DeepSeek charges $0.14/million tokens input and $0.28/million tokens output for Flash, and $1.74/million input and $3.48/million output for Pro. Pro is roughly 12× more expensive on output, so default to Flash and only escalate when you have a measured quality lift.

Step 3 — Multi-turn conversations (the API is stateless)

This trips up almost everyone moving from the chat app to the API. The web chat at chat.deepseek.com keeps your history server-side. The API does not. You must resend the full messages array on every request.

def chat_loop():
    history = [{"role": "system", "content": "You are a precise technical assistant."}]
    while True:
        user_input = input("you: ").strip()
        if not user_input:
            break
        history.append({"role": "user", "content": user_input})
        resp = client.chat.completions.create(
            model="deepseek-v4-flash",
            messages=history,
            temperature=1.3,  # general conversation
            max_tokens=800,
        )
        reply = resp.choices[0].message.content
        history.append({"role": "assistant", "content": reply})
        print(f"deepseek: {reply}n")

Two practical notes. First, history grows on every turn — once it threatens to exceed your budget, summarise older turns into a single system message. Second, the temperature recommendations DeepSeek publishes are: 0.0 for code and maths, 1.0 for data analysis, 1.3 for general chat and translation, 1.5 for creative writing. Use them as a starting point, not gospel.

Step 4 — Streaming responses

For anything user-facing, stream. It cuts perceived latency from “the model is broken” to “the model is fast” without any actual speedup.

stream = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Explain CSA attention in three short paragraphs."}],
    stream=True,
    max_tokens=600,
)

for chunk in stream:
    delta = chunk.choices[0].delta
    if delta.content:
        print(delta.content, end="", flush=True)
print()

Server-sent events arrive as ChatCompletionChunk objects. Catch openai.APIConnectionError around the iterator if you are running over a flaky network — broken streams should not crash your worker. The DeepSeek API streaming reference walks through SSE framing and reconnection patterns.

Step 5 — Thinking mode

V4 collapses the old “chat vs reasoner” split into a single parameter. Keep base_url, just update model to deepseek-v4-pro or deepseek-v4-flash. Supports OpenAI ChatCompletions & Anthropic APIs. Both models support 1M context & dual modes (Thinking / Non-Thinking). The model returns reasoning_content alongside the final content when thinking is on:

response = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[
        {"role": "user", "content": "A train leaves Chicago at 60mph; another leaves Denver at 75mph. They start 1,000 miles apart. When do they meet?"},
    ],
    reasoning_effort="high",
    extra_body={"thinking": {"type": "enabled"}},
    max_tokens=4000,
)

msg = response.choices[0].message
print("THINKING:", getattr(msg, "reasoning_content", None))
print("ANSWER:", msg.content)

Three settings are accepted: omit both arguments for non-thinking (fastest, cheapest); use reasoning_effort="high" with the thinking flag for standard reasoning; use reasoning_effort="max" for the heaviest setting. Max-effort thinking benefits from raising max_tokens well above the default — DeepSeek’s docs recommend a working window of at least 384K tokens to avoid truncating the reasoning trace. Note that legacy code using the old deepseek-reasoner ID still works during the migration window: deepseek-chat & deepseek-reasoner will be fully retired and inaccessible after Jul 24th, 2026, 15:59 (UTC Time). (Currently routing to deepseek-v4-flash non-thinking/thinking).

Step 6 — JSON mode without footguns

JSON mode is designed to return valid JSON, not guaranteed to. The API can return empty content or — more often — content that was truncated mid-object because max_tokens was too low. Three rules keep it boring:

Set response_format={"type": "json_object"}.
Include the word json in the system or user prompt, with a small example schema.
Set max_tokens high enough that the response cannot truncate.

import json

schema_hint = """
Return JSON in this exact shape:
{"sentiment": "positive|negative|neutral", "confidence": 0.0-1.0, "key_phrases": ["..."]}
"""

resp = client.chat.completions.create(
    model="deepseek-v4-flash",
    response_format={"type": "json_object"},
    messages=[
        {"role": "system", "content": "You are a sentiment classifier. Respond only with JSON. " + schema_hint},
        {"role": "user", "content": "The dashboard finally loads in under a second. I'm impressed."},
    ],
    temperature=0.0,
    max_tokens=400,
)

raw = resp.choices[0].message.content
if not raw:
    raise RuntimeError("Empty JSON response — retry or fall back.")
result = json.loads(raw)
print(result)

For deeper patterns — including retry-on-empty and Pydantic validation — see the DeepSeek API JSON mode reference.

Step 7 — Tool calling

Tool calling uses the OpenAI tools array verbatim. Both V4 tiers support it in non-thinking and thinking mode.

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a city.",
        "parameters": {
            "type": "object",
            "properties": {"city": {"type": "string"}},
            "required": ["city"],
        },
    },
}]

resp = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "What's the weather in Dublin?"}],
    tools=tools,
)

call = resp.choices[0].message.tool_calls[0]
print(call.function.name, call.function.arguments)

The model returns a tool_calls entry; your code executes the function, appends a {"role": "tool", ...} message with the result, and calls the API again. The DeepSeek API function calling guide has a full agent loop example.

Step 8 — Verify it worked

Run a smoke test that exercises every path you actually use in production. The minimum I run before deploying:

A non-streaming call returns a non-empty content and a usage block with non-zero token counts.
A streaming call yields at least one chunk with non-empty delta.content.
A JSON-mode call parses cleanly through json.loads.
A thinking-mode call exposes reasoning_content on the message object.
An invalid API key surfaces openai.AuthenticationError rather than crashing.

Common errors and fixes

Error	Likely cause	Fix
`401 Unauthorized`	Bad or missing key	Re-export `DEEPSEEK_API_KEY`; rotate if leaked.
`402 Insufficient Balance`	Account at $0	Top up at platform.deepseek.com; key is fine.
`429 Rate limit`	Burst too high	Exponential backoff; cap concurrency.
Empty `content` in JSON mode	Prompt missing “json” / truncation	Add schema example; raise `max_tokens`.
Truncated answer mid-sentence	`max_tokens` too low	Raise it; check `finish_reason == "length"`.
`model_not_found`	Old ID after retirement	Switch to `deepseek-v4-flash` or `deepseek-v4-pro`.
`NotFoundError` on `reasoning_effort`	OpenAI SDK too old	Upgrade to `openai>=1.40`.

The DeepSeek API error codes reference lists every status code with retry guidance.

Costing your integration honestly

Every Python integration eventually faces the question “what does this cost at 1M calls a day?” Here is the worked example for deepseek-v4-flash with a 2,000-token system prompt that gets cached, a 200-token user message that does not, and a 300-token reply:

Cached input: 2,000 × 1,000,000 = 2,000,000,000 × $0.028/M = $56.00
Uncached input: 200 × 1,000,000 = 200,000,000 × $0.14/M = $28.00
Output: 300 × 1,000,000 = 300,000,000 × $0.28/M = $84.00
Total: $168.00 per 1M calls

The same workload on deepseek-v4-pro costs $290 + $348 + $1,044 = $1,682.00 — roughly 10× more. The trap is forgetting the uncached-input line: each new user message is a cache miss against the cached prefix, even when the system prompt is reused. Skip that line and you under-budget by a fifth.

Two mechanical levers reduce the bill: keep system prompts stable so the cache hit rate stays high (see DeepSeek context caching), and trim max_tokens to the actual output you need. Most production answers fit in 1,000–2,000 tokens; 384K is for niche long-form work.

Production hardening

Three habits separate prototypes from running services:

Wrap every call in a retry helper that handles 429 and 5xx with exponential backoff. Do not retry 4xx — they are bugs in your code, not transient failures.
Log token usage on every call. Ship prompt_tokens, completion_tokens, and any reasoning_tokens to your observability stack. An alert on a sudden reasoning-token spike catches drifted prompts before the bill arrives.
Cap concurrency. The OpenAI SDK uses httpx under the hood; an asyncio.Semaphore around the async client prevents a runaway worker from melting your rate limit.

For framework integrations — agent loops, document Q&A, vector retrieval — there are dedicated tutorials for DeepSeek with LangChain and DeepSeek with LlamaIndex. Both libraries treat DeepSeek as a drop-in OpenAI-compatible provider.

Next steps

Once the basic integration is solid, the highest-leverage follow-ups are: deploying behind a Streamlit or FastAPI front-end, adding retrieval, or migrating an existing OpenAI codebase. Try the DeepSeek RAG tutorial for retrieval, the DeepSeek Streamlit app walkthrough for a quick UI, or browse the full set of DeepSeek tutorials for adjacent topics. If your codebase is JavaScript-first, the parallel DeepSeek Node.js integration guide mirrors this one.

Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.

Frequently asked questions

How do I install the DeepSeek Python SDK?

There is no separate DeepSeek Python SDK. DeepSeek’s API is OpenAI-compatible, so you install the OpenAI client with pip install "openai>=1.40,<2.0" and point it at https://api.deepseek.com with your DeepSeek API key. The same pattern works against the Anthropic SDK if you prefer that surface. The DeepSeek SDKs reference covers every supported client.

What model ID should I use in Python?

Use deepseek-v4-flash for chat, classification and most production workloads, or deepseek-v4-pro for frontier-tier coding and agentic tasks. The legacy deepseek-chat and deepseek-reasoner IDs still work but retire on July 24, 2026 at 15:59 UTC; until then they route to deepseek-v4-flash. See the DeepSeek V4 model page for the full lineage.

Does DeepSeek work with the OpenAI Python client without code changes?

Almost — you change two values. Set base_url="https://api.deepseek.com" on the OpenAI() constructor and pass your DeepSeek API key. The wire format is identical for chat completions, streaming, JSON mode, and tool calling. Only DeepSeek-specific parameters like reasoning_effort and the thinking flag require the extra_body argument. The DeepSeek OpenAI SDK compatibility guide details every edge case.

How do I enable thinking mode from Python?

Pass reasoning_effort="high" together with extra_body={"thinking": {"type": "enabled"}} on either V4 model. The response object then exposes reasoning_content alongside the final content. Use reasoning_effort="max" for the heaviest setting, and raise max_tokens so the reasoning trace is not truncated. The DeepSeek API best practices guide covers when each effort level is worth the latency cost.

Can I stream tokens with the DeepSeek Python integration?

Yes. Set stream=True on client.chat.completions.create() and iterate the returned object — each chunk is a ChatCompletionChunk with a delta.content field. When thinking mode is enabled, reasoning content streams alongside final content on separate fields. Wrap the iterator in try/except for APIConnectionError so flaky networks do not crash your worker. The DeepSeek API streaming reference walks through reconnection patterns and SSE framing.