How to Use DeepSeek with LangChain: A V4 Python Tutorial

Tutorials·April 25, 2026·By DS Guide Editorial

You want a working LangChain pipeline that calls DeepSeek V4, returns clean structured output, and does not break the moment the legacy `deepseek-chat` ID retires. That’s a fair ask, and the docs scattered across LangChain, DeepSeek and a dozen blog posts still mostly show V3-era examples. This guide walks through using DeepSeek with LangChain end to end: installing the official `langchain-deepseek` package, wiring it to `deepseek-v4-flash` and `deepseek-v4-pro`, enabling thinking mode, streaming tokens, calling tools, and dropping the model into a small RAG chain. By the end you will have a Python script that runs, plus the cost numbers to decide which V4 tier to use in production.

What you’ll build

The goal is a single Python project that demonstrates the four patterns most teams actually need: a plain chat call, a streaming call, a tool-calling call, and a retrieval-augmented chain. All four use the same ChatDeepSeek object, just configured differently. By the end you can swap the model= string between deepseek-v4-flash and deepseek-v4-pro without touching anything else.

Before any of that works, two facts to anchor on. First, the current generation is DeepSeek V4, released April 24, 2026, shipped as two open-weight MoE model IDs: deepseek-v4-pro (1.6T total / 49B active parameters, frontier tier) and deepseek-v4-flash (284B / 13B active, cost-efficient tier). Both are MIT-licensed. Second, LangChain ships an official integration package, langchain-deepseek, maintained jointly with the LangChain core team. The package is published on PyPI under MIT license with langchain-core and langchain-openai as required dependencies — which tells you the wrapper subclasses LangChain’s OpenAI chat-model base.

Prerequisites

Python 3.10 or later (LangChain 0.3+ needs it).
A DeepSeek API key from the developer console — see our walkthrough on how to get a DeepSeek API key if you have not done it yet.
A virtual environment (venv, uv, or poetry) to keep dependencies isolated.
Roughly $1 of credit to follow along; the V4-Flash example calls in this article cost fractions of a cent each.
Familiarity with the basics of the DeepSeek API getting started flow helps but is not required.

Step 1: Install the LangChain DeepSeek package

The LangChain DeepSeek integration lives in the langchain-deepseek package. Install it with pip in your virtual environment:

pip install -U langchain-deepseek langchain-core

If you plan to follow the RAG example in Step 5, also install a vector store and an embedding provider — DeepSeek does not currently publish a first-party embeddings API, so pair it with whatever embedder you already use:

pip install -U langchain-community langchain-text-splitters faiss-cpu sentence-transformers

Then export your key. Install langchain-deepseek and set the environment variable DEEPSEEK_API_KEY:

export DEEPSEEK_API_KEY="sk-..."

Step 2: A minimal ChatDeepSeek call

The wrapper class is ChatDeepSeek. Under the hood it inherits from BaseChatOpenAI, which is why every OpenAI-style argument you already know — temperature, max_tokens, timeout — works as expected. The class reads the API key from the DEEPSEEK_API_KEY env var and points at the DeepSeek API base URL by default, so the only required argument is the model ID.

Save this as quickstart.py. It uses the cost-efficient tier:

from langchain_deepseek import ChatDeepSeek

llm = ChatDeepSeek(
    model="deepseek-v4-flash",
    temperature=0,
    max_tokens=512,
    timeout=60,
    max_retries=2,
)

messages = [
    ("system", "You are a concise technical writer."),
    ("human", "Explain MoE routing in two sentences."),
]

response = llm.invoke(messages)
print(response.content)
print(response.usage_metadata)

Two things to notice. The messages list uses LangChain’s tuple shorthand instead of explicit HumanMessage / SystemMessage objects — both work, the tuples are shorter. And response.usage_metadata returns a dict like {"input_tokens": …, "output_tokens": …, "total_tokens": …}, which is what you wire into your own cost tracking.

Behind the wrapper, every call hits POST /chat/completions on https://api.deepseek.com — the OpenAI-compatible endpoint. The API is stateless: DeepSeek does not remember prior turns server-side, so multi-turn conversations require resending the full messages array every request. The web chat and mobile app behave differently — they keep session history for you. The API does not.

Step 3: Pick the right V4 tier

The two V4 tiers share the same API surface and the same 1,000,000-token context window with up to 384,000 output tokens. The difference is cost-per-task and benchmark ceiling. Here is the current rate card, taken from the official DeepSeek pricing page as of April 2026:

Model	Input cache hit	Input cache miss	Output	Best for
`deepseek-v4-flash`	$0.028 / 1M	$0.14 / 1M	$0.28 / 1M	Standard chat, RAG, classification, drafting
`deepseek-v4-pro`	$0.145 / 1M	$1.74 / 1M	$3.48 / 1M	Frontier coding, agentic loops, hard reasoning

The output column is the one to watch: V4-Pro is roughly 12× more expensive on output than V4-Flash. For a worked example covering both tiers, see our DeepSeek API pricing page or run the numbers in our DeepSeek pricing calculator.

Worked example — V4-Flash at scale

Imagine 1,000,000 LangChain calls per month with a 2,000-token cached system prompt, a 200-token user message (uncached against that prefix), and a 300-token answer. On deepseek-v4-flash the math breaks down as:

Cached input: 2,000 × 1,000,000 = 2.0B tokens × $0.028/M = $56.00
Uncached input: 200 × 1,000,000 = 0.2B tokens × $0.14/M = $28.00
Output: 300 × 1,000,000 = 0.3B tokens × $0.28/M = $84.00
Total: $168.00 / month

Run the same workload through deepseek-v4-pro and you land at $1,682 — same call pattern, ~10× the bill. Use Pro where the benchmark lift earns it; default to Flash everywhere else.

Step 4: Enable thinking mode

In V4, thinking mode is a request parameter, not a separate model ID. You can flip it on for either tier. With LangChain you pass it through as extra_body, since the wrapper inherits from the OpenAI base class:

from langchain_deepseek import ChatDeepSeek

reasoner = ChatDeepSeek(
    model="deepseek-v4-pro",
    temperature=0,
    max_tokens=8192,
    reasoning_effort="high",
    extra_body={"thinking": {"type": "enabled"}},
)

result = reasoner.invoke(
    "A train leaves Chicago at 9:00 averaging 60 mph. "
    "A second train leaves the same station at 10:30 averaging 80 mph "
    "on a parallel track. When does the second train catch up?"
)

print("Thinking trace:")
print(result.additional_kwargs.get("reasoning_content"))
print("nAnswer:")
print(result.content)

Thinking-mode responses return reasoning_content alongside the final content. LangChain surfaces the reasoning trace on response.additional_kwargs["reasoning_content"]; the user-facing answer stays on response.content. For maximum effort use reasoning_effort="max" — note the docs require max_model_len of at least 384K tokens to avoid truncating the trace.

Temperature guidance from DeepSeek’s own documentation, which is worth pinning to the wall:

0.0 — code generation, math
1.0 — data analysis, cleaning
1.3 — general conversation, translation
1.5 — creative writing

Legacy IDs are still accepted — for now

If you maintain an older codebase you may see deepseek-chat or deepseek-reasoner in examples. DeepSeek-R1, specified via model=”deepseek-reasoner”, does not support tool calling or structured output, while those features are supported via model=”deepseek-chat”. That guidance reflects the V3 era. Today both legacy IDs route to deepseek-v4-flash (non-thinking and thinking respectively) and will be fully retired on July 24, 2026 at 15:59 UTC. After that date, requests with the old IDs will fail. Migration is a one-line model= swap; the base_url does not change.

Step 5: Streaming, tool calling, and RAG

Streaming tokens

For chat UIs, streaming is non-negotiable. ChatDeepSeek implements LangChain’s standard .stream() method:

for chunk in llm.stream("Write a haiku about MoE routing."):
    print(chunk.content, end="", flush=True)

When thinking mode is enabled, reasoning chunks stream alongside answer chunks — inspect chunk.additional_kwargs on each event to separate them.

Tool calling

ChatDeepSeek exposes bind_tools() in the standard LangChain shape. The wrapper supports two methods for structured output: ‘function_calling’ which uses DeepSeek’s tool-calling features, and ‘json_mode’ which uses DeepSeek’s JSON mode feature. A minimal example:

from langchain_core.tools import tool
from langchain_deepseek import ChatDeepSeek

@tool
def get_weather(city: str) -> str:
    """Return current weather for a city."""
    return f"{city}: 14°C, light rain"

llm = ChatDeepSeek(model="deepseek-v4-flash", temperature=0)
agent = llm.bind_tools([get_weather])

reply = agent.invoke("What's the weather in Manchester?")
print(reply.tool_calls)

One caveat from the LangChain source itself: strict-mode schema adherence is included for compatibility with other chat models, and if specified will be passed to the API in accordance with the OpenAI specification — however, the DeepSeek API may ignore the parameter. Treat strict mode as best-effort, not a guarantee.

Structured JSON output

For response_format={"type": "json_object"}, remember that JSON mode is designed to return valid JSON, not guaranteed. Always include the word “json” plus a small example schema in the system prompt, and set max_tokens high enough that the response cannot truncate mid-object. The model may also occasionally return empty content — handle that path explicitly. Our DeepSeek API JSON mode guide covers the gotchas.

A small RAG chain

Drop ChatDeepSeek into a standard LangChain retriever + prompt + LLM pipeline. Note that DeepSeek does not currently ship a first-party embeddings API, so pair it with a separate embedder (here, a local sentence-transformer). For a deeper walkthrough see our DeepSeek RAG tutorial.

from langchain_deepseek import ChatDeepSeek
from langchain_community.vectorstores import FAISS
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

docs = ["DeepSeek V4 launched April 24, 2026.",
        "V4-Flash is 284B total / 13B active parameters.",
        "V4-Pro is 1.6T total / 49B active parameters."]

emb = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
store = FAISS.from_texts(docs, emb)
retriever = store.as_retriever(search_kwargs={"k": 2})

prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer using only the provided context. If unsure, say so."),
    ("human", "Context:n{context}nnQuestion: {question}"),
])

llm = ChatDeepSeek(model="deepseek-v4-flash", temperature=0)

chain = (
    {"context": retriever, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

print(chain.invoke("How many active parameters does V4-Flash use?"))

Step 6: Verify it worked

A working integration shows three signs:

response.content contains a sensible answer.
response.usage_metadata is populated with non-zero token counts.
response.response_metadata["model_name"] reflects the model you requested (or the alias it resolved to, for legacy IDs).

If you enabled thinking mode, also check that additional_kwargs["reasoning_content"] is non-empty. Empty reasoning with a populated answer usually means the request fell back to non-thinking mode — verify your extra_body made it through.

Common errors and fixes

Symptom	Likely cause	Fix
`AuthenticationError: 401`	Missing or wrong API key	Re-export `DEEPSEEK_API_KEY`; restart the shell
`NotFoundError: model not found`	Typo in model ID, or using a deprecated name after July 24, 2026	Use `deepseek-v4-flash` or `deepseek-v4-pro`
Empty `content` in JSON mode	Prompt missing the word “json”, or `max_tokens` too low	Add a JSON schema example; raise `max_tokens`
Truncated thinking trace	`reasoning_effort="max"` with insufficient context budget	Set `max_tokens` ≥ 384,000 or step down to `"high"`
Tool not called	Tool description too vague	Rewrite the docstring to describe when to use the tool
`RateLimitError`	Burst over your account’s RPM	Retry with exponential backoff; see DeepSeek API rate limits

When LangChain is the wrong choice

Not every DeepSeek project benefits from a LangChain wrapper. For a single-shot API call from a script or a Lambda, the official OpenAI SDK pointed at https://api.deepseek.com is shorter and has fewer moving parts. LangChain earns its weight when you need orchestration: chains of prompts, retrievers, tool calls, agents, memory, or the LangSmith tracing surface. If your project is “send one prompt, get one answer”, call POST /chat/completions directly. If your project is “retrieve, plan, call tools, summarise”, LangChain saves real time.

Next steps

Compare the LangChain integration with the equivalent flow for a different framework — see our DeepSeek with LlamaIndex walkthrough.
Wire the same model into a Streamlit UI with our DeepSeek Streamlit app tutorial.
Browse other patterns in our step-by-step DeepSeek guides.
Decide between Flash and Pro for your workload by reading the full DeepSeek V4 overview.

Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.

Frequently asked questions

How do I install the LangChain DeepSeek integration?

Run pip install -U langchain-deepseek in a Python 3.10+ virtual environment, then export your key as DEEPSEEK_API_KEY. The package pulls in langchain-core and langchain-openai as dependencies. Import ChatDeepSeek from langchain_deepseek. For a full walkthrough including how to obtain a key, see our guide on getting a DeepSeek API key.

What model name should I use with ChatDeepSeek today?

Use deepseek-v4-flash for standard chat, RAG and classification, or deepseek-v4-pro for frontier coding and agentic work. The legacy IDs deepseek-chat and deepseek-reasoner still work but are scheduled for retirement on July 24, 2026 at 15:59 UTC, after which they will fail. See the full DeepSeek V4 overview for tier guidance.

Does ChatDeepSeek support tool calling and structured output?

Yes for V4 models on both tiers. ChatDeepSeek inherits bind_tools() and with_structured_output() from the OpenAI base class and routes to DeepSeek’s function-calling or JSON-mode endpoints. Treat strict=True as best-effort because the DeepSeek API may ignore that flag. For deeper detail, browse our DeepSeek API function calling reference.

Can I run DeepSeek with LangChain locally instead of via the API?

Yes. The V4 weights are MIT-licensed and open. Most teams run them through Ollama or vLLM and point ChatDeepSeek (or LangChain’s ChatOllama) at the local endpoint by overriding the base URL. Hardware demands are significant — 13B active parameters on Flash is the more practical local target. Our running DeepSeek on Ollama walkthrough covers the setup.

How much does a typical LangChain + DeepSeek workflow cost?

For one million V4-Flash calls with a 2,000-token cached system prompt, a 200-token user message and a 300-token answer, expect about $168 per month: $56 cached input, $28 uncached input, $84 output. The same workload on V4-Pro costs roughly $1,682. Estimate your own numbers with our DeepSeek pricing calculator.