How to use DeepSeek with LlamaIndex for RAG (V4 tutorial)

Tutorials·April 25, 2026·By DS Guide Editorial

You have a folder of PDFs, a DeepSeek API key, and a question: how do I get LlamaIndex to answer questions over my documents using DeepSeek instead of OpenAI? This tutorial walks through wiring up DeepSeek with LlamaIndex end-to-end on the current V4 API — from `pip install` to a working query engine that retrieves chunks, calls `deepseek-v4-flash`, and streams an answer back. We’ll cover the official `llama-index-llms-deepseek` package, the more flexible `OpenAILike` route, model selection between V4-Flash and V4-Pro, thinking mode, embedding choices (since DeepSeek does not ship an embedding model), and the legacy `deepseek-chat` / `deepseek-reasoner` migration window. Code examples are Python 3.10+.

What you’ll build

A minimal but production-shaped Retrieval-Augmented Generation pipeline: load a directory of documents, chunk and embed them into a vector store, retrieve the most relevant passages for a user query, and pass them to DeepSeek V4 through LlamaIndex. By the end you’ll have a script that answers questions over your own data, plus a clear understanding of which DeepSeek model tier to pick and how to switch on thinking mode when a query genuinely needs reasoning.

Two integration paths exist. The first uses the dedicated llama-index-llms-deepseek package, which subclasses LlamaIndex’s OpenAILike class. The second uses OpenAILike directly. Both hit the same endpoint — POST /chat/completions, the OpenAI-compatible surface at https://api.deepseek.com — so the choice is mostly ergonomic. Snippets for both follow.

Prerequisites

Python 3.10 or newer, in a virtualenv.
A DeepSeek API key. If you don’t have one yet, see get a DeepSeek API key.
An embedding provider. DeepSeek does not currently ship an embedding model, so you’ll need either OpenAI’s text-embedding-3-small, a local model (BGE, Jina, or similar through Hugging Face), or Ollama-served embeddings.
About 200 MB of disk for the LlamaIndex install, more if you go local-embedding.
Familiarity with the LlamaIndex basics — if not, read DeepSeek Python integration first.

Pick a model: V4-Flash or V4-Pro

DeepSeek V4 (released April 24, 2026) ships as two open-weight Mixture-of-Experts models under the MIT license: deepseek-v4-flash (284B total / 13B active parameters) and deepseek-v4-pro (1.6T total / 49B active). Both share a 1,000,000-token default context window and support output up to 384,000 tokens. Both support thinking mode through the same parameter — it is not a separate model ID.

For most RAG workloads, deepseek-v4-flash is the right default. Output is roughly 12× cheaper than V4-Pro per million tokens, latency is lower, and synthesis-from-context tasks rarely benefit from frontier-tier reasoning. Reach for V4-Pro when the retrieved context contains code, dense financial tables, or multi-document reasoning chains where you’ve measured Flash falling short.

Model	Active params	Input miss ($/M)	Output ($/M)	Best for
`deepseek-v4-flash`	13B	$0.14	$0.28	Default RAG, chat, summarisation
`deepseek-v4-pro`	49B	$1.74	$3.48	Coding agents, complex multi-doc reasoning

Pricing as of April 2026; verify on the official DeepSeek API pricing page before committing.

One legacy note: if you’re maintaining an older script that still uses deepseek-chat or deepseek-reasoner, both IDs continue to work and currently route to deepseek-v4-flash. They retire on 2026-07-24 at 15:59 UTC; migrating is a one-line model= change. The base_url stays the same.

Step 1: Install the packages

Use the dedicated DeepSeek wrapper plus a vector store and an embedding provider. The shell commands install LlamaIndex core, the DeepSeek LLM package, OpenAI embeddings (used here only for embedding text), and the file readers:

pip install llama-index-core
pip install llama-index-llms-deepseek
pip install llama-index-embeddings-openai
pip install llama-index-readers-file

If you’d rather avoid OpenAI entirely, swap the third line for llama-index-embeddings-huggingface and use a local model such as BAAI/bge-small-en-v1.5. That keeps embeddings on your own hardware and the only outbound traffic is the DeepSeek call itself.

Step 2: Configure the DeepSeek LLM

The dedicated package gives you a DeepSeek class that extends OpenAILike and accepts a model plus api_key. Set it on the global Settings object so every retriever and query engine in the script uses DeepSeek by default. Python:

import os
from llama_index.core import Settings
from llama_index.llms.deepseek import DeepSeek
from llama_index.embeddings.openai import OpenAIEmbedding

Settings.llm = DeepSeek(
    model="deepseek-v4-flash",
    api_key=os.environ["DEEPSEEK_API_KEY"],
    temperature=1.3,           # general conversation default
    max_tokens=2048,
)
Settings.embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    api_key=os.environ["OPENAI_API_KEY"],
)

The temperature values follow DeepSeek’s official guidance: 0.0 for code or maths, 1.0 for data analysis, 1.3 for general conversation and translation, 1.5 for creative writing. Choose the value that matches your retrieval task; for fact-grounded RAG over technical documents, 0.0 reduces drift. See the broader parameter discussion in DeepSeek prompt engineering.

Alternative: configure with OpenAILike directly

If you don’t want a second package, point OpenAILike at DeepSeek’s base URL. This is the same code path the dedicated wrapper uses internally:

from llama_index.llms.openai_like import OpenAILike

Settings.llm = OpenAILike(
    model="deepseek-v4-flash",
    api_base="https://api.deepseek.com/v1",
    api_key=os.environ["DEEPSEEK_API_KEY"],
    is_chat_model=True,
    is_function_calling_model=True,
    context_window=1_000_000,
    max_tokens=2048,
)

Setting is_chat_model=True matters — without it, LlamaIndex routes through the legacy completion path and metadata inference breaks. is_function_calling_model=True enables tool calling, which V4 supports in both thinking and non-thinking modes. For the lower-level details, the DeepSeek OpenAI SDK compatibility reference is the place to start. DeepSeek also exposes an Anthropic-compatible surface against the same base URL if you’d prefer to drive it from the Anthropic SDK.

Step 3: Build the index

Drop your source files into a data/ directory (PDFs, Markdown, HTML, .txt, .docx are all read by SimpleDirectoryReader). Then build a vector index. The API is stateless — LlamaIndex sends the conversation history and retrieved context on every request, since DeepSeek does not retain prior turns server-side.

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex

documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir="./storage")

For a deeper look at chunking strategies, hybrid retrieval, and reranking against DeepSeek, the DeepSeek RAG tutorial covers the architectural choices. The minimal path here is enough for a working prototype.

Step 4: Query the index

Wrap the index in a query engine and ask a question. The retriever pulls the top-k chunks, LlamaIndex stuffs them into a prompt, and DeepSeek synthesises an answer. Python:

query_engine = index.as_query_engine(similarity_top_k=4)
response = query_engine.query(
    "Summarise the refund policy in three bullet points."
)
print(response)

Streaming is a one-flag change — set streaming=True on as_query_engine and iterate response.response_gen. That gives token-by-token output, which makes user-facing chat UIs feel responsive. For a deeper streaming reference see the DeepSeek API streaming docs.

Step 5: Switch on thinking mode (when you actually need it)

Thinking mode in V4 is a request parameter, not a separate model ID. To turn it on through LlamaIndex, pass reasoning_effort and the thinking extra-body flag through to the underlying client. With the dedicated DeepSeek wrapper:

Settings.llm = DeepSeek(
    model="deepseek-v4-pro",
    api_key=os.environ["DEEPSEEK_API_KEY"],
    additional_kwargs={
        "reasoning_effort": "high",
        "extra_body": {"thinking": {"type": "enabled"}},
    },
)

With thinking enabled, the response returns reasoning_content alongside the final content. LlamaIndex surfaces the final content in response.response; the reasoning trace is available on the underlying message object if you want to log or display it. Use reasoning_effort="max" for the highest setting (you’ll need max_tokens ≥ 384,000 to avoid truncation on very long traces).

A practical rule: enable thinking for queries that involve multi-step inference over retrieved documents (legal cross-references, financial reconciliation, multi-hop research) and leave it off for retrieval-and-summarise tasks. Thinking-mode responses cost more — you pay for the reasoning tokens as output. For a deeper comparison, see DeepSeek R1 vs OpenAI o1.

Step 6: Cost calculation worked example

Suppose a customer-support RAG application running 100,000 queries per month against deepseek-v4-flash. A typical request: a 1,500-token system prompt (cached after the first call), 400 tokens of retrieved context plus a 100-token user query (uncached), and a 250-token answer. Enumerating all three buckets at V4-Flash rates:

Cached input    : 1,500 × 100,000 = 150,000,000 × $0.028/M = $4.20
Uncached input  :   500 × 100,000 =  50,000,000 × $0.14/M  = $7.00
Output          :   250 × 100,000 =  25,000,000 × $0.28/M  = $7.00
                                                              ------
Total                                                          $18.20

The same workload at V4-Pro rates ($0.145 / $1.74 / $3.48 per million) comes to about $195 — roughly 11× higher. Pick the tier that matches your quality threshold; for grounded RAG, Flash usually wins on price-per-acceptable-answer. The DeepSeek pricing calculator automates this for arbitrary token mixes. Note that context caching kicks in only when LlamaIndex actually reuses a prefix — keep your system prompt and tool schemas at the start of messages for the cache to hit.

Verify it worked

Three checks before you call the integration done:

Confirm the request hits DeepSeek. Set llama_index.core.set_global_handler("simple") or wrap with an HTTP debug proxy and confirm the URL is https://api.deepseek.com/v1/chat/completions.
Confirm retrieved context is in the prompt. Print response.source_nodes — you should see the chunks that were retrieved for the question.
Confirm thinking mode (if enabled). The latency jumps noticeably and the underlying response object carries a non-empty reasoning_content.

Common errors and fixes

Error	Cause	Fix
The model `gpt-3.5-turbo` does not exist	You used the plain `OpenAI` class instead of `OpenAILike` or `DeepSeek`	Switch to `OpenAILike` with `is_chat_model=True`
401 Unauthorized	Wrong key or wrong env var	Print `os.environ["DEEPSEEK_API_KEY"][:6]`; rotate if leaked
Empty `response.response` in JSON mode	Truncation or missing schema example	Raise `max_tokens`; include the word “json” plus a small example schema in the prompt
Slow first call, fast subsequent calls	Context cache warming up	Expected — the system-prompt prefix is now cached at the cheaper tier
`model_not_found` on `deepseek-chat` after July 2026	Legacy ID retired 2026-07-24 15:59 UTC	Change `model` to `deepseek-v4-flash`

Next steps

You have a working DeepSeek-backed RAG pipeline. Reasonable extensions:

Swap VectorStoreIndex for a hosted vector DB (Qdrant, Weaviate, Pinecone) once your corpus exceeds memory.
Add reranking with a cross-encoder before the LLM call — it almost always beats raising similarity_top_k.
Compare against the LangChain equivalent in DeepSeek with LangChain if your team already standardises on LangChain.
If you want to keep everything off the public internet, point OpenAILike at a local Ollama server running an open DeepSeek weight — see running DeepSeek on Ollama.
Browse the rest of our DeepSeek tutorials for end-to-end app builds.

Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.

Frequently asked questions

How do I use DeepSeek with LlamaIndex in Python?

Install llama-index-llms-deepseek, set Settings.llm = DeepSeek(model="deepseek-v4-flash", api_key=...), configure an embedding model (DeepSeek does not ship one), and build a VectorStoreIndex from your documents. Query it with index.as_query_engine().query(...). The package wraps LlamaIndex’s OpenAILike class against DeepSeek’s OpenAI-compatible POST /chat/completions endpoint. Full walkthrough in this DeepSeek Python integration guide.

What model should I use with LlamaIndex — V4-Flash or V4-Pro?

Default to deepseek-v4-flash for retrieval-augmented generation, chat, and summarisation. Output costs $0.28 per million tokens versus $3.48 for deepseek-v4-pro, and most RAG queries are bottlenecked by retrieval quality, not generation reasoning. Reach for V4-Pro when you have measured Flash failing on multi-document reasoning or complex coding tasks. Compare the tiers in detail at the DeepSeek V4 overview.

Does DeepSeek provide an embedding model for LlamaIndex?

No — DeepSeek does not currently ship an embedding model, so you’ll need an external provider. Common choices are OpenAI’s text-embedding-3-small, a local Hugging Face model such as BGE or Jina, or Ollama-hosted embeddings. Pick locally hosted embeddings if you want to keep document content off third-party servers. The DeepSeek API documentation covers the chat surface but not embeddings.

How do I enable DeepSeek thinking mode in LlamaIndex?

Thinking is a request parameter on either V4 model, not a separate model ID. Pass reasoning_effort="high" and extra_body={"thinking": {"type": "enabled"}} through additional_kwargs on the DeepSeek or OpenAILike client. The response then returns reasoning_content alongside the final content. Use it for multi-step reasoning queries and leave it off for straight summarisation. More detail in the DeepSeek API best practices.

Why am I getting a “model not found” error when using DeepSeek with LlamaIndex?

Two common causes. First, you may be using LlamaIndex’s plain OpenAI class, which restricts model names to GPT family — switch to OpenAILike or the dedicated DeepSeek wrapper. Second, after 2026-07-24 15:59 UTC the legacy IDs deepseek-chat and deepseek-reasoner are retired; update model to deepseek-v4-flash or deepseek-v4-pro. See the DeepSeek API error codes reference for the full list.