How to use DeepSeek with LlamaIndex for RAG (V4 tutorial)
You have a folder of PDFs, a DeepSeek API key, and a question: how do I get LlamaIndex to answer questions over my documents using DeepSeek instead of OpenAI? This tutorial walks through wiring up DeepSeek with LlamaIndex end-to-end on the current V4 API — from `pip install` to a working query engine that retrieves chunks, calls `deepseek-v4-flash`, and streams an answer back. We’ll cover the official `llama-index-llms-deepseek` package, the more flexible `OpenAILike` route, model selection between V4-Flash and V4-Pro, thinking mode, embedding choices (since DeepSeek does not ship an embedding model), and the legacy `deepseek-chat` / `deepseek-reasoner` migration window. Code examples are Python 3.10+.
What you’ll build
A minimal but production-shaped Retrieval-Augmented Generation pipeline: load a directory of documents, chunk and embed them into a vector store, retrieve the most relevant passages for a user query, and pass them to DeepSeek V4 through LlamaIndex. By the end you’ll have a script that answers questions over your own data, plus a clear understanding of which DeepSeek model tier to pick and how to switch on thinking mode when a query genuinely needs reasoning.
Two integration paths exist. The first uses the dedicated llama-index-llms-deepseek package, which subclasses LlamaIndex’s OpenAILike class. The second uses OpenAILike directly. Both hit the same endpoint — POST /chat/completions, the OpenAI-compatible surface at https://api.deepseek.com — so the choice is mostly ergonomic. Snippets for both follow.
Prerequisites
- Python 3.10 or newer, in a virtualenv.
- A DeepSeek API key. If you don’t have one yet, see get a DeepSeek API key.
- An embedding provider. DeepSeek does not currently ship an embedding model, so you’ll need either OpenAI’s
text-embedding-3-small, a local model (BGE, Jina, or similar through Hugging Face), or Ollama-served embeddings. - About 200 MB of disk for the LlamaIndex install, more if you go local-embedding.
- Familiarity with the LlamaIndex basics — if not, read DeepSeek Python integration first.
Pick a model: V4-Flash or V4-Pro
DeepSeek V4 (released April 24, 2026) ships as two open-weight Mixture-of-Experts models under the MIT license: deepseek-v4-flash (284B total / 13B active parameters) and deepseek-v4-pro (1.6T total / 49B active). Both share a 1,000,000-token default context window and support output up to 384,000 tokens. Both support thinking mode through the same parameter — it is not a separate model ID.
For most RAG workloads, deepseek-v4-flash is the right default. Output is roughly 12× cheaper than V4-Pro per million tokens, latency is lower, and synthesis-from-context tasks rarely benefit from frontier-tier reasoning. Reach for V4-Pro when the retrieved context contains code, dense financial tables, or multi-document reasoning chains where you’ve measured Flash falling short.
| Model | Active params | Input miss ($/M) | Output ($/M) | Best for |
|---|---|---|---|---|
deepseek-v4-flash |
13B | $0.14 | $0.28 | Default RAG, chat, summarisation |
deepseek-v4-pro |
49B | $1.74 | $3.48 | Coding agents, complex multi-doc reasoning |
Pricing as of April 2026; verify on the official DeepSeek API pricing page before committing.
One legacy note: if you’re maintaining an older script that still uses deepseek-chat or deepseek-reasoner, both IDs continue to work and currently route to deepseek-v4-flash. They retire on 2026-07-24 at 15:59 UTC; migrating is a one-line model= change. The base_url stays the same.
Step 1: Install the packages
Use the dedicated DeepSeek wrapper plus a vector store and an embedding provider. The shell commands install LlamaIndex core, the DeepSeek LLM package, OpenAI embeddings (used here only for embedding text), and the file readers:
pip install llama-index-core
pip install llama-index-llms-deepseek
pip install llama-index-embeddings-openai
pip install llama-index-readers-file
If you’d rather avoid OpenAI entirely, swap the third line for llama-index-embeddings-huggingface and use a local model such as BAAI/bge-small-en-v1.5. That keeps embeddings on your own hardware and the only outbound traffic is the DeepSeek call itself.
Step 2: Configure the DeepSeek LLM
The dedicated package gives you a DeepSeek class that extends OpenAILike and accepts a model plus api_key. Set it on the global Settings object so every retriever and query engine in the script uses DeepSeek by default. Python:
import os
from llama_index.core import Settings
from llama_index.llms.deepseek import DeepSeek
from llama_index.embeddings.openai import OpenAIEmbedding
Settings.llm = DeepSeek(
model="deepseek-v4-flash",
api_key=os.environ["DEEPSEEK_API_KEY"],
temperature=1.3, # general conversation default
max_tokens=2048,
)
Settings.embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
api_key=os.environ["OPENAI_API_KEY"],
)
The temperature values follow DeepSeek’s official guidance: 0.0 for code or maths, 1.0 for data analysis, 1.3 for general conversation and translation, 1.5 for creative writing. Choose the value that matches your retrieval task; for fact-grounded RAG over technical documents, 0.0 reduces drift. See the broader parameter discussion in DeepSeek prompt engineering.
Alternative: configure with OpenAILike directly
If you don’t want a second package, point OpenAILike at DeepSeek’s base URL. This is the same code path the dedicated wrapper uses internally:
from llama_index.llms.openai_like import OpenAILike
Settings.llm = OpenAILike(
model="deepseek-v4-flash",
api_base="https://api.deepseek.com/v1",
api_key=os.environ["DEEPSEEK_API_KEY"],
is_chat_model=True,
is_function_calling_model=True,
context_window=1_000_000,
max_tokens=2048,
)
Setting is_chat_model=True matters — without it, LlamaIndex routes through the legacy completion path and metadata inference breaks. is_function_calling_model=True enables tool calling, which V4 supports in both thinking and non-thinking modes. For the lower-level details, the DeepSeek OpenAI SDK compatibility reference is the place to start. DeepSeek also exposes an Anthropic-compatible surface against the same base URL if you’d prefer to drive it from the Anthropic SDK.
Step 3: Build the index
Drop your source files into a data/ directory (PDFs, Markdown, HTML, .txt, .docx are all read by SimpleDirectoryReader). Then build a vector index. The API is stateless — LlamaIndex sends the conversation history and retrieved context on every request, since DeepSeek does not retain prior turns server-side.
from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
documents = SimpleDirectoryReader("data").load_data()
index = VectorStoreIndex.from_documents(documents)
index.storage_context.persist(persist_dir="./storage")
For a deeper look at chunking strategies, hybrid retrieval, and reranking against DeepSeek, the DeepSeek RAG tutorial covers the architectural choices. The minimal path here is enough for a working prototype.
Step 4: Query the index
Wrap the index in a query engine and ask a question. The retriever pulls the top-k chunks, LlamaIndex stuffs them into a prompt, and DeepSeek synthesises an answer. Python:
query_engine = index.as_query_engine(similarity_top_k=4)
response = query_engine.query(
"Summarise the refund policy in three bullet points."
)
print(response)
Streaming is a one-flag change — set streaming=True on as_query_engine and iterate response.response_gen. That gives token-by-token output, which makes user-facing chat UIs feel responsive. For a deeper streaming reference see the DeepSeek API streaming docs.
Step 5: Switch on thinking mode (when you actually need it)
Thinking mode in V4 is a request parameter, not a separate model ID. To turn it on through LlamaIndex, pass reasoning_effort and the thinking extra-body flag through to the underlying client. With the dedicated DeepSeek wrapper:
Settings.llm = DeepSeek(
model="deepseek-v4-pro",
api_key=os.environ["DEEPSEEK_API_KEY"],
additional_kwargs={
"reasoning_effort": "high",
"extra_body": {"thinking": {"type": "enabled"}},
},
)
With thinking enabled, the response returns reasoning_content alongside the final content. LlamaIndex surfaces the final content in response.response; the reasoning trace is available on the underlying message object if you want to log or display it. Use reasoning_effort="max" for the highest setting (you’ll need max_tokens ≥ 384,000 to avoid truncation on very long traces).
A practical rule: enable thinking for queries that involve multi-step inference over retrieved documents (legal cross-references, financial reconciliation, multi-hop research) and leave it off for retrieval-and-summarise tasks. Thinking-mode responses cost more — you pay for the reasoning tokens as output. For a deeper comparison, see DeepSeek R1 vs OpenAI o1.
Step 6: Cost calculation worked example
Suppose a customer-support RAG application running 100,000 queries per month against deepseek-v4-flash. A typical request: a 1,500-token system prompt (cached after the first call), 400 tokens of retrieved context plus a 100-token user query (uncached), and a 250-token answer. Enumerating all three buckets at V4-Flash rates:
Cached input : 1,500 × 100,000 = 150,000,000 × $0.028/M = $4.20
Uncached input : 500 × 100,000 = 50,000,000 × $0.14/M = $7.00
Output : 250 × 100,000 = 25,000,000 × $0.28/M = $7.00
------
Total $18.20
The same workload at V4-Pro rates ($0.145 / $1.74 / $3.48 per million) comes to about $195 — roughly 11× higher. Pick the tier that matches your quality threshold; for grounded RAG, Flash usually wins on price-per-acceptable-answer. The DeepSeek pricing calculator automates this for arbitrary token mixes. Note that context caching kicks in only when LlamaIndex actually reuses a prefix — keep your system prompt and tool schemas at the start of messages for the cache to hit.
Verify it worked
Three checks before you call the integration done:
- Confirm the request hits DeepSeek. Set
llama_index.core.set_global_handler("simple")or wrap with an HTTP debug proxy and confirm the URL ishttps://api.deepseek.com/v1/chat/completions. - Confirm retrieved context is in the prompt. Print
response.source_nodes— you should see the chunks that were retrieved for the question. - Confirm thinking mode (if enabled). The latency jumps noticeably and the underlying response object carries a non-empty
reasoning_content.
Common errors and fixes
| Error | Cause | Fix |
|---|---|---|
The model `gpt-3.5-turbo` does not exist |
You used the plain OpenAI class instead of OpenAILike or DeepSeek |
Switch to OpenAILike with is_chat_model=True |
| 401 Unauthorized | Wrong key or wrong env var | Print os.environ["DEEPSEEK_API_KEY"][:6]; rotate if leaked |
Empty response.response in JSON mode |
Truncation or missing schema example | Raise max_tokens; include the word “json” plus a small example schema in the prompt |
| Slow first call, fast subsequent calls | Context cache warming up | Expected — the system-prompt prefix is now cached at the cheaper tier |
model_not_found on deepseek-chat after July 2026 |
Legacy ID retired 2026-07-24 15:59 UTC | Change model to deepseek-v4-flash |
Next steps
You have a working DeepSeek-backed RAG pipeline. Reasonable extensions:
- Swap
VectorStoreIndexfor a hosted vector DB (Qdrant, Weaviate, Pinecone) once your corpus exceeds memory. - Add reranking with a cross-encoder before the LLM call — it almost always beats raising
similarity_top_k. - Compare against the LangChain equivalent in DeepSeek with LangChain if your team already standardises on LangChain.
- If you want to keep everything off the public internet, point
OpenAILikeat a local Ollama server running an open DeepSeek weight — see running DeepSeek on Ollama. - Browse the rest of our DeepSeek tutorials for end-to-end app builds.
Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.
Frequently asked questions
How do I use DeepSeek with LlamaIndex in Python?
Install llama-index-llms-deepseek, set Settings.llm = DeepSeek(model="deepseek-v4-flash", api_key=...), configure an embedding model (DeepSeek does not ship one), and build a VectorStoreIndex from your documents. Query it with index.as_query_engine().query(...). The package wraps LlamaIndex’s OpenAILike class against DeepSeek’s OpenAI-compatible POST /chat/completions endpoint. Full walkthrough in this DeepSeek Python integration guide.
What model should I use with LlamaIndex — V4-Flash or V4-Pro?
Default to deepseek-v4-flash for retrieval-augmented generation, chat, and summarisation. Output costs $0.28 per million tokens versus $3.48 for deepseek-v4-pro, and most RAG queries are bottlenecked by retrieval quality, not generation reasoning. Reach for V4-Pro when you have measured Flash failing on multi-document reasoning or complex coding tasks. Compare the tiers in detail at the DeepSeek V4 overview.
Does DeepSeek provide an embedding model for LlamaIndex?
No — DeepSeek does not currently ship an embedding model, so you’ll need an external provider. Common choices are OpenAI’s text-embedding-3-small, a local Hugging Face model such as BGE or Jina, or Ollama-hosted embeddings. Pick locally hosted embeddings if you want to keep document content off third-party servers. The DeepSeek API documentation covers the chat surface but not embeddings.
How do I enable DeepSeek thinking mode in LlamaIndex?
Thinking is a request parameter on either V4 model, not a separate model ID. Pass reasoning_effort="high" and extra_body={"thinking": {"type": "enabled"}} through additional_kwargs on the DeepSeek or OpenAILike client. The response then returns reasoning_content alongside the final content. Use it for multi-step reasoning queries and leave it off for straight summarisation. More detail in the DeepSeek API best practices.
Why am I getting a “model not found” error when using DeepSeek with LlamaIndex?
Two common causes. First, you may be using LlamaIndex’s plain OpenAI class, which restricts model names to GPT family — switch to OpenAILike or the dedicated DeepSeek wrapper. Second, after 2026-07-24 15:59 UTC the legacy IDs deepseek-chat and deepseek-reasoner are retired; update model to deepseek-v4-flash or deepseek-v4-pro. See the DeepSeek API error codes reference for the full list.
