How to Build a RAG Pipeline with DeepSeek V4 (Step-by-Step)
You have a folder of PDFs, internal docs, or support tickets. You want a chatbot that answers questions from that content — accurately, with citations, without hallucinating. That is what this DeepSeek RAG tutorial covers, end to end, using the current V4 generation released on April 24, 2026. We will chunk documents, embed them with a dedicated embedding model, store the vectors in FAISS, retrieve the top matches at query time, and pass them to `deepseek-v4-flash` for grounded answers. By the end you will have a working Python pipeline you can extend with LangChain, LlamaIndex, or your own framework, plus a costed example that reflects current V4 pricing.
What you will build
A minimal but production-shaped retrieval-augmented generation (RAG) pipeline that grounds DeepSeek V4 answers in your own documents. The architecture has two halves. First, an offline indexing step: split documents into chunks, generate embeddings with a dedicated embedding model, store them in a vector index. Second, an online query step: embed the user’s question, find the top-k matching chunks, and send those chunks plus the question to DeepSeek V4-Flash through the chat completions API.
One important caveat up front: DeepSeek’s reasoning models like R1 are not suitable for generating embeddings — they are reasoning engines, not semantic similarity models, and unless fine-tuned for embeddings they should not be used as a retrieval embedding model for RAG. The same applies to V4. Use a dedicated embedding model (Sentence Transformers, BGE, GTE-Qwen2, or similar) for the retrieval side, and DeepSeek for the generation side.
Prerequisites
- Python 3.10 or newer.
- A DeepSeek account with a funded balance — see how to get a DeepSeek API key.
- The OpenAI Python SDK (the DeepSeek API is OpenAI-compatible).
- FAISS for the vector index, and
sentence-transformersfor embeddings. - Roughly 200 MB of free disk space for the embedding model and a small index.
- A folder of source documents — Markdown, plain text, or extracted PDF text.
Install everything in one line:
pip install openai faiss-cpu sentence-transformers tiktoken pypdf
Step 1: Set up the DeepSeek client
The DeepSeek chat surface is OpenAI-compatible. Chat requests hit POST /chat/completions, the OpenAI-compatible endpoint, against the base URL https://api.deepseek.com. DeepSeek’s API is compatible with OpenAI ChatCompletions and Anthropic-style endpoints, allowing developers to update existing setups by simply changing the model parameter while keeping the base URL intact. For this tutorial we will use the OpenAI SDK pattern.
import os
from openai import OpenAI
client = OpenAI(
api_key=os.environ["DEEPSEEK_API_KEY"],
base_url="https://api.deepseek.com",
)
Two notes on model selection. The current generation is DeepSeek V4, shipped as two model IDs: deepseek-v4-pro (1.6T total / 49B active parameters, frontier tier) and deepseek-v4-flash (284B / 13B active, cost-efficient tier). Both are open-weight MoE models under the MIT license, and both default to a 1,000,000-token context window with up to 384,000 tokens of output. For most RAG workloads, V4-Flash is the right default — Pro is roughly 7× the output cost and only justifies the spend on hard agentic or coding tasks.
If you have older code referencing deepseek-chat or deepseek-reasoner, those older models like deepseek-chat and deepseek-reasoner will be retired by July 24, 2026, at 15:59 UTC, currently routing to deepseek-v4-flash equivalents. Migrating is a one-line model= swap; base_url does not change. New code should use the V4 IDs directly. For a deeper walkthrough of the SDK setup, see the DeepSeek API getting started guide.
Step 2: Prepare and chunk your documents
Chunking is unglamorous and matters more than the model. If chunks are too big, retrieval pulls in noise and hurts precision. Too small, and the model loses the surrounding context it needs to answer. A reasonable starting point is 500–800 tokens per chunk with a 50–100 token overlap between adjacent chunks.
from pathlib import Path
def load_documents(folder: str) -> list[dict]:
docs = []
for path in Path(folder).rglob("*.txt"):
docs.append({"id": str(path), "text": path.read_text(encoding="utf-8")})
return docs
def chunk_text(text: str, size: int = 700, overlap: int = 80) -> list[str]:
words = text.split()
chunks, start = [], 0
while start < len(words):
end = min(start + size, len(words))
chunks.append(" ".join(words[start:end]))
if end == len(words):
break
start = end - overlap
return chunks
documents = load_documents("./data")
chunks = []
for doc in documents:
for i, c in enumerate(chunk_text(doc["text"])):
chunks.append({"source": doc["id"], "chunk_id": i, "text": c})
For PDFs, run them through pypdf first and feed the extracted text into load_documents. For HTML, strip tags before chunking. Keep the source path on every chunk so you can show citations back to the user later.
Step 3: Generate embeddings and build the index
We will use all-MiniLM-L6-v2 from Sentence Transformers — small, fast, 384 dimensions, and good enough to demonstrate the pipeline. The MiniLM-L6-v2 model provides a good balance between embedding performance and accuracy. For higher quality, swap it later for BAAI/bge-large-en-v1.5 or Alibaba-NLP/gte-Qwen2-7B-instruct at the cost of more memory.
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
embedder = SentenceTransformer("all-MiniLM-L6-v2")
texts = [c["text"] for c in chunks]
vectors = embedder.encode(texts, normalize_embeddings=True, show_progress_bar=True)
vectors = np.asarray(vectors, dtype="float32")
dim = vectors.shape[1]
index = faiss.IndexFlatIP(dim) # inner product on L2-normalised vectors = cosine
index.add(vectors)
faiss.write_index(index, "rag.index")
Persist the chunks alongside the index — for a small corpus a JSON file is fine; for anything larger, use SQLite or a real vector database. The DeepSeek with LangChain and DeepSeek with LlamaIndex tutorials cover Chroma, Pinecone, and Milvus integrations if you want a managed store.
Step 4: Retrieve the top-k chunks
Embed the query with the same model used for indexing, then search the FAISS index. Returning four to six chunks is a sensible starting band — enough context, not so much that you blow the budget on irrelevant tokens.
def retrieve(query: str, k: int = 5) -> list[dict]:
q = embedder.encode([query], normalize_embeddings=True)
q = np.asarray(q, dtype="float32")
scores, idx = index.search(q, k)
return [
{**chunks[i], "score": float(s)}
for s, i in zip(scores[0], idx[0])
if i != -1
]
Cosine scores below roughly 0.3 are usually noise. If your top result is below that threshold, return “I don’t know” rather than forcing the model to invent an answer from weak context.
Step 5: Generate the grounded answer
Now wire retrieval into the chat call. Two things matter here: the system prompt must constrain the model to the context, and the user message must clearly delimit context from question. The structure below is the same one Milvus recommends in its DeepSeek tutorial.
SYSTEM_PROMPT = (
"You are a documentation assistant. Answer using ONLY the context "
"between <context> tags. If the answer is not in the context, "
"reply exactly: 'I don't know based on the provided documents.' "
"Cite the source path in square brackets after each claim."
)
def answer(query: str) -> str:
hits = retrieve(query, k=5)
context = "nn".join(
f"[{h['source']}#chunk{h['chunk_id']}]n{h['text']}" for h in hits
)
user_prompt = f"<context>n{context}n</context>nnQuestion: {query}"
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": user_prompt},
],
temperature=0.0, # factual extraction
max_tokens=800,
)
return resp.choices[0].message.content
print(answer("What is our refund policy for annual subscriptions?"))
A few production notes baked into that snippet. temperature=0.0 matches DeepSeek’s official guidance for factual or code-like tasks. The API is stateless: DeepSeek does not retain prior turns, so for multi-turn RAG you must resend the conversation history with every request — this is different from the web chat, which keeps session state for you.
Step 6: Add structured citations with JSON mode
Free-text answers are fine for chat. For agents, evaluation, or UI components, you usually want structured output. DeepSeek supports JSON mode via response_format={"type": "json_object"}. Important caveat: it is designed to return valid JSON, not guaranteed. The prompt must include the word “json” plus an example schema, and max_tokens needs to be high enough to avoid truncation — a truncated JSON is invalid JSON.
JSON_SYSTEM = (
"Reply with valid json matching this schema: "
'{"answer": str, "citations": [str], "confidence": float}. '
"Use ONLY the provided context. If unknown, set answer to "
"'I don't know' and citations to []."
)
resp = client.chat.completions.create(
model="deepseek-v4-flash",
messages=[
{"role": "system", "content": JSON_SYSTEM},
{"role": "user", "content": user_prompt},
],
response_format={"type": "json_object"},
temperature=0.0,
max_tokens=1200,
)
Handle the rare empty-content case in client code. See the DeepSeek API JSON mode reference for full details.
Step 7: Cost the pipeline honestly
Pricing as of April 2026, taken from the DeepSeek pricing page, for deepseek-v4-flash per 1M tokens: $0.028 cache hit, $0.14 cache miss, $0.28 output. The same workload on deepseek-v4-pro costs $0.145 / $1.74 / $3.48 — roughly 7× more on output.
Worked example for a customer-support RAG handling 100,000 queries per month, with a stable 1,500-token system prompt (cached after the first call), 3,000 tokens of retrieved context plus a 100-token question per call (uncached), and a 400-token answer:
| Bucket | Tokens (per month) | Rate (V4-Flash) | Cost |
|---|---|---|---|
| Input, cache hit (system) | 150,000,000 | $0.028 / 1M | $4.20 |
| Input, cache miss (context + question) | 310,000,000 | $0.14 / 1M | $43.40 |
| Output | 40,000,000 | $0.28 / 1M | $11.20 |
| Total | 500,000,000 | — | $58.80 |
The same workload on V4-Pro would cost roughly $410. The retrieved context dominates spend — so chunking strategy and top-k tuning are the highest-leverage cost levers, not the model choice. Note that k-NN search returning the three most similar documents is a reasonable default and you can adjust this number based on your needs. The off-peak discount that existed during V3.x is no longer available; DeepSeek discontinued it on September 5, 2025. For more cost modelling, the DeepSeek cost estimator handles arbitrary workloads.
Step 8: Verify it worked
Three quick sanity checks before you ship anything:
- Retrieval recall. Hand-write 20 questions whose answers you know are in the corpus. Confirm the correct chunk is in the top-5 at least 18 times. If not, the embedding model or chunk size is wrong, not the LLM.
- Refusal behaviour. Ask 5 questions whose answers are not in the corpus. The model should refuse. If it hallucinates, tighten the system prompt and lower the temperature.
- Citation accuracy. Spot-check that the bracketed source paths in the answer actually contain the claimed fact. This is where most prototypes quietly fail.
Common errors and fixes
| Symptom | Likely cause | Fix |
|---|---|---|
| 402 Insufficient Balance | Account has no funded balance | Top up via the billing console; even a small amount unlocks the API |
| Model invents facts not in context | Temperature too high, or weak system prompt | Set temperature=0.0, add explicit “use ONLY the context” instruction |
| JSON mode returns empty content | Prompt missing the word “json” or schema example | Include both, raise max_tokens above any plausible output length |
| Slow embeddings on large corpora | Single-process encoding | Batch with batch_size=64, or move to a GPU machine for indexing |
| Retrieval returns unrelated chunks | Embedding model mismatch with content language or domain | Switch to BGE-large or a domain-specific model; try hybrid BM25 + dense retrieval |
Where to go from here
Three upgrades pay off quickly once the basic pipeline is working. First, hybrid retrieval: combine dense vectors with BM25 to catch exact-match queries that embeddings miss. Second, re-ranking: send the top 20 candidates from FAISS through a cross-encoder (such as bge-reranker-large) and keep only the top 5 — this typically lifts answer quality more than swapping the LLM. Third, context caching: structure prompts so the system message and any stable instructions sit at the start, where DeepSeek’s cache will detect the repeated prefix and bill at the cache-hit rate.
If you want to host a chat UI on top of this pipeline, the DeepSeek Streamlit app walkthrough plugs straight into the answer() function above. For Discord or Telegram front ends, see the DeepSeek Discord bot tutorial. For other patterns and end-to-end recipes, browse the full library of DeepSeek tutorials.
Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.
Frequently asked questions
Can DeepSeek be used for retrieval-augmented generation?
Yes. DeepSeek V4 is well-suited as the generator in a RAG pipeline — it accepts retrieved context through standard chat messages and follows grounded-answer instructions reliably. Use a dedicated embedding model (Sentence Transformers, BGE, or GTE-Qwen2) for the retrieval side, since reasoning models are not designed for semantic similarity. Full setup steps are covered in the DeepSeek with LangChain tutorial.
What is the best DeepSeek model for RAG: V4-Flash or V4-Pro?
For most RAG workloads, V4-Flash is the right default. It handles long retrieved contexts well at a fraction of the cost — roughly 7× cheaper on output than V4-Pro. Reach for V4-Pro only when answers require multi-step reasoning over the retrieved evidence, such as legal synthesis or complex troubleshooting. Compare both on the DeepSeek V4 page.
How does the DeepSeek API differ from the web chat for RAG?
The API is stateless — every request must include the full conversation history and any retrieved context. The web chat and mobile app, by contrast, maintain session history server-side. For RAG you almost always want the API, since you control retrieval and prompt construction. Authentication and quickstart code live in the DeepSeek API documentation.
Does DeepSeek provide its own embedding model?
No. DeepSeek publishes chat and reasoning models, not a dedicated embedding model. RAG pipelines pair DeepSeek’s generator with an external embedding model — Sentence Transformers MiniLM for fast prototypes, BGE-large or GTE-Qwen2 for higher recall. The OpenAI-compatible client only exposes chat completions on DeepSeek’s surface. See the DeepSeek OpenAI SDK compatibility notes for what is and is not exposed.
How much does a DeepSeek RAG application cost to run?
For 100,000 queries per month with a typical 3,000-token retrieved context and 400-token answer, V4-Flash works out to roughly $59 per month at April 2026 pricing — $0.14 per 1M cache-miss input and $0.28 per 1M output, with a portion of the system prompt billed at the $0.028 cache-hit rate. Model the numbers for your own workload with the DeepSeek API pricing reference.
