DeepSeek Offline Setup: Run V4 and R1 on Your Own Hardware
Can you actually run DeepSeek without sending a single token to a Chinese server? Yes — and a working DeepSeek offline setup is the only honest answer to the privacy and compliance questions that keep coming up in client meetings. The open weights are on Hugging Face under the MIT licence; the harder question is which model your hardware can realistically host, and which runtime to use.
This guide walks through three deployment paths that I run in production today: a small distilled R1 on a single consumer GPU, the new DeepSeek V4-Flash on a serious workstation or single-node server, and a sketch of what it takes to host V4-Pro. Expect concrete VRAM numbers, real commands, and clear warnings about where the marketing departs from reality.
What you’ll build
By the end of this tutorial you will have a DeepSeek model running on hardware you control, exposing an OpenAI-compatible endpoint on localhost. Your prompts will not leave the machine. You will be able to swap your existing OpenAI client over by changing two lines: base_url and model.
Three deployment tiers, by hardware budget:
- Tier 1 — Laptop / single consumer GPU. A DeepSeek R1 distill (1.5B to 32B) via Ollama. Reasoning quality on a $200 used GPU.
- Tier 2 — Workstation or single-node server.
deepseek-v4-flashvia vLLM. The current-generation MoE model, MIT-licensed weights, OpenAI-compatible API. - Tier 3 — Multi-node cluster.
deepseek-v4-provia vLLM with tensor and pipeline parallel. Frontier-tier; not a desktop install.
If you want a comparison of the API hosted by DeepSeek vs running it yourself, see the AI comparison hub and the DeepSeek privacy writeup.
Prerequisites
- Operating system: Linux (Ubuntu 22.04+ tested), macOS 14+ on Apple Silicon, or Windows 11 with WSL2.
- Disk: 10 GB for an 8B distill, 200 GB+ for V4-Flash, 1.5 TB+ for V4-Pro.
- Python: 3.10 or 3.11 (vLLM is fussy about newer versions at time of writing).
- GPU drivers: NVIDIA driver 550+ with CUDA 12.4+ for vLLM. Apple Silicon needs nothing extra for Ollama.
- A Hugging Face account and access token for downloading V4 weights (the R1 distills can be pulled via Ollama without one).
- Optional but recommended: DeepSeek hardware calculator to size your GPU before you commit to a checkpoint.
Hardware reality check
The phrase “run DeepSeek locally” hides an order-of-magnitude difference between models. Pick from this table before you start downloading anything.
| Model | Total / active params | Practical local VRAM (Q4) | Realistic host |
|---|---|---|---|
| R1 Distill 1.5B | 1.5B dense | ~2 GB | Any modern laptop |
| R1 Distill 8B | 8B dense | ~6 GB | RTX 3060 12GB / M-series Mac |
| R1 Distill 14B | 14B dense | ~10–12 GB | RTX 3080 / 4070 |
| R1 Distill 32B | 32B dense | ~20–24 GB | RTX 4090 / Mac Studio 32GB+ |
| R1 Distill 70B | 70B dense | ~40 GB | 2× RTX 3090 / Mac Studio 128GB |
| DeepSeek V4-Flash | 284B / 13B MoE | ~150–200 GB (FP4+FP8) | 2× A100 80GB or 1× H200 |
| DeepSeek V4-Pro | 1.6T / 49B MoE | ~800 GB+ (FP4+FP8) | Multi-node H100/H200 cluster |
Two facts worth absorbing. First, on V4 only 13B (Flash) or 49B (Pro) parameters are active per token, which makes inference compute manageable — but the full weights still have to fit somewhere, so total memory dominates the cost. DeepSeek’s own local-deployment notes point to model weight conversion, multi-process torchrun execution, model-parallel settings and multi-node inference — a very different operating model from a desktop app or an Ollama pull. Second, the KV cache consumes more RAM than the model itself for long contexts, so the 1M-token context window is theoretical until you allocate memory for it.
Tier 1: DeepSeek R1 distill via Ollama
This is the simplest DeepSeek offline setup and the right starting point for almost everyone. The R1 distillations are smaller dense models (Qwen and Llama backbones) trained on R1’s reasoning traces. They keep most of the chain-of-thought behaviour and run on consumer hardware.
Step 1 — Install Ollama
Run the appropriate shell command:
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS — use the desktop app from ollama.com, or:
brew install ollama
# Windows — download the installer from ollama.com/download
Step 2 — Pull a model
Pick the largest variant that fits your VRAM. For 12 GB cards, the 14B is the sweet spot; for 24 GB cards, the 32B; for everything else, 8B.
ollama pull deepseek-r1:8b # ~5 GB on disk
ollama pull deepseek-r1:14b # ~9 GB on disk
ollama pull deepseek-r1:32b # ~20 GB on disk
Step 3 — Run it
ollama run deepseek-r1:14b "A bat and a ball cost $1.10. The bat costs $1 more than the ball. How much is the ball?"
You should see <think> tags wrapping the model’s reasoning before the final answer ($0.05). If the answer is $0.10, the model skipped reasoning — drop the temperature and try again.
Step 4 — Expose an OpenAI-compatible endpoint
Ollama already serves one. The endpoint at http://localhost:11434/v1 implements the OpenAI chat completions API. Point the OpenAI SDK at this URL and use deepseek-r1:14b as the model name. Ollama does not validate API keys, so any string works.
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
resp = client.chat.completions.create(
model="deepseek-r1:14b",
messages=[{"role": "user", "content": "Refactor this Python script."}],
)
For a deeper walkthrough of the Ollama path including Modelfile customisation, see running DeepSeek on Ollama.
Tier 2: DeepSeek V4-Flash via vLLM
If you want the current generation rather than R1, V4-Flash is the model to host yourself. DeepSeek-V4-Flash has a total of 284 billion parameters and 13 billion active parameters and is released on Hugging Face under the MIT License. The instruct checkpoint uses FP4 for MoE expert weights and FP8 for everything else, which keeps the on-disk footprint manageable for a frontier-class model.
Step 1 — Provision the hardware
Realistic single-node configurations for V4-Flash with a 128K context (not the full 1M):
- 2× NVIDIA A100 80GB
- 1× NVIDIA H200 141GB
- 4× RTX 6000 Ada 48GB (tight; expect to cap context lower)
Step 2 — Download the weights
pip install huggingface_hub
huggingface-cli login # paste your HF token
huggingface-cli download deepseek-ai/DeepSeek-V4-Flash
--local-dir ./deepseek-v4-flash
This pulls the FP4+FP8 instruct checkpoint (around 150–180 GB depending on shard layout).
Step 3 — Install vLLM
Use a fresh virtualenv. vLLM 0.9 or later is required for V4 architecture support:
python -m venv .venv && source .venv/bin/activate
pip install --upgrade pip
pip install "vllm>=0.9.0"
Step 4 — Serve the model
The command below starts an OpenAI-compatible server on port 8000 with two-way tensor parallel and a 128K context window:
python -m vllm.entrypoints.openai.api_server
--model ./deepseek-v4-flash
--tensor-parallel-size 2
--max-model-len 131072
--trust-remote-code
--port 8000
For local sampling, the model card recommends setting temperature = 1.0 and top_p = 1.0, and for the Think Max reasoning mode setting the context window to at least 384K tokens. Bump --max-model-len to 393216 if you need max-effort thinking, and accept that KV cache memory will jump accordingly.
Step 5 — Talk to it from Python
vLLM exposes the same Chat Completions wire format DeepSeek’s hosted API uses. Chat requests hit POST /chat/completions, the OpenAI-compatible endpoint, and the SDK swap is two lines:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
resp = client.chat.completions.create(
model="./deepseek-v4-flash",
messages=[{"role": "user", "content": "Plan the migration."}],
reasoning_effort="high",
extra_body={"thinking": {"type": "enabled"}},
temperature=1.0,
top_p=1.0,
max_tokens=8000,
)
print(resp.choices[0].message.content)
When thinking is enabled, the response returns reasoning_content alongside the final content. Other parameters worth knowing: temperature, top_p, max_tokens, reasoning_effort, plus JSON mode and tool calling on the same endpoint. JSON mode is designed to return valid JSON, not guaranteed — include the word “json” and a small example schema in your prompt, and set max_tokens high enough to avoid truncation.
Step 6 — Verify it worked
Three quick checks:
curl http://localhost:8000/v1/modelsshould return a JSON list including your model path.nvidia-smishould show GPU memory pinned across all cards involved in tensor parallel.- A test prompt should complete in under five seconds for short outputs. If it takes 30+ seconds, you are probably swapping to system RAM — reduce
--max-model-len.
Tier 3: DeepSeek V4-Pro on a cluster
V4-Pro is not a single-machine workload. DeepSeek-V4-Pro has a total of 1.6 trillion parameters and 49 billion active parameters, and even at FP4+FP8 mixed precision the weights fill multiple H100 or H200 nodes. Plan for tensor parallel within a node and pipeline parallel across nodes, plus a fast interconnect (InfiniBand or NVLink Switch).
The hosted API is dramatically cheaper than the cluster you would otherwise build. As context, V4-Flash on the official DeepSeek API pricing is $0.028 / $0.14 / $0.28 per million tokens (cache hit / cache miss / output) and V4-Pro is $0.145 / $1.74 / $3.48. For 1,000,000 calls with a 2,000-token cached system prompt, a 200-token user message and a 300-token response, V4-Flash totals $168.00 ($56 cached input + $28 uncached input + $84 output). The same workload on V4-Pro costs $1,682.00. For most teams, “offline” means tier 1 or tier 2, with the hosted API used as a fallback for the heaviest jobs.
Common errors and fixes
| Symptom | Likely cause | Fix |
|---|---|---|
CUDA out of memory on vLLM startup |
Context window too large for VRAM | Reduce --max-model-len by half; restart |
| Ollama responds at 1–2 tok/s on a GPU machine | Model is running on CPU | Run ollama ps; reinstall NVIDIA drivers if GPU not listed |
| JSON mode returns empty content | Prompt missing the word “json” or schema; max_tokens too low |
Add an example schema; raise max_tokens |
<think> blocks break downstream parsers |
Reasoning trace leaking into structured output | Strip <think>...</think> in a post-processing step |
| vLLM refuses to load V4 weights | vLLM version too old | Upgrade to 0.9+; pip install -U vllm |
Local vs hosted: when to switch
Local inference removes the network and the third party from your data path. It does not remove the operations work. A few patterns from production:
- Privacy-bound workloads (legal, healthcare, internal source code) → local, every time. Even the smallest R1 distill running offline beats sending the same prompt to any hosted API.
- Bursty, low-volume traffic → hosted. A workstation idling at 400 W for two requests a day is poor economics.
- High-volume batch processing → run the math both ways. Local V4-Flash on a depreciated H200 can undercut the hosted price after a few months.
- Mixed setup → Tier 1 locally for sensitive snippets, hosted V4-Pro for the heavy reasoning jobs that don’t touch private data.
A note on the API surface
The hosted DeepSeek API behaves differently from the web app: it is stateless, so the client must resend the conversation history with every request. Your local vLLM server inherits the same model: stateless. DeepSeek’s API documentation also lists V4 Pro and V4 Flash as current model options, with support for thinking and non-thinking modes, JSON output, tool calls, OpenAI format access, and Anthropic format access. The legacy IDs deepseek-chat and deepseek-reasoner still work against the hosted API and route to deepseek-v4-flash, but they retire on 2026-07-24 at 15:59 UTC — migrating is a one-line model= swap.
Next steps
Once your DeepSeek offline setup is stable, the natural follow-ons are:
- install DeepSeek locally — a deeper dive into platform-specific install steps if you hit issues above.
- DeepSeek RAG tutorial — wire your local model to a vector store so it can answer questions about your private documents.
- fine-tuning DeepSeek — adapt a distill to your domain once the base setup proves out.
- DeepSeek V4-Flash — full model page with architecture details and benchmark context.
Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.
Frequently asked questions
Is a DeepSeek offline setup actually private?
Yes, when you run an open-weight checkpoint on hardware you control, no prompts or responses leave your machine. There is no telemetry from the model files themselves. The privacy concerns that get raised about DeepSeek concern the hosted chatbot at chat.deepseek.com and the API, both subject to Chinese data law. For a fuller breakdown, see DeepSeek privacy and the is DeepSeek safe guide.
Can I run DeepSeek V4-Pro on a laptop?
No. V4-Pro is a 1.6 trillion parameter MoE model and even with FP4+FP8 mixed precision its weights fill multiple H100/H200 GPUs across more than one node. Realistic single-machine offline setups stop at V4-Flash on a workstation with 2× A100 80GB or an H200, or at the R1 distills on a single consumer GPU. Use the hosted API or a managed cluster for V4-Pro. The DeepSeek V4-Pro page covers the architecture trade-offs.
How does Ollama compare to vLLM for DeepSeek?
Ollama is the simplest path for the R1 distill family — single command to pull, single command to run, sensible defaults. vLLM is what you want for V4-Flash and any production deployment: faster inference, better batching, full control over tensor parallel and context length. Use Ollama to prototype, vLLM to serve. The running DeepSeek on Ollama tutorial covers the Ollama side in detail.
What hardware do I need to run the DeepSeek R1 32B distill?
Around 20–24 GB of VRAM at Q4_K_M quantisation. In practice this means an RTX 4090 (24 GB), an RTX 5090 (32 GB), or an Apple Silicon Mac with 32 GB or more of unified memory. The 32B distill is the quality inflection point — it captures most of full R1’s reasoning ability on math and coding benchmarks. See the DeepSeek R1 Distill model page and the DeepSeek system requirements guide.
Why use offline DeepSeek instead of the hosted API?
Three reasons: data residency (prompts never leave your network), no per-token cost after the hardware is paid for, and no dependency on an external API’s uptime or rate limits. The trade-offs are the up-front capital cost, electricity, and operations overhead. For low-volume bursty workloads the hosted API usually wins; for steady high-volume or privacy-bound work, local can be cheaper after a few months. The DeepSeek tutorials hub has more deployment patterns to compare.
