How to Install DeepSeek Locally on Windows, macOS, or Linux

Tutorials·April 25, 2026·By DS Guide Editorial

You want your prompts to stay on your machine, your latency to stop depending on someone else’s data centre, and your bill to stop depending on token counts. That is what learning how to install DeepSeek locally buys you — at the cost of disk space, GPU memory, and an evening of setup. The catch in 2026 is that DeepSeek’s flagship V4 family is enormous: DeepSeek-V4-Pro has 1.6T parameters (49B activated) and DeepSeek-V4-Flash has 284B parameters (13B activated), both supporting a context length of one million tokens. So most readers will pick a smaller distilled R1 variant for laptops and reserve V4 for serious GPU rigs. This tutorial walks through both paths — Ollama for the easy route, vLLM and Hugging Face for the production route — with the hardware reality checks in between.

What you will end up with

By the end of this guide you will have a DeepSeek model serving completions on http://localhost:11434 (Ollama) or http://localhost:8000/v1 (vLLM), reachable from any OpenAI-compatible client by changing two lines of code. You will know which DeepSeek variant fits your hardware, how to download it without surprises, and how to verify it is running properly. The same setup powers offline coding assistants, private chatbots, and local RAG pipelines — see our DeepSeek RAG tutorial for that follow-on.

Two facts before you start. First, the DeepSeek API is a separate hosted service: it speaks POST /chat/completions at https://api.deepseek.com and is stateless, meaning the client resends the conversation history every call. Local installs replicate that wire format on your own machine. Second, the V4 family is open-weight under MIT, but the full models are not laptop-class — DeepSeek-V4 is not a laptop-sized model. Even the Flash version is a serious multi-GPU deployment. Plan accordingly.

Prerequisites

Operating system: Windows 10/11, macOS 12+ (Apple Silicon strongly preferred), or a recent Linux (Ubuntu 22.04/24.04, Rocky 10, Debian 13).
Disk space: 5 GB for a tiny distilled R1, 20–50 GB for mid-size models, 400 GB+ for V4 quantized weights.
RAM/VRAM: see the hardware table below.
GPU drivers: recent NVIDIA driver + CUDA 12.x for Linux GPU paths; nothing extra for Apple Silicon (Metal is built in).
Software: Python 3.10+, git, git-lfs, plus either Ollama (easy path) or pip for vLLM (advanced path).
Optional: a Hugging Face account if you want the official upstream weights instead of community quantizations.

Pick the right DeepSeek variant for your hardware

This is the step most tutorials skip. The “DeepSeek” you can actually run depends on what GPU or unified memory you have. Here is an honest matrix based on the official model cards and the community quantization guides.

Model	Total params	Active params	Realistic minimum hardware	Best for
DeepSeek-R1-Distill-Qwen 1.5B	1.5B	1.5B	4 GB RAM, CPU-only OK	Quick experiments, scripts
DeepSeek-R1-Distill-Llama 8B	8B	8B	8–16 GB RAM, optional 8 GB VRAM	Laptop reasoning, offline Q&A
DeepSeek-R1-Distill-Qwen 32B	32B	32B	24 GB VRAM (RTX 3090/4090) at 4-bit	Serious local coding
DeepSeek-R1-Distill-Llama 70B	70B	70B	2× 24 GB GPUs or 64 GB unified memory	Workstation reasoning
DeepSeek-V4-Flash	284B (MoE)	13B	2× A100 80 GB or 1× H200 (FP4/FP8)	Production self-hosting
DeepSeek-V4-Pro	1.6T (MoE)	49B	Multi-node cluster	Frontier research, labs

The full R1 model has 671 billion parameters and requires serious GPU infrastructure. For practical local use, DeepSeek released distilled variants based on Qwen 2.5 and Llama architectures, ranging from 1.5B to 70B parameters. The distilled models retain the chain-of-thought reasoning behavior while running on commodity hardware. If you want to size your rig before downloading, our DeepSeek hardware calculator takes a model ID and returns realistic VRAM and disk numbers.

Quick rule: if you are on a laptop, install an R1 distill via Ollama. If you have a 24 GB GPU or a 64 GB+ Apple Silicon Mac, you can stretch to 32B or 70B. If you have multi-GPU servers, you can attempt V4-Flash. V4-Pro belongs in clusters.

Path A — Install DeepSeek locally with Ollama (easiest)

Ollama is the fastest way to go from zero to a running model. It manages model weights, automatically downloads and organizes model files in GGUF format, optimizes hardware by auto-detecting GPU acceleration (NVIDIA, AMD, Apple Silicon), and provides an API exposing a REST endpoint compatible with OpenAI libraries. If your goal is “I just want to chat with DeepSeek offline,” stop here.

Step 1: Install Ollama

Download the installer for your platform from ollama.com. On macOS and Windows it is a normal app installer. On Linux, run the official one-line shell installer. After install, restart your terminal so ollama is on your PATH, then verify with this Bash command:

ollama --version

Step 2: Pull a DeepSeek model

Pick a tag that matches your hardware. For a laptop, the 8B distilled R1 is the practical sweet spot. Run this in your terminal:

ollama pull deepseek-r1:8b

The download is about 5.2 GB. On a 100 Mbps connection, expect roughly 5 minutes. Larger variants (deepseek-r1:32b, deepseek-r1:70b) take proportionally longer and need correspondingly more RAM/VRAM. For DeepSeek V4 you can try the Ollama-hosted deepseek-v4-flash:cloud tag, but be aware it routes inference through Ollama’s cloud, not your machine — read the tag description before assuming it is fully local.

Step 3: Run an interactive session

Start chatting from the terminal:

ollama run deepseek-r1:8b

The first run takes 10-30 seconds to load the model into memory. After that, you get a prompt where you can type questions. Type /bye to exit the session. Reasoning-mode output from R1 distills appears inside <think> blocks before the final answer; that is the model’s reasoning_content alongside the final content, exposed in the terminal stream.

Step 4: Use the OpenAI-compatible API

Ollama exposes a REST server on port 11434 by default. Any OpenAI SDK can talk to it by overriding base_url. Here is a minimal Python example:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # Ollama ignores the key, but the SDK requires one
)

resp = client.chat.completions.create(
    model="deepseek-r1:8b",
    messages=[{"role": "user", "content": "Explain MoE in two sentences."}],
    temperature=0.6,
)
print(resp.choices[0].message.content)

The same pattern works in Node.js, Go, and any other language with an OpenAI client. For a deeper walkthrough of the parameters, see our running DeepSeek on Ollama tutorial.

Path B — Install DeepSeek V4 locally with vLLM (production)

If you have multi-GPU hardware and want the official V4 weights, vLLM is the standard serving stack. This path replicates what DeepSeek’s hosted API does behind the scenes: an OpenAI-compatible server on port 8000 backed by the real upstream weights.

Step 1: Install system dependencies (Linux)

You need recent NVIDIA drivers, CUDA, and Python with build tools. Install with apt: git, git-lfs, python3, python3-venv, python3-pip, then run git lfs install and pip install -U huggingface_hub. Create a virtual environment to keep things tidy:

python3 -m venv dsv4-env
source dsv4-env/bin/activate
pip install -U pip wheel setuptools
pip install -U huggingface_hub torch transformers safetensors vllm

Step 2: Download the weights from Hugging Face

Use the huggingface-cli tool to pull the official V4-Flash repository. Install huggingface_hub, then run huggingface-cli download deepseek-ai/DeepSeek-V4-Flash with –local-dir ./deepseek-v4-flash to download the full instruct model into a local folder, which vLLM can load directly:

huggingface-cli download deepseek-ai/DeepSeek-V4-Flash 
  --local-dir ./deepseek-v4-flash

Plan for hundreds of gigabytes of disk for V4-Flash and substantially more for V4-Pro. Avoid downloading directly to your system disk.

Step 3: Serve the model with vLLM

vLLM ships an OpenAI-compatible server. The minimum sensible launch command for a 2× A100 80 GB box is this Bash one-liner:

python -m vllm.entrypoints.openai.api_server 
  --model ./deepseek-v4-flash 
  --tensor-parallel-size 2 
  --max-model-len 131072 
  --trust-remote-code 
  --port 8000

For two A100 80 GB GPUs and a 128K context window, this starts vLLM with –tensor-parallel-size 2 and –max-model-len 131072 on port 8000. You can push --max-model-len closer to 1,000,000 tokens (the V4 default), but only if you have the KV-cache headroom; for the Think Max reasoning mode, the recommended context window is at least 384K tokens.

Step 4: Set sampling parameters correctly

DeepSeek’s official guidance for V4 differs from older models. For local deployment, the recommendation is to set sampling parameters to temperature = 1.0 and top_p = 1.0. For task-specific work on the hosted API the recommended temperatures are 0.0 for code and maths, 1.0 for data analysis, 1.3 for general conversation and translation, and 1.5 for creative writing. Match those if you are mirroring hosted behaviour.

Verify the install worked

Whichever path you took, run a smoke test against your local endpoint with curl:

curl http://localhost:11434/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "deepseek-r1:8b",
    "messages": [{"role": "user", "content": "Reply with the single word OK."}]
  }'

For vLLM, swap the port to 8000 and the model name to ./deepseek-v4-flash. A clean response with an "OK" in choices[0].message.content means the stack is healthy. If you get a connection refused, the server is not running. If you get a CUDA out-of-memory error, drop to a smaller variant or shorter context length.

Common errors and fixes

Error	Likely cause	Fix
`CUDA out of memory`	Model + KV cache exceed VRAM	Use a smaller variant, lower `--max-model-len`, or add GPUs and raise `--tensor-parallel-size`
`Connection refused` on 11434/8000	Server not started or crashed	Re-run the launch command; check logs for OOM or driver errors
`No module named vllm`	Wrong Python environment	Re-activate your venv; `pip install -U vllm`
Hugging Face 401	Repo gated or token missing	Run `huggingface-cli login` with a valid access token
Slow generation on CPU	No GPU acceleration available	Expect 1–6 tok/s; switch to a 1.5B variant or rent a GPU
Garbled output from V4	Wrong sampling defaults	Set `temperature=1.0, top_p=1.0` per the model card

Local install vs the hosted API

Self-hosting is the right answer for privacy, offline work, and long-term volume. It is the wrong answer when you want the latest model behaviour, no GPU bills, or a one-off experiment. The hosted DeepSeek API uses two model IDs — deepseek-v4-pro and deepseek-v4-flash — and accepts the legacy deepseek-chat and deepseek-reasoner IDs until they are retired on 2026-07-24 at 15:59 UTC, at which point both legacy IDs route to deepseek-v4-flash. Chat requests hit POST /chat/completions, the OpenAI-compatible endpoint at https://api.deepseek.com; an Anthropic-compatible surface lives at the same base URL.

A worked cost example for the hosted deepseek-v4-flash tier — 1,000,000 calls with a 2,000-token cached system prompt, 200-token user message, 300-token response — comes out as:

Cached input: 2,000 × 1,000,000 = 2,000,000,000 tokens × $0.028/M = $56.00
Uncached input: 200 × 1,000,000 = 200,000,000 tokens × $0.14/M = $28.00
Output: 300 × 1,000,000 = 300,000,000 tokens × $0.28/M = $84.00
Total: $168.00 (per DeepSeek’s pricing page, as of April 2026)

Compare that with the amortised cost of a workstation GPU plus electricity. For low-volume personal use, the hosted API usually wins on dollars; for high-volume or sensitive workloads, local wins. If you want a richer breakdown, see our DeepSeek free vs paid guide and the DeepSeek API pricing reference.

Plug your local install into real tools

Once Ollama or vLLM is exposing an OpenAI-compatible endpoint, you can wire it into almost anything:

Editor assistance: point a Continue, Cursor, or Cline plugin at http://localhost:11434/v1. See DeepSeek with VS Code.
Application code: use the standard OpenAI SDKs from Python or Node.js — see DeepSeek Python integration and DeepSeek Node.js integration.
Containerised deployment: package vLLM in a container — our DeepSeek Docker deployment walkthrough covers Compose and GPU passthrough.
Fully air-gapped use: see DeepSeek offline setup for blocking outbound traffic and pinning model versions.

Next steps

You now have a working local install. The next moves are: tune prompts (our DeepSeek prompt engineering guide is the fast track), build a retrieval layer for your own documents, or fine-tune a smaller distill on domain data. For the broader catalogue of walkthroughs, browse the DeepSeek tutorials hub. If you decide cloud beats local for your case after all, our DeepSeek API getting started guide gets you on the hosted endpoint in five minutes.

Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.

Can I install DeepSeek V4 on a laptop?

Not the full V4 models — both V4-Pro (1.6T total parameters) and V4-Flash (284B) need server-class GPUs. On a laptop, install a DeepSeek-R1 distilled variant via Ollama instead; the 1.5B and 8B distills run on CPU or modest GPUs. For Apple Silicon Macs with 64 GB+ unified memory, heavily quantized 32B or 70B distills are realistic. See our DeepSeek system requirements page for specifics.

Is DeepSeek free to install locally?

Yes. DeepSeek’s recent releases — V4-Pro, V4-Flash, V3.2, V3.1 and R1 — publish open weights under the MIT license, so you can download and run them at no cost beyond your own hardware and electricity. Some older releases (V3 base, Coder-V2, VL2) split MIT-licensed code from a separate DeepSeek Model License for weights; check each Hugging Face repo. More detail in is DeepSeek open source.

How much disk space do I need to install DeepSeek locally?

It depends on the variant. The 1.5B distill is around 1 GB, the 8B distill about 5 GB, the 32B around 20 GB, and the 70B around 40 GB at 4-bit quantization. Official V4-Flash weights run into the hundreds of gigabytes; V4-Pro is larger still. Always provision extra space for converted weights and the Hugging Face cache. Our DeepSeek hardware calculator helps size this.

Does a local install work offline?

Yes, once the model is downloaded. Ollama and vLLM both load weights from disk and run inference locally; no outbound network call is required for completions. You only need internet for the initial model pull and for software updates. For an air-gapped workflow with traffic blocking and version pinning, follow our DeepSeek offline setup walkthrough.

What is the difference between local DeepSeek and the DeepSeek API?

The API is a hosted, stateless service at https://api.deepseek.com serving deepseek-v4-pro and deepseek-v4-flash; you pay per token but get the latest weights with no hardware. A local install runs the open-weight version on your own GPU, which costs nothing per token but limits you to the snapshot you downloaded. The wire format is the same — both speak POST /chat/completions. See our DeepSeek API documentation for the hosted side.