How to Run DeepSeek on Ollama: Local Setup, Models, and Tips
You want a DeepSeek model on your laptop — answering prompts without sending tokens to a remote API, working on a flight, or just keeping a private notebook of experiments. Running DeepSeek on Ollama is the most direct way to get there: one binary, one `pull`, and a chat session in the terminal. This guide is written from hands-on use across an M-series Mac, a 16 GB RTX 4070 desktop, and a 64 GB Linux workstation. By the end you will have picked a DeepSeek variant that fits your hardware, pulled it through Ollama, run a first prompt, exposed it to VS Code or a Python script, and learned where local hits a wall and the hosted API takes over.
What you’ll build
By the end of this tutorial you will have:
- Ollama installed on macOS, Windows or Linux, with the background service listening on
http://localhost:11434. - At least one DeepSeek model — R1, R1 Distill, V3.2, V4-Flash (cloud), or DeepSeek Coder — pulled and running locally or via Ollama Cloud.
- A working REST call against Ollama’s API, plus a Python snippet using the
ollamapackage. - An honest sense of when running DeepSeek on Ollama is the right call, and when you should be hitting DeepSeek’s hosted API instead.
Two notes before we start. First, “DeepSeek” on Ollama is a family — R1 and its distills, the older V3 and V3.2, the new V4-Flash preview, plus DeepSeek Coder and DeepSeek-OCR. They have very different VRAM profiles. Second, Ollama is not the DeepSeek API. The official hosted API on https://api.deepseek.com is OpenAI-compatible and has its own pricing, parameter set, and the V4 thinking-mode flags described in the DeepSeek API documentation. Local and hosted are complementary, not interchangeable.
Prerequisites
- Operating system: macOS 12+, Windows 10/11, or a recent Linux distribution.
- Disk space: 5 GB minimum for a small distill, 40 GB+ for mid-range models, 400 GB+ for the full R1 671B weights.
- RAM: 8 GB will run a 1.5B distill; 16 GB is the practical floor for 7B–8B; 32 GB+ for anything in the 14B–32B range.
- GPU (recommended): a CUDA-capable NVIDIA GPU, an Apple Silicon Mac, or an AMD ROCm-supported card. CPU-only works but is slow.
- Optional: an Ollama account if you want to use cloud-hosted DeepSeek models; a DeepSeek API key if you intend to fall back to hosted V4. See get a DeepSeek API key.
If you are unsure what your machine can handle, our DeepSeek hardware calculator estimates VRAM and RAM needs by model size and quantisation.
Pick the right DeepSeek model for your hardware
This is the step most people skip and then regret. The DeepSeek family on Ollama spans three orders of magnitude in size. The table below summarises the variants you are most likely to pull.
| Ollama tag | Type | Approx. size on disk | Best for | Runs locally? |
|---|---|---|---|---|
deepseek-r1:1.5b |
R1 distill (Qwen base) | ~1.1 GB | Quick math/logic, low-end laptops | Yes — 8 GB RAM |
deepseek-r1:8b |
R1 distill (Llama/Qwen base) | ~5 GB | General reasoning on a 16 GB machine | Yes — 16 GB RAM / 8 GB VRAM |
deepseek-r1:14b |
R1 distill | ~9 GB | Stronger reasoning, mid-range GPU | Yes — 24 GB RAM / 12 GB VRAM |
deepseek-r1:32b |
R1 distill | ~20 GB | Workstation-grade reasoning | Yes — 32 GB+ RAM / 24 GB VRAM |
deepseek-r1:671b |
Full R1 MoE | ~404 GB | Multi-GPU servers | Rarely — server-class only |
deepseek-coder:6.7b |
Code completion | ~3.8 GB | FIM and code chat | Yes — 16 GB RAM |
deepseek-v3.2:cloud |
Cloud-hosted V3.2 | n/a (cloud) | Frontier quality without local hardware | Cloud only |
deepseek-v4-flash:cloud |
Cloud-hosted V4-Flash | n/a (cloud) | Latest preview, agentic workflows | Cloud only |
About the distills. DeepSeek’s team has shown that the reasoning patterns of larger models can be distilled into smaller models, resulting in better performance compared to the reasoning patterns discovered through RL on small models, and the smaller distilled dense models perform well on benchmarks. The Qwen-based distills inherit Qwen’s Apache 2.0 licence with DeepSeek’s reasoning fine-tune layered on top, while the model weights for DeepSeek-R1 are licensed under the MIT License and the series supports commercial use, modifications and derivative works, including distillation for training other LLMs. For a deeper look at the distill family, see DeepSeek R1 Distill.
About V4 on Ollama. DeepSeek-V4-Flash is a preview of the DeepSeek-V4 series, a Mixture-of-Experts model with 284B total parameters and 13B activated, built for efficient reasoning across a 1M-token context window. Ollama currently exposes it as deepseek-v4-flash:cloud, meaning the request runs against Ollama’s hosted backend rather than your machine — you still need an Ollama account, and you cannot run those weights on a typical laptop in 2026. Local V4-Pro (1.6T total / 49B active) is realistic only on multi-GPU servers.
Step 1 — Install Ollama
Ollama publishes installers for macOS, Windows and Linux. On Linux, the one-line installer is the simplest path:
curl -fsSL https://ollama.com/install.sh | sh
Start building with open models with curl -fsSL https://ollama.com/install.sh | sh, or download manually; on Windows use irm https://ollama.com/install.ps1 | iex; manual install instructions are available too. macOS and Windows users can also download a GUI installer from ollama.com/download; Ollama has a GUI application for macOS and Windows users, but this guide focuses on the command-line tool.
Verify the install:
ollama --version
ollama list
The Ollama service should be running in the background — normally you don’t need to start it manually, and it runs on port 11434 by default. If ollama list errors out with a connection refused, start the daemon with ollama serve in another terminal.
Step 2 — Pull a DeepSeek model
Pulling a model downloads the weights and registers them with the local Ollama daemon. For a first run on a 16 GB laptop, the 8B R1 distill is a good balance of capability and size:
ollama pull deepseek-r1:8b
If you have less RAM, swap in deepseek-r1:1.5b. If you are coding and want completion plus chat, pull deepseek-coder:6.7b instead — DeepSeek Coder is trained from scratch on 87% code and 13% natural language in English and Chinese, with each model pre-trained on 2 trillion tokens. For broader context on the coder line, see our DeepSeek Coder V2 page.
To use the cloud-hosted V4-Flash preview through Ollama, sign in first:
ollama signin
ollama pull deepseek-v4-flash:cloud
Cloud tags do not download weights to your disk; they route requests through Ollama’s hosted infrastructure. That is useful when your hardware cannot fit the model but you want to keep the same client code path.
Step 3 — Run your first prompt
The fastest sanity check is an interactive session:
ollama run deepseek-r1:8b
You will get a chat prompt where each message is sent to the local model. Ask it something deliberately reasoning-flavoured — “If a train leaves Manchester at 14:05 travelling at 90 km/h…” — and you should see R1’s thinking-style output before the final answer. Type /bye to exit.
For a one-shot prompt without entering interactive mode, append the prompt directly:
ollama run deepseek-r1:8b "Summarise the CAP theorem in 80 words."
Step 4 — Call the local API from code
Ollama exposes an HTTP API on port 11434. The chat endpoint mirrors OpenAI’s request shape closely enough that wiring up clients is trivial.
A minimal curl call:
curl http://localhost:11434/api/chat -d '{
"model": "deepseek-r1:8b",
"messages": [{"role": "user", "content": "Hello!"}],
"stream": false
}'
The same call from Python using the official ollama package (pip install ollama):
from ollama import chat
response = chat(
model="deepseek-r1:8b",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.message.content)
Both shapes are taken from the model’s own page on the Ollama library; the JS equivalent uses npm i ollama and the same chat() call. For a deeper Python walk-through that goes beyond a hello-world, see our DeepSeek Python integration guide.
Local vs hosted, in one paragraph. The Ollama endpoint above runs on your machine and has no per-token cost. The hosted DeepSeek API at https://api.deepseek.com is a different surface entirely — chat requests there hit POST /chat/completions, the OpenAI-compatible endpoint, and the API is stateless: clients must resend the full conversation history with every request. The current generation hosted there is DeepSeek V4 (released April 24, 2026), shipped as deepseek-v4-pro (1.6T total / 49B active) and deepseek-v4-flash (284B / 13B active), both open-weight MoE under MIT. Thinking mode on V4 is a request parameter — reasoning_effort="high" with extra_body={"thinking": {"type": "enabled"}} — not a separate model ID, and the response returns reasoning_content alongside the final content. Default context is 1,000,000 tokens with output up to 384,000 tokens. Legacy IDs deepseek-chat and deepseek-reasoner still work but route to deepseek-v4-flash until July 24, 2026 at 15:59 UTC, after which they fail.
Step 5 — Wire it into your editor
The most common reason to run DeepSeek locally is code assistance without leaking source to a third party. Two paths work well:
- VS Code with a local-LLM extension. Continue, Cline, and similar plugins accept an Ollama base URL and a model name. Point them at
http://localhost:11434and thedeepseek-coder:6.7bordeepseek-r1:14bmodel. Full walk-through in our DeepSeek with VS Code tutorial. - Ollama’s
launchintegrations. You’ll be prompted to run a model or connect Ollama to your existing agents or applications such as Claude Code, OpenClaw, OpenCode, Codex and Copilot, and to launch a specific integration you run ollama launch claude. The same pattern works with DeepSeek tags, e.g.ollama launch claude --model deepseek-r1:14b.
Verify it worked
Three quick checks confirm a healthy local install:
ollama listshows your DeepSeek model with a non-zero size.ollama psshows the model loaded into memory after a request, with a CPU/GPU split. ollama ps shows currently running models and sessions (useful to debug “why is my VRAM full?”); a typical line reads NAME ID SIZE PROCESSOR CONTEXT UNTIL with the percentage on GPU shown next to size.- The curl call from Step 4 returns JSON with a non-empty
message.content.
Common errors and fixes
| Symptom | Likely cause | Fix |
|---|---|---|
Error: connection refused on port 11434 |
Daemon not running | Run ollama serve in a separate terminal, or restart the desktop app |
| Model loads but generates one token per second | Layers spilling to CPU | Pick a smaller quantisation, e.g. deepseek-r1:8b over :14b |
out of memory at first prompt |
Context window too large for VRAM | Lower num_ctx in the request, or close other GPU apps |
| Cloud tag fails with “unauthorised” | Not signed in to Ollama | Run ollama signin or set OLLAMA_API_KEY |
| R1 emits long <think> sections that get cut off | Output token cap too low | Raise num_predict / max_tokens in the request body |
When local is the wrong answer
Running DeepSeek on Ollama is the right tool for privacy-sensitive work, offline use, and latency-bound workflows on capable hardware. It is the wrong tool for three cases.
First, frontier quality. The 8B and 14B distills are not the same model as DeepSeek-V4-Pro or even full R1. If you need V4-Pro’s SWE-Bench Verified or Terminal-Bench performance, run it through the hosted API and read our DeepSeek V4-Pro notes for the trade-offs.
Second, 1M-token context. The V4 hosted API ships a 1,000,000-token context window with up to 384,000-token output. Local quantised distills typically run in the 4K–32K range and will silently truncate.
Third, cost-per-call at scale. Once you cross a few thousand calls a day, hosted V4-Flash often beats your electricity bill. As of April 2026, V4-Flash lists at $0.028 cache-hit, $0.14 cache-miss and $0.28 output per 1M tokens; V4-Pro lists at $0.145 / $1.74 / $3.48. A worked example: 1,000,000 V4-Flash calls with a 2,000-token cached system prompt, a 200-token user message, and a 300-token reply costs $56.00 (cached) + $28.00 (uncached input) + $84.00 (output) = $168.00 total. The same workload on V4-Pro is $290.00 + $348.00 + $1,044.00 = $1,682.00. Numbers are from DeepSeek’s pricing page; verify before committing real spend, and use our DeepSeek pricing calculator for your own shape of traffic.
Next steps
Three things to do next, in order of likely value:
- If you want a fully air-gapped setup, follow the DeepSeek offline setup guide.
- If your goal is document Q&A on local files, the DeepSeek RAG tutorial shows how to plug an Ollama-hosted DeepSeek into a vector store.
- If you also want to compare against other open-weight families on the same machine, browse the DeepSeek tutorials hub for parallel walk-throughs and the DeepSeek models hub for the full lineup.
Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.
How do I install DeepSeek R1 on Ollama?
Install Ollama from ollama.com/download (or via the Linux one-liner curl -fsSL https://ollama.com/install.sh | sh), then pull the size that fits your machine: ollama pull deepseek-r1:8b for a 16 GB laptop, :1.5b for low-end hardware, :14b or :32b for a workstation. Run it with ollama run deepseek-r1:8b. For a fuller local-first walk-through, see our guide on how to install DeepSeek locally.
What hardware do I need to run DeepSeek on Ollama?
The 1.5B distill runs on 8 GB of RAM, the 8B distill needs 16 GB of RAM or roughly 8 GB of VRAM, the 14B needs 24 GB, and 32B comfortably uses a workstation with 24 GB+ of VRAM. The full 671B R1 is server-class and effectively unrunnable on consumer hardware. Our DeepSeek hardware calculator gives a per-model estimate and our DeepSeek system requirements page covers GPU and OS specifics.
Is running DeepSeek on Ollama free?
Yes for the local tags. Ollama itself is free, the open-weight DeepSeek distills are MIT-licensed, and you pay nothing per token because inference runs on your hardware. Cloud-tagged models like deepseek-v4-flash:cloud route through Ollama’s hosted backend and may have plan limits. Hosted access through DeepSeek’s own API is metered separately. See is DeepSeek free for the broader picture across surfaces.
Can I use DeepSeek with VS Code through Ollama?
Yes. Install a local-LLM extension such as Continue or Cline, point its provider settings at http://localhost:11434, and select a pulled DeepSeek tag — deepseek-coder:6.7b for completion, or deepseek-r1:14b for chat reasoning. Ollama also ships a launch command that wires popular agents to a local model. Step-by-step instructions live in our DeepSeek with VS Code tutorial.
Why is my Ollama DeepSeek response so slow?
Almost always because the model does not fit in VRAM and layers are spilling to CPU. Run ollama ps after a prompt — if the PROCESSOR column shows a CPU percentage above zero, drop down a size (e.g. from :14b to :8b) or switch to a smaller quantisation. Closing browser tabs and other GPU consumers helps too. The general DeepSeek troubleshooting guide has more diagnostic steps.
