Sizing Local Hardware for DeepSeek: A Practical Calculator

Use our DeepSeek hardware calculator to size GPUs, VRAM and RAM for V4, R1 and distilled models — with quantization math. Plan your build now.

Sizing Local Hardware for DeepSeek: A Practical Calculator

Tools·April 25, 2026·By DS Guide Editorial

“Will it fit on my GPU?” is the first question anyone asks before downloading a 400 GB checkpoint. This guide is the working notebook behind our DeepSeek hardware calculator — the same arithmetic we use before committing a workstation, a rented H200 node, or a Mac Studio to a DeepSeek deployment. We will walk through how parameter count, quantization format, KV cache and context length combine into a real VRAM number, then map those numbers to actual GPUs. The focus is on DeepSeek V4-Pro, V4-Flash, the V3 family, and R1’s distilled variants, because those are the checkpoints currently published on Hugging Face. By the end you will be able to estimate, in under a minute, whether a given DeepSeek model will run on the hardware in front of you.

How a DeepSeek hardware calculator actually works

Every sizing exercise reduces to four numbers added together: model weights, KV cache, activation overhead, and a small safety margin. Get those right and you can predict VRAM use to within a gigabyte or two. Get them wrong — usually by ignoring the KV cache at long context — and your model loads but crashes on the first long prompt.

The model-weight number is the easiest. Each parameter takes a fixed number of bytes that depends on the data type:

  • FP16 / BF16 — 2 bytes per parameter (the native checkpoint format for most LLMs).
  • FP8 — 1 byte per parameter. DeepSeek V4 ships non-MoE weights at FP8 natively.
  • FP4 — 0.5 bytes per parameter. DeepSeek V4’s instruct models use FP4 for MoE expert weights and FP8 for everything else.
  • Q4_K_M (4-bit GGUF) — roughly 0.5–0.6 bytes per parameter including metadata.
  • Q8_0 (8-bit GGUF) — roughly 1 byte per parameter; near-lossless.

Parameter count and quantization level determine the memory footprint of a model during inference. At full FP16/BF16 precision, each parameter consumes 2 bytes, so a 14B parameter model needs roughly 28 GB of VRAM just for the weights. At 4-bit quantization (Q4), each parameter occupies approximately 0.5 bytes for the weight data alone, bringing the same 14B model down to roughly 7 to 8 GB. A useful rule of thumb is that Q4 quantization requires approximately 0.5 to 1 GB of VRAM per billion parameters, with the range accounting for overhead from the KV cache and varying implementations.

The MoE wrinkle: total parameters vs active parameters

Mixture-of-Experts models confuse first-time sizing. The full V4-Pro checkpoint has 1.6T total parameters with 49B activated, while V4-Flash has 284B parameters with 13B activated, and both support a context length of one million tokens. The 49B “active” figure is the compute cost per token; the 1.6T figure is the storage cost. Inference engines need every expert resident somewhere — VRAM, system RAM, or NVMe — because routing is data-dependent. You cannot fit only the active experts in VRAM and stream the rest at speed.

DeepSeek V4 sizing: Pro and Flash on real hardware

V4 is unusual because four checkpoints are on the Hub — the instruct models use FP4 for MoE expert weights and FP8 for everything else, and the base models are FP8 throughout. That mixed-precision design is deliberate: it shaves the storage cost of the experts (where most of the parameters live) without sacrificing precision in attention layers.

Weight-only memory budgets, before KV cache:

Model Total params Native format Approx. weight VRAM Realistic minimum
DeepSeek V4-Pro (instruct) 1.6T FP4 MoE + FP8 dense ~870 GB 8× H200 141 GB or equivalent
DeepSeek V4-Pro (base) 1.6T FP8 ~1.6 TB 16× H100/H200 cluster
DeepSeek V4-Flash (instruct) 284B FP4 MoE + FP8 dense ~155 GB 2× H100 80 GB / 2× A100 80 GB
DeepSeek V4-Flash (base) 284B FP8 ~284 GB 4× H100 80 GB

Confirming on the community side, the smallest DeepSeek-V4 model, Flash, is 160 GB for FP16 or 120 GB Q4_K_M — that is how much VRAM or RAM you need. A typical V4-Flash production setup matches that: the vLLM command serves V4-Flash with two A100 80 GB GPUs and a 128K context window using tensor-parallel-size 2.

Why V4 is dramatically cheaper to run than V3.2 at long context

The architecture change is what makes million-token contexts viable. In the 1M-token context setting, DeepSeek-V4-Pro requires only 27% of single-token inference FLOPs and 10% of KV cache compared with DeepSeek-V3.2. V4-Flash drops these numbers even further: 10% of the FLOPs and 7% of the KV cache. If you sized a V3.2 deployment last year and assumed V4 would need similar memory, you can comfortably halve the KV-cache budget for the same context length. For workloads near the full 1M window, that difference is what separates a working agent from one that OOMs halfway through a trace.

R1 and the distilled variants — consumer-hardware territory

Most readers running locally will not touch V4-Pro. R1 and its distilled checkpoints are where the calculator gets practical. The distilled models (1.5B, 7B, 8B, 14B, 32B, 70B) were produced by fine-tuning Qwen2.5 and Llama 3 series checkpoints on reasoning traces generated by the full R1 model. They are dense transformer networks, not MoE, which makes them easier to quantize and deploy on single GPUs. The 671B full model uses 37B active parameters per forward pass across its 256 expert layers — but you still need to store all 671B parameters in memory.

Variant Q4_K_M weight size Q8_0 weight size Recommended GPU Context headroom
R1-Distill 1.5B ~1 GB ~1.6 GB Any 6 GB GPU; CPU works 32K+
R1-Distill 7B ~4 GB ~7.5 GB RTX 3060 12 GB 16K–32K
R1-Distill 14B ~8 GB ~15 GB RTX 4070 Ti / 4080 16K
R1-Distill 32B ~18 GB ~34 GB RTX 4090 24 GB 8K–16K
R1-Distill 70B ~40 GB ~74 GB RTX 5090 32 GB (tight) / 64 GB Mac 4K–8K
R1 671B (full MoE) ~404 GB ~720 GB Multi-GPU server or 192 GB Mac at extreme quant limited

The RTX 4090 with 24 GB VRAM is the current practical ceiling for single-card consumer deployment. The RTX 5090 (32 GB VRAM) expands the ceiling modestly, allowing the 70B model to run at 4-bit quantization with a meaningful context window without offloading. For the 14B and smaller models, mid-range cards work well: an RTX 3060 (12 GB) is comfortable with 7B at Q8 or 14B at Q4.

The full 671B: not impossible, just slow

If you are determined to run the full R1 weights at home, the community has documented two viable paths. A four-way RTX 4090 workstation (4 × 24 GB) with quad-channel DDR5 5600 (4 × 96 GB) and a Threadripper 7980X gets 2–4 tokens/s for short text generation on R1-Q4_K_M, slowing to 1–2 tokens/s for long output. A cheaper option: 5–8 t/s on a dual EPYC CPU with 24 × 16 GB DDR5 RAM (384 GB), running the IQ4_XS version with llama.cpp, no GPU, total system cost a bit over $4,000 with engineering-sample CPUs from eBay.

Don’t forget the KV cache — it’s the silent killer

The KV cache scales linearly with context length and is often the reason a model that “should” fit suddenly OOMs. In a typical MoE deployment the KV cache consumes more RAM than the model itself; for example, with a context length of 32,092 tokens it takes around 220 GB of RAM on full-precision DeepSeek R1.

Two practical rules:

  1. Always reserve at least 20% of your total memory budget for the KV cache before you commit to a context length.
  2. If you plan to use V4 at the full 1M window, the architectural KV-cache savings are doing real work — but you still need to budget for it. For the Think Max reasoning mode, DeepSeek recommends setting the context window to at least 384K tokens. That is non-trivial cache memory even at V4’s compressed rates.

Apple Silicon: the dark horse

Unified memory changes the calculation entirely. For single-user interactive use, Apple Silicon is highly competitive, especially at the 70B level. The unified memory architecture removes the traditional GPU memory bottleneck that constrains NVIDIA cards. The M4 Max at 100 tokens/second for the 70B model beats what most NVIDIA consumer cards can do for the same model size. A 192 GB Mac Studio is one of the cheapest legitimate ways to fit an aggressively quantized full R1 in memory, though the prefill speed at long context will lag a server GPU.

The cloud break-even — when local stops making sense

Local hardware looks attractive until you do the cost math against API rates. The entry-level V4-Pro deployment described above (8× H200 at roughly $3/GPU/hr on a major cloud) costs about $24/hour, or $17,000+/month if you keep it pinned. By comparison, a million V4-Flash API calls with a cached 2,000-token system prompt, 200-token user input, and 300-token output run $168 total at MIT-licensed public pricing. Unless your throughput needs are seven figures of requests per day, the API wins on pure cost. Self-hosting earns its keep on data residency, latency floors, or fine-tuning needs — not on raw price-per-token. For a more granular breakdown, see our DeepSeek cost estimator and the dedicated DeepSeek API pricing page.

A worked sizing example: V4-Flash on 2× H100 80 GB

Imagine a team running V4-Flash for an internal coding assistant via a self-hosted endpoint. The OpenAI-compatible chat surface is at POST /chat/completions, the same endpoint exposed by the public DeepSeek API. A minimal sanity check uses the OpenAI Python SDK:

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="local")
resp = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Refactor this function."}],
    max_tokens=2048,
    temperature=0.0,
)

Memory budget on 2× H100 80 GB (160 GB total):

  • Weights (FP4 MoE + FP8 dense, instruct): ~155 GB
  • Available for KV cache and activations: ~5 GB per GPU
  • Practical max context: ~128K tokens (matches the vLLM command shown earlier)

For 1M-token context on V4-Flash you need three or four H100s, or two H200s. That is the whole reason the API still wins for most teams: the weights barely fit, and you spend the rest of the budget on cache.

Mapping your build to a DeepSeek model

A short decision tree from the same calculator we used in production:

  • 8 GB VRAM — R1-Distill-7B at Q4. See the DeepSeek R1 Distill page for variant choices.
  • 12–16 GB VRAM — R1-Distill-14B at Q4 or 7B at Q8.
  • 24 GB VRAM — R1-Distill-32B at Q4. The sweet spot. Pair with our running DeepSeek on Ollama tutorial.
  • 32 GB VRAM — R1-Distill-70B at Q4 with tight context.
  • 64–192 GB unified memory (Mac) — 70B comfortably; full R1 at 1.58-bit with patience.
  • 2× H100/H200 + cluster — V4-Flash territory; see DeepSeek V4-Flash.
  • 8× H200 or larger — V4-Pro instruct. See DeepSeek V4-Pro.

If self-hosting is overkill, the cleanest fallback is the hosted API; legacy IDs deepseek-chat and deepseek-reasoner currently route to deepseek-v4-flash and retire on July 24, 2026 at 15:59 UTC, so production code should be on the V4 IDs by then. Browse the full lineup on the DeepSeek tools and utilities hub or read the DeepSeek system requirements guide for the chat and app side.

External primary sources

Two primary references underpin the V4 numbers in this article: the DeepSeek V4-Pro model card on Hugging Face and the DeepSeek V4 architecture write-up on the Hugging Face blog. Cross-check any sizing decision against those before purchasing hardware — Preview-stage releases sometimes ship updated checkpoints in their first weeks.

Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.

How much VRAM do I need to run DeepSeek locally?

It depends entirely on which DeepSeek you mean. The R1-Distill-7B fits on 8 GB VRAM at Q4. The 32B distill needs 24 GB. The full V4-Flash needs roughly 155 GB across multiple GPUs. A 7B model quantized to 4-bit might only need 4–6 GB VRAM, while V4-Pro needs server-class clusters. Use our DeepSeek model comparison tool to pick a variant before sizing.

Can I run the full DeepSeek R1 671B at home?

Yes, but slowly. Using Unsloth’s 1.58-bit dynamic quantization with CPU offloading, the full R1 will technically run on a single RTX 4090, but expect under 5 tokens/second — too slow for interactive use. The R1-Distill-32B on the same card is a much better experience. The install DeepSeek locally tutorial covers both routes.

What is the difference between active and total parameters for DeepSeek V4?

V4-Pro has 1.6T parameters with 49B activated, while V4-Flash has 284B parameters with 13B activated. Active parameters drive compute per token; total parameters drive storage. You must hold every expert in memory, but only the activated subset participates in each forward pass. See the DeepSeek V4 page for the full architecture summary.

Does DeepSeek V4 really run on Huawei chips?

Partially. The DeepSeek V4 paper notes that the company validated its fine-grained Expert Parallel scheme on both Nvidia GPUs and Ascend NPU platforms — but this does not mean the model was trained entirely on Huawei hardware, only that DeepSeek validated those accelerators to serve it. For most readers it confirms Nvidia compatibility while opening a non-Nvidia path. The DeepSeek latest updates feed tracks this.

Is it cheaper to run DeepSeek locally or use the API?

For most workloads, the API is cheaper. A self-hosted V4-Flash on 2× H100 80 GB runs roughly $4–6/hour on cloud rentals, while the same throughput via the public API costs cents per thousand calls thanks to context caching. Self-hosting wins on data residency, latency floors, and fine-tuning. For a worked breakdown see the DeepSeek pricing calculator.

Leave a Reply

Your email address will not be published. Required fields are marked *