Using DeepSeek for Math: Models, Prompts and Worked Examples
If you are stuck on a tricky integral, an Olympiad-style number-theory problem, or a 30-row spreadsheet of dosage calculations, which model should you actually open? Using DeepSeek for math is one of the better-supported workflows in the current model lineup — the lab has shipped a dedicated math model, two reasoning-trained chat models, and now a frontier MoE family with a thinking switch you can flip per request. But not every model is right for every problem. A 7B math specialist beats a frontier chat model on some olympiad-style questions and loses on word problems that need broader context.
This guide covers which DeepSeek model to pick for which kind of math, how to prompt it, what it costs through the API, and where it still gets things wrong.
The concrete problem: math is not one task
“Doing math” with a language model is a stack of very different jobs. Arithmetic on long numbers, algebraic manipulation, calculus, statistics, formal proofs, contest problems, and applied word problems each stress a model in different ways. A model that gets 95% on grade-school word problems can still hallucinate a derivative, and a contest-winning reasoner can still misread a units conversion in an applied physics question.
The DeepSeek lineup reflects this. There is a small, dedicated math model from 2024; a reasoning model that posts near-ceiling scores on contest math; a formal theorem prover; and the new V4 chat family with a thinking switch. The right pick depends on whether you need speed, depth of reasoning, code-assisted calculation, or formal verification.
Which DeepSeek model for which kind of math
The short version: for everyday algebra, calculus, and applied word problems, DeepSeek V4-Flash in thinking mode is the default. For the hardest contest problems where every percentage point matters, R1 still leads the open-weights field. For specialised theorem-proving in Lean 4, the Prover line is the only sensible choice.
| Task | Best DeepSeek pick | Why |
|---|---|---|
| Homework, calculus, statistics | V4-Flash, thinking on | Cheapest tier, 1M context, full reasoning trace |
| Olympiad / AIME / contest math | DeepSeek R1 or V4-Pro (max) | R1 at 97.3% MATH-500 and 90.8 MMLU is the current open-source reasoning benchmark leader |
| Tool-using math (Python, SymPy) | V4-Pro, thinking on | Best at multi-step agentic loops with code execution |
| Formal proofs in Lean 4 | DeepSeek Prover | Trained specifically for verified theorem proving |
| Local, offline math on a laptop | DeepSeek Math 7B or an R1 distill | Runs on a single 24 GB GPU; surprisingly strong on MATH |
Why V4 is the new default for most math
DeepSeek V4 launched on April 24, 2026 as two open-weight MoE models under the MIT license. DeepSeek-V4-Pro has 1.6T parameters (49B activated) and DeepSeek-V4-Flash has 284B parameters (13B activated), both supporting a context length of one million tokens. Both expose a single API parameter for thinking mode, which is what you want for math: switch it on for hard problems, leave it off for trivial ones.
Either model can be addressed via POST /chat/completions, the OpenAI-compatible endpoint at https://api.deepseek.com. DeepSeek also exposes an Anthropic-compatible surface against the same base URL. If you maintain an older integration on deepseek-chat or deepseek-reasoner, those IDs still work and currently route to deepseek-v4-flash, but they retire on 2026-07-24 at 15:59 UTC. Migrate by changing one line — the model= argument — and leave base_url alone.
Where R1 still wins
For the hardest math, R1’s training pipeline is purpose-built for what you need. The R1 training pipeline — multi-stage reinforcement learning without early supervised fine-tuning — produces reasoning chains that self-verify on mathematical problems in a way V-series models don’t replicate. If you are working through Putnam problems, AIME-style number theory, or olympiad geometry, R1 in DeepThink mode is still the open-weights leader on MATH-500. For an architectural deep-dive, see the DeepSeek R1 page.
Five workflows that actually work
What follows is what I run in production day-to-day. Each one is a prompt pattern, not a one-shot trick — small adjustments matter.
1. The “show your work” calculus prompt
Default V4-Flash with thinking enabled, plus an explicit final-answer directive. The directive is the one R1 publishes in its own evaluation guide: For mathematical problems, it is advisable to include a directive in your prompt such as: “Please reason step by step, and put your final answer within boxed{}.” That single sentence cuts ambiguity in the answer parser to near zero.
System: You are a math tutor. Use LaTeX for equations.
User: Compute the indefinite integral of x*ln(x) dx.
Please reason step by step, and put your final
answer within boxed{}.
For temperature, use 0.0 on math — the official DeepSeek guidance is 0.0 for code generation and mathematics, 1.3 for general conversation, 1.5 for creative writing. Determinism matters when you intend to verify the result.
2. Tool-augmented arithmetic for ugly numbers
Language models are imperfect calculators. For anything involving more than four-digit multiplication, long division, or exact fractions, ask the model to write Python rather than compute in its head. V4 supports tool calling in OpenAI-compatible format. Declare a run_python tool and let the model call it; the model writes from sympy import integrate, symbols and you execute it.
For full setup, see the DeepSeek API function calling guide.
3. Verified contest-math attempts
Contest problems reward depth over breadth. Use V4-Pro at reasoning_effort="max" and ask for two solutions by different methods, then check that they agree. The pattern looks like this in Python:
from openai import OpenAI
client = OpenAI(base_url="https://api.deepseek.com", api_key="...")
resp = client.chat.completions.create(
model="deepseek-v4-pro",
messages=[{"role": "user",
"content": "Find all integer solutions to "
"x^2 + y^2 = 2024. Solve twice using "
"two different methods, then verify "
"the answers agree. \boxed{} the result."}],
reasoning_effort="max",
extra_body={"thinking": {"type": "enabled"}},
)
print(resp.choices[0].message.reasoning_content) # the thinking
print(resp.choices[0].message.content) # the answer
The response returns reasoning_content alongside the final content, so you can inspect the chain of working separately from the answer.
4. Word problems with hidden constraints
Applied math fails most often on units, edge cases, and unstated assumptions. The fix is to ask the model to list assumptions before solving. Prompt: “List every assumption you are making about units, sign conventions, and edge cases. Then solve. Then check that none of your assumptions contradicts the original problem.”
5. Long datasets with the 1M-token window
If you have a CSV of survey results, exam scores, or experimental data, you no longer need RAG for most of it. DeepSeek V4 Pro is an MoE model with 1.6T total parameters (49B active) and a 1M-token context window. It’s built for advanced reasoning, coding, and long-running agent tasks, and performs well on knowledge, math, and software engineering benchmarks. Paste the data, ask for the analysis, and read the working. For repeating workflows, cache the system prompt — see DeepSeek context caching.
What it costs to use DeepSeek for math at scale
Reasoning is verbose. A single AIME-style problem in thinking mode can emit 5,000–15,000 tokens of reasoning_content plus a few hundred tokens of content. That changes the economics. Below is a worked example for a tutoring-bot workload at V4-Flash rates: 100,000 problems, a 1,500-token cached system prompt, a 250-token user question, and a 4,000-token thinking-plus-answer response.
| Bucket | Tokens | Rate (per 1M) | Cost |
|---|---|---|---|
| Input, cache hit | 1,500 × 100,000 = 150,000,000 | $0.028 | $4.20 |
| Input, cache miss | 250 × 100,000 = 25,000,000 | $0.14 | $3.50 |
| Output (incl. reasoning) | 4,000 × 100,000 = 400,000,000 | $0.28 | $112.00 |
| Total | $119.70 |
If you swap to V4-Pro for the same workload (rates: $0.145 cache-hit / $1.74 cache-miss / $3.48 output per 1M), the bill rises to roughly $1,458. Pro is worth it for olympiad-grade reasoning; Flash is the right default for tutoring, homework help, and most applied calculation. For a calculator that handles your specific numbers, see the DeepSeek pricing calculator. Quoted rates are as of April 2026 — verify on the official pricing page before committing.
One trap to avoid: when thinking mode is enabled, set max_tokens high. DeepSeek V4 Pro (High) is a open weight model with a 1M token context window. It uses explicit chain-of-thought reasoning, which typically improves performance on math and complex reasoning tasks at the cost of higher latency and token usage. A truncated reasoning trace produces wrong answers as often as no reasoning at all. For max-effort thinking the platform requires max_model_len >= 393216 (384K tokens).
The legacy specialists: DeepSeek Math 7B and DeepSeek Prover
Before V4 and R1, the lab shipped a dedicated 7B math model that still has its uses. DeepSeekMath is initialized with DeepSeek-Coder-v1.5 7B and continues pre-training on math-related tokens sourced from Common Crawl, together with natural language and code data for 500B tokens. DeepSeekMath 7B has achieved an impressive score of 51.7% on the competition-level MATH benchmark without relying on external toolkits and voting techniques, approaching the performance level of Gemini-Ultra and GPT-4.
Two reasons to still care about it in 2026: it runs on a single consumer GPU with bf16 weights, and it was the paper that introduced GRPO — the reinforcement-learning recipe that later powered R1. Furthermore, we introduce the Group Relative Policy Optimization (GRPO), a variant reinforcement learning (RL) algorithm of Proximal Policy Optimization (PPO). GRPO foregoes the critic model, instead estimating the baseline from group scores, significantly reducing training resources. If you need a local, offline math model and cannot run a 70B distill, this is the one. For offline setup, see the guide on how to install DeepSeek locally.
For Lean 4 theorem proving — a much narrower use case — the DeepSeek Prover line is purpose-built and beats general chat models on verified proofs.
Where DeepSeek still gets math wrong
Honest list, from a year of production use:
- Long-form arithmetic without code. Numbers with more than 8–10 digits get errors. Always route through Python.
- Geometry from word descriptions. Without a diagram, the model often misreads orientations. Sketch first, paste the figure if you have it.
- Probability with subtle conditioning. Conditional-probability traps fool reasoning models almost as often as they fool humans. Ask for the sample space explicitly.
- Statistics on real data. The model can compute a t-test but rarely catches when your data violates the test’s assumptions. You still need to know stats.
- “Slick” proofs. R1 will produce a 4,000-token proof when a two-line one exists. Expect verbosity; ask for “the shortest valid argument” if brevity matters.
For a broader breakdown, see DeepSeek limitations.
Honest alternatives for specific math sub-tasks
I would not pretend DeepSeek wins every category. For step-by-step homework explanations aimed at high-school students, GPT-5’s tutoring UI in ChatGPT is more polished. For symbolic computation and worked theory, Wolfram Alpha is still the right tool — and you can let DeepSeek call it as a function. For Lean 4 work outside of DeepSeek Prover’s training distribution, Anthropic’s Claude family handles novel theorem statements competitively. The honest framing: pick DeepSeek for cost-efficient API math at scale, R1 for olympiad depth, and external tools for symbolic exactness.
For broader head-to-heads, see DeepSeek vs ChatGPT and DeepSeek vs Claude.
Getting started for math
If you have never used DeepSeek for math before, the cheapest way to test the API is:
- Sign up and get a DeepSeek API key.
- Try a single call against
POST /chat/completionswithmodel="deepseek-v4-flash"andreasoning_effort="high". - Add the “step by step, boxed{} the answer” directive from §1.
- Read the
reasoning_contentas well as thecontent— the working is where errors hide. - For repeated workloads, move shared instructions into a cached system prompt.
Note that the API is stateless: you must resend the conversation history with each request. The web chat and mobile app keep session state for you, but the API does not. JSON mode is designed to return valid JSON, not guaranteed — include the word “json” in your prompt with a small example schema, and set max_tokens high enough that the response cannot truncate.
This use case lives in the broader DeepSeek use cases hub if you want to see how math compares to coding, research, or education workflows.
Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.
Is DeepSeek good at math?
Yes, on most benchmarks it is competitive with or ahead of similarly priced closed-source models. R1 reaches near-ceiling scores on the MATH-500 benchmark and the V4 family inherits strong math performance with thinking mode enabled. For everyday calculus, statistics, and word problems, V4-Flash with reasoning_effort="high" handles them well. See the DeepSeek performance review for full task-by-task results.
Which DeepSeek model is best for math homework?
For typical school and university homework, DeepSeek V4-Flash in thinking mode is the cheapest and fastest option. It returns reasoning_content alongside the final content, so students can read the working and check each step. For contest-level problems, switch to DeepSeek R1 or V4-Pro at maximum reasoning effort.
How do I turn on thinking mode for math problems?
In the web chat or app, toggle “DeepThink” on. Through the API, set reasoning_effort="high" with extra_body={"thinking": {"type": "enabled"}} on either deepseek-v4-pro or deepseek-v4-flash. Use "max" for the hardest problems, but raise max_tokens to avoid truncation. Worked examples live in the DeepSeek API getting started tutorial.
Can DeepSeek do calculus and linear algebra step by step?
Yes. With thinking mode enabled and a “reason step by step, boxed{} the final answer” directive, V4-Flash and R1 produce a full chain of working for derivatives, integrals, eigenvalue problems, and matrix manipulations. For exact symbolic results, ask the model to call SymPy via tool use rather than computing in natural language. See DeepSeek prompt engineering for the exact prompt patterns.
What does it cost to solve a thousand math problems with DeepSeek?
At V4-Flash rates as of April 2026 ($0.028 cache-hit / $0.14 miss / $0.28 output per 1M tokens), a thousand thinking-mode problems with a cached system prompt and 4,000-token responses runs about $1.20. V4-Pro for the same workload runs about $14.50. The DeepSeek cost estimator gives a per-workload figure once you plug in your token counts.
