DeepSeek API Streaming: SSE on V4-Pro and V4-Flash

Master DeepSeek API streaming with V4-Pro and V4-Flash. SSE setup, reasoning_content deltas, code samples, error handling. Start streaming today.

DeepSeek API Streaming: SSE on V4-Pro and V4-Flash

API·April 25, 2026·By DS Guide Editorial

You hit “send” and your user stares at a spinner for fifteen seconds while a 1,500-token answer assembles in silence. That is the default `POST /chat/completions` experience — and it is the single biggest reason teams turn on DeepSeek API streaming before anything else. With `stream=true`, tokens land in the browser as the model generates them, perceived latency drops to the time-to-first-token, and thinking-mode traces become readable in real time instead of arriving as a wall of text.

This guide covers the practical mechanics on **DeepSeek V4** (released April 24, 2026): how the SSE protocol works on `api.deepseek.com`, how to consume `reasoning_content` deltas alongside the final `content`, the gotchas that bite production deployments (proxy timeouts, the V4 round-trip rule for tool calls), and a worked cost example so you can plan spend before you ship.

What DeepSeek API streaming actually is

DeepSeek API streaming is the Server-Sent Events (SSE) mode of the chat endpoint. Set stream: true in your JSON body and the server replies with Content-Type: text/event-stream, pushing each token (or small batch of tokens) as a data: line until it sends a final data: [DONE] sentinel. If set, partial message deltas will be sent. Tokens will be sent as data-only server-sent events (SSE) as they become available, with the stream terminated by a data: [DONE] message.

The wire format is the OpenAI-compatible Chat Completions schema. Each chunk is a small JSON object whose choices[0].delta carries the new token text rather than a full message. That means your client appends deltas as they arrive instead of replacing a buffer.

Two facts shape everything else in this article:

  • The current generation is DeepSeek V4, shipped as two open-weight MoE models under MIT — deepseek-v4-pro (1.6T total / 49B active) and deepseek-v4-flash (284B / 13B active). DeepSeek-V4-Pro with 1.6T parameters (49B activated) and DeepSeek-V4-Flash with 284B parameters (13B activated) — both supporting a context length of one million tokens.
  • The API is stateless. Streaming changes how the response arrives, not how state is managed. You still resend the full conversation history on every request — unlike the web chat, which keeps history server-side for the user’s session.

If you have not yet provisioned access, work through get a DeepSeek API key first; the rest of this guide assumes a valid key in $DEEPSEEK_API_KEY.

Quickstart: streaming with curl and Python

Chat requests hit POST /chat/completions, the OpenAI-compatible endpoint at https://api.deepseek.com. The minimal curl call:

curl https://api.deepseek.com/chat/completions 
  -H "Content-Type: application/json" 
  -H "Authorization: Bearer $DEEPSEEK_API_KEY" 
  -d '{
    "model": "deepseek-v4-flash",
    "messages": [{"role": "user", "content": "Stream a haiku about caching."}],
    "stream": true
  }'

You will see a sequence of data: {...} lines, each containing a chunk like {"choices":[{"delta":{"content":"a "}}]}, ending with data: [DONE].

The same call with the OpenAI Python SDK — no library swap required, just point base_url at DeepSeek:

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["DEEPSEEK_API_KEY"],
    base_url="https://api.deepseek.com",
)

stream = client.chat.completions.create(
    model="deepseek-v4-flash",
    messages=[{"role": "user", "content": "Stream a 200-word essay on MoE."}],
    stream=True,
)

for chunk in stream:
    delta = chunk.choices[0].delta.content or ""
    print(delta, end="", flush=True)

For a full primer on auth, base URL, and SDK installation, see DeepSeek API getting started and DeepSeek OpenAI SDK compatibility. DeepSeek also exposes an Anthropic-compatible surface against the same base URL if your stack already speaks that schema.

Streaming with thinking mode (reasoning_content)

In V4, thinking mode is a request parameter on either model, not a separate model ID. Enable it with reasoning_effort="high" plus extra_body={"thinking": {"type": "enabled"}}; the response then returns reasoning_content alongside the final content. For thinking mode only. The reasoning contents of the assistant message, before the final answer.

When streaming, those traces arrive on a separate field of the delta. Streaming chat completions are also supported for reasoning models. The reasoning_content field is available in the delta field in chat completion response chunks. A pattern that handles both fields:

stream = client.chat.completions.create(
    model="deepseek-v4-pro",
    messages=[{"role": "user", "content": "Plan a database migration."}],
    reasoning_effort="high",
    extra_body={"thinking": {"type": "enabled"}},
    stream=True,
)

reasoning, content = "", ""
for chunk in stream:
    delta = chunk.choices[0].delta
    if getattr(delta, "reasoning_content", None):
        reasoning += delta.reasoning_content
        # render in a collapsible UI panel
    if delta.content:
        content += delta.content
        # render in the main answer panel

Two practical notes:

  • Reasoning streams first, content second. The model emits the full reasoning trace before any answer tokens. UIs that show both should render reasoning_content in a separate, dim, collapsible panel so users see something moving while the model thinks.
  • V4 changed the round-trip rule. If a turn includes a tool call, you must pass the assistant’s reasoning_content back in the next request, or the API returns 400. “The `reasoning_content` in the thinking mode must be passed back to the API.”, “type”: “invalid_request_error” The legacy deepseek-reasoner ID had the opposite rule (strip reasoning_content from history); make sure your client library handles V4 correctly.

For the high-effort variant, reasoning_effort="max" requires a context window of at least 384K tokens to avoid truncation. For the Think Max reasoning mode, we recommend setting the context window to at least 384K tokens. See DeepSeek V4-Pro for the full feature matrix.

Reference: chunks, deltas, and the [DONE] sentinel

Each streamed chunk is a JSON object that mirrors the non-streaming response shape, with two key differences:

  1. choices[0].message is replaced by choices[0].delta — the incremental fields for this chunk.
  2. The whole stream terminates with a literal data: [DONE] line that is not JSON. Don’t try to parse it.

A representative chunk:

data: {"id":"...","object":"chat.completion.chunk","created":1761000000,
       "model":"deepseek-v4-flash",
       "choices":[{"index":0,"delta":{"content":" Hi"},"finish_reason":null}]}
Field Streaming behaviour
delta.role Sent once on the first chunk ("assistant"), omitted thereafter.
delta.content Incremental text. Concatenate across chunks for the final answer.
delta.reasoning_content Thinking-mode trace. Streams before content begins.
delta.tool_calls Function call arguments stream as JSON fragments — buffer before parsing.
finish_reason Null until the last content chunk. Then stop, length, tool_calls, content_filter, or insufficient_system_resource. The reason the model stopped generating tokens. This will be stop if the model hit a natural stop point or a provided stop sequence, length if the maximum number of tokens specified in the request was reached, content_filter if content was omitted due to a flag from our content filters, tool_calls if the model called a tool, or insufficient_system_resource if the request is interrupted due to insufficient resource of the inference system.
usage Null on every chunk by default. Set stream_options={"include_usage": true} to receive a final usage chunk before [DONE]. If set, an additional chunk will be streamed before the data: [DONE] message.

The full request/response schema lives in the DeepSeek API documentation; see also DeepSeek API code examples for ready-to-paste snippets in Node, Go, and Rust.

Production-grade streaming: timeouts, proxies, and keep-alives

The single most common production failure with streaming is not in the client code — it is in the network path. Long generations on V4-Pro thinking-max can run for minutes, and any reverse proxy, CDN, serverless gateway, or WAF in front of your service can silently kill the connection.

DeepSeek itself addresses scheduling pressure with two safety valves you should know about. DeepSeek’s rate-limit documentation states that under scheduling pressure: Non-streaming requests may return empty lines while waiting. Streaming requests may return : keep-alive comments while waiting. If inference has not started after 10 minutes, the server closes the connection. A robust client must therefore:

  • Treat lines starting with : (the SSE comment syntax) as keep-alives — ignore them, don’t error.
  • Set a read timeout longer than your longest expected gap between tokens, not your total response budget. Sixty seconds is a reasonable floor for thinking-mode traffic.
  • Configure your reverse proxy (nginx, Cloudflare, ALB, API Gateway) to disable response buffering and to allow long idle connections. Use explicit connect/read timeouts in production, and make sure your reverse proxies, serverless runtime, or gateway layer do not kill long-running streamed responses too early.
  • Handle 429 as a retryable signal — DeepSeek throttles by dynamic concurrency, not a fixed RPM. See DeepSeek API rate limits for the current behaviour.

For SSE parsing in Python, two acceptable patterns: the OpenAI SDK (which handles framing for you) or requests with manual line iteration. If you prefer a helper, sseclient-py works, but rolling your own here is fine as long as you guard against partial lines and timeouts. In Node, fetch + a ReadableStream reader works; in browsers, the native EventSource API does not support custom Authorization headers, so use fetch + a streaming reader instead.

Cancelling in flight

Closing the HTTP connection cancels generation server-side. In Python with the OpenAI SDK, call stream.close(); in Node, call controller.abort() on the AbortController you passed in. You stop being billed for output tokens you never receive — useful when a user hits “stop” in your UI.

Cost worked example: streaming does not change the bill

Streaming is purely a transport choice — DeepSeek bills the same tokens whether you stream or not. The price difference between tiers is what matters. As of April 2026, the V4-Flash rates are $0.028 cache-hit / $0.14 cache-miss / $0.28 output per 1M tokens, and V4-Pro is $0.145 / $1.74 / $3.48. Always verify on the live DeepSeek API pricing page before quoting.

Worked example for a streaming chat product on deepseek-v4-flash: 1,000,000 calls per month with a 2,000-token system prompt (cached), a 200-token user message (uncached), and a 300-token streamed response.

Bucket Tokens Rate (V4-Flash) Cost
Input, cache hit 2,000,000,000 $0.028 / 1M $56.00
Input, cache miss 200,000,000 $0.14 / 1M $28.00
Output 300,000,000 $0.28 / 1M $84.00
Total $168.00

The same workload on deepseek-v4-pro at $0.145 / $1.74 / $3.48 lands at $1,682.00 — about ten times the bill, justified only when the agentic or coding lift on V4-Pro pays for itself. Off-peak discounts ended on September 5, 2025 and have not returned with V4. For a live calculator, see the DeepSeek cost estimator.

Note that usage.prompt_cache_hit_tokens and usage.prompt_cache_miss_tokens on the final chunk tell you exactly which bucket each request fell into. How to inspect it: check usage.prompt_cache_hit_tokens and usage.prompt_cache_miss_tokens in the response. Make sure you set stream_options={"include_usage": true} if you want those numbers from a streamed call. Read more about prefix reuse in DeepSeek context caching.

Error handling patterns

Streaming failures fall into three buckets, each with a different fix:

Failure mode Symptom Fix
Pre-stream HTTP error Non-200 status before any chunk arrives (401, 402, 429, 5xx) Read body as JSON, surface the error, retry with backoff for 429/5xx.
Mid-stream disconnect Connection drops after partial content Buffer received tokens, retry with the full original request — DeepSeek does not support resume.
Truncated finish finish_reason: "length" on the final chunk Increase max_tokens; for thinking-max, ensure context budget ≥ 384K.
400 on multi-turn V4 “reasoning_content in the thinking mode must be passed back” Round-trip the assistant’s reasoning_content in subsequent turns.
Empty / whitespace stream Long-running stream with no real content JSON mode without the word “json” in the prompt — see below.

The last row deserves emphasis. JSON mode is designed to return valid JSON, not guaranteed; the docs warn explicitly: When using JSON Output, you must also instruct the model to produce JSON yourself via a system or user message. Without this, the model may generate an unending stream of whitespace until the generation reaches the token limit, resulting in a long-running and seemingly “stuck” request. Always include the literal word “json” plus an example schema in your prompt, set max_tokens high enough to avoid truncation, and handle occasional empty content. Full pattern in DeepSeek API JSON mode; for status codes, see DeepSeek API error codes.

Legacy IDs and the migration window

If you still have integrations using deepseek-chat or deepseek-reasoner, they continue to work — both currently route to deepseek-v4-flash (non-thinking and thinking respectively). The retirement date is 2026-07-24 at 15:59 UTC; after that, requests with those IDs fail. Migration is a one-line model= swap; base_url does not change.

For streaming code specifically, the only behavioural change to test for is the V4 reasoning_content round-trip rule mentioned earlier. Legacy deepseek-reasoner required you to strip reasoning_content from history; V4 requires you to keep it whenever a tool call was involved. Audit your client library before flipping the switch.

For broader API patterns post-migration, the DeepSeek API best practices guide covers retries, batching, and prefix design. The full reference set lives at the DeepSeek API docs and guides hub.

Last verified: 2026-04-25. DeepSeek AI Guide is an independent resource and is not affiliated with DeepSeek or its parent company. Model IDs, pricing and API behaviour change; check the official DeepSeek documentation and pricing page before committing to a production decision.

Frequently asked questions

How do I enable streaming in the DeepSeek API?

Set "stream": true in your JSON body when calling POST /chat/completions. The server replies with Content-Type: text/event-stream and pushes tokens as data: SSE chunks until a final data: [DONE] line. Both deepseek-v4-flash and deepseek-v4-pro support streaming, with or without thinking mode. See DeepSeek API documentation for the full request schema.

What does reasoning_content look like when streaming?

When thinking mode is enabled, each streamed chunk’s delta may carry a reasoning_content field with the chain-of-thought trace, separate from content (the final answer). Reasoning streams first; content follows. Buffer them into two strings if your UI shows both. The DeepSeek V4-Pro page documents the three reasoning-effort modes that produce these traces.

Does streaming cost more than a regular request?

No. DeepSeek bills tokens, not bytes on the wire — streaming and non-streaming requests with identical inputs and outputs cost the same. The advantage is perceived latency, not price. To estimate spend across token buckets, use the DeepSeek cost estimator, and confirm current rates on the DeepSeek API pricing page before committing.

Why does my stream hang and then disconnect?

Most often a proxy or gateway closing the idle connection. DeepSeek sends : keep-alive comments under scheduling pressure and closes its own end if inference has not started in ten minutes. Disable response buffering on your reverse proxy, set generous read timeouts, and treat 429 responses as retryable. The DeepSeek API rate limits guide covers concurrency behaviour in detail.

Can I stream tool calls and function arguments?

Yes. Tool-call arguments arrive as incremental JSON fragments on delta.tool_calls — buffer them across chunks before parsing, since a single argument string is split arbitrarily. V4 also requires you to round-trip reasoning_content on assistant messages that include tool calls, or the next request fails with 400. The DeepSeek API function calling guide has the full pattern.

Leave a Reply

Your email address will not be published. Required fields are marked *