- The gateway starts sending responses as soon as the model begins generating content
- Each response chunk contains a delta (increment) of the content
- The final chunk indicates the completion of the response
Examples
You can enable streaming by setting thestream
parameter to true
in your inference request.
The response will be returned as a Server-Sent Events (SSE) stream, followed by a final [DONE]
message.
When using a client library, the client will handle the SSE stream under the hood and return a stream of chunk objects.
See API Reference for more details.
You can also find a runnable example on GitHub.
Chat Functions
In chat functions, typically each chunk will contain a delta (increment) of the text content:JSON Functions
For JSON functions, each chunk contains a portion of the JSON string being generated. Note that the chunks may not be valid JSON on their own - you’ll need to concatenate them to get the complete JSON response. The gateway doesn’t return parsed or validated JSON objects when streaming.Technical Notes
- Token usage information is only available in the final chunk with content (before the
[DONE]
message) - Streaming may not be available with certain inference-time optimizations