Skip to main content

What is Streaming?

Streaming delivers AI responses incrementally as they’re generated, rather than waiting for the complete response. User Experience:
  • Non-streaming: Wait 3 seconds → Full response appears
  • Streaming: Text appears word-by-word as generated
Why this matters: Perceived latency is drastically reduced. Users see progress immediately.

Why Server-Sent Events?

DualMind uses Server-Sent Events (SSE), not WebSocket.

SSE vs WebSocket

FeatureSSEWebSocket
DirectionServer → Client onlyBidirectional
ProtocolHTTPCustom (upgrade from HTTP)
ReconnectionAutomaticManual implementation required
DebuggingStandard browser DevToolsSpecialized tools
ComplexityLowHigher
Use CaseServer pushes dataClient-server chat
DualMind only needs server-to-client streaming during response generation. There’s no client data to stream back during inference, making SSE the perfect protocol.

Why Not WebSocket?

WebSocket adds unnecessary complexity:
  • Requires connection upgrade handshake
  • No automatic reconnection on disconnect
  • Harder to debug (non-HTTP protocol)
  • Bidirectional capability unused
DualMind principle: Use simplest protocol that meets requirements.

SSE Event Format

Events follow a strict format:
data: {JSON_PAYLOAD}\n\n
Rules:
  • Each event prefixed with data:
  • JSON must be single-line (no newlines in payload)
  • Each event terminated with double newline (\n\n)
  • UTF-8 encoding required

Event Types

All events use ai.stream.* namespace:
Purpose: Stream initialization, provides model metadataPayload:
{
  "type": "ai.stream.start",
  "model": {
    "name": "llama-3.3-70b-versatile",
    "provider": "groq"
  }
}
When sent: First event immediately after connection established
Purpose: Incremental text chunkPayload:
{
  "type": "ai.stream.delta",
  "delta": {
    "text": "Quantum computing"
  }
}
When sent: Repeatedly as model generates text (0+ times)Client handling: Append delta.text to displayed response
Purpose: Stream completion with final metricsPayload:
{
  "type": "ai.stream.done",
  "usage": {
    "promptTokens": 15,
    "completionTokens": 120,
    "totalTokens": 135
  },
  "timing": {
    "responseTimeMs": 2145
  }
}
When sent: After final delta event, stream terminates
Purpose: Stream failure notificationPayload:
{
  "type": "ai.error",
  "code": "PROVIDER_ERROR",
  "message": "Groq API timeout after 45 seconds"
}
When sent: On provider failure, timeout, or other errorsStream behavior: Terminates after error event

Event Lifecycle

A successful stream follows this sequence:
1

Connection Established

Client opens HTTP connection to streaming endpoint with JWT authentication.
2

Start Event

Server sends ai.stream.start with model information.
3

Delta Events

Server sends 0+ ai.stream.delta events as text generates.
4

Done Event

Server sends ai.stream.done with usage and timing statistics.
5

Connection Closes

HTTP connection terminates cleanly.
Failure path: starterror (connection closes)
Stream MUST NOT send delta events after done or error. Event ordering is strictly enforced.

Client Implementation

Browser (Fetch API with ReadableStream)

Modern browsers can consume SSE via the Fetch API:
// Note: EventSource does NOT support custom headers or POST requests.
// Use the Fetch API with ReadableStream instead:
const response = await fetch('http://localhost:5079/api/arena/chat/stream', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${jwt}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ prompt: 'Hello' })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let fullResponse = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  const lines = chunk.split('\n\n');

  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.substring(6));

      switch (data.type) {
        case 'ai.stream.start':
          console.log('Model:', data.model.name);
          break;

        case 'ai.stream.delta':
          fullResponse += data.delta.text;
          displayText(fullResponse); // Update UI
          break;

        case 'ai.stream.done':
          console.log('Usage:', data.usage);
          break;

        case 'ai.error':
          console.error('Stream error:', data.message);
          break;
      }
    }
  }
}

Custom HTTP Client

For environments without EventSource:
const response = await fetch('/api/arena/chat/stream', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${jwt}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ prompt: 'Hello' })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const chunk = decoder.decode(value);
  const lines = chunk.split('\n\n');
  
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const json = line.substring(6); // Remove "data: " prefix
      const event = JSON.parse(json);
      handleEvent(event);
    }
  }
}

Server Behavior

Connection Lifecycle

1

Request Received

Backend validates JWT, extracts user ID, performs user sync.
2

Headers Set

Response headers configured: Content-Type: text/event-stream
3

Provider Streaming

AI provider streams response via callback function.
4

Event Transformation

Provider chunks transformed into SSE event format.
5

Client Detection

Monitors HttpContext.RequestAborted token for disconnects.

Disconnect Handling

If client disconnects during streaming:
  1. RequestAborted cancellation token fires
  2. Provider receives cancellation signal
  3. Stream processing terminates immediately
  4. No database writes occur for partial streams
Partial streams are not persisted to thread messages. Only complete non-streaming requests write to database.

Timeout Behavior

Streaming requests have same 45-second timeout as non-streaming:
  1. Provider timeout (45s)
  2. Fallback to alternative model (45s)
  3. Fallback to Bytez provider (45s)
  4. Return error if all fail
Difference from non-streaming: Error delivered as ai.error event instead of JSON response.

Streaming vs Non-Streaming

AspectNon-StreamingStreaming
Response timeWait for full responseImmediate first token
Perceived latencyHighLow
Response formatSingle JSON objectSSE event stream
Database persistenceYes (if threadId provided)No
Client complexitySimple (await response)Higher (event handling)
BandwidthSingle burstGradual delivery

When to Use Streaming

Use streaming when:
  • User experience priority (perceived speed)
  • Long responses (>500 tokens)
  • Interactive chat interfaces
  • Real-time feedback important
Avoid streaming when:
  • Database persistence required immediately
  • Processing full response programmatically
  • Client doesn’t support SSE
  • Response needs to be votable (dual-chat)
Streaming endpoint does NOT write to thread_messages table. For persistent threads, client must call non-streaming endpoint separately after stream completes.

Dual-Chat Streaming

Current status: Not supported Why not?
  • Two parallel SSE streams complicate client-side handling
  • Arena comparisons require complete responses for fair voting
  • Use case unclear (voters need full text anyway)
Workaround: If streaming needed for dual-chat, call single-chat streaming endpoint twice sequentially.

Error Scenarios

Provider Timeout

If provider exceeds 45 seconds:
data: {"type":"ai.stream.start","model":{...}}\n\n
data: {"type":"ai.error","code":"PROVIDER_TIMEOUT","message":"Groq API timeout"}\n\n
Stream terminates. Client should display error.

Network Interruption

If network drops during stream:
  • Browser’s EventSource auto-reconnects (with Last-Event-ID header)
  • DualMind doesn’t support resumption (no event IDs)
  • Client must start new request

Malformed Events

Provider sends invalid JSON:
  • Backend catches parse errors
  • Sends ai.error event
  • Stream terminates

Performance Considerations

Bandwidth

Streaming uses more total bandwidth than non-streaming:
  • Event overhead: data: prefix + \n\n suffix per chunk
  • JSON overhead: Repeated {"type":"ai.stream.delta","delta":{"text":"..."}}
Example: 100-word response might generate 50-100 delta events vs 1 non-streaming response. Tradeoff: Higher bandwidth for better perceived latency.

Server Resources

Streaming holds HTTP connection open longer:
  • Non-streaming: 2-3 seconds
  • Streaming: 2-3 seconds (same inference time, different delivery)
Impact: Minimal. Inference time dominates connection time.

Next Steps

Chat Modes

Single vs dual chat explained

Thread Management

Persisting conversations

System Overview

Architecture decisions