Streaming Protocol

What is Streaming?

Streaming delivers AI responses incrementally as they’re generated, rather than waiting for the complete response. User Experience:

Non-streaming: Wait 3 seconds → Full response appears
Streaming: Text appears word-by-word as generated

Why this matters: Perceived latency is drastically reduced. Users see progress immediately.

Why Server-Sent Events?

DualMind uses Server-Sent Events (SSE), not WebSocket.

SSE vs WebSocket

Feature	SSE	WebSocket
Direction	Server → Client only	Bidirectional
Protocol	HTTP	Custom (upgrade from HTTP)
Reconnection	Automatic	Manual implementation required
Debugging	Standard browser DevTools	Specialized tools
Complexity	Low	Higher
Use Case	Server pushes data	Client-server chat

DualMind only needs server-to-client streaming during response generation. There’s no client data to stream back during inference, making SSE the perfect protocol.

Why Not WebSocket?

WebSocket adds unnecessary complexity:

Requires connection upgrade handshake
No automatic reconnection on disconnect
Harder to debug (non-HTTP protocol)
Bidirectional capability unused

DualMind principle: Use simplest protocol that meets requirements.

SSE Event Format

Events follow a strict format:

data: {JSON_PAYLOAD}\n\n

Rules:

Each event prefixed with data:
JSON must be single-line (no newlines in payload)
Each event terminated with double newline (\n\n)
UTF-8 encoding required

Event Types

All events use ai.stream.* namespace:

ai.stream.start

Purpose: Stream initialization, provides model metadataPayload:

{
  "type": "ai.stream.start",
  "model": {
    "name": "llama-3.3-70b-versatile",
    "provider": "groq"
  }
}

When sent: First event immediately after connection established

ai.stream.delta

Purpose: Incremental text chunkPayload:

{
  "type": "ai.stream.delta",
  "delta": {
    "text": "Quantum computing"
  }
}

When sent: Repeatedly as model generates text (0+ times)Client handling: Append delta.text to displayed response

ai.stream.done

Purpose: Stream completion with final metricsPayload:

{
  "type": "ai.stream.done",
  "usage": {
    "promptTokens": 15,
    "completionTokens": 120,
    "totalTokens": 135
  },
  "timing": {
    "responseTimeMs": 2145
  }
}

When sent: After final delta event, stream terminates

ai.error

Purpose: Stream failure notificationPayload:

{
  "type": "ai.error",
  "code": "PROVIDER_ERROR",
  "message": "Groq API timeout after 45 seconds"
}

When sent: On provider failure, timeout, or other errorsStream behavior: Terminates after error event

Event Lifecycle

A successful stream follows this sequence:

Connection Established

Client opens HTTP connection to streaming endpoint with JWT authentication.

Start Event

Server sends ai.stream.start with model information.

Delta Events

Server sends 0+ ai.stream.delta events as text generates.

Done Event

Server sends ai.stream.done with usage and timing statistics.

Connection Closes

HTTP connection terminates cleanly.

Failure path: start → error (connection closes)

Stream MUST NOT send delta events after done or error. Event ordering is strictly enforced.

Client Implementation

Browser (Fetch API with ReadableStream)

Modern browsers can consume SSE via the Fetch API:

// Note: EventSource does NOT support custom headers or POST requests.
// Use the Fetch API with ReadableStream instead:
const response = await fetch('http://localhost:5079/api/arena/chat/stream', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${jwt}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ prompt: 'Hello' })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();
let fullResponse = '';

while (true) {
  const { done, value } = await reader.read();
  if (done) break;

  const chunk = decoder.decode(value);
  const lines = chunk.split('\n\n');

  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const data = JSON.parse(line.substring(6));

      switch (data.type) {
        case 'ai.stream.start':
          console.log('Model:', data.model.name);
          break;

        case 'ai.stream.delta':
          fullResponse += data.delta.text;
          displayText(fullResponse); // Update UI
          break;

        case 'ai.stream.done':
          console.log('Usage:', data.usage);
          break;

        case 'ai.error':
          console.error('Stream error:', data.message);
          break;
      }
    }
  }
}

Custom HTTP Client

For environments without EventSource:

const response = await fetch('/api/arena/chat/stream', {
  method: 'POST',
  headers: {
    'Authorization': `Bearer ${jwt}`,
    'Content-Type': 'application/json'
  },
  body: JSON.stringify({ prompt: 'Hello' })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  
  const chunk = decoder.decode(value);
  const lines = chunk.split('\n\n');
  
  for (const line of lines) {
    if (line.startsWith('data: ')) {
      const json = line.substring(6); // Remove "data: " prefix
      const event = JSON.parse(json);
      handleEvent(event);
    }
  }
}

Server Behavior

Connection Lifecycle

Request Received

Backend validates JWT, extracts user ID, performs user sync.

Headers Set

Response headers configured: Content-Type: text/event-stream

Provider Streaming

AI provider streams response via callback function.

Event Transformation

Provider chunks transformed into SSE event format.

Client Detection

Monitors HttpContext.RequestAborted token for disconnects.

Disconnect Handling

If client disconnects during streaming:

RequestAborted cancellation token fires
Provider receives cancellation signal
Stream processing terminates immediately
No database writes occur for partial streams

Partial streams are not persisted to thread messages. Only complete non-streaming requests write to database.

Timeout Behavior

Streaming requests have same 45-second timeout as non-streaming:

Provider timeout (45s)
Fallback to alternative model (45s)
Fallback to Bytez provider (45s)
Return error if all fail

Difference from non-streaming: Error delivered as ai.error event instead of JSON response.

Streaming vs Non-Streaming

Aspect	Non-Streaming	Streaming
Response time	Wait for full response	Immediate first token
Perceived latency	High	Low
Response format	Single JSON object	SSE event stream
Database persistence	Yes (if threadId provided)	No
Client complexity	Simple (await response)	Higher (event handling)
Bandwidth	Single burst	Gradual delivery

When to Use Streaming

Use streaming when:

User experience priority (perceived speed)
Long responses (>500 tokens)
Interactive chat interfaces
Real-time feedback important

Avoid streaming when:

Database persistence required immediately
Processing full response programmatically
Client doesn’t support SSE
Response needs to be votable (dual-chat)

Streaming endpoint does NOT write to thread_messages table. For persistent threads, client must call non-streaming endpoint separately after stream completes.

Dual-Chat Streaming

Current status: Not supported Why not?

Two parallel SSE streams complicate client-side handling
Arena comparisons require complete responses for fair voting
Use case unclear (voters need full text anyway)

Workaround: If streaming needed for dual-chat, call single-chat streaming endpoint twice sequentially.

Error Scenarios

Provider Timeout

If provider exceeds 45 seconds:

data: {"type":"ai.stream.start","model":{...}}\n\n
data: {"type":"ai.error","code":"PROVIDER_TIMEOUT","message":"Groq API timeout"}\n\n

Stream terminates. Client should display error.

Network Interruption

If network drops during stream:

Browser’s EventSource auto-reconnects (with Last-Event-ID header)
DualMind doesn’t support resumption (no event IDs)
Client must start new request

Malformed Events

Provider sends invalid JSON:

Backend catches parse errors
Sends ai.error event
Stream terminates

Performance Considerations

Bandwidth

Streaming uses more total bandwidth than non-streaming:

Event overhead: data: prefix + \n\n suffix per chunk
JSON overhead: Repeated {"type":"ai.stream.delta","delta":{"text":"..."}}

Example: 100-word response might generate 50-100 delta events vs 1 non-streaming response. Tradeoff: Higher bandwidth for better perceived latency.

Server Resources

Streaming holds HTTP connection open longer:

Non-streaming: 2-3 seconds
Streaming: 2-3 seconds (same inference time, different delivery)

Impact: Minimal. Inference time dominates connection time.

Next Steps

Chat Modes

Single vs dual chat explained

Thread Management

Persisting conversations

System Overview

Architecture decisions

Getting Started

Core Concepts

Architecture

Database

Frontend Guide

Development

Deployment

What is Streaming?

Why Server-Sent Events?

SSE vs WebSocket

Why Not WebSocket?

SSE Event Format

Event Types

Event Lifecycle

Client Implementation

Browser (Fetch API with ReadableStream)

Custom HTTP Client

Server Behavior

Connection Lifecycle

Disconnect Handling

Timeout Behavior

Streaming vs Non-Streaming

When to Use Streaming

Dual-Chat Streaming

Error Scenarios

Provider Timeout

Network Interruption

Malformed Events

Performance Considerations

Bandwidth

Server Resources

Next Steps

Chat Modes

Thread Management

System Overview

Getting Started

Core Concepts

Architecture

Database

Frontend Guide

Development

Deployment

​What is Streaming?

​Why Server-Sent Events?

​SSE vs WebSocket

​Why Not WebSocket?

​SSE Event Format

​Event Types

​Event Lifecycle

​Client Implementation

​Browser (Fetch API with ReadableStream)

​Custom HTTP Client

​Server Behavior

​Connection Lifecycle

​Disconnect Handling

​Timeout Behavior

​Streaming vs Non-Streaming

​When to Use Streaming

​Dual-Chat Streaming

​Error Scenarios

​Provider Timeout

​Network Interruption

​Malformed Events

​Performance Considerations

​Bandwidth

​Server Resources

​Next Steps

Chat Modes

Thread Management

System Overview

What is Streaming?

Why Server-Sent Events?

SSE vs WebSocket

Why Not WebSocket?

SSE Event Format

Event Types

Event Lifecycle

Client Implementation

Browser (Fetch API with ReadableStream)

Custom HTTP Client

Server Behavior

Connection Lifecycle

Disconnect Handling

Timeout Behavior

Streaming vs Non-Streaming

When to Use Streaming

Dual-Chat Streaming

Error Scenarios

Provider Timeout

Network Interruption

Malformed Events

Performance Considerations

Bandwidth

Server Resources

Next Steps