Provider Architecture
Interface Abstraction
GroqProvider: Primary LPU-accelerated inferenceBytezProvider: Fallback for reliability
Provider Factory
Groq Provider
API Configuration
Chat Completion Request
Streaming Implementation
Error Handling
Bytez Provider
API Configuration
Implementation Differences
Request Format: Similar to OpenAI API Response Parsing: Expects OpenAI-compatible response structure Timeout: Same 45-second limit Streaming: Not currently implemented (fallback is non-streaming only)Failover Chain
Single Chat Failover
Dual Chat Failover
Each model has independent fallback chain:- ✅ Both succeed → Full dual-chat response
- ⚠️ One succeeds → Partial response with error note
- ❌ Both fail → 500 error
Performance Characteristics
Response Times
Groq (LPU):- Simple prompt: 500-1500ms
- Complex prompt: 1500-3000ms
- Streaming first token: 100-300ms
- Simple prompt: 1500-3000ms
- Complex prompt: 3000-5000ms
Timeout Strategy
45-Second Rationale:- Balances user patience with completion probability
- Most responses complete within 30 seconds
- Allows fallback attempts within reasonable total time
- Shorter timeout (30s): More frequent fallbacks
- Longer timeout (60s): Fewer fallbacks but slower failover
Rate Limits
Groq Free Tier:- 30 requests/minute
- 14,400 tokens/minute
- Higher limits (check API dashboard)
- Provider-specific limits (not documented here)
429 Too Many Requests triggers fallback chain
Model Registry Integration
Model Lookup
Provider Assignment
Static Mapping (current):Provider-Specific Features
Groq LPU Advantages
Low Latency: Hardware-optimized tensor processing Fast Streaming: Sub-300ms first token latency Cost: Competitive pricing on per-token basis Models: Llama, Mixtral, Gemma familiesBytez Reliability
Uptime: Independent from Groq (diversification) Fallback: Critical for production availability Models: Variety beyond Groq’s catalogMonitoring & Observability
Metrics to Track
Provider Success Rate:- p50 (median)
- p95
- p99
- How often does primary fail?
- How often does secondary succeed?
Alerting Thresholds
High Timeout Rate: > 10% of requests timeout Low Success Rate: < 95% for primary provider Fallback Dependency: > 20% of requests use fallbackConfiguration
Environment Variables
HttpClient Configuration
Future Enhancements
Adaptive Routing
Route based on:- Model performance metrics
- Current provider latency
- Rate limit status
- Cost optimization
Circuit Breaker
Response Caching
Deterministic Requests (temperature = 0):Next Steps
Request Lifecycle
Provider execution in request flow
System Invariants
Provider timeout and fallback invariants