Technical docs
Realtime Voice
Overview
Support reference for the current realtime voice server and client flows.
Runtime entrypoints
- HTTP/WebSocket runtime:
- `backend/src/realtimeServer.ts`
- WebSocket orchestration:
- `backend/src/components/realtime/RealtimeProxy.ts`
- Provider/runtime implementations:
- `backend/src/components/realtime/agents/realtime/OpenAIRealtimeAgent.ts`
- `backend/src/components/realtime/agents/realtime/GoogleAIRealtimeAgent.ts`
- `backend/src/components/realtime/agents/tts/TtsAgent.ts`
- WebSocket path:
- `/chat/realtime?session={sessionId}`
- Health endpoint:
- `GET /chat/realtime/health`
Session creation endpoints
- Public webchat voice:
- `POST /chat/realtime/create-session/{employeeId}`
- The widget currently passes its configured `eventId` in that path slot.
- Internal skill voice:
- `POST /chat/realtime/skill/create-session/{organization_id}/{skill_id}`
- Internal AI teammate voice:
- `POST /chat/realtime/employee/create-session/{organization_id}/{employee_id}`
- Internal Ally voice:
- `POST /chat/realtime/ally/create-session/{organization_id}`
Request rules and responses
- Employee and Ally session creation require exactly one of:
- `conversation_id`
- `create_new_conversation`
- Public webchat session creation can include:
- `contact`
- `multi_conversations`
- Successful create-session responses include:
- `sessionId`
- `isAudioAutoTurnOn`
- Employee and Ally responses also include `conversation_id`.
- Session IDs are one-time use and currently match the created workflow run ID.
- Pending session configs are stored in Redis with a 20-minute TTL and deleted on the first websocket connect.
- All realtime session routes reject `structured` workflows. The published starting step must resolve to an `agent` step with a configured TTS-capable employee.
- Supported `locale` values are currently:
- `en-US`
- `ru-RU`
- `es-ES`
- default fallback is `en-US`
Provider routing
- If the selected TTS model is realtime and the provider is Google:
- `GoogleAIRealtimeAgent`
- `GeminiLiveVoice`
- If the selected TTS model is realtime and not Google:
- `OpenAIRealtimeAgent`
- If the TTS model is non-realtime but still TTS-capable:
- `TtsAgent`
- OpenAI realtime STT is used when an STT model is configured
- The websocket runtime rebuilds a `RequestContext` from `workflowRun.state.runtimeContext` and adds `organizationId`, `employeeId`, and `runId` before the session starts.
Audio and message flow
- Clients send binary microphone audio over the websocket.
- The websocket wire format expects mono signed `16-bit PCM` audio at `24kHz`.
- The internal frontend explicitly resamples to `24kHz` before sending.
- Clients can also send JSON:
- `type: "user_message"`
- `data.text`
- optional `data.attachments`
- The server validates attachments for websocket `user_message` payloads before processing them.
- Attachment guardrails on websocket messages:
- maximum 5 files
- maximum 10 MB each
- allowed MIME families: images, PDF, plain text, XML, and SQLite
Server -> client events
- The realtime websocket currently sends JSON events such as:
- `session.created`
- `speech_started`
- `speech_stopped`
- `response.cancelled`
- `error`
- Audio responses are streamed back as binary websocket frames.
- `TtsAgent` additionally sends text-sidecar events such as:
- `text` with `role: user|assistant`
- `speaking.done`
- Internal chat also listens to org websocket events from the TTS path:
- `voice_thinking_started`
- `voice_thinking_ended`
Automation and takeover behavior
- Realtime agents refresh automation state from the conversation while the session is running.
- If automation is disabled on the conversation, the voice session stays connected and still saves user messages, but skips AI response generation.
- If the organization is out of tokens when voice starts, the server sets the conversation to `waiting_for_operator` and closes the realtime websocket with code `4008`.
- Native realtime paths also re-check token usage after each response and close with `4008` when the balance reaches zero.