Technical docs

Realtime Voice

Public reference generated from tech docs/realtime-voice.md.

Overview

Support reference for the current realtime voice server and client flows.

Runtime entrypoints

- HTTP/WebSocket runtime: - `backend/src/realtimeServer.ts` - WebSocket orchestration: - `backend/src/components/realtime/RealtimeProxy.ts` - Provider/runtime implementations: - `backend/src/components/realtime/agents/realtime/OpenAIRealtimeAgent.ts` - `backend/src/components/realtime/agents/realtime/GoogleAIRealtimeAgent.ts` - `backend/src/components/realtime/agents/tts/TtsAgent.ts` - WebSocket path: - `/chat/realtime?session={sessionId}`

Session creation endpoints

- Public webchat voice: - `POST /chat/realtime/create-session/{employeeId}` - The widget currently passes its configured `eventId` in that path slot. - Internal skill voice: - `POST /chat/realtime/skill/create-session/{organization_id}/{skill_id}` - Internal AI teammate voice: - `POST /chat/realtime/employee/create-session/{organization_id}/{employee_id}` - Internal Ally voice: - `POST /chat/realtime/ally/create-session/{organization_id}`

Request rules and responses

- Employee and Ally session creation require exactly one of: - `conversation_id` - `create_new_conversation` - Public webchat session creation can include: - `contact` - `multi_conversations` - Successful create-session responses include: - `sessionId` - `isAudioAutoTurnOn` - Employee and Ally responses also include `conversation_id`. - Session IDs are one-time use and currently match the created workflow run ID. - Pending session configs are stored in Redis with a 20-minute TTL and deleted on the first websocket connect. - All realtime session routes reject `structured` workflows. The published starting step must resolve to an `agent` step with a configured TTS-capable employee. - Supported `locale` values are currently: - `en-US` - `ru-RU` - `es-ES` - default fallback is `en-US`

Provider routing

- If the selected TTS model is realtime and the provider is Google: - `GoogleAIRealtimeAgent` - `GeminiLiveVoice` - If the selected TTS model is realtime and not Google: - `OpenAIRealtimeAgent` - If the TTS model is non-realtime but still TTS-capable: - `TtsAgent` - OpenAI realtime STT is used when an STT model is configured - The websocket runtime rebuilds a `RequestContext` from `workflowRun.state.runtimeContext` and adds `organizationId`, `employeeId`, and `runId` before the session starts.

Audio and message flow

- Clients send binary microphone audio over the websocket. - The websocket wire format expects mono signed `16-bit PCM` audio at `24kHz`. - The internal frontend explicitly resamples to `24kHz` before sending. - Clients can also send JSON: - `type: "user_message"` - `data.text` - optional `data.attachments` - The server validates attachments for websocket `user_message` payloads before processing them. - Attachment guardrails on websocket messages: - maximum 5 files - maximum 10 MB each - allowed MIME families: images, PDF, plain text, XML, and SQLite

Server -> client events

- The realtime websocket currently sends JSON events such as: - `session.created` - `speech_started` - `speech_stopped` - `response.cancelled` - `error` - Audio responses are streamed back as binary websocket frames. - `TtsAgent` additionally sends text-sidecar events such as: - `text` with `role: user|assistant` - `speaking.done` - Internal chat also listens to org websocket events from the TTS path: - `voice_thinking_started` - `voice_thinking_ended`

Automation and takeover behavior

- Realtime agents refresh automation state from the conversation while the session is running. - If automation is disabled on the conversation, the voice session stays connected and still saves user messages, but skips AI response generation. - If the organization is out of tokens when voice starts, the server sets the conversation to `waiting_for_operator` and closes the realtime websocket with code `4008`. - Native realtime paths also re-check token usage after each response and close with `4008` when the balance reaches zero.