Chatbot with Voice – Real-Time AI Conversations with Video & Voice Cloning
The client operated a customer support platform used by mid-market SaaS companies, with a text chatbot product that was losing usage to live agent escalation. Their data showed that more than half of all conversations ended with the user typing a variation of "talk to a human" within three turns. The board had given the product team two quarters to demonstrate a measurable lift in self-service resolution or the chatbot line would be sunset.
Generative AI Solutions
Customer Support, EdTech, SaaS
12 weeks from kickoff to general availability
5 specialists
The full story
The user problem behind the metric was that the chatbot felt mechanical. Users could not tell when it had understood them, there was no sense of presence, and the interaction modality forced typing on mobile where most support traffic originated. Long answers buried critical steps in walls of text. Multilingual users got translated output that read as machine-translated and lost trust in the responses.
We proposed shifting the modality from text to voice and video, but only after solving the latency problem honestly. Round-trip time from user speech to synthesized response had to land under nine hundred milliseconds for a conversation to feel real, including STT, LLM inference, and voice synthesis. We architected a streaming pipeline that began TTS playback while the LLM was still generating, used a fine-tuned RVC voice per brand, and ran emotion classification on the live video feed to adjust response phrasing.
What shipped was a browser-based voice and video assistant embedded via a single script tag. The agent speaks in a brand-specific voice, reads micro-expressions from the user’s webcam (with explicit opt-in), and shifts to a calmer cadence when frustration is detected. Multilingual users get native synthesis rather than translated text, with twelve languages live at launch. Self-service resolution moved decisively and average session length expanded as users explored more topics in a single visit.
The text chatbot was losing the support funnel at turn three — users were either escalating or churning out of the session entirely.
More than half of conversations contained an explicit "agent" request within the first three turns, regardless of the topic or tenant.
Mobile users typed short, ambiguous queries because typing on a small keyboard discouraged the full context the model needed.
Translated responses for non-English users read as machine output, which suppressed trust and lifted escalation rates further.
Long procedural answers were rendered as walls of text, so users abandoned the conversation rather than scroll through them.
There was no signal back to the agent about user emotional state, so the tone of responses stayed flat through frustration spikes.
How we structured the engagement
Modality change first, then latency discipline, then brand voice — solved in that order so each gain could be measured cleanly.
- 01Phase 01Weeks 1-2
Discovery
Recorded forty real support sessions with consent, transcribed the points where users escalated, and built a taxonomy of frustration triggers. Benchmarked Azure, AssemblyAI, and Whisper for streaming STT latency on noisy mobile audio. Picked Whisper-streaming for cost and Azure for premium tier.
- 02Phase 02Weeks 3-4
Architecture
Designed a four-stage streaming pipeline with overlapping execution: STT begins emitting partial transcripts at three hundred milliseconds, the LLM starts generation on the partial, and TTS begins playback on the first sentence boundary. Picked RVC for voice cloning because it ran on a single A10G per tenant.
- 03Phase 03Weeks 5-10
Build
Shipped the streaming pipeline, the WebRTC media layer, and the emotion classifier in parallel. The emotion classifier was a small CNN on facial landmarks rather than full-frame, which kept inference under thirty milliseconds per frame. Voice cloning required a four-minute sample per brand and a six-hour training run.
- 04Phase 04Weeks 11-12
Launch
Rolled out to two design-partner tenants for two weeks, monitored escalation rate and audio quality complaints daily, then enabled the remaining waitlist. Shipped a fallback to text-only when bandwidth dropped below a threshold so users on poor connections still got a response.
What we built, component by component
- 01
WebRTC media gateway
Handles inbound audio and video streams from the browser, with adaptive bitrate and a fallback to text-only on poor connections.
- 02
Streaming STT
Whisper-streaming on Azure for premium tier, with partial transcripts emitted at three hundred milliseconds for early LLM start.
- 03
Dialogue orchestrator
Manages turn state, routes to the configured LLM, holds the per-tenant system prompt, and gates on safety filters before TTS.
- 04
RVC voice synthesizer
Per-tenant cloned voice, runs on a dedicated A10G GPU, streams audio frames as soon as the first sentence is generated.
- 05
Emotion classifier
Small CNN on facial landmarks from the video feed, emits a frustration score that the orchestrator uses to adjust prompt tone.
- 06
Knowledge retriever
Per-tenant vector store seeded from the help center and past resolved tickets, with hybrid keyword plus dense retrieval.
The browser opens a WebRTC session and begins streaming audio and video to the gateway. Whisper emits partial transcripts that the orchestrator feeds to the LLM, which begins generation immediately while retrieval runs in parallel. The first sentence of the LLM response is handed to RVC and the user hears speech before the model has finished thinking, with the emotion classifier shaping later turns.
The trade-offs we made and why
Used RVC over ElevenLabs for voice cloning
ElevenLabs would have produced marginally cleaner output but the per-character pricing made the unit economics impossible at the expected session length. RVC ran on a single A10G per tenant and cost roughly a tenth per minute of synthesized speech.
Streamed TTS from the first sentence boundary
Waiting for the full LLM response before synthesis pushed total latency past one and a half seconds and the conversation felt dead. Beginning playback on the first sentence cut perceived latency in half at the cost of a small acoustic seam between sentences, which users did not notice in testing.
Emotion classification from landmarks rather than full frames
Full-frame inference would have required a heavier model and pushed GPU cost per session above the support-tier price point. Landmarks captured the signal we needed for frustration detection without sending raw video off the user device.
Made the camera opt-in with a visible indicator
The emotion feature only worked with consent and we wanted users to trust the surface. A visible indicator and an explicit toggle made the trade-off legible. Adoption among consenting users was high enough that the feature still moved the metric.
What changed for the client
interaction time
Median session duration grew from one minute fifty seconds to six minutes thirty seconds across the design-partner cohort.
speech clarity
MOS score from a blind listener panel of one hundred reviewers comparing cloned voices against the reference recording.
round-trip latency
Measured from end of user speech to start of audible response across the p95 of sessions on standard broadband.
languages live
Native synthesis languages at launch, with per-tenant voice cloning available in English, Spanish, French, and German.
“The voice agent retained users we used to lose in three turns. The frustration detection is the part our support leads talk about most in QBRs.”
The tools behind the system
Built with a deliberate stack chosen for production reliability and operational velocity.
Lessons learned from the build
Latency is the product when the modality is voice. We made every architectural decision against a sub-second budget and the feel of the conversation followed. If we had treated latency as something to optimize after launch, the feature would have shipped feeling slow and never recovered the impression.
Voice cloning quality matters less than voice consistency. After the first round of user testing we stopped chasing marginal acoustic improvements and focused on making sure the voice never broke character across long sessions. Consistency is what builds trust in the agent.
Emotion classification needs an exit hatch. The model is good but not infallible and a mis-read user gets a worse experience than a flat-toned agent. We added a manual override and a confidence threshold below which the orchestrator simply ignores the signal.
Similar delivery work usually starts in these service areas
If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.
Where this project sits in the bigger market picture
Patterns for AI features, internal tooling, and product delivery in SaaS businesses.
Build a result-driven AI product with a team that has shipped before
If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.
Related case studies worth reviewing next
Have an AI idea, messy workflow, or product vision? Let's make it buildable.
Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.
A practical first roadmap in the discovery call
Architecture, timeline, and delivery options in plain English
Security, scalability, and reliability discussed upfront
Model registry
softus-rag-v4.2
187ms
Latency
128k
Context
$0.004
Cost / req
Evaluation suite
Deploy pipeline
prod / canary 25% — healthy
