Skip to main content
Generative AI Solutions — Case Study

Chatbot with Voice – Real-Time AI Conversations with Video & Voice Cloning

The client operated a customer support platform used by mid-market SaaS companies, with a text chatbot product that was losing usage to live agent escalation. Their data showed that more than half of all conversations ended with the user typing a variation of "talk to a human" within three turns. The board had given the product team two quarters to demonstrate a measurable lift in self-service resolution or the chatbot line would be sunset.

3.5xinteraction time
95%speech clarity
<900msround-trip latency
12languages live
Chatbot with Voice – Real-Time AI Conversations with Video & Voice Cloning
Category

Generative AI Solutions

Industry

Customer Support, EdTech, SaaS

Timeline

12 weeks from kickoff to general availability

Team size

5 specialists

Project Overview

The full story

The user problem behind the metric was that the chatbot felt mechanical. Users could not tell when it had understood them, there was no sense of presence, and the interaction modality forced typing on mobile where most support traffic originated. Long answers buried critical steps in walls of text. Multilingual users got translated output that read as machine-translated and lost trust in the responses.

We proposed shifting the modality from text to voice and video, but only after solving the latency problem honestly. Round-trip time from user speech to synthesized response had to land under nine hundred milliseconds for a conversation to feel real, including STT, LLM inference, and voice synthesis. We architected a streaming pipeline that began TTS playback while the LLM was still generating, used a fine-tuned RVC voice per brand, and ran emotion classification on the live video feed to adjust response phrasing.

What shipped was a browser-based voice and video assistant embedded via a single script tag. The agent speaks in a brand-specific voice, reads micro-expressions from the user’s webcam (with explicit opt-in), and shifts to a calmer cadence when frustration is detected. Multilingual users get native synthesis rather than translated text, with twelve languages live at launch. Self-service resolution moved decisively and average session length expanded as users explored more topics in a single visit.

The Problem

The text chatbot was losing the support funnel at turn three — users were either escalating or churning out of the session entirely.

01Friction point

More than half of conversations contained an explicit "agent" request within the first three turns, regardless of the topic or tenant.

02Friction point

Mobile users typed short, ambiguous queries because typing on a small keyboard discouraged the full context the model needed.

03Friction point

Translated responses for non-English users read as machine output, which suppressed trust and lifted escalation rates further.

04Friction point

Long procedural answers were rendered as walls of text, so users abandoned the conversation rather than scroll through them.

05Friction point

There was no signal back to the agent about user emotional state, so the tone of responses stayed flat through frustration spikes.

Our Approach

How we structured the engagement

Modality change first, then latency discipline, then brand voice — solved in that order so each gain could be measured cleanly.

  1. Phase 01Weeks 1-2

    Discovery

    Recorded forty real support sessions with consent, transcribed the points where users escalated, and built a taxonomy of frustration triggers. Benchmarked Azure, AssemblyAI, and Whisper for streaming STT latency on noisy mobile audio. Picked Whisper-streaming for cost and Azure for premium tier.

  2. Phase 02Weeks 3-4

    Architecture

    Designed a four-stage streaming pipeline with overlapping execution: STT begins emitting partial transcripts at three hundred milliseconds, the LLM starts generation on the partial, and TTS begins playback on the first sentence boundary. Picked RVC for voice cloning because it ran on a single A10G per tenant.

  3. Phase 03Weeks 5-10

    Build

    Shipped the streaming pipeline, the WebRTC media layer, and the emotion classifier in parallel. The emotion classifier was a small CNN on facial landmarks rather than full-frame, which kept inference under thirty milliseconds per frame. Voice cloning required a four-minute sample per brand and a six-hour training run.

  4. Phase 04Weeks 11-12

    Launch

    Rolled out to two design-partner tenants for two weeks, monitored escalation rate and audio quality complaints daily, then enabled the remaining waitlist. Shipped a fallback to text-only when bandwidth dropped below a threshold so users on poor connections still got a response.

System Architecture

What we built, component by component

  1. 01

    WebRTC media gateway

    Handles inbound audio and video streams from the browser, with adaptive bitrate and a fallback to text-only on poor connections.

  2. 02

    Streaming STT

    Whisper-streaming on Azure for premium tier, with partial transcripts emitted at three hundred milliseconds for early LLM start.

  3. 03

    Dialogue orchestrator

    Manages turn state, routes to the configured LLM, holds the per-tenant system prompt, and gates on safety filters before TTS.

  4. 04

    RVC voice synthesizer

    Per-tenant cloned voice, runs on a dedicated A10G GPU, streams audio frames as soon as the first sentence is generated.

  5. 05

    Emotion classifier

    Small CNN on facial landmarks from the video feed, emits a frustration score that the orchestrator uses to adjust prompt tone.

  6. 06

    Knowledge retriever

    Per-tenant vector store seeded from the help center and past resolved tickets, with hybrid keyword plus dense retrieval.

Data Flow

The browser opens a WebRTC session and begins streaming audio and video to the gateway. Whisper emits partial transcripts that the orchestrator feeds to the LLM, which begins generation immediately while retrieval runs in parallel. The first sentence of the LLM response is handed to RVC and the user hears speech before the model has finished thinking, with the emotion classifier shaping later turns.

WebRTC media gateway
Streaming STT
Dialogue orchestrator
RVC voice synthesizer
Emotion classifier
Key Decisions

The trade-offs we made and why

Decision 01Lead trade-off

Used RVC over ElevenLabs for voice cloning

ElevenLabs would have produced marginally cleaner output but the per-character pricing made the unit economics impossible at the expected session length. RVC ran on a single A10G per tenant and cost roughly a tenth per minute of synthesized speech.

Decision 02

Streamed TTS from the first sentence boundary

Waiting for the full LLM response before synthesis pushed total latency past one and a half seconds and the conversation felt dead. Beginning playback on the first sentence cut perceived latency in half at the cost of a small acoustic seam between sentences, which users did not notice in testing.

Decision 03

Emotion classification from landmarks rather than full frames

Full-frame inference would have required a heavier model and pushed GPU cost per session above the support-tier price point. Landmarks captured the signal we needed for frustration detection without sending raw video off the user device.

Decision 04

Made the camera opt-in with a visible indicator

The emotion feature only worked with consent and we wanted users to trust the surface. A visible indicator and an explicit toggle made the trade-off legible. Adoption among consenting users was high enough that the feature still moved the metric.

Outcomes

What changed for the client

interaction time

Median session duration grew from one minute fifty seconds to six minutes thirty seconds across the design-partner cohort.

speech clarity

MOS score from a blind listener panel of one hundred reviewers comparing cloned voices against the reference recording.

round-trip latency

Measured from end of user speech to start of audible response across the p95 of sessions on standard broadband.

languages live

Native synthesis languages at launch, with per-tenant voice cloning available in English, Spanish, French, and German.

In their words
The voice agent retained users we used to lose in three turns. The frustration detection is the part our support leads talk about most in QBRs.
VP of ProductCustomer support SaaS platform
Tech Stack

The tools behind the system

Built with a deliberate stack chosen for production reliability and operational velocity.

6 componentsProduction-grade
PythonFastAPIAzure AI StudioGemini APIRVC Voice CloningAWS
What we’d carry forward

Lessons learned from the build

01Lesson

Latency is the product when the modality is voice. We made every architectural decision against a sub-second budget and the feel of the conversation followed. If we had treated latency as something to optimize after launch, the feature would have shipped feeling slow and never recovered the impression.

02Lesson

Voice cloning quality matters less than voice consistency. After the first round of user testing we stopped chasing marginal acoustic improvements and focused on making sure the voice never broke character across long sessions. Consistency is what builds trust in the agent.

03Lesson

Emotion classification needs an exit hatch. The model is good but not infallible and a mis-read user gets a worse experience than a flat-toned agent. We added a manual override and a confidence threshold below which the orchestrator simply ignores the signal.

Related Services

Similar delivery work usually starts in these service areas

If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.

Industry Context

Where this project sits in the bigger market picture

Patterns for AI features, internal tooling, and product delivery in SaaS businesses.

Similar Project?

Build a result-driven AI product with a team that has shipped before

If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.

Start with clarity

Have an AI idea, messy workflow, or product vision? Let's make it buildable.

Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.

  • A practical first roadmap in the discovery call

  • Architecture, timeline, and delivery options in plain English

  • Security, scalability, and reliability discussed upfront

Model registry

softus-rag-v4.2

live

187ms

Latency

128k

Context

$0.004

Cost / req

Evaluation suite

Faithfulness94%
Answer relevance97%
Citation accuracy99%

Deploy pipeline

prod / canary 25% — healthy