Skip to main content
Generative AI Solutions — Case Study

VoiceAI Application for Advanced Voice Synthesis and Cloning

The client was a mid-sized media production house that ran a steady volume of explainer videos, training content, and brand-voice ad spots. Their cost structure was dominated by talent booking and studio time, which made tight-turnaround projects unprofitable. A single revision often required rebooking the original voice talent, and licensing celebrity voices for higher-tier clients was either impossible or budget-breaking on every project.

-85%voice-over cost
95%clone accuracy
<200msSTS latency
4minsample for cloning
VoiceAI Application for Advanced Voice Synthesis and Cloning
Category

Generative AI Solutions

Industry

Media, Virtual Assistants

Timeline

14 weeks from kickoff to studio rollout

Team size

4 specialists

Project Overview

The full story

The practical problem broke into three pieces. Standard text-to-speech sounded too synthetic for client-facing work. Speech-to-speech conversion existed in research code but had not been packaged for a production team that needed predictable output across thousands of utterances. And the legal frame around celebrity voice cloning required an audit trail that no off-the-shelf tool provided, which kept the studio from offering the feature even where talent had signed off.

We built a single workstation product that handled three modalities — TTS, STS, and per-talent voice cloning — behind a consistent project structure. Each project carried a license artifact and a usage ledger so the studio could prove compliance per minute of generated audio. Emotion and tone controls were exposed as continuous sliders rather than discrete labels, which gave directors something that felt closer to giving notes to a voice actor.

What shipped was a desktop-class web application where a producer uploads a script, picks a voice from the licensed library or clones a new one from a four-minute sample, and tunes emotion line by line before exporting. Real-time STS lets a director read a line in their own voice and have it re-rendered in the target voice instantly, preserving timing and inflection. Every generated minute is logged with the source license, talent identifier, and operator for downstream compliance review.

The Problem

Studio economics broke on revisions and licensing — rebooking talent for ten seconds of new copy was the default failure mode.

01Friction point

Each minor script revision required a re-booking, a studio session, and engineer time, turning small changes into multi-day delays.

02Friction point

Off-the-shelf TTS sounded synthetic enough to lose pitches with brand-conscious clients, especially on emotional reads.

03Friction point

Speech-to-speech existed in research papers but no productized tool offered the consistency a daily workflow demanded.

04Friction point

Celebrity voice usage had no audit surface, so the studio could not safely sell the feature even where rights were cleared.

05Friction point

Tone direction was discrete in existing tools — happy, sad, neutral — which did not map to how directors actually give notes.

Our Approach

How we structured the engagement

Treated this as a media-production tool, not a model demo, so workflow consistency and license auditability drove the design.

  1. Phase 01Weeks 1-2

    Discovery

    Spent a week embedded with the studio across two active productions. Cataloged every place a revision triggered a re-booking and every license artifact that already existed in their files. Output: a project schema that mirrored studio session structure and a compliance ledger requirement.

  2. Phase 02Weeks 3-4

    Architecture

    Designed three engines behind a common project layer: an RVC-based clone engine, a speech-to-speech engine sharing the same voice embeddings, and a TTS engine fine-tuned per voice. All three wrote to the same export pipeline and logged to the same compliance ledger.

  3. Phase 03Weeks 5-12

    Build

    Shipped the cloning pipeline first, then TTS, then STS. Built a slider-based emotion control on top of a continuous embedding rather than discrete tags. Implemented the license ledger as an append-only Postgres table with per-utterance hash chaining so audits could prove no record had been altered.

  4. Phase 04Weeks 13-14

    Launch

    Rolled out to the in-house production team first, ran two paid client projects entirely on the platform, and only then opened access to the studio’s external partners. Shipped a real-time STS booth mode in week thirteen after directors asked for live read-throughs.

System Architecture

What we built, component by component

  1. 01

    Voice library

    Per-talent voice embeddings, license artifact, and consent records, with row-level scoping per studio operator.

  2. 02

    TTS engine

    Fine-tuned per voice with prosody conditioning, exposes continuous emotion and pace controls per line of dialogue.

  3. 03

    STS engine

    Speech-to-speech that preserves the source timing and inflection while substituting the target voice embedding.

  4. 04

    Clone trainer

    RVC-based pipeline that takes a four-minute clean sample and produces a voice embedding plus a fine-tuned head.

  5. 05

    Project workspace

    Script-aware workspace with per-line emotion sliders, takes management, and exportable session bundles.

  6. 06

    Compliance ledger

    Append-only Postgres log with hash-chained entries, recording every generated minute and its license provenance.

Data Flow

A producer creates a project bound to a licensed voice, types or imports a script, and tunes emotion line by line. The TTS or STS engine generates audio per take, the clone trainer runs only when a new voice is being added, and every export writes a hash-chained entry to the compliance ledger before the audio file leaves the workspace.

Voice library
TTS engine
STS engine
Clone trainer
Project workspace
Key Decisions

The trade-offs we made and why

Decision 01Lead trade-off

Built one project layer over three engines

Treating TTS, STS, and cloning as separate products would have fragmented the workflow. A single project that can switch modality mid-session matches how directors actually move between approaches when a line is not landing the way they want it.

Decision 02

Used continuous emotion sliders over discrete tags

Discrete tags forced directors to pick between happy and sad when they wanted a touch warmer. Continuous sliders mapped to a learned latent space gave the same kind of dial you would give an actor in a booth, which is the existing mental model.

Decision 03

Made the compliance ledger append-only and hash-chained

Auditability is a precondition for selling celebrity voice work to enterprise buyers. A chained ledger makes tampering provable rather than just discouraged, which gave the studio a defensible story when their legal teams reviewed the contract terms.

Decision 04

Shipped to the in-house team before external partners

Production workflows have edge cases the spec never captures. Eating our own output for two client projects surfaced a dozen friction points around takes, retakes, and export naming that we would not have caught in user testing.

Outcomes

What changed for the client

voice-over cost

Per-minute production cost on internal projects after rollout, including platform time amortized across the cohort.

clone accuracy

Speaker verification score against original talent reference clips, averaged across the licensed voice library.

STS latency

Round-trip from operator microphone to rendered target voice in real-time booth mode for live read-throughs.

sample for cloning

Clean reference audio required to produce a production-quality voice embedding plus fine-tuned synthesis head.

In their words
We stopped rebooking talent for revisions and started accepting client briefs we used to decline on margin. The ledger is what got our legal team onside.
Executive ProducerMid-market media production studio
Tech Stack

The tools behind the system

Built with a deliberate stack chosen for production reliability and operational velocity.

5 componentsProduction-grade
PythonRVCTensorFlowPyTorchDocker
What we’d carry forward

Lessons learned from the build

01Lesson

Workflow shape matters more than model fidelity in production tools. The studio kept us on the project precisely because the workspace mirrored their existing session structure. A more accurate model in a worse workspace would not have shipped client work.

02Lesson

Compliance was a feature, not overhead. Once the ledger existed, the studio leaned on it as a sales tool with enterprise clients. We would build the ledger and the legal-facing reports earlier next time and treat them as first-class product surface.

03Lesson

Real-time STS surprised us as the most-used feature. Directors used it as a live coaching tool, reading lines in their own voice to demonstrate intent. We had scoped it as a secondary feature and the production team treated it as primary within two weeks of launch.

Related Services

Similar delivery work usually starts in these service areas

If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.

Industry Context

Where this project sits in the bigger market picture

This project reflects a broader pattern we often see when teams use AI to improve operational speed, insight quality, and product capability.

Similar Project?

Build a result-driven AI product with a team that has shipped before

If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.

Start with clarity

Have an AI idea, messy workflow, or product vision? Let's make it buildable.

Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.

  • A practical first roadmap in the discovery call

  • Architecture, timeline, and delivery options in plain English

  • Security, scalability, and reliability discussed upfront

Model registry

softus-rag-v4.2

live

187ms

Latency

128k

Context

$0.004

Cost / req

Evaluation suite

Faithfulness94%
Answer relevance97%
Citation accuracy99%

Deploy pipeline

prod / canary 25% — healthy