VoiceAI Application for Advanced Voice Synthesis and Cloning
The client was a mid-sized media production house that ran a steady volume of explainer videos, training content, and brand-voice ad spots. Their cost structure was dominated by talent booking and studio time, which made tight-turnaround projects unprofitable. A single revision often required rebooking the original voice talent, and licensing celebrity voices for higher-tier clients was either impossible or budget-breaking on every project.
Generative AI Solutions
Media, Virtual Assistants
14 weeks from kickoff to studio rollout
4 specialists
The full story
The practical problem broke into three pieces. Standard text-to-speech sounded too synthetic for client-facing work. Speech-to-speech conversion existed in research code but had not been packaged for a production team that needed predictable output across thousands of utterances. And the legal frame around celebrity voice cloning required an audit trail that no off-the-shelf tool provided, which kept the studio from offering the feature even where talent had signed off.
We built a single workstation product that handled three modalities — TTS, STS, and per-talent voice cloning — behind a consistent project structure. Each project carried a license artifact and a usage ledger so the studio could prove compliance per minute of generated audio. Emotion and tone controls were exposed as continuous sliders rather than discrete labels, which gave directors something that felt closer to giving notes to a voice actor.
What shipped was a desktop-class web application where a producer uploads a script, picks a voice from the licensed library or clones a new one from a four-minute sample, and tunes emotion line by line before exporting. Real-time STS lets a director read a line in their own voice and have it re-rendered in the target voice instantly, preserving timing and inflection. Every generated minute is logged with the source license, talent identifier, and operator for downstream compliance review.
Studio economics broke on revisions and licensing — rebooking talent for ten seconds of new copy was the default failure mode.
Each minor script revision required a re-booking, a studio session, and engineer time, turning small changes into multi-day delays.
Off-the-shelf TTS sounded synthetic enough to lose pitches with brand-conscious clients, especially on emotional reads.
Speech-to-speech existed in research papers but no productized tool offered the consistency a daily workflow demanded.
Celebrity voice usage had no audit surface, so the studio could not safely sell the feature even where rights were cleared.
Tone direction was discrete in existing tools — happy, sad, neutral — which did not map to how directors actually give notes.
How we structured the engagement
Treated this as a media-production tool, not a model demo, so workflow consistency and license auditability drove the design.
- 01Phase 01Weeks 1-2
Discovery
Spent a week embedded with the studio across two active productions. Cataloged every place a revision triggered a re-booking and every license artifact that already existed in their files. Output: a project schema that mirrored studio session structure and a compliance ledger requirement.
- 02Phase 02Weeks 3-4
Architecture
Designed three engines behind a common project layer: an RVC-based clone engine, a speech-to-speech engine sharing the same voice embeddings, and a TTS engine fine-tuned per voice. All three wrote to the same export pipeline and logged to the same compliance ledger.
- 03Phase 03Weeks 5-12
Build
Shipped the cloning pipeline first, then TTS, then STS. Built a slider-based emotion control on top of a continuous embedding rather than discrete tags. Implemented the license ledger as an append-only Postgres table with per-utterance hash chaining so audits could prove no record had been altered.
- 04Phase 04Weeks 13-14
Launch
Rolled out to the in-house production team first, ran two paid client projects entirely on the platform, and only then opened access to the studio’s external partners. Shipped a real-time STS booth mode in week thirteen after directors asked for live read-throughs.
What we built, component by component
- 01
Voice library
Per-talent voice embeddings, license artifact, and consent records, with row-level scoping per studio operator.
- 02
TTS engine
Fine-tuned per voice with prosody conditioning, exposes continuous emotion and pace controls per line of dialogue.
- 03
STS engine
Speech-to-speech that preserves the source timing and inflection while substituting the target voice embedding.
- 04
Clone trainer
RVC-based pipeline that takes a four-minute clean sample and produces a voice embedding plus a fine-tuned head.
- 05
Project workspace
Script-aware workspace with per-line emotion sliders, takes management, and exportable session bundles.
- 06
Compliance ledger
Append-only Postgres log with hash-chained entries, recording every generated minute and its license provenance.
A producer creates a project bound to a licensed voice, types or imports a script, and tunes emotion line by line. The TTS or STS engine generates audio per take, the clone trainer runs only when a new voice is being added, and every export writes a hash-chained entry to the compliance ledger before the audio file leaves the workspace.
The trade-offs we made and why
Built one project layer over three engines
Treating TTS, STS, and cloning as separate products would have fragmented the workflow. A single project that can switch modality mid-session matches how directors actually move between approaches when a line is not landing the way they want it.
Used continuous emotion sliders over discrete tags
Discrete tags forced directors to pick between happy and sad when they wanted a touch warmer. Continuous sliders mapped to a learned latent space gave the same kind of dial you would give an actor in a booth, which is the existing mental model.
Made the compliance ledger append-only and hash-chained
Auditability is a precondition for selling celebrity voice work to enterprise buyers. A chained ledger makes tampering provable rather than just discouraged, which gave the studio a defensible story when their legal teams reviewed the contract terms.
Shipped to the in-house team before external partners
Production workflows have edge cases the spec never captures. Eating our own output for two client projects surfaced a dozen friction points around takes, retakes, and export naming that we would not have caught in user testing.
What changed for the client
voice-over cost
Per-minute production cost on internal projects after rollout, including platform time amortized across the cohort.
clone accuracy
Speaker verification score against original talent reference clips, averaged across the licensed voice library.
STS latency
Round-trip from operator microphone to rendered target voice in real-time booth mode for live read-throughs.
sample for cloning
Clean reference audio required to produce a production-quality voice embedding plus fine-tuned synthesis head.
“We stopped rebooking talent for revisions and started accepting client briefs we used to decline on margin. The ledger is what got our legal team onside.”
The tools behind the system
Built with a deliberate stack chosen for production reliability and operational velocity.
Lessons learned from the build
Workflow shape matters more than model fidelity in production tools. The studio kept us on the project precisely because the workspace mirrored their existing session structure. A more accurate model in a worse workspace would not have shipped client work.
Compliance was a feature, not overhead. Once the ledger existed, the studio leaned on it as a sales tool with enterprise clients. We would build the ledger and the legal-facing reports earlier next time and treat them as first-class product surface.
Real-time STS surprised us as the most-used feature. Directors used it as a live coaching tool, reading lines in their own voice to demonstrate intent. We had scoped it as a secondary feature and the production team treated it as primary within two weeks of launch.
Similar delivery work usually starts in these service areas
If you are exploring a similar product, workflow, or implementation challenge, these are the service tracks that usually fit best.
Where this project sits in the bigger market picture
This project reflects a broader pattern we often see when teams use AI to improve operational speed, insight quality, and product capability.
Build a result-driven AI product with a team that has shipped before
If you are exploring a similar product, workflow, or AI use case, we can help scope the right architecture, delivery model, and first milestone.
Related case studies worth reviewing next
Have an AI idea, messy workflow, or product vision? Let's make it buildable.
Bring the problem. We'll help shape the product, define the architecture, and show the fastest path to a serious first version.
A practical first roadmap in the discovery call
Architecture, timeline, and delivery options in plain English
Security, scalability, and reliability discussed upfront
Model registry
softus-rag-v4.2
187ms
Latency
128k
Context
$0.004
Cost / req
Evaluation suite
Deploy pipeline
prod / canary 25% — healthy
