All projects
AI / ML

AI Voice Generation Tool

Production-grade TTS pipeline with sub-200ms latency

Tech Stack

PythonFastAPIRedis StreamsKubernetesDockerPostgreSQLS3OpenAI TTSCoqui TTS

Overview

A cloud-native voice synthesis platform that converts text to natural-sounding speech using fine-tuned neural models, serving 50k+ daily requests.

The Problem

The team needed a flexible, latency-sensitive voice synthesis API capable of supporting multiple languages and custom voice cloning — all within strict SLA requirements for a B2B product.

The Solution

Built a streaming TTS microservice wrapping multiple model backends (OpenAI TTS, Coqui, ElevenLabs) behind a unified API. Implemented model-level caching, async request queuing via Redis Streams, and a voice-fingerprint registry. Dockerised for horizontal scaling on Kubernetes.

Architecture

Stateless FastAPI service → Redis Stream queue → Worker pool (GPU nodes) → S3 audio storage → signed CDN URLs. Each voice model runs in isolated containers with shared GPU scheduling. A Postgres-backed registry tracks voice profiles, usage, and per-tenant rate limits.

Outcome & Impact

P95 latency reduced to 180ms. Scaled to 50k+ daily requests with zero downtime. Three enterprise clients onboarded within the first month of launch.