What Cartesia Does (And Why Latency Actually Matters)
Most people evaluate text-to-speech APIs by listening to a demo and asking 'does it sound human?' That is the wrong question for Cartesia. The right question is 'how fast does it start talking after I send text?'
I have been building voice agents since early 2025 — first with ElevenLabs, then Play.ht, then Deepgram — and the consistent problem across all of them was the dead air between turns. User asks a question. LLM generates a response. TTS API renders the audio. Total round trip: 1.5-3 seconds. In a phone conversation, 3 seconds feels like 30. Users hang up.
Cartesia solves this with their Sonic model. Time-to-first-audio is 80-120ms. The voice starts speaking while the LLM is still generating the rest of the response. The result is a conversation that flows — not perfectly, but close enough that users forget they are talking to a machine.
Here is the thing most reviews will not tell you: latency is the number one killer of voice agent adoption. Better voice quality with 2-second gaps between sentences will lose to average voice quality with 100ms gaps every time. I have A/B tested this with real callers. The numbers do not lie.
---
The Developer Experience: What Actually Works
The Sonic Streaming Pipeline
Cartesia's streaming API accepts text chunks and returns audio chunks in real time. The integration pattern looks like this:
- User speaks → speech-to-text (I use Deepgram for this) → text
- Text goes to LLM (GPT-4o or Claude) → LLM starts generating response
- As each sentence is generated, it streams to Cartesia → Cartesia starts speaking immediately
- LLM continues generating the next sentence while Cartesia speaks the current one
This overlapping pipeline is what makes the agent feel real-time. The first time I got it working, I called the bot myself and honestly forgot I was testing — I just had a conversation. That is the bar you are shooting for.
Setting this up is about 150-200 lines of Python with their SDK. Their cartesia package has clean abstractions: Voice, OutputFormat, and a Context object that manages the WebSocket connection. The hardest part is coordinating the LLM streaming with the TTS streaming — you need to buffer text until you have a complete sentence, then send it, because Cartesia sounds weird if you send mid-sentence fragments.
Voice Cloning That Actually Ships
Cartesia's voice cloning endpoint needs 3-5 minutes of clean audio from one speaker. This is less than ElevenLabs (which can sometimes work with 1 minute) but more than acceptable for a professional service offering.
I recorded a real estate client reading their own website copy for 3 minutes. Uploaded the audio. Waited about 30 seconds. Got back a voice ID. Used it in their lead qualification bot. The result was good enough that the client's wife, who occasionally handles overflow calls, asked 'when did you record all those greeting messages?'
The cloning is not perfect — emotional range is limited, and the clone drifts on words the original speaker never said in the training sample. But for business applications (standard greeting scripts, FAQ responses, confirmation messages), it is extremely useful. Clients love hearing 'their own voice' handling customer interactions 24/7.
One hard-earned tip: do not use phone-quality audio for cloning. A client sent me a WhatsApp voice note recorded in their car and expected a perfect clone. It sounded like a robot with a cold. Clean audio, quiet room, no processing, at least 3 minutes — that is the minimum bar.
---
Building a Voice Agent Business on Cartesia
The Unit Economics
Let me break down a real deployment. My real estate lead qualification bot:
- Monthly client fee: $500/month
- Call volume: ~40 calls/day, averaging 3,000 characters per call
- Cartesia cost: 40 calls × 3K chars × $0.05/1K chars × 30 days = $180/month
- LLM cost (GPT-4o): 40 calls × 3K tokens × $2.50/1M tokens input + $10/1M tokens output × 30 days ≈ $36/month
- STT cost (Deepgram): 40 calls × 2 minutes × $0.0059/min × 30 days ≈ $14/month
- Server cost (cheap VPS): $10/month
- Total infra: $240/month
- Net margin: $260/month
At 5 clients (which is manageable for one person), that is $1,300/month net from voice agents alone. The key insight: the client pays $500/month for a bot that answers calls 24/7. Hiring a part-time receptionist costs $1,500-$2,500/month. The value proposition sells itself — you are not competing with other AI services, you are competing with human labor costs.
Scaling Model
The business model works because voice agents are fundamentally configurable, not custom-built per client. The real estate bot's core pipeline (STT → LLM → Cartesia TTS → phone system) is the same for the restaurant bot, the medical appointment bot, and the insurance quote bot. What changes is the LLM prompt, the knowledge base, and the voice.
Build one solid agent template. Clone it for each new client with a custom prompt and a cloned voice. Charge $300-$500/month. At 8-10 clients, you are at $2,400-$5,000/month with maybe 10-15 hours of maintenance per week — mostly monitoring call logs, tweaking prompts when the LLM hallucinates, and handling client requests for new features.
The hard part is not the technology. The hard part is finding clients who understand what a voice agent can actually do, and setting realistic expectations. Most small business owners have never interacted with a good AI voice agent. Give them a demo call. Let them talk to the bot. When they experience a real conversation with sub-100ms latency, they get it immediately.
---
Honest Comparison: Cartesia vs Everyone Else
| Feature | Cartesia | ElevenLabs | Play.ht | Deepgram |
|---|---|---|---|---|
| TTS Latency | < 100ms | 400-800ms | 600ms-1.2s | 300-600ms |
| Voice Quality | Good (not great) | Excellent | Very Good | Good |
| Voice Cloning | 3-5 min audio | 1-3 min audio | 1 min audio | N/A (STT) |
| Emotion Range | Limited | Good | Moderate | N/A |
| Price (500K chars/mo) | $25 | $22 (Creator) | $39 (Creator) | $0.0059/min |
| SDK Quality | Good | Excellent | Good | Excellent |
| Best For | Voice agents | Audiobooks, creative | Content, YouTube | Transcription |
Cartesia wins on latency. ElevenLabs wins on voice quality and expressiveness. Play.ht wins on value for content creators (built-in editor, voice library). Deepgram is a transcription company that added TTS — their latency is good but their TTS voice quality lags behind the dedicated platforms.
My setup: Cartesia for real-time voice agents, ElevenLabs for pre-recorded content and client voice-overs, Deepgram for speech-to-text (their STT is genuinely excellent). Three tools, three purposes, $80/month total. One voice agent client pays for all of it.
---
Bottom Line
Cartesia is not the best TTS API for every use case. If you are producing an audiobook, use ElevenLabs. If you are making YouTube videos, use Play.ht. If you just need captions, use Deepgram.
But if you are building a voice agent — something that needs to respond to human speech in real time, over the phone, without making people wait — Cartesia is the best option right now. The latency advantage is real and measurable. I have the call duration data to prove it.
The catch: you need to commit to building on their platform, accept the pricing model, and deal with the fact that they are a younger company than ElevenLabs with a smaller voice library and less mature tooling. If you can live with that, the technology delivers.
For monetization: voice agents for SMBs at $300-$500/month per client is a viable solo business. The technology works today. The market is not saturated — most small businesses have never spoken to an AI voice agent that actually sounds good. Go build one and show them.