Finally, you can control speech word by word.
(Using a new 100% open-source TTS model)
Every TTS system before this had the same core limitation.
You'd say "speak in an angry tone" and the whole sentence shifted. There was no way to say "be calm here, then laugh right at this word, then drop to a whisper for this specific phrase."
Fish Audio S2 breaks that completely.
Here's how it works:
When you train a TTS model, you need annotated data. Most systems use coarse, global labels: "this clip is angry," "this clip is a whisper."
S2 does something different.
They built a transcription model that injects inline vocal tags at the exact position they occur in the text. The transcript doesn't say "this clip sounds angry." It says:
"I can't believe [angry] you did that [inhale] right in front of everyone."
The model trains on millions of hours of audio annotated exactly this way. So it doesn't learn global style. It learns that control is local, precise, and positional.
But here's where it gets even smarter.
That same transcription model is reused as a reward signal during RL training. Most systems build their reward models separately, which creates a mismatch between training and evaluation.
S2 eliminates that by design.
The RL setup uses three rewards so the model can't game one without the others catching it:
- Semantic accuracy: did the model say the right words in the right way?
- Acoustic quality: does it sound clean, or are there artifacts and noise?
- Timbre similarity: does the generated voice still sound like the reference speaker?
The results:
↳ In a human vs. AI audio test, S2 fooled humans more often than not. GPT-4o barely registers on the same test.
↳ Head-to-head against GPT-4o and Gemini, S2 won 8 out of 10 times.
↳ On the hardest dimension (breaths, laughs, hesitations), it beat every model on the list, including closed-source ones.
↳ Nearly 5x faster than real time, with first audio in under a tenth of a second.
Model weights, fine-tuning code, and the full inference engine are all open source.
Link in to the paper in next tweet.
The combination of position-level control, dual-purpose reward design, and open release puts this in a different category from everything else right now.
Finally, you can control speech word by word.
(Using a new 100% open-source TTS model)
Every TTS system before this had the same core limitation.
You'd say "speak in an angry tone" and the whole sentence shifted. There was no way to say "be calm here, then laugh right at this word, then drop to a whisper for this specific phrase."
Fish Audio S2 breaks that completely.
Here's how it works:
When you train a TTS model, you need annotated data. Most systems use coarse, global labels: "this clip is angry," "this clip is a whisper."
S2 does something different.
They built a transcription model that injects inline vocal tags at the exact position they occur in the text. The transcript doesn't say "this clip sounds angry." It says:
"I can't believe [angry] you did that [inhale] right in front of everyone."
The model trains on millions of hours of audio annotated exactly this way. So it doesn't learn global style. It learns that control is local, precise, and positional.
But here's where it gets even smarter.
That same transcription model is reused as a reward signal during RL training. Most systems build their reward models separately, which creates a mismatch between training and evaluation.
S2 eliminates that by design.
The RL setup uses three rewards so the model can't game one without the others catching it:
- Semantic accuracy: did the model say the right words in the right way?
- Acoustic quality: does it sound clean, or are there artifacts and noise?
- Timbre similarity: does the generated voice still sound like the reference speaker?
The results:
↳ In a human vs. AI audio test, S2 fooled humans more often than not. GPT-4o barely registers on the same test.
↳ Head-to-head against GPT-4o and Gemini, S2 won 8 out of 10 times.
↳ On the hardest dimension (breaths, laughs, hesitations), it beat every model on the list, including closed-source ones.
↳ Nearly 5x faster than real time, with first audio in under a tenth of a second.
Model weights, fine-tuning code, and the full inference engine are all open source.
Link in to the paper in next tweet.
The combination of position-level control, dual-purpose reward design, and open release puts this in a different category from everything else right now.
Login or register