A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

Code; Paper

Abstract This paper introduces F5-TTS, a fully non-autoregressive text-to-speech system based on flow matching with Diffusion Transformer (DiT). Without requiring complex designs such as duration model, text encoder, and phoneme alignment, the text input is simply padded with filler tokens to the same length as input speech, and then the denoising is performed for speech generation, which was originally proved feasible by E2 TTS. However, the original design of E2 TTS makes it hard to follow due to its slow convergence and low robustness. To address these issues, we first model the input with ConvNeXt to refine the text representation, making it easy to align with the speech. We further propose an inference-time Sway Sampling strategy, which significantly improves our model’s performance and efficiency. This sampling strategy for flow step can be easily applied to existing flow matching based models without retraining. Our design allows faster training and achieves an inference RTF of 0.15, which is greatly improved compared to state-of-the-art diffusion-based TTS models. Trained on a public 100K hours multilingual dataset, our Fairytaler Fakes Fluent and Faithful speech with Flow matching (F5-TTS) exhibits highly natural and expressive zero-shot ability, seamless code-switching capability, and speed control efficiency. Demo samples can be found at https://SWivid.github.io/F5-TTS. We release all code and checkpoints to promote community development.


This page is for research demonstration purposes only.

Model Overview

Figure 1: An overview of F5-TTS training (left) and inference (right). The model is trained on the text-guided speech-infilling task and condition flow matching loss. The input text is converted to a character sequence, padded with filler tokens to the same length as input speech, and refined by ConvNeXt blocks before concatenation with speech input. The inference leverages Sway Sampling for flow steps, with the model and an ODE solver to generate speech from sampled noise.

All samples in this demo page are generated with F5-TTS (NFE=32 CFG=2 w/ SS) in one time (no cut), with pretrained Vocos as vocoder.

Zero-shot Generation

Prompt and text from the demo page of Seed-TTS.

Language Prompt Same Language Generation Cross-linugal Generation
I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.


Perhaps they are driven by the delicious blend of flavors, or it could be the appealing visual presentation. At the end of the day, our choices in food reflect our personal preferences and sometimes, even our lifestyle or belief system.


Your safety and the pack's reputation are at stake. Your bravery is admirable, but sometimes bravery is knowing when to retreat. Please, consider returning with me. We can work out a plan, but only if you're willing to listen.


Suddenly, there was a burst of laughter beside me. I looked at them, stood up straight with high spirit, shook the slightly fleshy arms, and smiled lightly, saying, "The flesh on my body is to hide my bursting charm. Otherwise, wouldn't it scare you?"


Suddenly, the atmosphere became gloomy. At first glance, all the troubles seemed to surround me. I frowned, feeling that pressure, but I know I can't give up, can't admit defeat. So, I took a deep breath, and the voice in my heart told me, "Anyway, must calm down and start again."


The emperor's complexion did not change, remaining as still as a sculpture, and a touch of touching warmth flashed in his eyes. He deeply glanced at the loyal minister, and finally spoke: "Well, I will consider it again." His voice was low and firm, leaving a faint hint of helplessness and tenderness in the air.

Code-Switch, Text from FireRedTTS demo page.

Prompt Text Code-Switched Generation

Speed Control

Prompt and Text from Seed-TTS, same as those used in MaskGCT demo page.

F5-TTS only needs a total duration, the character's position and duration will be automatically assigned by model.

Prompt Text 0.7x Speed 1.0x Speed 1.3x Speed
I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.
Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me.

Prompt and Text from E2 TTS demo page.

Prompt Text 0.7x Speed 1.0x Speed 1.3x Speed
He gave way to the others very readily and retreated unperceived by the Squire and Mistress Fitzooth to the rear of the tent.
“How cheerfully he seems to grin, How neatly spread his claws, And welcome little fishes in With gently smiling jaws”!
Yes; then something better, something still grander, will surely follow, or wherefore should they thus ornament me?
And, though I have grown serene And strong since then, I think that God has willed A still renewable fear…
He wore blue silk stockings, blue knee pants with gold buckles, a blue ruffled waist and a jacket of bright blue braided with gold.
Not only this, but on the table I found a small ball of black dough or clay, with specks of something which looks like sawdust in it.


Compare with most expressive results in E2 TTS demo page, Prompt from RAVDESS dataset.

Emotion Prompt Text E2 TTS F5-TTS
Calm So, I was, like, at the, um, grocery store, and, uh, I saw this, like, really yummy-looking, um, cake, y’know? And I, uh, totally wanted to, like, buy it, but, um, I was, like, on a diet, so, uh, I just, like, stared at it for a while, y’know?
Happy I was, like, talking to my friend, and she’s all, um, excited about her, uh, trip to Europe, and I’m just, like, so jealous, right?


First prompt (Wukong) from FireRedTTS demo page, Second to Fourth from Bailing-TTS. Text from Seed-TTS hard testset same as in MaskGCT demo page.

Prompt Text F5-TTS

Hard sentences from ELLA-V. Same audio prompts as in E2 TTS demo page from LibriSpeech-PC test-clean.

Prompt Text F5-TTS
Active artists always appreciate artistic achievements and applaud awesome artworks.
Brave bakers boldly baked big batches of brownies in beautiful bakeries.
Daring dancers dazzled during dynamic dance displays, drawing delighted crowds.
Excited engineers eagerly enjoyed exploring enormous engineering exhibits.
Friendly farmers faithfully fostered fields, favoring fruitful crops.
Gallant gophers gracefully gambled golden gooseberries on grandiose glaciers.
Happy hikers harmoniously hiked through hilly landscapes on heavenly holidays.
Inquisitive individuals ingeniously invented innovative inventions.
Jovial joggers joyfully joined jogging jaunts, justifying joyful jolliness.
Keen kids keenly knitted knotted knots in kindergartens.
F one F two F four F eight H sixteen H thirty two H sixty four.
Clever cats carefully crafted colorful collages creating cheerful compositions.