Text-to-speech systems can now patch their own mispronunciations on the fly, without touching the underlying model.
Researchers have built FlowEdit, an adaptation layer that sits on top of frozen flow-matching TTS models and stores pronunciation corrections in something called a Modern Hopfield Network — a type of associative memory that can retrieve fuzzy matches at inference time. When a correction is fed in, FlowEdit optimizes a small perturbation in the text embedding space and saves it to that memory store rather than updating any model weights. At inference, it soft-matches incoming text against stored corrections and applies the right fix when it finds a close enough hit. The whole correction process takes roughly 15 seconds on a single GPU.
This matters because proper nouns — brand names, place names, people's names — are exactly where deployed TTS systems fall apart, and until now the only real fix was retraining. That is expensive, slow, and impractical for a system handling thousands of edge cases across many languages. FlowEdit's benchmark of 312 multilingual proper nouns across 18 language families showed a 92.7% relative reduction in Phoneme Error Rate against the zero-shot baseline, with no degradation to general speech quality.
The comparison to fine-tuning is generous to FlowEdit in one obvious way: the model never changes, so catastrophic forgetting is not a concern. Whether the associative memory scales gracefully as the correction library grows into the thousands — or whether retrieval starts to degrade — is a question the paper does not answer.