I had been experimenting with TTS (Text-to-Speech) for a while now, mostly due to the fact that I wanted to replace google assistant on my phone with something better, something more private and personal.
But the lack of quality local TTS models held me back for a while, while I had set up the infrastructure to serve STT and LLMs with tool calling for this, TTS was the one thing I was still missing.
Now obviously I could have just used the cloud services, but I wanted to have something more local, something that I could run on my own hardware, where I could control the data, and also I really didn’t feel like paying the big corpos even more.
After not fidning anything really suitable, I decided to just do it myself.
The testing took a while, I scanned through basically every open TTS model, noted their quality, strenghts, weaknesses and so on. While there were some pretty great options, that were pretty high quality, they lacked consistency and coherence, even when finetuned on hours of data.
I plan on eventually talking more about the architecture I chose, the data I used, and even releasing the training code, for now I’m just sticking to sharing pogress updates since I don’t want to be overconfident and end up failing or release something subpar that I’m not proud of.
While I was working on this, F5-TTS released, which was surprisingly similar tech wise but still below my quality threshold. Now obviously, the consistence and quality of my model comes with the drawback of versitility. I’m stuck to one language, and one voice, but for my uses? That is more than enough.
I’m still working on improving it further and figuring out how to cram more voices into one model to avoid relying on hotswapping and running multiple simultanously but that will have to wait, as there’s still plenty of quality imrpovements to be made.
Anyway thanks for reading this, and for listening to me yap and ramble, I’ll leave you with some examples and a promise that it’ll get better.

