Voice Fine-tuning Project Enhancing Speech Recognition and Synthesis Models

Text-to-Speech (TTS) technology has significantly advanced in recent years, transforming written text into spoken words with various applications like virtual assistants and language learning tools. However, creating natural-sounding synthetic voices, especially for tonal languages such as Vietnamese, remains challenging.


InfinityAI's voice fine-tuning project aims to address these challenges by enhancing TTS models for improved Vietnamese language synthesis. Team NoChill’s focus is to train and fine tune the Tortoise TTS model, known for its multi-voice capabilities and realistic prosody, to generate high-quality Vietnamese speech. The project involves collecting and meticulously preprocessing diverse Vietnamese voice recordings to fine-tune the model. By leveraging techniques such as transfer learning, we aim to improve the model's ability to capture the critical tonal nuances of the Vietnamese language.


Our final goal is to develop a user-friendly web application that utilizes the enhanced TTS model, enabling users to convert written Vietnamese text into clear, natural-sounding audio. While the primary focus is on improving TTS technology for Vietnamese, the project also aims to provide a practical tool that could benefit various local applications, such as e-learning, accessibility, and customer service. By addressing the unique challenges of Vietnamese speech synthesis, this project contributes to the growing field of AI in Vietnam and could serve as a foundation for further innovations in digital content creation and user interaction.

Project Snapshots

Get Project Poster