Amazon.com Inc. researchers have developed a brand new text-to-speech model, Base TTS, that may pronounce words more naturally than earlier neural networks.
TechCrunch reported the project late Wednesday. The researchers detailed the architecture of Base TTS in an academic paper published on Monday.
Besides generating more natural-sounding audio than its predecessors, the model can be the biggest neural network within the category. Essentially the most advanced version of BASE TTS features about 1 billion parameters, that are configuration settings that determine how a synthetic intelligence processes data. Generally, increasing an AI model’s parameter count expands the range of tasks it may perform.
Amazon’s researchers trained Base TTS on 100,000 hours’ price of audio sourced from the general public web. English-language recordings account for about 90% of the dataset. To streamline the training process, the researchers split the audio into small files that every included not more than 40 seconds of speech.
“Echoing the widely-reported ‘emergent abilities’ of huge language models when trained on increasing volume of knowledge, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to reveal natural prosody on textually complex sentences,” the researchers wrote within the paper detailing the system.
On the architectural level, BASE TTS comprises two separate AI models. The primary turns text entered by the user into abstract mathematical representations dubbed speechcodes. The second neural network, in turn, transforms those mathematical representations into audio.
The primary model relies on the Transformer architecture that underpins OpenAI’s GPT-4. Developed by Google LLC in 2017, the architecture allows neural networks to think about the context of a word when trying to find out its meaning. That feature enables Transformer-based neural networks to interpret input data more accurately than earlier algorithms.
The Transformer model in Base TTS turns text imputed by the user into speechcodes, mathematical representations that the opposite components of the system can more easily process. The model also performs two other tasks. In response to the researchers, it compresses speechcodes to hurry up processing and ensures that the audio Base TTS produces doesn’t include unnecessary elements corresponding to background noise.
Once the speechcodes are ready, they move into the second AI model that underpins Base TTS. That model turns the information into spectrograms, that are graphs used to visualise sound waves. Those graphs may be easily was AI-generated speech.
Amazon’s researchers assessed Base TTS’s capabilities with the assistance of an authority linguist, in addition to an automatic evaluation benchmark called MUSHRA. They determined that the model can read input text aloud in a more naturally-sounding way than earlier models.
Through the evaluation, Base TTS successfully pronounced the @ sign and other symbols together with paralinguistic sounds corresponding to “shh.” It also managed to read aloud English-language sentences that contained foreign words and questions. In response to Amazon, Base TTS accomplished the duty though it wasn’t specifically trained to process a few of the sentence types included within the evaluation dataset.
Photo: Unsplash
Your vote of support is vital to us and it helps us keep the content FREE.
One click below supports our mission to supply free, deep, and relevant content.
Join our community on YouTube
Join the community that features greater than 15,000 #CubeAlumni experts, including Amazon.com CEO Andy Jassy, Dell Technologies founder and CEO Michael Dell, Intel CEO Pat Gelsinger, and lots of more luminaries and experts.
THANK YOU