The backbone of the cross-lingual model is based on Conformer, which combines convolution neural networks and transformers to efficiently model both local and global dependencies in data sequence. This allows the model to transfer the voice of a speaker from one language to another. The cross -lingual model is a single unified model that is trained with data from different speakers and languages. Enabled with the cross-lingual adaptation technology, CNV feature allows for the creation of a custom voice that can speak dozens of languages without adding language-specific training data. In today's interconnected world, developers are expected to build voice-enabled applications that can reach a global audience. Multi-lingual CNV GA: adapt your voice to speak different languages Or, specify the speaking styles using SSML in your codes via the Speech SDK ( see details here ). Īfter you've tested the voice and styles, you can deploy the voice to a custom neural voice endpoint (see how to deploy and use your voice model ) and use the Audio Content Creation tool to create audio with your deployed voice in different speaking styles. Once the model is created, you can review the system generated audio samples to test the voice quality. The training could take 20-40 hours to finish based on the training data size, the language and the styles you select. Here is when you can also provide your own style data to create new speaking styles for the same voice. Then select the target speaking styles you want to enable from a preset style list. To create a multi-style voice, you only need to prepare a small set of voice samples (about 300+ utterances) in its default style.Īfter your data is imported to the Speech Studio portal, select ‘Neural – multi style’ as the training method. For zh-CN, the prebuilt styles are Angry, Calm, Chat, Cheerful, Disgruntled, Fearful, Sad, and Serious.įinally, for ja-JP, another new language that we support for multi-style CNV, the prebuilt styles are Angry, Happy and Sad.īeyond these styles, you can also create your own speaking style for the same voice, with your style training data available. With the GA update, we released a more robust English (US) multi-style model which increased the naturalness of the speaking styles created.Ĭhinese is a new language we support for CNV with multiple styles. You can review the previous samples in this blog post. The prebuilt styles for en-US include Angry, Cheerful, Excited, Friendly, Hopeful, Sad, Shouting, Terrified, Unfriendly, and Whispering. With its GA, we have updated the Style Transfer model for English (US) and expanded its support to Chinese (Mandarin, Simplified) and Japanese. The result of the Style Transfer is the target speaker adopts the tone and prosody of the source speaker yet keeps their own voice timbre. Style Transfer is a method to apply the speaking tone and prosody (i.e., pace, intonation, rhythm) of one speaker (source speaker) to another speaker (target speaker). Using the multi-style CNV feature, customers can create voices that speak in multiple styles/emotions without adding new training data, through Style Transfer. However, customers often request support for voice emotions and styles that would enhance their end-users' experience. Multi-style CNV GA: enable your voice to convey emotionsīranded voices have become increasingly popular in various scenarios, such as voice assistants, news reading, and audiobook creation. In this blog post, we introduce what’s new and share a step-by-step guide to help you harness the power of this new feature. Now with the general availability of these features, we are upgrading the models to support more languages and enhanced voice naturalness. To answer the customer requests to support more expressive voices across the globe, we have already released a preview version of the CNV multi-style and multi-lingual capabilities. (For more details, read the Innovation Stories blog. Since its launch, custom neural voice has empowered organizations such as AT&T, Progressive, Vodafone, and Swisscom to develop branded speech solutions that delight users in various scenarios including voice assistant, customer service bot, audiobooks, language learning, news reading and many more. This new technology allows you to create a natural branded voice capable of expressing different emotions and speaking different languages.Ĭustom neural voice, a feature of Azure AI Speech, is a great way to help you create a one-of-a-kind voice that is natural and sounds identical to your voice actor. Today at Microsoft Inspire 2023, we're excited to announce the general availability (GA) of the new multi-style and multi-lingual custom neural voice (CNV) features inside Text to Speech, part of the Azure AI Speech capability.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |