A voice replicator is a robust software for folks vulnerable to dropping their capacity to talk, together with these with a current analysis of amyotrophic lateral sclerosis (ALS) or different circumstances that may progressively influence talking capacity. First launched in Might 2023 and made accessible on iOS 17 in September 2023, Private Voice is a software that creates a synthesized voice for such customers to talk in FaceTime, cellphone calls, assistive communication apps, and in-person conversations.
To begin, the consumer reads aloud a randomized set of textual content prompts to file 150 sentences on the newest iPhone, iPad or Mac.The voice audio is then tuned with machine studying methods in a single day instantly on the gadget whereas the gadget is charging, locked and linked to Wi-Fi. That is just for downloading the pre-trained asset. By the subsequent day, the individual can sort what they need to say utilizing the Dwell Speech text-to-speech (TTS) characteristic, as illustrated in Determine 1, and be heard in dialog in a voice that seems like theirs. As a result of mannequin coaching and inference are performed fully on-device, customers can reap the benefits of Private Voice each time they need, and maintain their data each non-public and safe.
On this analysis spotlight, we focus on the three machine studying approaches behind Private Voice:
Private Voice TTS system
Voice mannequin pretraining and fine-tuning
On-device speech recording enhancement
Determine 1: Private voice characteristic in iOS17. Picture A exhibits the directions to create private voice and the right way to use it.
![Figure 1: Personal voice feature in iOS17. Image B shows what a recording process looks like.](https://mlr.cdn-apple.com/media/Fig1b_Personal_Voice_6a5ac03cf7.jpg)
Determine 1: Private voice characteristic in iOS17. Picture B exhibits what a recording course of appears to be like like.
Private Voice TTS System
The primary machine studying method we are going to focus on is a typical neural TTS system, which takes in textual content and gives speech output. A TTS system consists of three main parts:
Textual content processing: Converts graphemes (written textual content) to phonemes, a written notation that represents a definite models of sound (such because the h of hat and the c of cat in English)
Acoustic mannequin: Converts phonemes to acoustic options (for instance, to the Mel spectrum, a frequency illustration of sound, engineered to characterize the vary of human speech)
Vocoder mannequin: Converts acoustic options to speech waveforms, offering a illustration of the audio sign over time
To develop Private Voice, Apple researchers labored on the Open SLR LibriTTS dataset. The cleaned dataset consists of 300 hours of 1000 audio system with very totally different talking types or accents. Private Voice should produce speech output that others can acknowledge because the voice of the goal speaker. Each the acoustic mannequin and vocoder mannequin are speaker-dependent in a typical TTS system. To clone the goal speaker’s voice, we fine-tuned the acoustic mannequin with on-device coaching. For the vocoder mannequin, we thought-about each a common mannequin and on-device adaptation. Our crew discovered that fine-tuning solely the acoustic mannequin, and utilizing a common vocoder, typically generates poorer voice high quality. Uncommon prosody, audio glitches, and noise have been extra prevalent, when examined towards unseen audio system. Wonderful-tuning each fashions, as seen in Determine 2, requires additional coaching time on gadget however ends in higher general high quality.
![A personal voice text-to-speech system diagram.](https://mlr.cdn-apple.com/media/Fig2_Personal_Voice_618aa292a2.png)
Listening checks confirmed that fine-tuning each fashions achieves the perfect voice high quality and similarity to the goal speaker’s voice, as measured by imply opinion rating (MOS) and voice similarity (VS) rating, respectively. The MOS is 0.43 greater than the common vocoder model on common. As well as, fine-tuning can cut back the precise mannequin dimension sufficient to attain real-time speech synthesis for a sooner and extra satisfying dialog expertise.
Voice Mannequin Pretraining and Wonderful-Tuning
The subsequent machine studying approaches we are going to focus on are voice mannequin pretraining and fine-tuning. The fashions include two elements:
Modified FastSpeech2-based acoustic mannequin
WaveRNN-based vocoder mannequin
The acoustic mannequin follows an structure just like FastSpeech2. Nevertheless, we add speaker ID as a part of the decoder enter to be taught common voice data throughout the pretraining stage. Additional, our crew makes use of dilated convolution layers for decoding as an alternative of transformer-based layers. This ends in sooner coaching and inference, in addition to lowered reminiscence consumption, making the fashions shippable on iPhone and iPad.
We use a common pretraining and fine-tuning technique for Private Voice. Each the acoustic and vocoder fashions are pretrained with the identical Open SLR LibriTTS dataset.
Throughout the fine-tuning stage with target-speaker information, we fine-tuned solely on the acoustic mannequin’s decoder and variance adapters half. The variance adapters are used to foretell goal speaker phoneme-wise length, pitch, and power. Nevertheless, we do a full mannequin adaptation, during which all parameters will probably be fine-tuned, for the vocoder mannequin. Furthermore, your entire fine-tuning stage (and Private TTS system) happens on the consumer’s Apple gadget, not the server. To hurry up the on-device coaching efficiency, we use full bfloat16 precision with fp32 accumulation for vocoder mannequin fine-tuning with a batch dimension of 32. Every batch incorporates 10ms audio samples.
On-System Speech Recording Enhancement
The ultimate machine studying method we are going to focus on is on-device speech recording enhancement. Those that use the Private Voice characteristic can file their voice samples wherever they select. Consequently, these recordings may embody undesirable sounds, akin to visitors noise or different folks’s voices close by. In our analysis, we discovered that the standard of the generated or synthesized voice is extremely associated to the standard of the consumer’s recordings. Therefore, we apply speech augmentation to the target-speaker information to attain the perfect voice high quality.
Our speech augmentation incorporates 4 main parts as seen in Determine 3:
Sound strain stage (SPL) and signal-to-noise ratio (SNR) filtering: Screens out very noisy recordings which can be tough to boost
Voice isolation: Removes common noise and leaves solely speech
Mel spectrum augmentation: Mannequin-based resolution that gives a cleaner Mel spectrum with higher audio constancy
Audio restoration: Mannequin-based resolution to recuperate the audio sign from the improved Mel spectrum
![Caption.](https://mlr.cdn-apple.com/media/Fig3_Personal_Voice_9baf9ab9e2.png)
Our Mel spectrum augmentation mannequin is a mannequin based mostly on U-Internet, skilled with noisy Mel spectrum because the enter and clear Mel spectrum because the output. The audio restoration mannequin is a straightforward Chunked Autoregressive GAN (CarGAN) mannequin that converts a clear Mel spectrum to an audio sign.
With the speech augmentation stream, we discovered the generated voice high quality improved considerably, particularly with the real-world iPhone recorded information that we collected from inner and exterior audio system. The MOS rating is 0.25 greater in contrast with the baseline stream which doesn’t have audio augmentation.
Determine 4 exhibits the ultimate outcomes of high quality analysis for Private Voice, in each imply opinion rating and voice similarity rating.
Conclusion
On this analysis spotlight, we cowl the technical particulars behind the Private Voice characteristic, which accessibility customers can use to create their very own voice in a single day absolutely on gadget, and use with real-time speech synthesis to speak with others. Our hope is that individuals vulnerable to dropping their capacity to talk, akin to these with ALS or different circumstances that may diminish their capacity to talk, could profit vastly from the Private Voice characteristic.
Acknowledgments
Many individuals contributed to this work, together with Dipjyoti Paul, Jiangchuan Li, Luke Chang, Petko Petkov, Pierre Su, Shifas Padinjaru Veettil, and Ye Tian.
Apple Sources
Apple Developer. 2023. “Prolonged Speech Synthesis with Private and Customized Voices.” [link.]
Apple Newsroom. 2023. “Apple Introduces New Options for Cognitive Accessibility, Together with Dwell Speech, Private Voice, and Level and Converse in Magnifier.” [link.]
Apple Help. 2023. “Create a Private Voice in your iPhone, iPad, or Mac.” [link.]
Apple Youtube. 2023. “Private Voice on iPhone – The Misplaced Voice.” [link.]
References
Kalchbrenner, Nal, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, et al. 2018. “Environment friendly Neural Audio Synthesis.” [link.]
Morrison, Max, Rithesh Kumar, Kundan Kumar, Prem Seetharaman, Aaron Courville, and Yoshua Bengio. 2022. “Chunked Autoregressive GAN for Conditional Waveform Synthesis.” March. [link.]
Open SLR. n.d. “LibriTTS Corpus.” [link.]
Ren, Yi, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. “FastSpeech 2: Quick and Excessive-High quality Finish-To-Finish Textual content to Speech.” March. [link.]
Silva-Rodríguez, J., M. F. Dolz, M. Ferrer, A. Castelló, V. Naranjo, and G. Piñero. 2021. “Acoustic Echo Cancellation Utilizing Residual U-Nets.” September. [link.]