The neural network is taught to almost perfectly replicate the human voice

In the past year, the company DeepMind engaged in the development of artificial intelligence technology, shared details about his new project WaveNet neural networks deep learning that can be used to sintetici realistic human speech. Recently was released an upgraded version of this technology that will be used as the basis of the digital mobile assistant Google Assistant.

The system of voice synthesis (also known as conversion function “text-to-speech” text-to-speech, TTS) is usually built on the basis of one of two basic methods. Concatenative (or composite) method involves the construction of phrases through the collection of separate pieces of recorded words and parts pre-recorded with the involvement of the actor dubbing. The main disadvantage of this method is the need for constant replacement sound library every time, when there are any updates or changes.

Another method is called the parametric TTS, and its feature is the use of sets of parameters by which the computer generates the desired phrase. Minus the method that is most often the result manifests itself in the form of so-called unrealistic or robotic sound.

As for WaveNet, it produces sound waves from scratch based on the system based on convolutional neural networks, where sound generation happens in several layers. First for training platform centenarii “live” speech, her “feed” a huge amount of samples, thus noting which audible signals sound realistic and which are not. It gives a voice synthesizer reproduce naturalistic intonation, and even such details as the sounds of smacking lips. Depending on which samples are run through a speech system, this allows her to develop a unique “accent” that could eventually be used to create many different voices.

A sharp tongue

Perhaps the greatest limitation of the WaveNet system was that it required a huge amount of computing power, and even in this condition it was not different speed. For example, for generation of 0.02 seconds of sound she had about 1 second of time.

After a year working DeepMind engineers still found a way to improve and optimize the system so that it is now able to produce a raw sound with a duration of one second using only 50 milliseconds, which is 1000 times faster than its original capacity. Moreover, the experts managed to increase the audio sampling rate with 8-bit to 16-bit, which has a positive impact on the tests with the involvement of the audience. Thanks to these successes, WaveNet opened the road for integration into such consumer products as Google Assistant.

At the moment, WaveNet can be used to generate English and Japanese voices via Google Assistant and all platforms that use the digital assistant. Because the system can create a special type of votes depending on which set of samples was provided for learning, then soon Google will most likely implement in WaveNet support centenarii realistic speech and other tongues, including with regard to their local dialects.

Speech interfaces are becoming more and more common on a variety of platforms, but their distinct unnatural nature sound repels many potential users. Attempts company DeepMind to improve this technology will certainly contribute to a broader dissemination of these voice systems, and will also improve user experience from their use.

Examples of English and Japanese synthesized speech using neural network, WaveNet can be found by clicking on this link.

The neural network is taught to almost perfectly replicate the human voice
Nikolai Khizhnyak