How does it work? | Speech synthesis

In the last issue we talked about speech recognition, today we will discuss the inverse problem. So how does speech synthesis, or, in other words, convert some arbitrary text to voice about it in today’s issue!

The task of speech synthesis is solved in several stages. First of all, a special algorithm is necessary to prepare the text to robot be comfortable to read: it records all the number words and decode abbreviations. Then the text is broken down into individual phrases that need to be read with continuous tone — for this system focuses on the punctuation and sustainable design.

Then for all words are phonetic transcription. To understand how to read the word and where to put it the accent, the system accesses the built-in, written by the dictionary. If the desired word is absent, the computer builds the transcription of their own, based on academic rules. If they are insufficient, in the case involving statistical rules: the system iterates through the records of the speakers and determines what style they did the emphasis.

When the transcription is made, the computer calculates how many frames, or, in other words, fragments with a length of 25 milliseconds. Next, each frame is described by many parameters: part of which phoneme it is, what place it occupies in a syllable that include this phoneme. It also describes the French or bezdarnosti phoneme, if it is a vowel. In addition, the system creates the correct intonation using phrase and sentence.

The system then uses the acoustic model to read the prepared text. It establishes the correspondence between the phonemes with certain characteristics and sounds. Acoustic model knows how to correctly pronounce the phoneme and to give the correct intonation of the sentence through machine learning. The more data on which the model learns, the better she issued a result.

As for the votes, makes them recognizable in the first place, the tone depends on the characteristics of the structure of the organs of the vocal apparatus. The timbre of any voice can be simulated, that is, to describe its characteristics — it is enough to record in the Studio a small amount of text. From then on, the tone can be used in the synthesis of speech in any language. When the system needs to say something, it uses a generator of sound waves — the vocoder. Displays information about the frequency characteristics of the phrase, obtained from the acoustic model, as well as data on the voice which gives voice recognizable color.

It is worth noting that the modern technology of speech synthesis have some problems. The first of these is the artificiality. Any synthesized speech is perceived by a person with difficulty, and he is forced to use additional resources to understand it. Thus, people can normally perceive synthesized speech only about 20 minutes. Also synthesized speech, as a rule, no emotional coloring, and it has low noise immunity. In other words, the perception of synthesized speech interfere with any person, even the smallest noises.

How does it work? | Speech synthesis
Hi-News.ru