There are more and more ways to identify a person by voice. There are more and more ways to identify a person by voice. And at the same time, researchers figure out how to bypass these mechanisms - both to protect their own personal information, and to hack systems protected in this way. Let's look at the most recent achievements of scientists in this area.

 

How to fake a voice using a neural network

By
 
cumshoat.blogspot.com
min

There are more and more ways to identify a person by voice. And at the same time, researchers figure out how to bypass these mechanisms - both to protect their own personal information, and to hack systems protected in this way. Let's look at the most recent achievements of scientists in this area.

"The article is taken from open sources"

Generating voice

The human voice is the result of the movement of ligaments, tongue, lips. The computer has only numbers representing the wave recorded by the microphone. How does a computer create sound that we can hear from speakers or headphones?

Text to speech

One of the most popular and researched methods of generating sounds is directly converting the text to be played into sound. The very first programs of this kind glued individual letters into words, and words into sentences.
With the development of synthesizer programs, a set of letters pre-recorded on a microphone became a set of syllables, and then whole words.
The advantages of such utilities are obvious: they are easy to write, use, maintain, they can reproduce all the words that are in the language, predictable - all this at one time became the reason for their commercial use. But the quality of the voice created by this method leaves much to be desired. We all remember the distinctive features of such a generator - insensitive speech, incorrect stress, words and letters torn from each other.

Sounds in speech

This method of generating speech relatively quickly replaced the first, since it imitated human speech better: we pronounce not letters, but sounds. That is why systems based on the international phonetic alphabet - IPA, are of higher quality and more pleasant to listen to.
This method is based on individual sounds pre-recorded in the studio, which are glued into words. Compared to the first method, a qualitative improvement is noticeable: instead of simple gluing of audio tracks, methods of mixing sounds are used, both based on mathematical laws and based on neural networks.

Speech to speech

A relatively new approach is entirely based on neural networks. Recursive architecture WaveNet , built by researchers at DeepMind, allows you to convert sound or text to other sound directly, without involving pre-recorded building blocks ( research paper ).
The key to this technology is the correct use of Long Short-Term Memory recursive neurons , which retain their state not only at the level of each individual cell of the neural network, but also at the level of the entire layer.
WaveNet scheme
WaveNet scheme
In general, this architecture works with any kind of sound wave, regardless of whether it is music or a human voice.
There are several projects based on WaveNet.
To recreate speech, such systems use sound notation generators from text and intonation generators (stress, pause) to create a natural-sounding voice.
This is the most advanced technology for creating speech: it not only glues or mixes sounds incomprehensible to the machine, but independently creates transitions between them, pauses between words, changes the pitch, strength and timbre of the voice for the sake of correct pronunciation - or any other purpose.

Making a fake voice

For the simplest identification, almost any method will work - even the unprocessed five seconds of the recorded voice may be enough for especially successful hackers. But to bypass a more serious system, built, for example, on neural networks, we need a real, high-quality voice generator.

How the voice simulator works

It takes a lot of effort to create a plausible voice-to-voice model based on WaveNet: you have to record a lot of text spoken by two different people, and so that all the sounds match a second per second - and this is difficult to do. However, there is another method.
Based on the same principles as the sound synthesis technology, you can achieve an equally realistic transmission of all the parameters of the voice. So, a program was created that clones a voice based on a small speech recording. This is what you and I use.
The program itself consists of several important parts that work sequentially, so let's figure it out in stages.

Voice coding

Each person's voice has a number of characteristics - they cannot always be recognized by ear, but they are important. In order to accurately separate one speaker from another, it will be correct to create a special neural network that forms its own sets of features for different people.
This encoder allows not only transferring the voice in the future, but also comparing the results with the desired ones.
This is what 256 voice characteristics look like
This is what 256 voice characteristics look like

Creating a spectrogram

Based on these characteristics, it is possible to create a chalk-spectrogram of sound from the text. This is done by the synthesizer, which is based on the Tacotron 2, using WaveNet.
An example of a generated spectrogram
The generated spectrogram contains all the information about pauses, sounds and pronunciation, and it already contains all the pre-calculated characteristics of the voice.

Sound synthesis

Now another neural network - based on WaveRNN - will gradually create a sound wave from the chalk spectrogram. This sound wave will be played as a finished sound.
All characteristics of the main voice are preserved in the synthesized sound, which, albeit not without difficulty, recreates the original human voice on any text.

Method testing

Now that we know how to create a believable voice simulation, let's put it into practice. There are two very simple but working methods for identifying a person by voice: using the analysis of small-cepstral coefficients and using neural networks specially trained to identify one person. Let's find out how well we can fool these systems with fake records.
Let's take a five second recording of a man's voice and create two recordings with our tool.
Let us compare these records using the mel-cepstral coefficients.
Odds type on the chart
The difference in the coefficients is also visible in the numbers: How will the neural network react to such a good fake?
It turned out to be possible to convince the neural network, but not perfectly. Serious security systems that are installed, for example, in banks, are likely to be able to detect a fake, but a person, especially on the phone, is unlikely to be able to distinguish a real interlocutor from his computer imitation.

conclusions

Faking a voice is no longer as difficult as it used to be, and this opens up great opportunities not only for hackers, but also for content creators: indie game developers will be able to make high-quality and cheap voice acting, animators - to voice their characters, and film directors - to film reliable documentary.

And although technologies of high-quality speech synthesis are still developing, their potential is already taking your breath away. Soon, all voice assistants will find their own personal voice - not metallic, cold, but filled with emotions and feelings; chat with tech support will stop annoying, and you can make your phone answer unpleasant calls for yourself.

Source - https://tech-geek.ru/voice-cloning-with-neural-network/

Просмотры:

Коментарі

Популярні публікації