Voice Conversion: Definition, Usage & Concerns

Now let’s go one layer under and talk about a specific category of biometric data and the role AI plays in it: Voice.

Definition

According to Merriam-Webster, a voice means:

“a sound produced by vertebrates by means of lungs, larynx, or syrinx, especially sound so produced by human beings.”

It’s the unique sound, including the tone and pitch of a person.

Voice Conversion, then, is the means of converting one voice to another, a technique to modify speech waveform to convert non-linguistic information while preserving the linguistic information (i.e.; same words, different voice).

But wait!

A human speech has more than just the voice and words: there is rhythm, accent, speed, breaks, vocal and verbal habits (everyone knows that one person who drops “uhm” every 2 seconds and those who emphasize wrong syllables on certain words). This Speech Conversion takes the voice conversion to the next level by keeping the emotion, emphasis, and the delivery style of the speaker into another voice.

In short, original words and emotion & style is translated and conveyed into a new voice. This is what the general population thinks of when they hear the phrase Voice Conversion. (From hereon, I will use the phrases interchangeably)

Technology

The most common technology currently available is Text-to-Speech (TTS), where a speech synthesis system converts text into speech. I will explain where this is used in a later section.

Basic architecture of modern text to speech systems.

Think of it as a computer reading the given text in the pre-set voice chosen by the user. Because the system doesn’t know what emotion, accent, or delivery the text is supposed to be read in, they must stick to whatever level the chosen voice will speak in (which is usually a calm monotone)

On the flip side is the Speech-to-Text (STT), which is most often called speech recognition technology. Again, I will explain the use cases in the next section.

The system hears a speech, filters out the text and transcribes it. The emotion from the speech is, of course, lost.

In the sweet spot is Speech-to-Speech (STS). This is a technology that is somewhat new, where the system – in theory – will hear the speech, take away the text from the rest of the input (voice, emotion, style, etc.), transcribe the text and have a new voice read it, then add the emotion and style from the original speaker back into the new voice. The result would be much more authentic and useful.

I say this is somewhat new because a lot of companies are claiming they can do the STS when in reality they are simply doing STT-TTS, in which case you’ve already lost the emotion and style from the original speaker while transitioning from speech to text.

Usage

So where is this used? EVERYWHERE.

TTS is used in all the smart speakers (Alexa & Google), smart phones (Siri & Bixby), navigations (anyone else’s dad try to have a conversation with their navigation system?), bus stop announcements (next stop is Downtown Berkeley Bart), smart pens and notepads, and basically any IoT or smart devices that talk to you that you can think of. They are also used widely to create advertisements, social media content, explainer videos, YouTube, and corporate training and education.

STT is also used in most smart devices, like when your smart speakers understand your commands, your conference call app gives you the text version of your call recordings, or when Apple, Facebook, and Google are listening in on you. (shhhhhhh)

Aside from these, there are voice filters (i.e.; Snapchat, and other free apps) that make you sound like cats, hairy monsters, or a dolphin. These voice filters are to current speech synthesis tech as Australopithecuses and Neanderthals were to Homo Sapiens and Homo Sapiens Sapiens 🙂

Concern

Some people are alarmed by the possibility of their voices being used unbeknownst to them, or in malicious, illegal ways. Others are concerned they are going to take away the jobs. While they are right to be concerned, it shouldn’t be a divisive confrontation. Just like how we must be wary of our images, names, affiliations, and identities are used by ourselves or others, we must ensure that our laws and ethics develop in step with the technology to create a safe environment.

AI is on track to becoming much more advanced but in the near future, the wide arrange of human emotions and traits cannot be expressed and mimicked perfectly by machine alone. Humans will always be in the loop, and it is up to those involved to create a community.

Afterthought

Once we get the STS tech right to preserve the emotion, style, and delivery to make the resulting voices more natural and the input constraints to be minimal, the potential is endless: content creation, media, entertainment, business, health, meetings, IoT, you name it.

In the next article, I’ll talk about what kind of companies are in the market, where LOVO fits in, and share more details about the potential use cases aforementioned in terms of how LOVO is tackling them. I can’t give away everything… yet.

If you are eager to get started, check out Genny for free today!