How to Use Multiple Text-to-Speech Voices in the Same Audio

An illustration showing a microphone with soundwaves

When you hear the term “text-to-speech” (TTS), it is tempting to imagine a robot reading out a piece of text. Think of the little widget that pops up on a Google search that tells you how a word is pronounced. You click the button and a flat, robotic voice reads out the chosen word.

While this type of TTS is useful for getting an idea of how words might be pronounced or for audio descriptions where only the words themselves are important, they are lacklustre, to say the least, when it comes to producing audio for videos, podcasts, or virtual assistants.

But now, with the help of artificial intelligence (AI), you can create multiple text-to-speech voices that sound just like real people, giving your projects an authentic, human-sounding touch. In this guide, we’ll show you how to use multiple TTS voices to create one, integrated audio track.

Why Use AI For Multiple Text To Speech Voices?

You may be wondering why you would need to use an AI-driven TTS program in the first place. There are a couple of good reasons for doing so.

First, the AI voices produced by TTS programs sound far more natural. In fact, they sound indistinguishable from human voices. This allows you to create audio tracks that sound as if they have been performed by a voice actor but can be created in far less time and at a far lower cost.

Second, you are able to customize your voice as you wish. With good AI TTS tools, there are a range of different voice options, allowing you to choose a male or female voice, voices in different languages, and voices with different accents, among many other options. This adds variety and can be helpful if you have a specific voice or target audience in mind.

Third, if you are creating a track with multiple speakers, you can choose multiple, different-sounding voices. By doing this, you can help listeners distinguish between who is talking at a given time, just like a real conversation.

Building on the final point here, now let’s look at how you can use multiple text-to-speech voices in the same audio track. For this example, we’ll be walking you through how to do it using LOVO AI’s Genny.

How To Create Multiple Text To Speech Voices

Step 1

Log into your LOVO AI account and open Genny. From there, you can begin a new project or jump back into one that you have been working on.

Click “New Project” and select the “AI Voice and Video” project type. We will be using Ai Voice and Video for this example, as the “Short Voiceover” option only allows for one voice and text block per project.

Step 2

Once you are in the new project, it is time to select the voices you would like to use. You can do this in two ways:

Use the “Speaker Selection” tab on the left side of the screen to select a voice from our library. These voices come in a range of styles, accents, and languages, and can be tweaked to your tastes.
Use the “Voice Cloning” tab to create a new AI voice. If you choose this option, you will need to upload some good-quality voice clips of your chosen voice in order for Genny to create your voice. Read our full guide on creating the perfect voice clone here!

When creating a track with multiple voices, it is important to select a different voice for each speaker. If you choose the same voice, it will be almost impossible for a listener to tell who is speaking at a given moment.

Step 3

With your voices chosen, you can start creating your audio track. To do this, type or paste your text into the boxes next to each speaker. Once you have the text in place, hit the generate button on the right and Genny will create your TTS track. This may take a couple of minutes depending on how long your text is.

Once the track is generated, you can listen to it by clicking the “Play” button on the right. Each new piece of audio will also be dropped into the timeline at the bottom of the screen. Then, click “Add a new block” to add another block of text, select your speaker, and add your next section of text.

Repeat this process for all of the text that you would like each of your speakers to say. Again, each new section of audio will be dropped into the timeline next to the respective speaker.

Step 4

Once you have all of your audio in place, it is time to arrange and customize it. In the timeline at the bottom of the screen, you will see all of the audio tracks laid out in the order you generated them. You can listen to the full track with the “Play” button above the timeline, as well as skip forwards and back, change the play speed, and use the yellow bar to scrub precisely. You are also able to zoom in on the timeline for a closer view.

Now, you can arrange your individual voice tracks into one audio track in whatever way you desire. You can change the order of the voice tracks, choose when they start, and even have them overlap with one another.

When you have arranged your tracks in the correct order and with the correct timing, you can listen back to the track and ensure it sounds exactly how you were hoping.

And with that, you’re finished. Now, you can export your multiple-speaker audio track and integrate it into your next project!

The Best Choice For Multiple Text To Speech Voices

With it’s vast range of TTS voices and powerful timeline editor, Genny is your one-stop platform for all your text to speech needs. Create high quality projects with our online video editor that lets you go beyond just generating human-like AI voices and creating your own custom voices. With Genny’s AI tools you can create images, write scripts, edit videos, and add subtitles in just a few clicks.