Computer programmer obsessed with audio and synthetic media.
If you thought that voice cloning and deepfakes are recent buzzwords, think again. The first original record of mimicking human voice dates back to 1779, in Russia. Professor Christian Kratzenstein built acoustic resonators that mimicked the human vocal tract when activated by means of vibrating reeds (just like wind instruments), in his lab in St. Petersburg.
Nowadays, voice cloning with artificial intelligence is used for a myriad of applications in industries such as film, video game production, audiobooks and podcasts, and more.
Still, we cannot ignore the quite high likelihood of unethical applications of this technology, through which people could be made to believe that someone said something they haven’t.
This raises the need for safeguards to prevent malevolent uses of synthesized voices. These safeguards may pertain to various domains:
- technological ones, like authentication by means of an audio fingerprint layered atop generated voices
- deepfake detection through high-level representations of voice samples
- mathematical ones, such as algorithms that can determine if a voice comes from a human vocal tract based on the voice features (like prosody, phase, frequency, and tone)
- training humans’ ability to detect deepfakes by repeated exposure
How does voice cloning with artificial intelligence actually work?
In the first place, you need a large amount of recorded speech to create a dataset, which you can further use to train another voice model. The embeddings of each speaker can then be used to synthetically articulate phrases. Users can also feed their own texts to an AI tool which will give them voice. The novelty resides in the fact that the tool can utter even discourse bits that you have never uttered yourself.
Practical use cases of voice cloning with AI
Let’s dive deeper and discuss some representative use cases for voice cloning with artificial intelligence.
1. Give back people their ability to speak naturally
Conditions like amyotrophic lateral sclerosis (ALS), apraxia, Huntington’s disease, autism, strokes, or traumatic brain injuries, can sadly leave people unable to use their voice anymore. If before losing their voice they bank it, then it can be used and recreated via voice cloning with AI.
Additionally, crowdsourced speech and vocalization samples captured from people who can’t speak normally, can be matched with nonverbal people that are likely to sound similar. This way, even people who were born without the ability to speak can eventually utter words.
2. Foster the development of a working alliance with clients
This is very relevant mainly for healthcare professionals and social workers, particularly when they venture in the online space. Voice is a crucial element to consider in the relationships between professionals pertaining to these domains and their clients, primarily because it elicits clients’ trust.
Consequently, voice cloning may be used for building digital avatars on the web and within apps. The working alliance can thus become stronger even in the absence of face to face contact. And this is definitely something to be taken very seriously, after the experience of physical distancing throughout the Coronavirus pandemic.
3. Facilitate actors’ work as brand voices
This application area is a good example of the positive commercial implications of voice synthesis technology. Rather frequently, brand voices have to record phone trees for interactive voice response systems or various kinds of scripts for corporate training videos, and, if necessary, deal with the mistakes and modifications encountered in voiceover scripts.
Voice cloning with AI reduces the need for additional recordings, and thus allows actors to make better, more creative use of their time. More pragmatically speaking, synthesization also raises actors’ chances of being paid residuals.
4. Provide interactive content for online learning courses
The importance of this use case for voice cloning with AI is also more salient during the times of lockdown due to COVID-19. Voice conversion technology makes it easier to record audio notes because it makes it unnecessary to do so for every new session, or to address the mistakes in previous sessions.
The operational costs of professionally recorded lectures are dramatically reduced, and students can really benefit from the educational materials as if they were in a regular classroom.
5. Replicate anybody’s voice to create a perfect match in film and TV
Voice synthesis helps you to dub an actor’s voice in post production, or to bring back on screen the voice of an actor who has sadly passed away. The former is a great way to save time because you no longer need to wait until a hard-to-get actor can make time to come to the recording studio. Speech synthesis technology allows you to scale voices and record new lines anytime. This means that you are no longer tied by strict adherence to the original script.
6. Create speech that’s indistinguishable from the original speaker for game developers
If high-demand actors know for a fact that they wouldn’t have to spend ages in the recording studio since their cloned voice can ‘take over’, it makes it more likely that they actually consider working with you for a game. No more “now or never”, since voice cloning offers you more flexibility, allowing you to still make changes after the recordings.
The ability of ‘giving old voices new life’ is also beneficial for game developers, whether you speak about adding historical voices to the game script, or simply being able to finish the game with an actor who unfortunately passed away.
7. Set the ideal tone for your ad
Voice cloning streamlines the workflow for the production of advertisements. All you need in order to start a commercial video is a high-quality recording of the voice you’d like to replicate. Replication allows the use of voices which would otherwise be very difficult to record, e.g., unavailable actors, kids, historical figures. This can contribute significantly to lowering production time and costs.
8. Speed up and ease the dubbing process
Voice conversion technology saves the time that you would otherwise have to invest in voiceover work during post-production. Given the monotony of this process, it is relevant to also mention nerve savings among the benefits of using voice synthesis.
If you choose to use a language agnostic technology, you can easily record the voice that you need in any language, and then simply translate it in an automated fashion. You can also be more adaptable to your target audience, by using precisely the kind of accent that’s presumably best received in a particular region.
If you’re curious about recent voice cloning projects, you can take a look at the Nixon project. A group of researchers, journalists, and artists at MIT teamed up with voice cloning company Respeecher and VDR company Canny AI to create an alternate history of the first venture to the moon, where astronauts Neil Armstrong and Edwin “Buzz” Aldrin fail their mission and are stranded on the moon.
They created a posthumous deepfake by altering an actual video of president Nixon, and so made it possible to actually hear him inform the world that the journey to the moon had a tragic outcome.