Digital sound-alikes
When it cannot be determined by human testing whether some fake voice is a synthetic fake of some person's voice, or is it an actual recording made of that person's actual real voice, it is a digital sound-alike.
Living people can defend¹ themselves against digital sound-alike by denying the things the digital sound-alike says if they are presented to the target, but dead people cannot. Digital sound-alikes offer criminals new disinformation attack vectors and wreak havoc on provability.
Timeline of digital sound-alikes
- In 2016 w:Adobe Inc.'s Voco, an unreleased prototype, was publicly demonstrated in 2016. (View and listen to Adobe MAX 2016 presentation of Voco)
- In 2016 w:DeepMind's w:WaveNet owned by w:Google also demonstrated ability to steal people's voices
- In 2018 Conference on Neural Information Processing Systems the work 'Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis' (at arXiv.org) was presented. The pre-trained model is able to steal voices from a sample of only 5 seconds with almost convincing results
- As of 2019 Symantec research knows of 3 cases where digital sound-alike technology has been used for crimes.[1]
Examples of speech synthesis software not quite able to fool a human yet
Some other contenders to create digital sound-alikes are though, as of 2019, their speech synthesis in most use scenarios does not yet fool a human because the results contain tell tale signs that give it away as a speech synthesizer.
- Lyrebird.ai (listen)
- CandyVoice.com (test with your choice of text)
- Merlin, a w:neural network based speech synthesis system by the Centre for Speech Technology Research at the w:University of Edinburgh
Documented digital sound-alike attacks
- 'An artificial-intelligence first: Voice-mimicking software reportedly used in a major theft', a 2019 Washington Post article
Example of a hypothetical digital sound-alike attack
A very simple example of a digital sound-alike attack is as follows:
Someone puts a digital sound-alike to call somebody's voicemail from an unknown number and to speak for example illegal threats. In this example there are at least two victims:
- Victim #1 - The person whose voice has been stolen into a covert model and a digital sound-alike made from it to frame them for crimes
- Victim #2 - The person to whom the illegal threat is presented in a recorded form by a digital sound-alike that deceptively sounds like victim #1
- Victim #3 - It could also be viewed that victim #3 is our law enforcement systems as they are put to chase after and interrogate the innocent victim #1
- Victim #4 - Our judiciary which prosecutes and possibly convicts the innocent victim #1.
Thus it is high time to act and to criminalize the covert modeling of human appearance and voice!
See also in Ban Covert Modeling! wiki
Footnote 1. Whether a suspect can defend against faked synthetic speech that sounds like him/her depends on how up-to-date the judiciary is. If no information and instructions about digital sound-alikes have been given to the judiciary, they likely will not believe the defense of denying that the recording is of the suspect's voice.
Transcluded Wikipedia articles
Speech synthesis article transcluded from Wikipedia
[Template fetch failed for https://en.wikipedia.org/wiki/speech_synthesis?action=render: HTTP 503]