In September 2016, Google DeepMind published a blog post titled “WaveNet: A Generative Model for Raw Audio”. This was the second most significant deep learning related announcement in 2016 in my opinion (after AlphaGo vs Lee Sedol). While the model can be used for speech recognition and other related topics, it was initially focused on text-to-speech synthesis (TTS). The glitching and buzzing artificial voices that are synonymous with TTS have been a problem for over five decades, and it’s an issue that’s becoming increasingly prominent with the number of smart phones and smart speakers in use.
Whenever a technology leap occurs, it can be difficult for some in the industry to adjust. As we saw with the emergence of deep learning, specialists, often with decades of experience, can be dismissive of a new technology. This is because it challenges the need for their specialty. Perhaps its simply human nature, no one wants to feel that their talents or efforts have become dramatically less valuable — or automated — overnight.
As a result, WaveNet was received with mixed opinions. Some criticized both it’s speed and noise due to the 8-bit encoding it used. Others were simply skeptical as very few samples were released to prove how consistent it was.
WaveNet still made many companies and universities strongly reconsider their current approaches to deep learning for speech signal generation. In the past year we’ve seen new publications and demos from several organisations. Audio links are provided below to the most relevant samples.
For reference here are the two samples from Google DeepMind’s WaveNet version 1:
After Google published their WaveNet paper, several other organisations followed up with samples from the systems they were working on.
A sample from Baidu’s Deep Voice (source):
Sample’s from Facebook’s loop system (source): (Note: This system was not trained off data designed for TTS.)
And samples from Montreal’s Institute for Learning Algorithms (MILA), (source): Which many would consider the best academic AI lab in the world today. This is an end-to-end system, but the signal generation is its weakness.
While each of the above systems had different requirements and data, the broad trend is clearly towards using bigger, carefully optimized models for audio generation. Each of these systems are impressive in their own right, links to the source articles for each are provided. Yet, if you were to pick a single system to have in a real product, it is clear that Google’s WaveNet is far superior. This is despite the fact that all three are leading global organisations, with no shortage of talent or resources.
Open source implementations also exist, however, these simply try to make human-like sounds rather than speaking real language, and even at that Google’s WaveNet is vastly superior:
Open source TensorFlow WaveNet:
Open source SampleRnn:
Listening to the various samples above, it is clear that Google DeepMind’s WaveNet can produce superior quality human sounding voices. Yet while so many have tried, none of the other samples come close. Interestingly, Apple announced their new TTS for Siri just last month. While it makes definite progress, its still fundamentally based on the previous generation synthesis technique (source). It is also clear that Amazon and Microsoft are using the same style of “concatenative” synthesis which originated in the early 1990’s.
Naturally, people start to wonder what gives? Why hasn’t anybody reproduced WaveNet?
In August, at Interspeech 2017 (the largest speech technology conference), I spoke to several domain experts who questioned whether WaveNet was real, or just a PR stunt. Clearly, no one had come close to reproducing Google’s results, a full year after they published it. And this is at a time when the five biggest platform companies have close to billion dollar annual budgets; in addition to thousands of staff working on their speech and language technology platforms.
Yet, there is sufficient evidence that WaveNet is the next generation of speech recognition, synthesis, and voice-activity detection (VAD). Recently, at Made by Google 2017, Google announced that they have WaveNet production ready! Speed? 1000 times faster. Quality? Same as Compact Disc. Sounds too good to be true?…
With my background being in speech synthesis, I was delighted to see this technology leap emerge. And with Voysis being the only independent complete voice AI platform, TTS is one of the core topics we work on.
We’ve decided to publish some of our Voysis WaveNet samples to demonstrate that the technology is indeed real. We developed it completely independently of Google. Our architecture has evolved a bit beyond the original WaveNet and we suspect it may be similar in several ways to Google’s newest version, which they have not published yet. WaveNet is without doubt, the future of how machines will talk to people and visa versa.
Have a listen to the sample below of an audio book (Anna Sewell’s Black Beauty). We made it using a popular research dataset (Blizzard) so that others can compare it to the systems they’re working on. Like Facebook’s samples above, this dataset was not designed for TTS, so the quality suffers as a result. It was trained using less than half the amount of data of Google’s WaveNet system. It uses 16kHz audio, natural prosody (durations and f0) and a crude mu-law decoder (which we’re in the process of replacing), yet it’s clearly a real WaveNet.
For reference, here’s an audio sample of Google DeepMind’s WaveNet version 2 from their Google I/O 2017 announcement. This sample is trained from data designed for TTS, but uses a similar amount of training data to the Voysis WaveNet sample above.
If any researchers are interested in comparing our sample to theirs, the experiment setup was the following: the Blizzard dataset was split so that the first 20 sentences were used for testing, the following 980 were omitted from both testing and training and the remainder was used for training. We believe that this gave a fairer result than selecting one of every n sentences for testing as if you split it that way the system benefits from the neighboring sentences. Using natural durations makes it easy for others to compare to our sample too.
WaveNet is real and the end product is clearly far more advanced than its rivals. I strongly believe it is the future of speech synthesis and recognition, and likely many other domains of speech and language technology.
The era of glitching and buzzing artificial voices has ended.