Tuesday, October 25, 2016

States of the Arts

Fall 2016 has seen important improvements to the state of the art in both speech synthesis and speech recognition.

In September, Google DeepMind unveiled WaveNet, a speech synthesis system, that generates exceptionally natural sounding speech.  In October, Microsoft Research announced that they had developed a speech recognition system that matches the word error rate of human transcribers.

A few observations.

Life moves pretty fast.
It’s a time of rapid progress in speech and spoken language processing.  Microsoft, IBM and Baidu have all posted better and better speech recognition numbers in the last few years.

Deep Learning has the goods.
It’s very easy to be dismissive of “deep learning” as being over-hyped.  However, both of these advances rely heavily on deep neural networks.  So far, they continue to deliver on their promise.  

One of the first important ASR papers showing that DNNs can outperform traditional GMM acoustic models on a hard task (i.e. Switchboard) was presented at Interspeech 2011.  This means the work was done at least 6 months earlier.  Both of these advances are described not only by press releases and glossy webpages, but also technical papers [WaveNet paper, MSR paper].  Both were posted to arxiv.  There’s no doubt that immediate, self-publication is flooding the scientific engine with oxygen.  Progress is rapid because we’re still learning the limits of neural networks, but also groups are able to compete and learn from each other much more quickly than semi-annual conferences enable.

WaveNet is a new way of doing things.
WaveNet synthesizes speech in a novel way.  The resulting waveform is generated one sample at a time, conditioned on the previous sample.  This is essentially doing parametric speech synthesis without a vocoder.  Not only does this approach work surprisingly well, it’s exciting in its newness as well.  There’s other novelty in this work too (1. Using a classification approach to predict discretized mu-law values instead of predicting a continuous value 2. The dilating convolution layer.) but this work is most important for showing the promise of this approach to generating audio data.

The Microsoft ASR work is not a new way of doing things.
The work is of the highest quality, without a doubt.  This paper represents a exeptionally well-engineered collection of effective speech recognition tools that have hit an interesting milestone.  However, the individual pieces will seem familiar to anyone up to date on the current state-of-the-art.  The major improvements come improvements to language modeling (an LSTM LM specifically), LACE units which MSR showed at Interspeech this year, lattice free MMI, and spatial smoothing (which is similar to Cambridge’s “stimulated training”).  The Microsoft team has put these parts together more effectively than anyone else, and it’s an important achievement.  But compared to the WaveNet development, it’s a more incremental step.