This year's Interspeech was in Singapore. Singapore is, in some ways, a very easy venue to travel to. It's a modern, cosmopolitan city. They speak English. It's tropical, but you're never more than a hundred meters from air conditioning. In other ways, it's so far. Over 20 hours each way. I like airplanes. They're as magical as any technology we've got. But 20 hours is a long time to sit still. Think about how many steps you take in 20 hours. How many different faces you see. Then reduce that to about 200 steps, and 10 people.

"Because we are all poets or babies in the middle of the night, struggling with being." - Martin Amis "London Fields"

Interspeech 2014 was a well run conference. The quality of papers was generally quite high. The venue easily handled the size of the event and the wifi was steady. It was difficult to find enough food at the Welcome Reception, but easy to find enough beer. The banquet was flawed -- segregating vegetarians is pretty rude -- but they all are, and at least there was enough food to go around, and everyone ate promptly. And of course, there was Mambo and Jambo. I'm not going to go into it here, but find someone who attended the opening ceremony and ask them to describe it. Then don't believe them, and ask someone else to do the same. It was "odd" at best. But be sure, I'll be attending the Dresden opening ceremony to see how they one-up it.

Deep Learning
A few years ago, DNNs invaded speech conferences. Deep Learning is still a significant buzz word, and a hot topic. But for, the better of everyone involved, the intensity has cooled. Now the interest in DNNs seems to have shifted into 1) understanding how they work, and how to best train them to a task and 2) Long Short-Term Memory units to model sequential data. The latter really broke out at this years conference. There were a number of papers finding that them to be an effective alternative to traditional recurrent nets trained with back-propagation through time.

BABEL
I've been involved with the IARPA-BABEL program, so my view is pretty biased on this front, but I felt like the presence of BABEL in this year's Interspeech was particularly large. The program's central task is performing keyword search on low-resource languages. It has an aggressive evaluation schedule with an increased number of *new* languages involved each year. There were at least two sessions devoted to keyword search, and papers evaluated by either BABEL-proper or the NIST OpenKWS challenge seemed to be all over the conference. (Searching the paper index suggests that there are between 30 and 40 BABEL papers, and another 10 or so OpenKWS papers.) It seems clear that this program has had a large impact on ASR and KWS research. 2014 was certainly the high-water mark here, as the program shrunk by 50% last year, but it's worth noting its effect on the field.

Some standout papers

I don't mean to suggest that these are "the best" papers, but they're ones that caught my eye for one reason or another.

Acoustic Modeling with Deep Neural Networks Using Raw Time Signal for LVCSR by Zoltán Tüske, Pavel Golik, Ralf Schlüter and Hermann Ney. Part of the promise of Deep Learning is the ability to "learn feature representations" directly from data. This is frequently touted as a description of what is happening in the first hidden layer of a deep net. So, the logic goes, do we need MFCC/PLP/etc. features, or can we do speech recognition directly on a raw acoustic signal? This is the first paper I'm aware of to affirmatively show that "yes, yes we can". It requires a good amount of training data, and rectified linear (ReLU) neurons work better for this, but 1) it works competitively with traditional features and 2) many of the first hidden layer neurons can be shown to be learning a filterbank. Very cool.

Backoff Inspired Features for Maximum Entropy Language Models by Fadi Biadsy, Keith Hall, Pedro Moreno and Brian Roark. In n-gram language modeling, when a sequence of words A, B, C have never been observed, its n-gram probability P(C|A,B) can be approximated by the probability P(C|B). But to be sensible about it, you've got to apply a backoff penalty. Discriminative language models can seamlessly incorporate backoff features, F(A,B,C), F(B,C), etc. and learn appropriate weights. The key insight in this paper is that when a discriminative model uses these backoff estimates, it incurs no penalty. It's essentially overestimating the probability of uncommon n-grams in the context of common (n-k)-grams. This paper seeks to fix this, and does.

Additional favorites some from my students:

Word Embeddings for Speech Recognition by Samy Bengio and Georg Heigold. Far and away the most popular poster at the conference. This is high on my list to read closely. It promises to learn a euclidean space into which word decoding can happen, so that words that sound similar are closer in space.

The Obligatory Contour Principle in African and European Varieties of French by Mathieu Avanzi, Guri Bordal and Gélase Nimbona. An investigation of prosodic differences in dialects of French. Very consistent with one of my student's dissertation projects.

Improving Spoken Document Retrieval by Unsupervised Language Model Adaptation Using Utterance-Based Web Search by Robert Herms, Marc Ritter, Thomas Wilhelm-Stein, Maximilian Eibl. A clever way of handling OOVs in spoken document retrieval. Essentially the idea is if I recognize W[i-2], W[i-1], UNK, W[i+1], W[i+2], go do a web search for the context, find some matching documents, and augment the language model with them, then re-decode. Kind of like distant supervision for spoken document IR.

Learning Small-Size DNN with Output-Distribution-Based Criteria by Jinyu Li, Rui Zhao, Jui-Ting Huang, Yifan Gong. How do you effectively train a small DNN without ruining its performance? This paper out of MSR suggests training a large DNN, then using its output distribution to train the small one. It'll take me some closer reading to fully understand why this works, but I'm intrigued.

Canonical Correlation Analysis and Local Fisher Discriminant Analysis based Multi-View Acoustic Feature Reduction for Physical Load Prediction by Heysem Kaya, Tuğçe Özkaptan, Albert Ali Salah, Sadık Fikret Gürgen. We tried a number of dimensionality reduction approaches for this year's Paralinguistic Challenge, but didn't get particularly good results. These guys did using CCA and LFDA. Looking forward to reading this one as a more general feature reduction approach when the number of features are larger than the number of training instances.

Spoken Language Processing

Monday, October 13, 2014

Interspeech 2014 Recap

"Because we are all poets or babies in the middle of the night, struggling with being." - Martin Amis "London Fields"

Disqus for Spoken Language Processing Blog