Spoken Language Processing: Interspeech 2010 Recap

Interspeech was in Makuhari, Japan last week. Makuhari is about 40 minutes from Tokyo, and I'd say totally worth the commute. The conference center was large and clean, and (after the first day) had functional wireless, but Makuhari offers a lot less than Tokyo does.

Interspeech is probably the speech conference with the broadest scope and largest draw. This makes it a great place to learn what is going on in the field.

One of the things that was most striking about the work at Interspeech 2010 was the lack of a Hot Topic. Acoustic modeling for automatic speech recognition is a mainstay of any speech conference, that was there in spades. There was some nice work on prosody analysis. Recognition of age, affect and gender were highlighted in the INTERSPEECH 2010 Paralinguistics Challenge, but outside the special session focussing on this, there wasn't an exceptional amount of work on these questions. Despite the lack of a new major theme to emerge this year, there was some very high quality, interesting work.

Here is some of the work that I found particularly compelling.

Married Couples' speech
Sri Narayanan's group with other collaborators from USC and UCLA have collected a set of married couples' dialog speech during couple's therapy. So this is already compelling data to look at. You've got naturally occurring emotional speech, which is a rare occurrence, and it's emotion in dialog. They had (at least) 2 papers on this data at the conference, one looking at prosodic entrainment during these dialogs, and the other classifying qualities like blame, acceptance, and humor in either souse. Both very compelling first looks at this data. There are obviously some serious privacy issues with sharing this data, but hopefully it will be possible eventually.

Automatic Classification of Married Couples’ Behavior using Audio Features Matthew Black, et al.

Quantification of Prosodic Entrainment in Affective Spontaneous Spoken Interactions of Married Couples Chi-Chun Lee, et al.
Ferret auditory cortex data for phone recognition
Hynek Hermansky and colleagues have done a lot of compelling work on phone recognition. To my eye, a lot of it has been banging away at techniques other than MFCC representations for speech recognition. Some of them work better than others, obviously, but it's great to see that kind of scientific creativity applied to a core task for speech recognition. This time the idea was to take "spectro temporal receptive fields" empirically observed from ferrets that have been trained to be accurate phone recognizers, and use these responses to train a phone classifier. Yeah, that's right. They used ferret neural activity to try to recognize human speech. Way out there. If that weren't compelling enough, the results are good!

Prosodic Timing Analysis for Articulatory Re-Synthesis Using a Bank of Resonators with an Adaptive Oscillator Michael C. Brady
A pet project has been to find a nice way to process rhythm in speech for prosodic analysis. Most people use some statistic based on the intervocalic intervals, but this is unsatisfying. While is captures the degree of stability of the speaking rate, it doesn't tell you anything about which syllables are evenly spaced, and which are anomalous. This paper uses an adaptive oscillator to find the frequency that best describes the speech data. One of the nicest results (that Michael didn't necessarily highlight) was that he found that deaccented words in his example utterance were not "on the beat". In the near term I'm planning on replicating this approach for analyzing phrasing, on the idea that in addition to other acoustic resets, the prosodic timing resets at phrase boundaries. A very cool approach.

Compressive Sensing
There was a special session on compressive sensing that was populated mostly by IBM speech team folks. I hadn't heard of compressive sensing before this conference, and it's always nice to learn a new technique. At its core compressive sensing is an exemplar based learning algorithm. Where it gets clever is that where k-means uses a fixed number, k, of exemplars to use with equal weight, and SVM use a fixed set of support vectors to make decisions, in compressive sensing a dynamic set of exemplars are used to classify each data point. The set of candidate exemplars (possibly the whole training data set) are then weighted with some L1-ish regularization to drive most of the weights to zero -- selecting a subset of all candidates for classification. Then a weighted k-means is performed using the selected exemplars and weights. The dynamic selection and weighting of exemplars outperforms vanilla SVMs, but the process is fairly computationally expensive.

Interspeech 2011 is in Florence, and I can't wait -- now I've just got to get banging on some papers for it.

Spoken Language Processing

Tuesday, October 05, 2010

Interspeech 2010 Recap

No comments:

Disqus for Spoken Language Processing Blog