Spoken Language Processing: 09/01/2011

Interspeech 2011 was held in Florence Italy about a week ago, August 28-August 31. A vacation in Italy was too good to pass up on, so The Lady joined me, and we stayed until Labor Day.

I ended up sending a bulk of work to Interspeech, so spent more time than usual in sessions that I was presenting in rather than seeing a lot of papers.

Two interesting themes stood out to me this year. Not for nothing, but these represent some novel ideas about speech science through understanding dialog and the speech engineering.

Entrainment
Julia Hirschberg, my former advisor, received the ISCA medal for her years of work in speech. Her talk was on current work with Agustín Gravano and Ani Nenkova on entrainment. Entrainment is the phenomenon by which when people are speaking to each other, their speech becomes more similar. This can be realized in terms of the words that are used to describe a concept, as well as speaking rate, pitch, intensity. How this happens isn't totally understood, and measures of entrainment are still being developed. This research theme is still in its early phases, but I haven't seen an idea spread around a conference as quickly or as thoroughly as this did. There were questions and discussions all over the place (like Tom Mitchell's keynote about fMRI data and word meaning) about this phenomenon. The more engineering folks weren't as compelled by the utility of this in helping speech processing, but within the speech perception and production communities, and specifically the dialog and IVR folks, it was all the rage. It'll be something to see how this develops.

I-vectors
I-vectors are a topic that I need to spend some time learning. I hadn't heard of it before this conference, where there were no less than a dozen papers that used this approach. Essentially the idea is this: The location of mixture components in a GMM model are composed of (in at least one form) a UBM, a channel component, and a "interesting" component. This "interesting" component can be the contribution of a particular speaker, or a language/dialect, or anything else you're trying to model. Joint Factor Analysis is used to decompose the observation into these components in an unsupervised fashion. It's in this part where my understanding of the math is still limited. The crux is that the "interesting" component can be represented by a \[Dx\] transformation, where the dimensionality of x can be set by the user. In comparison to a supervector representation, where the dimensionality of the supervector is constrained to be equal to the number of parameters (or means) of the GMM, i-vectors can be significantly smaller leading to better estimation and smaller models. I'll be reading this tutorial by Howard Lei over the next few weeks to get up to speed.

There were few specific papers that stood out to me this conference. I'm intrigued by Functional Data Analysis as a way to model continuous time-value observations. Michele Gubian gave a tutorial on this that I sadly missed, and included it in at least one paper, Predicting Taiwan Mandarin tone shapes from their duration by Chierh Chung and Michele Gubian. This paper wasn't totally convincing in the utility of the technique, but there may be more appropriate applications.

It was a satisfying and inspiring conference, to be sure. I think I was more interested in talking to people than in papers in particular this time. If anyone has particular favorites that I missed, please use the comments to share or just email me.

Spoken Language Processing

Tuesday, September 06, 2011

Interspeech 2011 Recap

Disqus for Spoken Language Processing Blog