Wednesday, June 05, 2013

Deep Thoughts on ICASSP 2013

ICASSP 2013 is wrapping up today in Vancouver.  Unfortunately, I missed the last day (and sessions on speech synthesis and prosody that I would have enjoyed).  But a wedding on Saturday brought me back a day early.

I hadn't been to ICASSP before, mostly due to timing oddities and writing grants over summers rather than writing papers that would hit the deadline.  It is a very large conference.  About twice as large as Interspeech.  But the scope is also much broader.  Speech and Language work made up at most 30% of the work at the conference.  And even this is generous, including machine learning, and other work on audio.

So take this recap with a grain of salt.  I missed the last day of the conference, and my impressions are speech focused.  (I think I've described all conference recaps as blind-men-and-the-elephant problems and this one is no exception.)

Deep Learning.
OK, I pointed out that Deep Neural Nets were a "hot topic" at last years Interspeech.  It's hard to believe it's possible, but they're even hotter now.  Geoffrey Hinton gave the first plenary talk.  This was followed by an oral session called "Automatic Speech Recognition using Neural Networks", which was followed by a Special Session titled "New Types of Deep Neural Network Learning for Speech Recognition and Related Applications".  The next morning, you could attend "Acoustic Modeling with Neural Networks".  And this is just at the session level.  Even more applications of multilayer neural networks were scattered around other oral and poster sessions.  Some of these oral sessions were so crowded that people were standing along the walls and sitting in the aisles.  Nothing else that I saw received nearly so much attention.

It's easy to view "deep" learning as a silver bullet -- the next great machine learning that will solve all of our problems.  It's almost certainly not.  However, a wide array of research groups are seeing similar impressive performance gains by using deep network models for a broad spectrum of spoken language processing tasks.  This is especially true for acoustic modeling in speech recognition.  Given this, deep learning shouldn't be ignored.

Hinton's coursera course is a solid place to start. (Though resist drinking the kool-aid.  To my mind, perceptrons are bad approximations of neurons and worse approximations of the brain, and do little to advance our understanding of human intelligence.)

Another highlight
One paper which caught my attention for its simplicity came out of Google: "Language Model Verbalization for Automatic Speech Recognition".  Essentially "verbalization" is defined as a sort of inverse text-normalization. In text normalization for speech synthesis we have to translate "10" to "TEN", and "7:11" to "SEVEN ELEVEN" or "ELEVEN PAST SEVEN".  For ASR, the idea of verbalization is to convert decoding output of "SEVEN ELEVEN" into "7:11" or "7-11".  Why bother?  Well, Google (and everyone else) has big language models based on text data. You could run a text normalizer over all of this data, but the proposition here is to convert the ASR output into a form that looks more like the source material in your language model.

The Verbalizer solution to this problem is remarkably elegant.  A traditional WFST decoder can be expressed as D = C • L • G, where C comes off the acoustic model mapping context dependent to independent phones, L is the pronunciation model and G the language model.  The "Verbalized" WFST model includes a WFST V which maps ASR realizations like "SEVEN ELEVEN" to text-like realizations like "7-11" or "7:11" (and since it's a WFST it can do both simultaneously).  The new decoder looks like D = C • L • V • G.  No fuss, no muss.  Except that you have to write Verbalizer rules by hand.

The paper focused on terms involving numbers, but the framework is very extensible.  And it's great to see work coming out of Google that doesn't have Google-scale data as a prerequisite.

The acceptance rate at this years ICASSP was 52%.  This means that the ICASSP and Interspeech acceptance rates are identical for the first time.  I know that Interspeech organizers have been working to lower the acceptance rate, while it sounds like there has been pressure to keep the size of ICASSP large, even at the expense of a higher acceptance rate.  IEEE (the ICASSP parent organization) is a much larger bureaucracy than ISCA (Interspeech).  There are clear expectations from IEEE about the expected revenue from hosting a conference, which translates to expectations on attendance and therefore the number of accepted papers regardless of the number of submissions.

Despite the near constant rain, I genuinely enjoyed Vancouver and ICASSP 2013.  I'm looking forward to the next.