Spoken Language Processing: Interspeech 2012 Recap

Portland proved to be a great venue for this year's Interspeech. (Though people who attended ACL 2011 probably already could have guessed that.)

Setting up three simultaneous poster sessions in the parking garage may not sound like the mark of a good conference, but it was perfect. There was loads of space between each presenter. It allowed for all three sessions to be in the same place. And the folks at the Hilton did a great job of making it fairly unrecognizable as a parking lot. (In fact, Alejna Brugos didn't realize it until they were removing the carpets and "walls" on Thursday afternoon.)

Deep Neural Networks.
For "trends", there's really nothing hotter right now than Deep Neural Networks or Deep Belief Nets. This isn't an area that I do research in, but the story goes more or less this. Neural Networks with more than a few hidden layers don't train very well with back-propagation. Geoff Hinton and his group figured out how to overcome this limitation not too long ago. (I think this 2006 paper explains it, but I can't be 100% sure.) Then at ASRU 2012 and ICASSP 2011 and 2012, the folks at Microsoft showed that you can use Deep Neural Networks to generate *very* useful front end features. (Tara Sainath has a nice recap of ICASSP 2012 here.) Now, everyone wants a piece.

The field has expanded from Microsoft to include IBM and Stanford/Berkeley/Google and RWTH Aachen. Joining them with posters on Deep Neural Nets for ASR are Tsinghua, CMU, Karlsruhe, NTT, INESC-ID, UWashington, and Georgia Tech. At this point, there's no way to deny that this approach is receiving significant research attention. The results seem to be holding up. If only they didn't take so long to train...

Prominence Special Session.
I was particularly looking forward to the Special Session on Prominence. On balance I was happy about the session. It attracted work and discussion of prosody in a way that can sometimes feel diffuse and unfocused at a large conference like Interspeech.

I found this session to be surprising in a few ways.

It's been my understanding that "prominence" was used as a catch-all term to cover diverse prosodic phenomena including stress, emphasis, and pitch accenting. The first surprising element of this Session came in a review of the paper I submitted to it. The paper is on the use of automatically predicted pitch accents and intonational phrase boundaries to improve pronunciation modeling. The review, while generally positive, found the paper to not be appropriate for a prominence session because it explored the use of "pitch accents" rather than "prominence". I still haven't gotten a good explanation of the difference, and the reviews are blind.

A second surprise is that there seems to be a movement away from a phonological theory of prosody. Mark Hasegawa-Johnson and Jennifer Cole have been doing work over the last few years investigating how naive listeners perceive prominence. They've consistently found that listeners respond to different qualities sometimes at different thresholds when assessing prominence. I've found this line of research to be interesting and generally informative, but not a clear indictment of the theory that there perceptual and productive prosodic categories exist. The panel (which I was a part of) on balance seemed comfortable with the idea that prominence is a continuous rather than categorical phenomenon. This view was most directly expressed Denis Arnold who said approximately: focus can be categorical, stress can be categorical, while prominence is still continuous. I didn't understand this statement then, and still don't. But again, this may be due to a different definition of prominence than I use.

The last surprise comes from finding out that there is a direction of pursuing language universals in prominence and prosody more broadly. Petra Wagner and Fabio Tamburini (the session organizers) are planning a workshop to investigate this. In my experience, while the dimensions of prosodic variation may be used in multiple languages and some of these (e.g. increased intensity or duration) may be used to indicate prominence in all languages, it is extremely unlikely that either the communicative impacts of prosodic variation or its realization and perception are language-universal. From that perspective, I'm not quite clear about what this line of research hopes to accomplish, but I'm curious about where it ends up.

Dynamic Decoding.
It appears that every year, I find myself sitting in on an oral session on a topic that I know very little about. Last year it was the language identification session. This year it was Dynamic Decoding. I was most intrigued by this because I hadn't heard the term before. When I asked someone what it was, they said "I don't know, Viterbi?".

I'm not quite sure this is a good enough distillation of the topic, but the papers in this session were about how to make on-the-fly (or post-training) modifications to language or pronunciation models. This is a cool idea with clear practical importance -- how do you add words to a recognizer on a mobile device and have this appropriately incorporated into the LM and pronunciation model? These two papers have some interesting WFST based approaches on this task. I'll be curious to see learn more about this. Also, if anyone has a more precise definition of this research area, I'd love to hear it.

Finally, some comments on two of the keynotes.

There were four keynotes at this year's Interspeech, two were about interesting inter/multi-disciplinary questions about how speech processing intersects with music and animal vocalization, respectively.

Chin-Hui Lee: An Information-Extraction Approach to Speech Analysis and Processing
A third was delivered by this years ISCA medalist, Chin-Hui Lee. Prof. Lee's most famous accomplishment is MAP adaptation in acoustic modeling. This is a researcher who spent a career treating speech recognition as a pattern matching problem. This is a view embodied by the Fred Jelenik quote: "Every time I fire a linguist, the performance of the speech recognizer goes up". What struck me, is that despite this view, in a talk summarizing a successful career, Prof. Lee presented a view of speech recognition that says that linguistic knowledge and speech science should be incorporated into the task. This is an alternate perspective that has been investigated by a lot of talented researchers, including Hynek Hermansky, Jennifer Cole, Mark Hasegawa-Johnson, Alex Waibel, Hermann Ney (via speech-to-speech translation), Mari Ostendorf, Elizabeth Shriberg, Andreas Stolcke, Rene Beutler, Karen Livescu and many more (my apologies to anyone I missed).

I was struck by the evolution of perspective from someone who represents the statistical pattern matching approach to recognizing the potential importance of linguistic knowledge.

However, this talk was not so well received by some members of the audience for fairly obvious reasons. Firstly, it over-played the importance of Prof. Lee's own contributions. In a slide on "My contributions", virtually all major improvements to ASR over the last 20 years were mentioned including most styles of adaptation (including MAP), and virtually all major forms of discriminative training. Secondly, it failed to recognize that the linguistic inspired approach that he was advocating for the future had been extensively researched by other talented peers.

On balance, I found it a compelling message. In principle, it understandably rubbed some people the wrong way.

Michael Riley: Weighted Transducers in Speech and Language Processing
I should preface my comments about Michael Riley's keynote by saying that we worked together while I was interning at Google. I'm a fan. Michael has the rare quality of being the smartest guy in the room without letting anyone know until its genuinely useful.

The best part of this keynote was the history of the Weighted Finite State Transducer. This was a great story that takes place largely at Bell Labs in the 90s and features Fernando Pereira, Mehryar Mohri and, naturally, Michael Riley. This section was appropriately personal, while presenting this relevant recent history. The WFST is so ubiquitous in speech and NLP applications that it's easy to forget that it's has a human context.

Much of the rest of the keynote felt like a 3 hour tutorial compressed into 40 minutes. This involved showing algorithms, and example WFSTs and describing all of the things that they can be used for. While a successful demonstration of the breadth of application, it was presented at such a pace that it was difficult to get anything out of it, if you didn't know it already. I'd point the interested to the references found on the OpenFST page for more thorough tutorials that can be digested at your own pace.

Interspeech 2012 was successful and fun. Portland and the Hilton (and it's solid wifi) were excellent hosts. There was good work and as ever more than I could see. If you have great or favorite papers that I missed, please let me know!

Spoken Language Processing

Monday, September 17, 2012

Interspeech 2012 Recap

No comments:

Disqus for Spoken Language Processing Blog