Tuesday, June 08, 2010

HLT-NAACL 2010 Recap

I was at HLT-NAACL in Los Angeles last week.  HLT isn't always a perfect fit for someone sitting towards the speech end of the Human Language Technologies spectrum.  Every year, it seems, the organizers try (or claim to try) to attract more speech and spoken language processing work.  It hasn't quite caught on yet and the conference tends to be dominated by Machine Translation and Parsing.  However...The (median) quality of the work is quite high. This year I kept pretty close to the Machine Learning sessions and got turned on to the wealth of unsupervised structured learning which I've overlooked over the last N>5 years.

There were two new trends that I found particularly compelling this year:
  • Noisy Genre
    This is pretty clunky term covers genres of language which are not well-formed.  As far as I can tell this covers everything other than Newswire, Broadcast news, and read speech. This is what I would call "language in the wild" or in a snarkier mood, "language" (sans modifier). For the purposes of HLT-NAACL, it covers Twitter messages, email, forum comments, and ... speech recognition output.  It's this kind of language that got me into NLP and why I ended up working on speech, so I'm pretty excited that this is receiving more attention from the NLP community at large. 
  • Mechanical Turk for language tasks
    Like the excitement over wikipedia a few years ago, NLP folks have fallen in love with Amazon's Mechanical Turk. Mechanical turk was used for speech transcription, sentence compression, paraphrasing, and quite a lot more; there was even a workshop day dedicated solely to this topic. I didn't go to it, but will catch up on the papers this week or so.  This work is very cool, particularly when it comes to automatically detecting and dealing with outlier annotations.  The resource and corpora development uses of Mechanical Turk are obvious and valuable. It's in the development of "high confindence" or "gold standard" resources that I think this work has an opportunity to intersect very nicely in work on ensemble techniques and classifier combination/fusion.  If each turker is considered to be an annotator, the task of identifying a gold standard corpus is identical to generating a high-confidence prediction from an ensemble.
I had a sense of HLT-NAACL that was unfair:  My impression was that the quality of work was fairly modest.  I attribute this to three factors. 1) In the past there has been a lot of work of the type -- "I found this data set. I used this off-the-shelf ML algorithm. I got these results".  There's nothing particularly wrong with this type of work, except for it's boring, and not intellectually rigorous, it's not scientifically creative, and it doesn't illuminate the task with any particular clarity. (Ok, so there're at least four things wrong with this kind of work.)  2) HLT-NAACL accepts 4 page short papers.  With its formatting guidelines, it is almost impossible to fit more than a single idea in a 4 page ACL paper.  This leads to a good amount of simple or undeveloped ideas. (I've written a fair amount of these 4 page papers because they are accepted at a later deadline, but it's always frustrating when you realize you have more to say.)  3) And I think this is probably the most significant -- I've had a good amount of luck getting papers accepted to HLT-NAACL including my first publication in my first year of grad school.  This is probably just "I don't want to belong to any club that will accept people like me as a member"-syndrome, but it left me underestimating the caliber of this conference.

A couple of specific highlights of papers I liked this year:

  • “cba to check the spelling”: Investigating Parser Performance on Discussion Forum Posts Jennifer Foster.  This might be the first time I fully agree with a best paper award.  This paper looked at parsing outrageously sloppy forum comments. These are rife with spelling errors, grammatical errors, weird exclamations (lol).  The paper is a really nice example of the difficulty that "noisy genres" of text pose to traditional (i.e., trained on WSJ text) models.  The error analysis is clear and the paper proposes some nice solutions to bridge this gap by adding noise to the WSJ data. Also, bonus points for subtly including 
  • Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription
    Scott Novotney and Chris Callison-Burch.  A nice example of using Mechanical Turk to generate training data for a speech recognizer.  High quality transcription of speech is pretty expensive and critically important to speech recognizer performance.  Novotney and Callison-Burch found that Turkers are able to transcribe speech fairly well, and at a fraction of the cost.  This paper includes a really nice evaluation of Turker performance and some interesting approaches to ranking Turker performance.
  • The Simple Truth about Dependency and Phrase Structure Representations: An Opinion Piece
    Owen Rambow.  This paper was probably my favorite in terms of bringing joy and being a breath of fresh air. The argument Rambow lays out is that Dependency and Phrase Structure Representations of syntax are meaningless in isolation.  Moreover, these are simply alternate representations of identical syntactic phenomena.  Linguists love to fight over a "correct" representation of syntax.  This paper takes the position that the distinction between the representations is merely preference not substantive -- fighting over the correct representation of a phenomenon is a distraction to understanding the phenomenon itself.  Full disclosure: I've known Owen for years, and like him personally as well as his work.
  • Type-Based MCMC
    Percy Liang, Michael I. Jordan and Dan Klein.  Over the last few years, I've been boning up on MCMC methods.  I haven't applied them to my own work yet, but it's really only a matter of time.  This work does a nice job of pointing out a limitation of token based MCMC -- specifically that sampling on a token by token basis can make it overly difficult to get out of local minima.  Some of this difficulty can be overcome by sampling based on types, that is, sampling based on a higher level feature across the whole data set, as opposed to within 
    a particular token.  This makes intuitive sense and was empirically well motivated.

As a side note, I'd like to thank all you wonderful machine learning folks who have been doing a remarkable amount of unsupervised structured learning that I should have been paying better attention to over the last few years.  Now I've got to hit the books.

1 comment:

Taniya said...

This is an excellent recap of NAACL-HLT. I am really glad that I came across it while doing a search for noisy genre at NAACL. t really helped round out my general impression of the conference.

Besides the two trends that you mentioned, I noticed that there was quite a buzz about Twitter too. People seemed to be doing all kinds of stuff with it. Detecting new events; detecting controversies. Annotating conversations with
dialog tags, named entities, personalized annotation tags based on user’s interests and concerns...

BTW, I also really appreciated your frank observation "people who let me in can't be that great". I feel the same way! Perhaps a new-phd phenomena :)