Spoken Language Processing: 2012

Monday, October 22, 2012

Reading and Reconnecting

With travel finally settling down for me, but ramping up for my wife's book tour, I'm able to settle in to some long overdue reading, thinking and planning.

Also, after an exciting conversation with Dogan Can, during a trip to USC's SAIL lab, I'm trying to get more on top of sharing ideas, progress and information here.

First up: some drill-down reading from Paul Mineiro's blog post on Bagging!

Ensemble methods work too well for me to understand them so poorly, so:

How out-of-bag estimates can be used to get at generalization error (better than cross-validation can).
The relationship between the bias-variance tradeoff and ensemble methods from this lecture. This is a nicely framed discussion of ensemble methods that I hadn't seen before.

Lying Words: Predicting Deception From Linguistic Styles. This paper describes a common pattern of language use in deceptive story-telling: Less self-reference. More negative emotion words. Less cognitive complexity.

I'm looking forward to verifying these claims on some old deception data. And taking a look at debate transcripts through this lens.

Monday, September 17, 2012

Interspeech 2012 Recap

Portland proved to be a great venue for this year's Interspeech. (Though people who attended ACL 2011 probably already could have guessed that.)

Setting up three simultaneous poster sessions in the parking garage may not sound like the mark of a good conference, but it was perfect. There was loads of space between each presenter. It allowed for all three sessions to be in the same place. And the folks at the Hilton did a great job of making it fairly unrecognizable as a parking lot. (In fact, Alejna Brugos didn't realize it until they were removing the carpets and "walls" on Thursday afternoon.)

Deep Neural Networks.
For "trends", there's really nothing hotter right now than Deep Neural Networks or Deep Belief Nets. This isn't an area that I do research in, but the story goes more or less this. Neural Networks with more than a few hidden layers don't train very well with back-propagation. Geoff Hinton and his group figured out how to overcome this limitation not too long ago. (I think this 2006 paper explains it, but I can't be 100% sure.) Then at ASRU 2012 and ICASSP 2011 and 2012, the folks at Microsoft showed that you can use Deep Neural Networks to generate *very* useful front end features. (Tara Sainath has a nice recap of ICASSP 2012 here.) Now, everyone wants a piece.

The field has expanded from Microsoft to include IBM and Stanford/Berkeley/Google and RWTH Aachen. Joining them with posters on Deep Neural Nets for ASR are Tsinghua, CMU, Karlsruhe, NTT, INESC-ID, UWashington, and Georgia Tech. At this point, there's no way to deny that this approach is receiving significant research attention. The results seem to be holding up. If only they didn't take so long to train...

Prominence Special Session.
I was particularly looking forward to the Special Session on Prominence. On balance I was happy about the session. It attracted work and discussion of prosody in a way that can sometimes feel diffuse and unfocused at a large conference like Interspeech.

I found this session to be surprising in a few ways.

It's been my understanding that "prominence" was used as a catch-all term to cover diverse prosodic phenomena including stress, emphasis, and pitch accenting. The first surprising element of this Session came in a review of the paper I submitted to it. The paper is on the use of automatically predicted pitch accents and intonational phrase boundaries to improve pronunciation modeling. The review, while generally positive, found the paper to not be appropriate for a prominence session because it explored the use of "pitch accents" rather than "prominence". I still haven't gotten a good explanation of the difference, and the reviews are blind.

A second surprise is that there seems to be a movement away from a phonological theory of prosody. Mark Hasegawa-Johnson and Jennifer Cole have been doing work over the last few years investigating how naive listeners perceive prominence. They've consistently found that listeners respond to different qualities sometimes at different thresholds when assessing prominence. I've found this line of research to be interesting and generally informative, but not a clear indictment of the theory that there perceptual and productive prosodic categories exist. The panel (which I was a part of) on balance seemed comfortable with the idea that prominence is a continuous rather than categorical phenomenon. This view was most directly expressed Denis Arnold who said approximately: focus can be categorical, stress can be categorical, while prominence is still continuous. I didn't understand this statement then, and still don't. But again, this may be due to a different definition of prominence than I use.

The last surprise comes from finding out that there is a direction of pursuing language universals in prominence and prosody more broadly. Petra Wagner and Fabio Tamburini (the session organizers) are planning a workshop to investigate this. In my experience, while the dimensions of prosodic variation may be used in multiple languages and some of these (e.g. increased intensity or duration) may be used to indicate prominence in all languages, it is extremely unlikely that either the communicative impacts of prosodic variation or its realization and perception are language-universal. From that perspective, I'm not quite clear about what this line of research hopes to accomplish, but I'm curious about where it ends up.

Dynamic Decoding.
It appears that every year, I find myself sitting in on an oral session on a topic that I know very little about. Last year it was the language identification session. This year it was Dynamic Decoding. I was most intrigued by this because I hadn't heard the term before. When I asked someone what it was, they said "I don't know, Viterbi?".

I'm not quite sure this is a good enough distillation of the topic, but the papers in this session were about how to make on-the-fly (or post-training) modifications to language or pronunciation models. This is a cool idea with clear practical importance -- how do you add words to a recognizer on a mobile device and have this appropriately incorporated into the LM and pronunciation model? These two papers have some interesting WFST based approaches on this task. I'll be curious to see learn more about this. Also, if anyone has a more precise definition of this research area, I'd love to hear it.

Finally, some comments on two of the keynotes.

There were four keynotes at this year's Interspeech, two were about interesting inter/multi-disciplinary questions about how speech processing intersects with music and animal vocalization, respectively.

Chin-Hui Lee: An Information-Extraction Approach to Speech Analysis and Processing
A third was delivered by this years ISCA medalist, Chin-Hui Lee. Prof. Lee's most famous accomplishment is MAP adaptation in acoustic modeling. This is a researcher who spent a career treating speech recognition as a pattern matching problem. This is a view embodied by the Fred Jelenik quote: "Every time I fire a linguist, the performance of the speech recognizer goes up". What struck me, is that despite this view, in a talk summarizing a successful career, Prof. Lee presented a view of speech recognition that says that linguistic knowledge and speech science should be incorporated into the task. This is an alternate perspective that has been investigated by a lot of talented researchers, including Hynek Hermansky, Jennifer Cole, Mark Hasegawa-Johnson, Alex Waibel, Hermann Ney (via speech-to-speech translation), Mari Ostendorf, Elizabeth Shriberg, Andreas Stolcke, Rene Beutler, Karen Livescu and many more (my apologies to anyone I missed).

I was struck by the evolution of perspective from someone who represents the statistical pattern matching approach to recognizing the potential importance of linguistic knowledge.

However, this talk was not so well received by some members of the audience for fairly obvious reasons. Firstly, it over-played the importance of Prof. Lee's own contributions. In a slide on "My contributions", virtually all major improvements to ASR over the last 20 years were mentioned including most styles of adaptation (including MAP), and virtually all major forms of discriminative training. Secondly, it failed to recognize that the linguistic inspired approach that he was advocating for the future had been extensively researched by other talented peers.

On balance, I found it a compelling message. In principle, it understandably rubbed some people the wrong way.

Michael Riley: Weighted Transducers in Speech and Language Processing
I should preface my comments about Michael Riley's keynote by saying that we worked together while I was interning at Google. I'm a fan. Michael has the rare quality of being the smartest guy in the room without letting anyone know until its genuinely useful.

The best part of this keynote was the history of the Weighted Finite State Transducer. This was a great story that takes place largely at Bell Labs in the 90s and features Fernando Pereira, Mehryar Mohri and, naturally, Michael Riley. This section was appropriately personal, while presenting this relevant recent history. The WFST is so ubiquitous in speech and NLP applications that it's easy to forget that it's has a human context.

Much of the rest of the keynote felt like a 3 hour tutorial compressed into 40 minutes. This involved showing algorithms, and example WFSTs and describing all of the things that they can be used for. While a successful demonstration of the breadth of application, it was presented at such a pace that it was difficult to get anything out of it, if you didn't know it already. I'd point the interested to the references found on the OpenFST page for more thorough tutorials that can be digested at your own pace.

Interspeech 2012 was successful and fun. Portland and the Hilton (and it's solid wifi) were excellent hosts. There was good work and as ever more than I could see. If you have great or favorite papers that I missed, please let me know!

Wednesday, August 29, 2012

Overview of Speech and Spoken Language Processing

Here's the premise: I was invited to give a guest lecture in Advanced Natural Language Processing. The students will get one week out of 14 focusing on speech and spoken language processing. But it's early in the semester, so there's an opportunity to give a perspective about how speech fits in to the lessons that they'll be learning in more detail later in the semester.

Here's the question: how do you spend 75 minutes to provide a useful survey of speech and spoken language processing?

My answer, in powerpoint form, can be found here.

I spent about 2/3 or so of the material on speech recognition. I figured most students are fascinated by the idea of a machine being able to get words from speech, so let's go through the fundamentals of the technology behind it.

The remaining 1/3rd or so, I focus on the notion that speech recognition is not sufficient for speech understanding. This a lot of other information in speech that is either 1) unavailable in text, or 2) unavailable in ASR transcripts. The premise in this section is to convince students that speech isn't just a noisy string of unadorned words, but that there's a lot of information about structure, and intention that is available from the speech signal. What's more, we can use it in spoken language processing.

There are an outrageous amount of important concepts that get almost no attention here including but not limited to: Digital signal processing, human speech production and perception, speech synthesis, multimodal speech processing, speaker identification, language identification, building speech corpora, linguistic annotation, discourse and dialog, and conversational agents.

Would you do it differently? I'm curious what some other takes on this problem might look like.

Thursday, August 16, 2012

AuToBI v1.3

This release to AuToBI is a more traditional milestone release than v1.2 was. Trained models and a new .jar file will be available on the AuToBI site shortly.

There are improvements to performance that are thoroughly documented in a submission to IEEE SLT 2012. These improvements were achieved from two sources.

First, AuToBI uses importance weighting to improve classification performance on skewed distributions. I found this to be a more useful approach than the standard under- or over-sampling. This is discussed in a paper that will appear at Interspeech next month.

Second, inspired by features that Taniya Mishra, Vivek Sridhar and Aliaster Conkie developed at AT&T, I included some new features which had a big payoff. (They described these features in an upcoming Interspeech 2012 paper). One of the most significant was to calculate the area under a normalized intensity curve. This has a strong correlation with duration, but is more robust. You could make an argument that it approximates "loudness" by incorporating duration and intensity. This is a pretty poor psycholinguistic or perceptual argument so I wouldn't make it too strongly, but it could be part of the story.

Here is a recap of speaker-independent acoustic-only performance on the six ToBI classification tasks on BURNC speaker f2b.

Task	Version 1.2	Version 1.3
Pitch Accent Detection	81.01% F1:83.28	84.83% F1:86.58
Intermediate Phrase Detection	75.41% F1:43.15	77.97% F1:44.43
Intonational Phrase Detection	86.91% F1:74.50	90.36% F1:76.49
Pitch Accent Classification	18.46% Average Recall:18.97	16.33% Average Recall:21.06
Phrase Accent Classification	48.34% Average Recall:47.99	47.44% Average Recall:48.31
Phrase Accent/Boundary Tone Classification	73.18% Average Recall:25.92	74.47% Average Recall:26.02

There are also a number of improvements to AuToBI from a technical side and as a piece of code.

First of all, unit test coverage has increased from ~11% to ~73% between v1.2 and v1.3.

Second, there was a bug in the PitchExtractor code causing a pretty serious under prediction of unvoiced grames. (A big thanks to Victor Soto for finding this bug.)

Third, memory use is much lower by a more aggressive deletion of prediction attributes, and through a modification of how WavReader works.

I'd like to thank Victor Soto, Fabio Tesser, Samuel Sanchez, Jay Liang, Ian Kaplan, Erica Cooper and, as ever, Julia Hirschberg and anyone else who has been using AuToBI, for their patience and feedback.

I've been pretty lax about posting here. I'll try to get better about it in the coming academic year.

This fall is full of travel which will lead to a lot of ideas and not enough time to work on them.

Tuesday, January 10, 2012

AuToBI Version 1.2

I hadn't really planned for this current improvement to AuToBI be a milestone release.

I'm about halfway through an effort to get test coverage up to 90-95% of lines and 100% of classes. I promise it'll get there eventually.

But in the mean time, I was playing with an improvement to how attributes are associated to data points. I knew this was a significant source of inefficiency, but didn't quite expect this much.

Here are memory usage graphs for training a Pitch Accent Detection model on the Boston University Radio News Corpus -- about 22k data points and 136 features. The first one is on my MacBookPro Laptop with 4G RAM (and a lot of other nonsense running).

The max memory usage of Version 1.1 was 1914Mb, with this improvement it tops out at 1049Mb. An improvement of about 45%. (You'll notice it also ends a little bit quicker too, but this is probably because of fewer or quicker garbage collection calls.)

I figured I'd check on a compute server too, one of the Speech Lab @ Queens College's Quad Core Intel Xeon Processor E5450 (3.0GHz,2X6ML2,1333) with 4Gb RAM.

Similar results here. Max memory usage of version 1.1 was 2343Mb and with the improvement 1392Mb. Improving by 40%. (And the speed improvement is here too.) I don't have a good explanation for why the linux version is taking more memory to run, but for now I'll assume it has something to do with the difference to the JVM.

There are some other bugfixes in this version, but this is the big reason to upgrade.

The version 1.2 is available from github
git clone git@github.com:AndrewRosenberg/AuToBI.git

Tuesday, January 03, 2012

English Pronunciation by G. Nolst Trenité

This is a repost of a poem posted on spelling.wordpress.com that's been going around facebook today.

It's an incredibly elegant set of examples about why grapheme-to-phoneme (letter-to-sound) conversion is so difficult in English. (Maybe this should be a required regression test for any TTS frontend...)

Please enjoy.
English Pronunciation by G. Nolst Trenité (after the break)