Spoken Language Processing: 06/01/2010

Wednesday, June 16, 2010

Does intensity correlate with prominence in French?

According to a bunch of French researchers who study prosody: No.

I learned this at a prominence workshop at Speech Prosody 2010. I asked Mathieu Avanzi why, in his paper "A Corpus-based Learning Method for Prominence Detection in Spontaneous Speech", he and his co-authors looked at pitch, duration and pause features, but not intensity or spectral emphasis. The response: "Intensity does not correlate with prominence in French".

Now, I don't speak French, so far be it from me to comment on what is perceived as intonational prominence in French by French speakers.

But...

Intensity correlates with prominence in (at least) English, Dutch and Italian. So my curiosity was piqued.

And...

At the same workshop, Mathieu and others released C-PROM, a corpus of French speech which has been annotated for prominence, and labeled by French speakers no less!

So I figured it would only take a few minutes to check it out. Using the feature extraction routines in AuToBI, I pulled out mean values of pitch, intensity and duration for each annotated syllable. Armed with a t-test and R, I looked to see which if any of these features correlate with the labels of prominence. (For this analysis, I collapsed the annotations for strong and weak prominence.)

So, let's look at the data.

Bold Claim 1: Pitch correlates with prominence in French.

The bimodal distribution of mean pitch is almost certainly due to the presence of male and female speakers in the C-PROM material. But even without any speaker or gender normalization of pitch, we can still evidence of the correlation between mean pitch and prominence. The mean value of prominent syllables is 185.6Hz compared to 158.0Hz for non-prominent syllables. This has an associated t-score of 24.117 (p < 2.2*10^-16).

Bold Claim 2: Duration correlates with prominence in French.

This result is even clearer. Prominent syllables are on average 97ms longer (261ms) than non-prominent syllables (164ms). This has a t-value of 54.240 (p < 2.2*10^-16).

Bold Claim 3: Intensity correlates with prominence in French.

Well there it is. It's not as pronounced a difference as the difference in pitch or duration, but the data shows a clear correlation between mean intensity in a syllable and whether the syllable is prominent or not. Prominent syllables are on average 1.6dB louder than non-prominent syllables (72.08dB vs. 70.48dB). This corresponds to a t-value (15.174) that is lower than that seen in the pitch and duration analyses, but still significant (p < 2.2*10^-16).

Now...This is clearly a very basic analysis of correlates of prominence in French speech. But based on these results, I'm comfortable answering the question now.

Does intensity correlate with prominence in French? Yes.

[edited at 12:43pm 6/16/2010]

Following up on a comment from Raul Fernandez, I thought I'd post a parallel plot on the correlation of intensity and prominence in English.

Note that this chart is based on data on American English *words* from the Boston Directions Corpus. Because these are words, the prominent distribution includes some data from non-prominent syllables, so it's not exactly a one-to-one comparison. But there is evidence that acoustic aggregations drawn from words make *better* predictors of prominence (cf. Rosenberg & Hirschberg 2009).

Here we find a similar difference in mean intensity (1.9dB) between prominent (60.8dB) and non-prominent words (58.9dB). This has an associated t-value of 21.234 (p < 2.2*10^-16).

There is little controversy about the correlation of intensity with prominence in English. (In the last few years, there has been work even suggesting that intensity is a better predictor of prominence than pitch, (cf. Kochanski et al. 2005, Silipo & Greenberg 2000, and Rosenberg & Hirschberg 2009).) Of course, this chart doesn't indicate that there are equivalent relationships between intensity and prominence in French and English -- merely that the French correlation deserves more attention.

Tuesday, June 08, 2010

HLT-NAACL 2010 Recap

I was at HLT-NAACL in Los Angeles last week. HLT isn't always a perfect fit for someone sitting towards the speech end of the Human Language Technologies spectrum. Every year, it seems, the organizers try (or claim to try) to attract more speech and spoken language processing work. It hasn't quite caught on yet and the conference tends to be dominated by Machine Translation and Parsing. However...The (median) quality of the work is quite high. This year I kept pretty close to the Machine Learning sessions and got turned on to the wealth of unsupervised structured learning which I've overlooked over the last N>5 years.

There were two new trends that I found particularly compelling this year:

Noisy Genre
This is pretty clunky term covers genres of language which are not well-formed. As far as I can tell this covers everything other than Newswire, Broadcast news, and read speech. This is what I would call "language in the wild" or in a snarkier mood, "language" (sans modifier). For the purposes of HLT-NAACL, it covers Twitter messages, email, forum comments, and ... speech recognition output. It's this kind of language that got me into NLP and why I ended up working on speech, so I'm pretty excited that this is receiving more attention from the NLP community at large.
Mechanical Turk for language tasks
Like the excitement over wikipedia a few years ago, NLP folks have fallen in love with Amazon's Mechanical Turk. Mechanical turk was used for speech transcription, sentence compression, paraphrasing, and quite a lot more; there was even a workshop day dedicated solely to this topic. I didn't go to it, but will catch up on the papers this week or so. This work is very cool, particularly when it comes to automatically detecting and dealing with outlier annotations. The resource and corpora development uses of Mechanical Turk are obvious and valuable. It's in the development of "high confindence" or "gold standard" resources that I think this work has an opportunity to intersect very nicely in work on ensemble techniques and classifier combination/fusion. If each turker is considered to be an annotator, the task of identifying a gold standard corpus is identical to generating a high-confidence prediction from an ensemble.

I had a sense of HLT-NAACL that was unfair: My impression was that the quality of work was fairly modest. I attribute this to three factors. 1) In the past there has been a lot of work of the type -- "I found this data set. I used this off-the-shelf ML algorithm. I got these results". There's nothing particularly wrong with this type of work, except for it's boring, and not intellectually rigorous, it's not scientifically creative, and it doesn't illuminate the task with any particular clarity. (Ok, so there're at least four things wrong with this kind of work.) 2) HLT-NAACL accepts 4 page short papers. With its formatting guidelines, it is almost impossible to fit more than a single idea in a 4 page ACL paper. This leads to a good amount of simple or undeveloped ideas. (I've written a fair amount of these 4 page papers because they are accepted at a later deadline, but it's always frustrating when you realize you have more to say.) 3) And I think this is probably the most significant -- I've had a good amount of luck getting papers accepted to HLT-NAACL including my first publication in my first year of grad school. This is probably just "I don't want to belong to any club that will accept people like me as a member"-syndrome, but it left me underestimating the caliber of this conference.

A couple of specific highlights of papers I liked this year:

“cba to check the spelling”: Investigating Parser Performance on Discussion Forum Posts Jennifer Foster. This might be the first time I fully agree with a best paper award. This paper looked at parsing outrageously sloppy forum comments. These are rife with spelling errors, grammatical errors, weird exclamations (lol). The paper is a really nice example of the difficulty that "noisy genres" of text pose to traditional (i.e., trained on WSJ text) models. The error analysis is clear and the paper proposes some nice solutions to bridge this gap by adding noise to the WSJ data. Also, bonus points for subtly including
Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription
Scott Novotney and Chris Callison-Burch. A nice example of using Mechanical Turk to generate training data for a speech recognizer. High quality transcription of speech is pretty expensive and critically important to speech recognizer performance. Novotney and Callison-Burch found that Turkers are able to transcribe speech fairly well, and at a fraction of the cost. This paper includes a really nice evaluation of Turker performance and some interesting approaches to ranking Turker performance.
The Simple Truth about Dependency and Phrase Structure Representations: An Opinion Piece
Owen Rambow. This paper was probably my favorite in terms of bringing joy and being a breath of fresh air. The argument Rambow lays out is that Dependency and Phrase Structure Representations of syntax are meaningless in isolation. Moreover, these are simply alternate representations of identical syntactic phenomena. Linguists love to fight over a "correct" representation of syntax. This paper takes the position that the distinction between the representations is merely preference not substantive -- fighting over the correct representation of a phenomenon is a distraction to understanding the phenomenon itself. Full disclosure: I've known Owen for years, and like him personally as well as his work.
Type-Based MCMC
Percy Liang, Michael I. Jordan and Dan Klein. Over the last few years, I've been boning up on MCMC methods. I haven't applied them to my own work yet, but it's really only a matter of time. This work does a nice job of pointing out a limitation of token based MCMC -- specifically that sampling on a token by token basis can make it overly difficult to get out of local minima. Some of this difficulty can be overcome by sampling based on types, that is, sampling based on a higher level feature across the whole data set, as opposed to within a particular token. This makes intuitive sense and was empirically well motivated.

As a side note, I'd like to thank all you wonderful machine learning folks who have been doing a remarkable amount of unsupervised structured learning that I should have been paying better attention to over the last few years. Now I've got to hit the books.

Wednesday, June 02, 2010

Mistakes I've made: The first of an N part series.

In this installment, some ~~mistakes~~ lessons in teaching Machine Learning.

Using too few Examples.

Everyone, myself especially, learns best from examples. Hands-on example are even better. My class did used hardly any. I think that explains many of the blank stares I got in response to "are there any questions?" It's very easy to ask a question about an example -- "Wait, why does the entropy equal 2.3?". It's much more difficult to ask a question like "Could you clarify the advantages and disadvantages of L2 vs. L1 regularization? You lost me at 'gradient'."
Starting with the Math.

I spent the first two classes "reviewing" the linear algebra and calculus that would be necessary to really get into the details of some of the more complicated algorithms later in the course. Big mistake. First of all, this material wasn't review for many students -- an unexpected (but rapid) realization. Second of all, I had already lost sight of the point of the math. The math is there to support the big ideas of modeling and evaluation. These can't be accomplished without the math, but I put the cart way before the horse. In the future, I'll be starting with generalization with as little math as possible, and then bringing it in as needed.
Ending with Evaluation.

The class included material on mean-squared error and classification error rates, far earlier than I introduced the ideas of evaluation. Sure, accuracy is a pretty intuitive notion, but there's a big assumption made in assuming that every body in the seats will know what I'm talking about. Even the relatively simple distinction between linear and squared error only takes 10 minutes to discuss, but it goes a long way towards instilling greater understanding of what's going on.
Ambitious and unclear goals and expectations.

While this was never explicit, in reflection, it is obvious to me that my goal of the course was for the students to "know as much as I do about a machine learning". It should have been "understand the fundamentals of machine learning". Namely, 1) how can we generalize from data (statistical modeling), 2) how can we apply machine learning (feature extraction) and 3) how do we know if the system works (evaluation).

For instance, I spent a lot of time showing how frequentists and bayesians can come the the same end point w.r.t. L2 Regularization in Linear Regression. I think this is way cool, but is it more important than spending an hour describing k-nearest neighbors? Only for me, not for the students. Was it more helpful to describe MAP adaptation than decision trees? Almost definitely not not. Decision trees are disarmingly intuitive. They can be represented as if-statements, and provide an easy example of overfitting (without requiring that students know that an n-degree polynomial can intersect n+1 points). But I thought they were too simple, and not "statistical" enough to fit in with the rest of the class. Oops.

Well, that's (at least) four mistakes -- hopefully next time they'll be all new mistakes.