Well, no one wanted to join with me for Paper Writing Month, so I can only report on my own progress for November.
I got one full draft written, and the experiments for two others are more or less done running.
I can point out a number of factors that kept me from being more productive, none of which make me disappointed in myself. In all say this was a good exercise, and one I'll do again. Maybe next time, I can find a buddy or two to join with me on it.
Now that I've written four (now five) sentences including a first-person personal pronoun, here's some more general thoughts about research.
Alex and Aki at Ideas in Food have a blog post about about Creativity in the Kitchen. They loosely break this down into Inspiration, Flexibility, Motivation, Adaptation and Refinement. Both these top-level categories and the specifics they address apply almost as well to creativity in research. So check it out and cross-pollenate a little.
Some thoughts on Spoken Language Processing, with tangents on Natural Language Processing, Machine Learning, and Signal Processing thrown in for good measure.
Friday, December 03, 2010
Thursday, November 11, 2010
Cross-validation with one model
This is essentially a repost of Rob J Hyndman's blog post on the relevance of cross-validation for statisticians.
Within this very nice piece, Rob drops this bomb of mathematical knowledge:
Say what?
Here is a broader excerpt and the method itself (after the jump).
Within this very nice piece, Rob drops this bomb of mathematical knowledge:
It is not necessary to actually fit separate models when computing the CV statistic for linear models.
Say what?
Here is a broader excerpt and the method itself (after the jump).
Saturday, November 06, 2010
Semantically Related Term Challenge
Joseph Turian over at MetaOptimize.com has posted a fun NLP challenge.
The task is to identify semantically related words from a shared corpus.
So you're thinking, sure, no problem. I'll look for common concurrences. Maybe I'll start with some seed pairs and do some bootstrapping. Or you do LSA, if you're into that sort of thing.
But here's the rub, there are a few million documents, so you've got to get clever if you're going to use LSA (cause that would require SVD of an impossibly large and sparse matrix).
As if that weren't challenging enough, these "documents" are only a word or two long, so the concurrences you find are going to be pretty sparse.
So, that's it. Have at it.
The task is to identify semantically related words from a shared corpus.
So you're thinking, sure, no problem. I'll look for common concurrences. Maybe I'll start with some seed pairs and do some bootstrapping. Or you do LSA, if you're into that sort of thing.
But here's the rub, there are a few million documents, so you've got to get clever if you're going to use LSA (cause that would require SVD of an impossibly large and sparse matrix).
As if that weren't challenging enough, these "documents" are only a word or two long, so the concurrences you find are going to be pretty sparse.
So, that's it. Have at it.
Monday, November 01, 2010
National Novel Writing Month
November is National Novel Writing Month. In past years, I've known two or three people who have set out to write a complete novel within the month. To all of those writers who push out 100,000 words in a month, you have a huge gold star in my book.
Now, I don't quite have the motivation, creativity or time to try to dig a novel out of my head, but I'm going to put a spin on it. This November will be Personal Paper Writing Month.
If anyone out there reads this, and wants to join me in the effort, I'll post semi-regular updates about our progress. If I'm all alone out in this...well, so be it. You can still keep tabs on how it's going here though.
Now, I don't quite have the motivation, creativity or time to try to dig a novel out of my head, but I'm going to put a spin on it. This November will be Personal Paper Writing Month.
If anyone out there reads this, and wants to join me in the effort, I'll post semi-regular updates about our progress. If I'm all alone out in this...well, so be it. You can still keep tabs on how it's going here though.
Thursday, October 21, 2010
Are segmental durations normally distributed?
What's the best way to normalize duration to account for speaker and speaking rate differences?
If segmental durations are normally distributed, z-score normalization has some nice properties. It's easy to estimate from a relatively small set of data. Since the z-score normalization parameters are simply MLE univariate Gaussian parameters fit to the feature (here duration) you can adapt these parameters using maximum a posteriori adaptation to a new speaker, or new utterance -- if there is reason to believe that the speaking rate might have changed.
I've used z-score normalization of word duration in the past, acknowledging that it's poorly motivated. There's no expectation that word length should be normally distributed -- phone counts of words are not normally distributed, so why should their durations. In fact, phone counts may be more log-normal than normal. Brief tangent: here is the distribution of phone counts per word.
So I took a look at a speaker (f2) from the Boston University Radio News Corpus, and looked at the distributions of word, syllable, vowel and phone durations to see if any look more or less Gaussian.
The distribution of word durations is decidedly non-Gaussian. We can see evidence of the bimodality that is likely coming from the abundance of monosyllabic words in the data (~59% of words in this material). Also, comparing the histogram of phone counts and word durations, the relationship is fairly clear.
Syllable durations don't look too bad. There's a bit of skew to the right, this model is overestimating longer durations, but this isn't terrible. There is a strange spikiness to the distribution, but I blame this more on an interaction between the resolution of the duration information (in units of 10ms) and the histogram than an underlying phenomenon. If someone were to take a look at this data, and decide to go ahead and use z-score normalization on syllable durations, it wouldn't strike me as a terrible way to go.
There has been a good amount of discussion (and some writing, cf. C. Wightman, et al. "Segmental durations in the vicinity of prosodic phrase boundaries." JASA, 1992) about the inherent duration of phones, and how phone identity is an important conditional variable when examining (and normalizing) phone duration. I haven't reexamined that here, but suffice to say, the distribution of phone durations isn't well modeled by a Gaussian or likely any other simple parameterized distribution. In all likelihood phone identity is an important factor to consider here, but are all phone ids equally important or can we just look at vowel vs. consonant or other articulatory clusterings -- frontness, openness, fricatives, etc.? I'll have to come back to that.
But...if we look closer at the distribution of vowel durations, while this isn't all that Gaussian, it looks like a pretty decent fit to a half-normal distribution. I don't see any obvious multimodality which would suggest a major effect of another conditional variable on this either phone identity or lexical stress, but it's possible that the dip at around the 20th percentile is evidence of this. Or it could just be noise.
To properly use a half-normal distribution, you would have to discount the probability mass at zero. This has the advantage that vowel onsets and offsets are much easier to detect than consonant boundaries, syllable boundaries or word boundaries.
So this might be my new normalization scheme for segmental durations. Gaussian MLE z-score normalization for syllables if I have them (either from automatic or manual transcription), and Half-normal MLE z-score normalization on vowels that are acoustically approximated (or when ASR output is too poor to be trusted).
If segmental durations are normally distributed, z-score normalization has some nice properties. It's easy to estimate from a relatively small set of data. Since the z-score normalization parameters are simply MLE univariate Gaussian parameters fit to the feature (here duration) you can adapt these parameters using maximum a posteriori adaptation to a new speaker, or new utterance -- if there is reason to believe that the speaking rate might have changed.
I've used z-score normalization of word duration in the past, acknowledging that it's poorly motivated. There's no expectation that word length should be normally distributed -- phone counts of words are not normally distributed, so why should their durations. In fact, phone counts may be more log-normal than normal. Brief tangent: here is the distribution of phone counts per word.
So I took a look at a speaker (f2) from the Boston University Radio News Corpus, and looked at the distributions of word, syllable, vowel and phone durations to see if any look more or less Gaussian.
The distribution of word durations is decidedly non-Gaussian. We can see evidence of the bimodality that is likely coming from the abundance of monosyllabic words in the data (~59% of words in this material). Also, comparing the histogram of phone counts and word durations, the relationship is fairly clear.
Syllable durations don't look too bad. There's a bit of skew to the right, this model is overestimating longer durations, but this isn't terrible. There is a strange spikiness to the distribution, but I blame this more on an interaction between the resolution of the duration information (in units of 10ms) and the histogram than an underlying phenomenon. If someone were to take a look at this data, and decide to go ahead and use z-score normalization on syllable durations, it wouldn't strike me as a terrible way to go.
There has been a good amount of discussion (and some writing, cf. C. Wightman, et al. "Segmental durations in the vicinity of prosodic phrase boundaries." JASA, 1992) about the inherent duration of phones, and how phone identity is an important conditional variable when examining (and normalizing) phone duration. I haven't reexamined that here, but suffice to say, the distribution of phone durations isn't well modeled by a Gaussian or likely any other simple parameterized distribution. In all likelihood phone identity is an important factor to consider here, but are all phone ids equally important or can we just look at vowel vs. consonant or other articulatory clusterings -- frontness, openness, fricatives, etc.? I'll have to come back to that.
But...if we look closer at the distribution of vowel durations, while this isn't all that Gaussian, it looks like a pretty decent fit to a half-normal distribution. I don't see any obvious multimodality which would suggest a major effect of another conditional variable on this either phone identity or lexical stress, but it's possible that the dip at around the 20th percentile is evidence of this. Or it could just be noise.
To properly use a half-normal distribution, you would have to discount the probability mass at zero. This has the advantage that vowel onsets and offsets are much easier to detect than consonant boundaries, syllable boundaries or word boundaries.
So this might be my new normalization scheme for segmental durations. Gaussian MLE z-score normalization for syllables if I have them (either from automatic or manual transcription), and Half-normal MLE z-score normalization on vowels that are acoustically approximated (or when ASR output is too poor to be trusted).
Tuesday, October 05, 2010
Interspeech 2010 Recap
Interspeech was in Makuhari, Japan last week. Makuhari is about 40 minutes from Tokyo, and I'd say totally worth the commute. The conference center was large and clean, and (after the first day) had functional wireless, but Makuhari offers a lot less than Tokyo does.
Interspeech is probably the speech conference with the broadest scope and largest draw. This makes it a great place to learn what is going on in the field.
One of the things that was most striking about the work at Interspeech 2010 was the lack of a Hot Topic. Acoustic modeling for automatic speech recognition is a mainstay of any speech conference, that was there in spades. There was some nice work on prosody analysis. Recognition of age, affect and gender were highlighted in the INTERSPEECH 2010 Paralinguistics Challenge, but outside the special session focussing on this, there wasn't an exceptional amount of work on these questions. Despite the lack of a new major theme to emerge this year, there was some very high quality, interesting work.
Here is some of the work that I found particularly compelling.
Interspeech is probably the speech conference with the broadest scope and largest draw. This makes it a great place to learn what is going on in the field.
One of the things that was most striking about the work at Interspeech 2010 was the lack of a Hot Topic. Acoustic modeling for automatic speech recognition is a mainstay of any speech conference, that was there in spades. There was some nice work on prosody analysis. Recognition of age, affect and gender were highlighted in the INTERSPEECH 2010 Paralinguistics Challenge, but outside the special session focussing on this, there wasn't an exceptional amount of work on these questions. Despite the lack of a new major theme to emerge this year, there was some very high quality, interesting work.
Here is some of the work that I found particularly compelling.
- Married Couples' speech
Sri Narayanan's group with other collaborators from USC and UCLA have collected a set of married couples' dialog speech during couple's therapy. So this is already compelling data to look at. You've got naturally occurring emotional speech, which is a rare occurrence, and it's emotion in dialog. They had (at least) 2 papers on this data at the conference, one looking at prosodic entrainment during these dialogs, and the other classifying qualities like blame, acceptance, and humor in either souse. Both very compelling first looks at this data. There are obviously some serious privacy issues with sharing this data, but hopefully it will be possible eventually.
Automatic Classification of Married Couples’ Behavior using Audio Features Matthew Black, et al.
Quantification of Prosodic Entrainment in Affective Spontaneous Spoken Interactions of Married Couples Chi-Chun Lee, et al.
- Ferret auditory cortex data for phone recognition
Hynek Hermansky and colleagues have done a lot of compelling work on phone recognition. To my eye, a lot of it has been banging away at techniques other than MFCC representations for speech recognition. Some of them work better than others, obviously, but it's great to see that kind of scientific creativity applied to a core task for speech recognition. This time the idea was to take "spectro temporal receptive fields" empirically observed from ferrets that have been trained to be accurate phone recognizers, and use these responses to train a phone classifier. Yeah, that's right. They used ferret neural activity to try to recognize human speech. Way out there. If that weren't compelling enough, the results are good!
- Prosodic Timing Analysis for Articulatory Re-Synthesis Using a Bank of Resonators with an Adaptive Oscillator Michael C. Brady
A pet project has been to find a nice way to process rhythm in speech for prosodic analysis. Most people use some statistic based on the intervocalic intervals, but this is unsatisfying. While is captures the degree of stability of the speaking rate, it doesn't tell you anything about which syllables are evenly spaced, and which are anomalous. This paper uses an adaptive oscillator to find the frequency that best describes the speech data. One of the nicest results (that Michael didn't necessarily highlight) was that he found that deaccented words in his example utterance were not "on the beat". In the near term I'm planning on replicating this approach for analyzing phrasing, on the idea that in addition to other acoustic resets, the prosodic timing resets at phrase boundaries. A very cool approach. - Compressive Sensing
There was a special session on compressive sensing that was populated mostly by IBM speech team folks. I hadn't heard of compressive sensing before this conference, and it's always nice to learn a new technique. At its core compressive sensing is an exemplar based learning algorithm. Where it gets clever is that where k-means uses a fixed number, k, of exemplars to use with equal weight, and SVM use a fixed set of support vectors to make decisions, in compressive sensing a dynamic set of exemplars are used to classify each data point. The set of candidate exemplars (possibly the whole training data set) are then weighted with some L1-ish regularization to drive most of the weights to zero -- selecting a subset of all candidates for classification. Then a weighted k-means is performed using the selected exemplars and weights. The dynamic selection and weighting of exemplars outperforms vanilla SVMs, but the process is fairly computationally expensive.
Wednesday, August 18, 2010
matplotlib on Mac OS X (thanks StackOverflow)
The pyplot package in matplotlib wasn't working on my macbookpro.
It wasn't raising an error, but it wasn't showing a plot either. Pretty much the most frustrating kind of error.
The solution was easy -- once I found it.
Create a file: "~/.matplotlib/matplotlibrc"
and include the line "backend: MacOSX"
and then it works.
StackOverflow has consistently been a great place to find answers to needling problems like this.
It wasn't raising an error, but it wasn't showing a plot either. Pretty much the most frustrating kind of error.
The solution was easy -- once I found it.
Create a file: "~/.matplotlib/matplotlibrc"
and include the line "backend: MacOSX"
and then it works.
StackOverflow has consistently been a great place to find answers to needling problems like this.
Monday, August 09, 2010
P != NP
Well, someone seems to have finally cracked it.
Vinay Deolalikar claims to have proved that P != NP. Now, this doesn't quite have the applicational impact as if it turned out that P=NP, but it's a remarkable achievement. I haven't given the paper a thorough read, and I suspect it'll take some external reading to really digest the guts of it.
Assuming the proof holds up, this is a big deal.
Sure this doesn't have anything to do with speech, or natural language processing. However, somewhat surprisingly, the proof uses graphical models, statistics, conditional independence and graph ensembles. So we've got some machine learning ideas being invited to the party.
And, like I said, it's a big deal, so who cares if it's on topic.
Vinay Deolalikar claims to have proved that P != NP. Now, this doesn't quite have the applicational impact as if it turned out that P=NP, but it's a remarkable achievement. I haven't given the paper a thorough read, and I suspect it'll take some external reading to really digest the guts of it.
Assuming the proof holds up, this is a big deal.
Sure this doesn't have anything to do with speech, or natural language processing. However, somewhat surprisingly, the proof uses graphical models, statistics, conditional independence and graph ensembles. So we've got some machine learning ideas being invited to the party.
And, like I said, it's a big deal, so who cares if it's on topic.
Friday, August 06, 2010
Z-score normalization of pitch and intensity.
You may have noticed a post that was up here a few weeks ago about z-score normalization of pitch. Well, my calculation of KL-divergence had a bug in it, and the conclusions of the post were wrong. So I deleted it and am replacing it with this.
So the question is this. When normalizing pitch and intensity values across speakers, what is the best approach. Z-score normalization has a number of attractive properties -- it's relatively robust to outliers, it can be efficiently calculated -- however, it assumes a that the values being normalized follow a normal distribution. Specifically z-score normalized values are calculated as follows:
$x^* = \frac{x-\mu}{\sigma}$
Where $\mu$ is the mean, and $\sigma$ the standard deviation of the known population of values. So, the normalized value $x^*$ is how far the initial value $x$ was from the mean $\mu$ expressed in terms of a number of standard deviations.
So the question is when speaker normalizing pitch and intensity, is it better to use pitch or log pitch? intensity or log intensity (or exp intensity, for that matter)? The intuition about investigating these variations on Hz and dB come from perceptual qualities. Human hearing perceives pitch and loudness on a logarithmic scale. While this doesn't mean that the two qualities follow a log-normal distribution, processing logarithmic units gives some explanatory advantage.
To answer this question I calculated the KL divergence between the histogram of pitch and intensity values (with 100 equally spaced bins) and the maximum likelihood estimate Gaussian distribution. I did this for the 4 speakers in the Boston Directions Corpus (BDC), the 6 speakers in the Boston University Radio News Corpus (BURNC), and 4 speakers of Mandarin Chinese speaking English. This was repeated for log pitch and log intensity and exponentiated pitch and intensity.
The pitch results are mixed. For 7 of 10 native American English speakers, log-pitch is more gaussian than pitch. For the remaining three, the raw pitch is more gaussian. (The exponentiated pitch values are never better.) On average, the KL-divergence of log-pitch is lower (.1455 vs. .1683). While not decisive, it is better to perform z-score normalization of log-pitch on American English speech. Here are some histograms and normal distributions from speaker h1 from the BDC.
But something interesting happens with the native Mandarin speakers. Here, the raw pitch (.271) has a more gaussian distribution than log pitch (.397). This could be a recording condition effect, but there could be an influence of Chinese being a tonal language that is coming into play here. I'll be looking into this more...
For intensity, the results are clear. It's better to z-score normalize raw intensity (dB) than log intensity. Across all speakers (but one -- h3 from BDC) raw intensity (.154) is more gaussian than log intensity (.230). But intensity (dB) is already a logarithmic unit. So we compared raw intensity with exponentiated intensity values, we found that exp intensity is slightly more gaussian than raw intensity (.143). I'm skeptical that normalizing by exp intensity will have a dramatic effect, but it's better on all speakers but one.
However, the intensity distributions look pretty bimodal or maybe even trimodal. It might be better to normalize with a mixture of gaussians rather than a standard z-score. Something to keep in mind.
Aside: About speaker h3 from the BDC. The intensity values from this speaker show that the log intensity is most gaussian (.346), more than raw intensity (.696) while the exp intensity is the least (1.16). I don't yet have a good explanation for what makes this speaker different, but it's worth noting that even when trying to normalize out speaker differences, the normalization routine can have inconsistencies across speakers.
So the question is this. When normalizing pitch and intensity values across speakers, what is the best approach. Z-score normalization has a number of attractive properties -- it's relatively robust to outliers, it can be efficiently calculated -- however, it assumes a that the values being normalized follow a normal distribution. Specifically z-score normalized values are calculated as follows:
$x^* = \frac{x-\mu}{\sigma}$
Where $\mu$ is the mean, and $\sigma$ the standard deviation of the known population of values. So, the normalized value $x^*$ is how far the initial value $x$ was from the mean $\mu$ expressed in terms of a number of standard deviations.
So the question is when speaker normalizing pitch and intensity, is it better to use pitch or log pitch? intensity or log intensity (or exp intensity, for that matter)? The intuition about investigating these variations on Hz and dB come from perceptual qualities. Human hearing perceives pitch and loudness on a logarithmic scale. While this doesn't mean that the two qualities follow a log-normal distribution, processing logarithmic units gives some explanatory advantage.
To answer this question I calculated the KL divergence between the histogram of pitch and intensity values (with 100 equally spaced bins) and the maximum likelihood estimate Gaussian distribution. I did this for the 4 speakers in the Boston Directions Corpus (BDC), the 6 speakers in the Boston University Radio News Corpus (BURNC), and 4 speakers of Mandarin Chinese speaking English. This was repeated for log pitch and log intensity and exponentiated pitch and intensity.
The pitch results are mixed. For 7 of 10 native American English speakers, log-pitch is more gaussian than pitch. For the remaining three, the raw pitch is more gaussian. (The exponentiated pitch values are never better.) On average, the KL-divergence of log-pitch is lower (.1455 vs. .1683). While not decisive, it is better to perform z-score normalization of log-pitch on American English speech. Here are some histograms and normal distributions from speaker h1 from the BDC.
raw pitch |
log pitch |
But something interesting happens with the native Mandarin speakers. Here, the raw pitch (.271) has a more gaussian distribution than log pitch (.397). This could be a recording condition effect, but there could be an influence of Chinese being a tonal language that is coming into play here. I'll be looking into this more...
Mandarin Raw pitch |
Mandarin log pitch |
For intensity, the results are clear. It's better to z-score normalize raw intensity (dB) than log intensity. Across all speakers (but one -- h3 from BDC) raw intensity (.154) is more gaussian than log intensity (.230). But intensity (dB) is already a logarithmic unit. So we compared raw intensity with exponentiated intensity values, we found that exp intensity is slightly more gaussian than raw intensity (.143). I'm skeptical that normalizing by exp intensity will have a dramatic effect, but it's better on all speakers but one.
raw intensity |
exp intensity |
log intensity |
However, the intensity distributions look pretty bimodal or maybe even trimodal. It might be better to normalize with a mixture of gaussians rather than a standard z-score. Something to keep in mind.
Aside: About speaker h3 from the BDC. The intensity values from this speaker show that the log intensity is most gaussian (.346), more than raw intensity (.696) while the exp intensity is the least (1.16). I don't yet have a good explanation for what makes this speaker different, but it's worth noting that even when trying to normalize out speaker differences, the normalization routine can have inconsistencies across speakers.
Saturday, July 03, 2010
MetaOptimize
The ML blogs I follow are all atwitter about MetaOptimize (optimizing the process of optimizing the process of...).
Joseph Turian has a nice blog about all sorts of issues -- jobs, programming, open source community -- but certainly the most exciting part of the site is the QA section.
There's already an active community up and running over there, and the publicity its getting today will only help that. So check it out, ask a question, answer a question even. It's got the potential to be a very valuable forum for machine learning, and nlp questions. Only time will tell.
Joseph Turian has a nice blog about all sorts of issues -- jobs, programming, open source community -- but certainly the most exciting part of the site is the QA section.
There's already an active community up and running over there, and the publicity its getting today will only help that. So check it out, ask a question, answer a question even. It's got the potential to be a very valuable forum for machine learning, and nlp questions. Only time will tell.
UCSD Data Mining Contest
UCSD is hosting a classification challenge! UCSD Data Mining Contest.
There are cash prizes for undergraduate and graduate students (including post-docs), and it's open until Labor Day.
You get a bunch of data about customers and non-customers for training, then you classify a new person as True (a customer) or False (not a customer).
They can call it Data Mining, or CRM or anything else, it's all classification to me.
I forwarded this on to my ML students from last semester, and hopefully one or two of them will take a run at it.
There are cash prizes for undergraduate and graduate students (including post-docs), and it's open until Labor Day.
You get a bunch of data about customers and non-customers for training, then you classify a new person as True (a customer) or False (not a customer).
They can call it Data Mining, or CRM or anything else, it's all classification to me.
I forwarded this on to my ML students from last semester, and hopefully one or two of them will take a run at it.
Wednesday, June 16, 2010
Does intensity correlate with prominence in French?
According to a bunch of French researchers who study prosody: No.
I learned this at a prominence workshop at Speech Prosody 2010. I asked Mathieu Avanzi why, in his paper "A Corpus-based Learning Method for Prominence Detection in Spontaneous Speech", he and his co-authors looked at pitch, duration and pause features, but not intensity or spectral emphasis. The response: "Intensity does not correlate with prominence in French".
Now, I don't speak French, so far be it from me to comment on what is perceived as intonational prominence in French by French speakers.
But...
Intensity correlates with prominence in (at least) English, Dutch and Italian. So my curiosity was piqued.
And...
At the same workshop, Mathieu and others released C-PROM, a corpus of French speech which has been annotated for prominence, and labeled by French speakers no less!
So I figured it would only take a few minutes to check it out. Using the feature extraction routines in AuToBI, I pulled out mean values of pitch, intensity and duration for each annotated syllable. Armed with a t-test and R, I looked to see which if any of these features correlate with the labels of prominence. (For this analysis, I collapsed the annotations for strong and weak prominence.)
So, let's look at the data.
Bold Claim 1: Pitch correlates with prominence in French.
The bimodal distribution of mean pitch is almost certainly due to the presence of male and female speakers in the C-PROM material. But even without any speaker or gender normalization of pitch, we can still evidence of the correlation between mean pitch and prominence. The mean value of prominent syllables is 185.6Hz compared to 158.0Hz for non-prominent syllables. This has an associated t-score of 24.117 (p < 2.2*10^-16).
Bold Claim 2: Duration correlates with prominence in French.
This result is even clearer. Prominent syllables are on average 97ms longer (261ms) than non-prominent syllables (164ms). This has a t-value of 54.240 (p < 2.2*10^-16).
Bold Claim 3: Intensity correlates with prominence in French.
Well there it is. It's not as pronounced a difference as the difference in pitch or duration, but the data shows a clear correlation between mean intensity in a syllable and whether the syllable is prominent or not. Prominent syllables are on average 1.6dB louder than non-prominent syllables (72.08dB vs. 70.48dB). This corresponds to a t-value (15.174) that is lower than that seen in the pitch and duration analyses, but still significant (p < 2.2*10^-16).
Now...This is clearly a very basic analysis of correlates of prominence in French speech. But based on these results, I'm comfortable answering the question now.
Does intensity correlate with prominence in French? Yes.
[edited at 12:43pm 6/16/2010]
Following up on a comment from Raul Fernandez, I thought I'd post a parallel plot on the correlation of intensity and prominence in English.
Note that this chart is based on data on American English *words* from the Boston Directions Corpus. Because these are words, the prominent distribution includes some data from non-prominent syllables, so it's not exactly a one-to-one comparison. But there is evidence that acoustic aggregations drawn from words make *better* predictors of prominence (cf. Rosenberg & Hirschberg 2009).
Here we find a similar difference in mean intensity (1.9dB) between prominent (60.8dB) and non-prominent words (58.9dB). This has an associated t-value of 21.234 (p < 2.2*10^-16).
There is little controversy about the correlation of intensity with prominence in English. (In the last few years, there has been work even suggesting that intensity is a better predictor of prominence than pitch, (cf. Kochanski et al. 2005, Silipo & Greenberg 2000, and Rosenberg & Hirschberg 2009).) Of course, this chart doesn't indicate that there are equivalent relationships between intensity and prominence in French and English -- merely that the French correlation deserves more attention.
I learned this at a prominence workshop at Speech Prosody 2010. I asked Mathieu Avanzi why, in his paper "A Corpus-based Learning Method for Prominence Detection in Spontaneous Speech", he and his co-authors looked at pitch, duration and pause features, but not intensity or spectral emphasis. The response: "Intensity does not correlate with prominence in French".
Now, I don't speak French, so far be it from me to comment on what is perceived as intonational prominence in French by French speakers.
But...
Intensity correlates with prominence in (at least) English, Dutch and Italian. So my curiosity was piqued.
And...
At the same workshop, Mathieu and others released C-PROM, a corpus of French speech which has been annotated for prominence, and labeled by French speakers no less!
So I figured it would only take a few minutes to check it out. Using the feature extraction routines in AuToBI, I pulled out mean values of pitch, intensity and duration for each annotated syllable. Armed with a t-test and R, I looked to see which if any of these features correlate with the labels of prominence. (For this analysis, I collapsed the annotations for strong and weak prominence.)
So, let's look at the data.
Bold Claim 1: Pitch correlates with prominence in French.
The bimodal distribution of mean pitch is almost certainly due to the presence of male and female speakers in the C-PROM material. But even without any speaker or gender normalization of pitch, we can still evidence of the correlation between mean pitch and prominence. The mean value of prominent syllables is 185.6Hz compared to 158.0Hz for non-prominent syllables. This has an associated t-score of 24.117 (p < 2.2*10^-16).
Bold Claim 2: Duration correlates with prominence in French.
This result is even clearer. Prominent syllables are on average 97ms longer (261ms) than non-prominent syllables (164ms). This has a t-value of 54.240 (p < 2.2*10^-16).
Bold Claim 3: Intensity correlates with prominence in French.
Well there it is. It's not as pronounced a difference as the difference in pitch or duration, but the data shows a clear correlation between mean intensity in a syllable and whether the syllable is prominent or not. Prominent syllables are on average 1.6dB louder than non-prominent syllables (72.08dB vs. 70.48dB). This corresponds to a t-value (15.174) that is lower than that seen in the pitch and duration analyses, but still significant (p < 2.2*10^-16).
Now...This is clearly a very basic analysis of correlates of prominence in French speech. But based on these results, I'm comfortable answering the question now.
Does intensity correlate with prominence in French? Yes.
[edited at 12:43pm 6/16/2010]
Following up on a comment from Raul Fernandez, I thought I'd post a parallel plot on the correlation of intensity and prominence in English.
Note that this chart is based on data on American English *words* from the Boston Directions Corpus. Because these are words, the prominent distribution includes some data from non-prominent syllables, so it's not exactly a one-to-one comparison. But there is evidence that acoustic aggregations drawn from words make *better* predictors of prominence (cf. Rosenberg & Hirschberg 2009).
Here we find a similar difference in mean intensity (1.9dB) between prominent (60.8dB) and non-prominent words (58.9dB). This has an associated t-value of 21.234 (p < 2.2*10^-16).
There is little controversy about the correlation of intensity with prominence in English. (In the last few years, there has been work even suggesting that intensity is a better predictor of prominence than pitch, (cf. Kochanski et al. 2005, Silipo & Greenberg 2000, and Rosenberg & Hirschberg 2009).) Of course, this chart doesn't indicate that there are equivalent relationships between intensity and prominence in French and English -- merely that the French correlation deserves more attention.
Tuesday, June 08, 2010
HLT-NAACL 2010 Recap
I was at HLT-NAACL in Los Angeles last week. HLT isn't always a perfect fit for someone sitting towards the speech end of the Human Language Technologies spectrum. Every year, it seems, the organizers try (or claim to try) to attract more speech and spoken language processing work. It hasn't quite caught on yet and the conference tends to be dominated by Machine Translation and Parsing. However...The (median) quality of the work is quite high. This year I kept pretty close to the Machine Learning sessions and got turned on to the wealth of unsupervised structured learning which I've overlooked over the last N>5 years.
There were two new trends that I found particularly compelling this year:
A couple of specific highlights of papers I liked this year:
As a side note, I'd like to thank all you wonderful machine learning folks who have been doing a remarkable amount of unsupervised structured learning that I should have been paying better attention to over the last few years. Now I've got to hit the books.
- Noisy Genre
This is pretty clunky term covers genres of language which are not well-formed. As far as I can tell this covers everything other than Newswire, Broadcast news, and read speech. This is what I would call "language in the wild" or in a snarkier mood, "language" (sans modifier). For the purposes of HLT-NAACL, it covers Twitter messages, email, forum comments, and ... speech recognition output. It's this kind of language that got me into NLP and why I ended up working on speech, so I'm pretty excited that this is receiving more attention from the NLP community at large. - Mechanical Turk for language tasks
Like the excitement over wikipedia a few years ago, NLP folks have fallen in love with Amazon's Mechanical Turk. Mechanical turk was used for speech transcription, sentence compression, paraphrasing, and quite a lot more; there was even a workshop day dedicated solely to this topic. I didn't go to it, but will catch up on the papers this week or so. This work is very cool, particularly when it comes to automatically detecting and dealing with outlier annotations. The resource and corpora development uses of Mechanical Turk are obvious and valuable. It's in the development of "high confindence" or "gold standard" resources that I think this work has an opportunity to intersect very nicely in work on ensemble techniques and classifier combination/fusion. If each turker is considered to be an annotator, the task of identifying a gold standard corpus is identical to generating a high-confidence prediction from an ensemble.
A couple of specific highlights of papers I liked this year:
- “cba to check the spelling”: Investigating Parser Performance on Discussion Forum Posts Jennifer Foster. This might be the first time I fully agree with a best paper award. This paper looked at parsing outrageously sloppy forum comments. These are rife with spelling errors, grammatical errors, weird exclamations (lol). The paper is a really nice example of the difficulty that "noisy genres" of text pose to traditional (i.e., trained on WSJ text) models. The error analysis is clear and the paper proposes some nice solutions to bridge this gap by adding noise to the WSJ data. Also, bonus points for subtly including
- Cheap, Fast and Good Enough: Automatic Speech Recognition with Non-Expert Transcription
Scott Novotney and Chris Callison-Burch. A nice example of using Mechanical Turk to generate training data for a speech recognizer. High quality transcription of speech is pretty expensive and critically important to speech recognizer performance. Novotney and Callison-Burch found that Turkers are able to transcribe speech fairly well, and at a fraction of the cost. This paper includes a really nice evaluation of Turker performance and some interesting approaches to ranking Turker performance. - The Simple Truth about Dependency and Phrase Structure Representations: An Opinion Piece
Owen Rambow. This paper was probably my favorite in terms of bringing joy and being a breath of fresh air. The argument Rambow lays out is that Dependency and Phrase Structure Representations of syntax are meaningless in isolation. Moreover, these are simply alternate representations of identical syntactic phenomena. Linguists love to fight over a "correct" representation of syntax. This paper takes the position that the distinction between the representations is merely preference not substantive -- fighting over the correct representation of a phenomenon is a distraction to understanding the phenomenon itself. Full disclosure: I've known Owen for years, and like him personally as well as his work. - Type-Based MCMC
Percy Liang, Michael I. Jordan and Dan Klein. Over the last few years, I've been boning up on MCMC methods. I haven't applied them to my own work yet, but it's really only a matter of time. This work does a nice job of pointing out a limitation of token based MCMC -- specifically that sampling on a token by token basis can make it overly difficult to get out of local minima. Some of this difficulty can be overcome by sampling based on types, that is, sampling based on a higher level feature across the whole data set, as opposed to within a particular token. This makes intuitive sense and was empirically well motivated.
As a side note, I'd like to thank all you wonderful machine learning folks who have been doing a remarkable amount of unsupervised structured learning that I should have been paying better attention to over the last few years. Now I've got to hit the books.
Wednesday, June 02, 2010
Mistakes I've made: The first of an N part series.
In this installment, some mistakes lessons in teaching Machine Learning.
Well, that's (at least) four mistakes -- hopefully next time they'll be all new mistakes.
- Using too few Examples.
Everyone, myself especially, learns best from examples. Hands-on example are even better. My class did used hardly any. I think that explains many of the blank stares I got in response to "are there any questions?" It's very easy to ask a question about an example -- "Wait, why does the entropy equal 2.3?". It's much more difficult to ask a question like "Could you clarify the advantages and disadvantages of L2 vs. L1 regularization? You lost me at 'gradient'."
- Starting with the Math.
I spent the first two classes "reviewing" the linear algebra and calculus that would be necessary to really get into the details of some of the more complicated algorithms later in the course. Big mistake. First of all, this material wasn't review for many students -- an unexpected (but rapid) realization. Second of all, I had already lost sight of the point of the math. The math is there to support the big ideas of modeling and evaluation. These can't be accomplished without the math, but I put the cart way before the horse. In the future, I'll be starting with generalization with as little math as possible, and then bringing it in as needed.
- Ending with Evaluation.
The class included material on mean-squared error and classification error rates, far earlier than I introduced the ideas of evaluation. Sure, accuracy is a pretty intuitive notion, but there's a big assumption made in assuming that every body in the seats will know what I'm talking about. Even the relatively simple distinction between linear and squared error only takes 10 minutes to discuss, but it goes a long way towards instilling greater understanding of what's going on.
- Ambitious and unclear goals and expectations.
While this was never explicit, in reflection, it is obvious to me that my goal of the course was for the students to "know as much as I do about a machine learning". It should have been "understand the fundamentals of machine learning". Namely, 1) how can we generalize from data (statistical modeling), 2) how can we apply machine learning (feature extraction) and 3) how do we know if the system works (evaluation).
For instance, I spent a lot of time showing how frequentists and bayesians can come the the same end point w.r.t. L2 Regularization in Linear Regression. I think this is way cool, but is it more important than spending an hour describing k-nearest neighbors? Only for me, not for the students. Was it more helpful to describe MAP adaptation than decision trees? Almost definitely not not. Decision trees are disarmingly intuitive. They can be represented as if-statements, and provide an easy example of overfitting (without requiring that students know that an n-degree polynomial can intersect n+1 points). But I thought they were too simple, and not "statistical" enough to fit in with the rest of the class. Oops.
Well, that's (at least) four mistakes -- hopefully next time they'll be all new mistakes.
Wednesday, May 26, 2010
Panic Averted
If you are running mac os x 10.6.3...
and you install the new java update from Apple (Java for Mac OS X 10.6 Update 2)...
and everything that touches java stops working including your favorite IDE (IDEA, Eclipse) and even Java Preferences...
Do not panic.
I repeat.
Do not panic.
Just go to Disk Utility.
And hit Repair Disk Permissions.
There.
Isn't that better?
and you install the new java update from Apple (Java for Mac OS X 10.6 Update 2)...
and everything that touches java stops working including your favorite IDE (IDEA, Eclipse) and even Java Preferences...
Do not panic.
I repeat.
Do not panic.
Just go to Disk Utility.
And hit Repair Disk Permissions.
There.
Isn't that better?
Tuesday, May 25, 2010
Google Predict -- open machine learning service.
Google has taken a step into the open machine learning with Google Predict. Rather than release a toolkit, in typical Google fashion, they've set it up as a webservice. This is great. There needs to be greater interaction with machine learning from all walks of life. Currently, the entry bar is pretty high. Even with open source tools like weka, current interfaces are intimidating at best and require knowledge of the field. The Prediction API strips all that away: label some rows of data and let 'er rip.
The downside of this simplified process is that Google Predict works as a black box classifier (and maybe regressor?). It "Automatically selects from several available machine learning techniques", and it supports numerical values and unstructured text as input. There are no parameters to set and you can't get a confidence score out.
In all likelihood this uses the Seti infrastructure to do the heavy lifting, but there's at least a little bit of feature extraction thrown in to handle the text input.
It'll be interesting to see if anyone can suss out what is going on under the hood. I signed up for the access waiting list. When I get in, I'll post some comparison results between Google Predict and a variety of other open source tools here.
Thanks to Slashdot via Machine Learning (Theory) for the heads up on this one.
The downside of this simplified process is that Google Predict works as a black box classifier (and maybe regressor?). It "Automatically selects from several available machine learning techniques", and it supports numerical values and unstructured text as input. There are no parameters to set and you can't get a confidence score out.
In all likelihood this uses the Seti infrastructure to do the heavy lifting, but there's at least a little bit of feature extraction thrown in to handle the text input.
It'll be interesting to see if anyone can suss out what is going on under the hood. I signed up for the access waiting list. When I get in, I'll post some comparison results between Google Predict and a variety of other open source tools here.
Thanks to Slashdot via Machine Learning (Theory) for the heads up on this one.
Tuesday, May 18, 2010
Genre vs. Corpus Effects
The term "genre" gets used to broadly describe the context of speech -- read speech, spontaneous speech, broadcast news speech, telephone conversation speech, presentation speech, meeting speech, multiparty meeting speech, etc. The list goes on because it's not particularly rigorously defined term. The observations here also apply to text in NLP where genre can be used to characterize newswire text, blog posts, blog comments, email, IM, fictional prose, etc.
We'd all like to make claims about the nature of speech. Big bold claims of the sort "Conversation speech is like X" (for descriptive statistics) or "This algorithm performs with accuracy Y +- Z on broadcast conversation speech" (for evaluation tasks). These claims are inherently satisfying. Either you've made some broad (hopefully interesting or impactful) observation about speech or you're able to claim expected performance of an approach on unseen examples of speech -- from the same genre. There's a problem though. It's usually impossible to know if the effects are broad enough to be consistent across the genre of speech, or if they are specific to the examined material -- the corpus. This isn't terrible, just an overly broad claim.
Where this gets to be more of a problem is when corpus effects are considered to be genre effects. When we make claims like "it's harder to recognize spontaneous speech than read speech" usually what's being said is "my performance is lower on the spontaneous material I tested than on the read material I tested."
I was reminded about this issue around Anna Margolis, Mari Ostendorf and Karen Livescu's paper at Speech Prosody (see last post). They were looking at the impact of genre on prosodic event detection. They found that cross-genre/corpus training led to poor results, but that by combining training material from both genres/corpora, performance improved.
However, the impact of differences between the corpora other than genre effects get muddled. The two corpora in this paper are the Boston University Radio News Corpus, and the Switchboard corpus. One is carefully recorded professionally read news speech and the other is telephone conversations. In addition to genre, the recording conditions, conversational participants and labelers are all distinct. I really like this paper, even if its results show that joint training can overcome the corpus disparities (including genre differences). These are exactly the differences likely to be found between training data and any unseen data! And this is what system evaluations seek to measure in the first place.
We'd all like to make claims about the nature of speech. Big bold claims of the sort "Conversation speech is like X" (for descriptive statistics) or "This algorithm performs with accuracy Y +- Z on broadcast conversation speech" (for evaluation tasks). These claims are inherently satisfying. Either you've made some broad (hopefully interesting or impactful) observation about speech or you're able to claim expected performance of an approach on unseen examples of speech -- from the same genre. There's a problem though. It's usually impossible to know if the effects are broad enough to be consistent across the genre of speech, or if they are specific to the examined material -- the corpus. This isn't terrible, just an overly broad claim.
Where this gets to be more of a problem is when corpus effects are considered to be genre effects. When we make claims like "it's harder to recognize spontaneous speech than read speech" usually what's being said is "my performance is lower on the spontaneous material I tested than on the read material I tested."
I was reminded about this issue around Anna Margolis, Mari Ostendorf and Karen Livescu's paper at Speech Prosody (see last post). They were looking at the impact of genre on prosodic event detection. They found that cross-genre/corpus training led to poor results, but that by combining training material from both genres/corpora, performance improved.
However, the impact of differences between the corpora other than genre effects get muddled. The two corpora in this paper are the Boston University Radio News Corpus, and the Switchboard corpus. One is carefully recorded professionally read news speech and the other is telephone conversations. In addition to genre, the recording conditions, conversational participants and labelers are all distinct. I really like this paper, even if its results show that joint training can overcome the corpus disparities (including genre differences). These are exactly the differences likely to be found between training data and any unseen data! And this is what system evaluations seek to measure in the first place.
Monday, May 17, 2010
Speech Prosody 2010 Recap
Speech Prosody is a biannual conference held by a special interest group of ISCA. Despite working on intonation and prosody since about 2006, this is the first year I've attended.
The conference has only been held five times and has the feeling of a workshop -- no parallel sessions is nice, but the quality of work was very varied. I'm sympathetic to conference organizers and particularly sympathetic when you're coordinating a conference that is still relatively new. I'm sure you want to make sure people attend, and the easiest way to do that is to accept work for presentation. But by casting a wider net some work gets in that probably could have used another round of revision. Mark Hasegawa-Johnson and his team logistically executed a very successful conference. But (at the risk of having my own work rejected) I think the conference is mature enough that Speech Prosody 2012 could stand more wheat, less chaff.
Two major themes stuck out:
Here are some of my favorite papers. This list is *heavily* biased towards work I might cite in the future more than the "best" papers at the conference.
The conference has only been held five times and has the feeling of a workshop -- no parallel sessions is nice, but the quality of work was very varied. I'm sympathetic to conference organizers and particularly sympathetic when you're coordinating a conference that is still relatively new. I'm sure you want to make sure people attend, and the easiest way to do that is to accept work for presentation. But by casting a wider net some work gets in that probably could have used another round of revision. Mark Hasegawa-Johnson and his team logistically executed a very successful conference. But (at the risk of having my own work rejected) I think the conference is mature enough that Speech Prosody 2012 could stand more wheat, less chaff.
Two major themes stuck out:
- Recognizing emotion from speech -- A lot of work, but very little novelty in corpus, machine learning approach, or findings. It's easy to recognize high vs. low activation/arousal, and hard to recognize high vs. low valence. (I'll return to this theme in a post on Interspeech reviewing...)
- Analyzing non-native intonation -- There were scads of papers on this topic covering both how intonation is perceived and produced by non-native speakers of a language (including two by me!). I had no anticipation that this would be so popular, but if this conference were any indication, I would expect to see a lot of work on non-native spoken language processing at Interspeech and SLT.
Here are some of my favorite papers. This list is *heavily* biased towards work I might cite in the future more than the "best" papers at the conference.
- The effect of global F0 contour shape on the perception of tonal timing contrasts in American English intonation Jonathan Barnes, Nanette Veilleux, Alejna Brugos and Stefanie Shattuck-Huffnagel The paper examines the question of peak timing in the perception of L+H* vs. L*+H accents. Generally has been assumed (including in the ToBI guidelines) that the difference is in the peak timing relative to the accented vowel or syllable onset -- with L*+H having later peaks. Barnes et al. found that this can be manipulated such that if you keep the peak timing the same, but make the F0 approach to that peak more "dome-y" or "scoop-y", subjects will perceive a difference in accent type. This is interesting to me, though *very* inside baseball. What I liked most about the paper and presentation was the use of Tonal Center of Gravity or ToCG to identify a time associated with the greatest mass of f0 information. This aggregation is more robust to spurious f0 estimates than a simple maximum, and is a really interesting way to parameterize an acoustic contour. Oh, and it's incredibly simple to calculate: \[ T_{cog} = \frac{\sum_i f_i t_i}{\sum_i f_i}\].
- Cross-genre training for automatic prosody classification Anna Margolis, Mari Ostendorf, Karen Livescu Mari Ostendorf and her students have been looking at automatic analysis of prosody for as long as anyone. In this work, her student Anna is looking at how detection of accents and breaks performance degrades across corpora, looking at the BURNC (radio news speech) and Switchboard (spontaneous telephone speech) data. This paper found that you suffer a significant degradation of performance when models are trained on one corpus and evaluated on the other, compared to within-genre training -- with lexical features suffering more dramatically than acoustic features (unsurprisingly...). However, the interesting effect is that by using a combined training set from both genres, the performance doesn't suffer very much. This is encouraging for the notion that when analyzing prosody, we can include diverse genres of training data and increase robustness. It's not so much that this is a surprising result as it is comforting.
- C-PROM: An Annotated Corpus for French Prominence Study M. Avanzi, A.C. Simon, J.-P. Goldman & A. Auchlin This was presented at the Prominence Workshop preceding the conference. The title pretty much says it all. But it deserves acknowledgement for sharing data with the community. Everyone working with prosody bemoans the lack of intonationally annotated data. It's time consuming, exhausting work. (Until last year I was a graduate student -- I've done more than my share of ToBI labeling, and I'm sure there will be more to come.) These lovely folks from France (mostly) have labeled about an hour of data for prominence from 7 different genres, and ... wait for it ... posted it online for all the world to see, download and abuse. They've set a great example, and even though I haven't worked with French before and have no intuitions about it, I appreciate open data a lot.
Welcome
I've been happily reading NLP (http://nlpers.blogspot.com/) and Machine Learning (http://hunch.net/, http://pleasescoopme.com/) blogs for the past few years. It's been a great way to see what other researchers are thinking about. When reading conference papers or journal articles, the ideas are (more or less) completely formed. I was initially skeptical, but research blog posts give an added and (more importantly) faster layer of insight and sharing of information.
So here's my drop in the ocean.
I expect the broad theme to be Spoken Language Processing research. But it's impossible to discuss about this without drawing heavily from Natural Language Processing, Machine Learning and Signal Processing. So I expect a fair amount of that too.
So here's my drop in the ocean.
I expect the broad theme to be Spoken Language Processing research. But it's impossible to discuss about this without drawing heavily from Natural Language Processing, Machine Learning and Signal Processing. So I expect a fair amount of that too.
Subscribe to:
Posts (Atom)