Spoken Language Processing: Speech Prosody 2010 Recap

Speech Prosody is a biannual conference held by a special interest group of ISCA. Despite working on intonation and prosody since about 2006, this is the first year I've attended.

The conference has only been held five times and has the feeling of a workshop -- no parallel sessions is nice, but the quality of work was very varied. I'm sympathetic to conference organizers and particularly sympathetic when you're coordinating a conference that is still relatively new. I'm sure you want to make sure people attend, and the easiest way to do that is to accept work for presentation. But by casting a wider net some work gets in that probably could have used another round of revision. Mark Hasegawa-Johnson and his team logistically executed a very successful conference. But (at the risk of having my own work rejected) I think the conference is mature enough that Speech Prosody 2012 could stand more wheat, less chaff.

Two major themes stuck out:

Recognizing emotion from speech -- A lot of work, but very little novelty in corpus, machine learning approach, or findings. It's easy to recognize high vs. low activation/arousal, and hard to recognize high vs. low valence. (I'll return to this theme in a post on Interspeech reviewing...)
Analyzing non-native intonation -- There were scads of papers on this topic covering both how intonation is perceived and produced by non-native speakers of a language (including two by me!). I had no anticipation that this would be so popular, but if this conference were any indication, I would expect to see a lot of work on non-native spoken language processing at Interspeech and SLT.

Here are some of my favorite papers. This list is *heavily* biased towards work I might cite in the future more than the "best" papers at the conference.

The effect of global F0 contour shape on the perception of tonal timing contrasts in American English intonation Jonathan Barnes, Nanette Veilleux, Alejna Brugos and Stefanie Shattuck-Huffnagel The paper examines the question of peak timing in the perception of L+H* vs. L*+H accents. Generally has been assumed (including in the ToBI guidelines) that the difference is in the peak timing relative to the accented vowel or syllable onset -- with L*+H having later peaks. Barnes et al. found that this can be manipulated such that if you keep the peak timing the same, but make the F0 approach to that peak more "dome-y" or "scoop-y", subjects will perceive a difference in accent type. This is interesting to me, though *very* inside baseball. What I liked most about the paper and presentation was the use of Tonal Center of Gravity or ToCG to identify a time associated with the greatest mass of f0 information. This aggregation is more robust to spurious f0 estimates than a simple maximum, and is a really interesting way to parameterize an acoustic contour. Oh, and it's incredibly simple to calculate: \[ T_{cog} = \frac{\sum_i f_i t_i}{\sum_i f_i}\].
Cross-genre training for automatic prosody classification Anna Margolis, Mari Ostendorf, Karen Livescu Mari Ostendorf and her students have been looking at automatic analysis of prosody for as long as anyone. In this work, her student Anna is looking at how detection of accents and breaks performance degrades across corpora, looking at the BURNC (radio news speech) and Switchboard (spontaneous telephone speech) data. This paper found that you suffer a significant degradation of performance when models are trained on one corpus and evaluated on the other, compared to within-genre training -- with lexical features suffering more dramatically than acoustic features (unsurprisingly...). However, the interesting effect is that by using a combined training set from both genres, the performance doesn't suffer very much. This is encouraging for the notion that when analyzing prosody, we can include diverse genres of training data and increase robustness. It's not so much that this is a surprising result as it is comforting.
C-PROM: An Annotated Corpus for French Prominence Study M. Avanzi, A.C. Simon, J.-P. Goldman & A. Auchlin This was presented at the Prominence Workshop preceding the conference. The title pretty much says it all. But it deserves acknowledgement for sharing data with the community. Everyone working with prosody bemoans the lack of intonationally annotated data. It's time consuming, exhausting work. (Until last year I was a graduate student -- I've done more than my share of ToBI labeling, and I'm sure there will be more to come.) These lovely folks from France (mostly) have labeled about an hour of data for prominence from 7 different genres, and ... wait for it ... posted it online for all the world to see, download and abuse. They've set a great example, and even though I haven't worked with French before and have no intuitions about it, I appreciate open data a lot.

Full disclosure: I had two papers at Speech Prosody 2010, one in the main workshop (circuitously placed in a Special Session) and the other in a workshop on prominence the day before. And I would certainly include these in the "varied" description of the caliber of work. Not that I'd describe my work as shoddy papers (though there were some at SP2010, to be sure) but they are quite limited in scope, reporting on two small perception and production studies of non-native intonation.

Spoken Language Processing

Monday, May 17, 2010

Speech Prosody 2010 Recap

No comments:

Disqus for Spoken Language Processing Blog