Spoken Language Processing: 05/01/2010

Wednesday, May 26, 2010

Panic Averted

If you are running mac os x 10.6.3...

and you install the new java update from Apple (Java for Mac OS X 10.6 Update 2)...

and everything that touches java stops working including your favorite IDE (IDEA, Eclipse) and even Java Preferences...

Do not panic.

I repeat.

Do not panic.

Just go to Disk Utility.

And hit Repair Disk Permissions.

There.

Isn't that better?

Tuesday, May 25, 2010

Google Predict -- open machine learning service.

Google has taken a step into the open machine learning with Google Predict. Rather than release a toolkit, in typical Google fashion, they've set it up as a webservice. This is great. There needs to be greater interaction with machine learning from all walks of life. Currently, the entry bar is pretty high. Even with open source tools like weka, current interfaces are intimidating at best and require knowledge of the field. The Prediction API strips all that away: label some rows of data and let 'er rip.

The downside of this simplified process is that Google Predict works as a black box classifier (and maybe regressor?). It "Automatically selects from several available machine learning techniques", and it supports numerical values and unstructured text as input. There are no parameters to set and you can't get a confidence score out.
In all likelihood this uses the Seti infrastructure to do the heavy lifting, but there's at least a little bit of feature extraction thrown in to handle the text input.

It'll be interesting to see if anyone can suss out what is going on under the hood. I signed up for the access waiting list. When I get in, I'll post some comparison results between Google Predict and a variety of other open source tools here.

Thanks to Slashdot via Machine Learning (Theory) for the heads up on this one.

Tuesday, May 18, 2010

Genre vs. Corpus Effects

The term "genre" gets used to broadly describe the context of speech -- read speech, spontaneous speech, broadcast news speech, telephone conversation speech, presentation speech, meeting speech, multiparty meeting speech, etc. The list goes on because it's not particularly rigorously defined term. The observations here also apply to text in NLP where genre can be used to characterize newswire text, blog posts, blog comments, email, IM, fictional prose, etc.

We'd all like to make claims about the nature of speech. Big bold claims of the sort "Conversation speech is like X" (for descriptive statistics) or "This algorithm performs with accuracy Y +- Z on broadcast conversation speech" (for evaluation tasks). These claims are inherently satisfying. Either you've made some broad (hopefully interesting or impactful) observation about speech or you're able to claim expected performance of an approach on unseen examples of speech -- from the same genre. There's a problem though. It's usually impossible to know if the effects are broad enough to be consistent across the genre of speech, or if they are specific to the examined material -- the corpus. This isn't terrible, just an overly broad claim.

Where this gets to be more of a problem is when corpus effects are considered to be genre effects. When we make claims like "it's harder to recognize spontaneous speech than read speech" usually what's being said is "my performance is lower on the spontaneous material I tested than on the read material I tested."

I was reminded about this issue around Anna Margolis, Mari Ostendorf and Karen Livescu's paper at Speech Prosody (see last post). They were looking at the impact of genre on prosodic event detection. They found that cross-genre/corpus training led to poor results, but that by combining training material from both genres/corpora, performance improved.

However, the impact of differences between the corpora other than genre effects get muddled. The two corpora in this paper are the Boston University Radio News Corpus, and the Switchboard corpus. One is carefully recorded professionally read news speech and the other is telephone conversations. In addition to genre, the recording conditions, conversational participants and labelers are all distinct. I really like this paper, even if its results show that joint training can overcome the corpus disparities (including genre differences). These are exactly the differences likely to be found between training data and any unseen data! And this is what system evaluations seek to measure in the first place.

Monday, May 17, 2010

Speech Prosody 2010 Recap

Speech Prosody is a biannual conference held by a special interest group of ISCA. Despite working on intonation and prosody since about 2006, this is the first year I've attended.

The conference has only been held five times and has the feeling of a workshop -- no parallel sessions is nice, but the quality of work was very varied. I'm sympathetic to conference organizers and particularly sympathetic when you're coordinating a conference that is still relatively new. I'm sure you want to make sure people attend, and the easiest way to do that is to accept work for presentation. But by casting a wider net some work gets in that probably could have used another round of revision. Mark Hasegawa-Johnson and his team logistically executed a very successful conference. But (at the risk of having my own work rejected) I think the conference is mature enough that Speech Prosody 2012 could stand more wheat, less chaff.

Two major themes stuck out:

Recognizing emotion from speech -- A lot of work, but very little novelty in corpus, machine learning approach, or findings. It's easy to recognize high vs. low activation/arousal, and hard to recognize high vs. low valence. (I'll return to this theme in a post on Interspeech reviewing...)
Analyzing non-native intonation -- There were scads of papers on this topic covering both how intonation is perceived and produced by non-native speakers of a language (including two by me!). I had no anticipation that this would be so popular, but if this conference were any indication, I would expect to see a lot of work on non-native spoken language processing at Interspeech and SLT.

Here are some of my favorite papers. This list is *heavily* biased towards work I might cite in the future more than the "best" papers at the conference.

The effect of global F0 contour shape on the perception of tonal timing contrasts in American English intonation Jonathan Barnes, Nanette Veilleux, Alejna Brugos and Stefanie Shattuck-Huffnagel The paper examines the question of peak timing in the perception of L+H* vs. L*+H accents. Generally has been assumed (including in the ToBI guidelines) that the difference is in the peak timing relative to the accented vowel or syllable onset -- with L*+H having later peaks. Barnes et al. found that this can be manipulated such that if you keep the peak timing the same, but make the F0 approach to that peak more "dome-y" or "scoop-y", subjects will perceive a difference in accent type. This is interesting to me, though *very* inside baseball. What I liked most about the paper and presentation was the use of Tonal Center of Gravity or ToCG to identify a time associated with the greatest mass of f0 information. This aggregation is more robust to spurious f0 estimates than a simple maximum, and is a really interesting way to parameterize an acoustic contour. Oh, and it's incredibly simple to calculate: \[ T_{cog} = \frac{\sum_i f_i t_i}{\sum_i f_i}\].
Cross-genre training for automatic prosody classification Anna Margolis, Mari Ostendorf, Karen Livescu Mari Ostendorf and her students have been looking at automatic analysis of prosody for as long as anyone. In this work, her student Anna is looking at how detection of accents and breaks performance degrades across corpora, looking at the BURNC (radio news speech) and Switchboard (spontaneous telephone speech) data. This paper found that you suffer a significant degradation of performance when models are trained on one corpus and evaluated on the other, compared to within-genre training -- with lexical features suffering more dramatically than acoustic features (unsurprisingly...). However, the interesting effect is that by using a combined training set from both genres, the performance doesn't suffer very much. This is encouraging for the notion that when analyzing prosody, we can include diverse genres of training data and increase robustness. It's not so much that this is a surprising result as it is comforting.
C-PROM: An Annotated Corpus for French Prominence Study M. Avanzi, A.C. Simon, J.-P. Goldman & A. Auchlin This was presented at the Prominence Workshop preceding the conference. The title pretty much says it all. But it deserves acknowledgement for sharing data with the community. Everyone working with prosody bemoans the lack of intonationally annotated data. It's time consuming, exhausting work. (Until last year I was a graduate student -- I've done more than my share of ToBI labeling, and I'm sure there will be more to come.) These lovely folks from France (mostly) have labeled about an hour of data for prominence from 7 different genres, and ... wait for it ... posted it online for all the world to see, download and abuse. They've set a great example, and even though I haven't worked with French before and have no intuitions about it, I appreciate open data a lot.

Full disclosure: I had two papers at Speech Prosody 2010, one in the main workshop (circuitously placed in a Special Session) and the other in a workshop on prominence the day before. And I would certainly include these in the "varied" description of the caliber of work. Not that I'd describe my work as shoddy papers (though there were some at SP2010, to be sure) but they are quite limited in scope, reporting on two small perception and production studies of non-native intonation.

Welcome

I've been happily reading NLP (http://nlpers.blogspot.com/) and Machine Learning (http://hunch.net/, http://pleasescoopme.com/) blogs for the past few years. It's been a great way to see what other researchers are thinking about. When reading conference papers or journal articles, the ideas are (more or less) completely formed. I was initially skeptical, but research blog posts give an added and (more importantly) faster layer of insight and sharing of information.

So here's my drop in the ocean.

I expect the broad theme to be Spoken Language Processing research. But it's impossible to discuss about this without drawing heavily from Natural Language Processing, Machine Learning and Signal Processing. So I expect a fair amount of that too.