The term "genre" gets used to broadly describe the context of speech -- read speech, spontaneous speech, broadcast news speech, telephone conversation speech, presentation speech, meeting speech, multiparty meeting speech, etc. The list goes on because it's not particularly rigorously defined term. The observations here also apply to text in NLP where genre can be used to characterize newswire text, blog posts, blog comments, email, IM, fictional prose, etc.
We'd all like to make claims about the nature of speech. Big bold claims of the sort "Conversation speech is like X" (for descriptive statistics) or "This algorithm performs with accuracy Y +- Z on broadcast conversation speech" (for evaluation tasks). These claims are inherently satisfying. Either you've made some broad (hopefully interesting or impactful) observation about speech or you're able to claim expected performance of an approach on unseen examples of speech -- from the same genre. There's a problem though. It's usually impossible to know if the effects are broad enough to be consistent across the genre of speech, or if they are specific to the examined material -- the corpus. This isn't terrible, just an overly broad claim.
Where this gets to be more of a problem is when corpus effects are considered to be genre effects. When we make claims like "it's harder to recognize spontaneous speech than read speech" usually what's being said is "my performance is lower on the spontaneous material I tested than on the read material I tested."
I was reminded about this issue around Anna Margolis, Mari Ostendorf and Karen Livescu's paper at Speech Prosody (see last post). They were looking at the impact of genre on prosodic event detection. They found that cross-genre/corpus training led to poor results, but that by combining training material from both genres/corpora, performance improved.
However, the impact of differences between the corpora other than genre effects get muddled. The two corpora in this paper are the Boston University Radio News Corpus, and the Switchboard corpus. One is carefully recorded professionally read news speech and the other is telephone conversations. In addition to genre, the recording conditions, conversational participants and labelers are all distinct. I really like this paper, even if its results show that joint training can overcome the corpus disparities (including genre differences). These are exactly the differences likely to be found between training data and any unseen data! And this is what system evaluations seek to measure in the first place.
No comments:
Post a Comment