Here's the premise: I was invited to give a guest lecture in Advanced Natural Language Processing. The students will get one week out of 14 focusing on speech and spoken language processing. But it's early in the semester, so there's an opportunity to give a perspective about how speech fits in to the lessons that they'll be learning in more detail later in the semester.
Here's the question: how do you spend 75 minutes to provide a useful survey of speech and spoken language processing?
My answer, in powerpoint form, can be found here.
I spent about 2/3 or so of the material on speech recognition. I figured most students are fascinated by the idea of a machine being able to get words from speech, so let's go through the fundamentals of the technology behind it.
The remaining 1/3rd or so, I focus on the notion that speech recognition is not sufficient for speech understanding. This a lot of other information in speech that is either 1) unavailable in text, or 2) unavailable in ASR transcripts. The premise in this section is to convince students that speech isn't just a noisy string of unadorned words, but that there's a lot of information about structure, and intention that is available from the speech signal. What's more, we can use it in spoken language processing.
There are an outrageous amount of important concepts that get almost no attention here including but not limited to: Digital signal processing, human speech production and perception, speech synthesis, multimodal speech processing, speaker identification, language identification, building speech corpora, linguistic annotation, discourse and dialog, and conversational agents.
Would you do it differently? I'm curious what some other takes on this problem might look like.
Some thoughts on Spoken Language Processing, with tangents on Natural Language Processing, Machine Learning, and Signal Processing thrown in for good measure.
Wednesday, August 29, 2012
Thursday, August 16, 2012
AuToBI v1.3
This release to AuToBI is a more traditional milestone release than v1.2 was. Trained models and a new .jar file will be available on the AuToBI site shortly.
There are improvements to performance that are thoroughly documented in a submission to IEEE SLT 2012. These improvements were achieved from two sources.
First, AuToBI uses importance weighting to improve classification performance on skewed distributions. I found this to be a more useful approach than the standard under- or over-sampling. This is discussed in a paper that will appear at Interspeech next month.
Second, inspired by features that Taniya Mishra, Vivek Sridhar and Aliaster Conkie developed at AT&T, I included some new features which had a big payoff. (They described these features in an upcoming Interspeech 2012 paper). One of the most significant was to calculate the area under a normalized intensity curve. This has a strong correlation with duration, but is more robust. You could make an argument that it approximates "loudness" by incorporating duration and intensity. This is a pretty poor psycholinguistic or perceptual argument so I wouldn't make it too strongly, but it could be part of the story.
Here is a recap of speaker-independent acoustic-only performance on the six ToBI classification tasks on BURNC speaker f2b.
There are also a number of improvements to AuToBI from a technical side and as a piece of code.
First of all, unit test coverage has increased from ~11% to ~73% between v1.2 and v1.3.
Second, there was a bug in the PitchExtractor code causing a pretty serious under prediction of unvoiced grames. (A big thanks to Victor Soto for finding this bug.)
Third, memory use is much lower by a more aggressive deletion of prediction attributes, and through a modification of how WavReader works.
I'd like to thank Victor Soto, Fabio Tesser, Samuel Sanchez, Jay Liang, Ian Kaplan, Erica Cooper and, as ever, Julia Hirschberg and anyone else who has been using AuToBI, for their patience and feedback.
I've been pretty lax about posting here. I'll try to get better about it in the coming academic year.
This fall is full of travel which will lead to a lot of ideas and not enough time to work on them.
There are improvements to performance that are thoroughly documented in a submission to IEEE SLT 2012. These improvements were achieved from two sources.
First, AuToBI uses importance weighting to improve classification performance on skewed distributions. I found this to be a more useful approach than the standard under- or over-sampling. This is discussed in a paper that will appear at Interspeech next month.
Second, inspired by features that Taniya Mishra, Vivek Sridhar and Aliaster Conkie developed at AT&T, I included some new features which had a big payoff. (They described these features in an upcoming Interspeech 2012 paper). One of the most significant was to calculate the area under a normalized intensity curve. This has a strong correlation with duration, but is more robust. You could make an argument that it approximates "loudness" by incorporating duration and intensity. This is a pretty poor psycholinguistic or perceptual argument so I wouldn't make it too strongly, but it could be part of the story.
Here is a recap of speaker-independent acoustic-only performance on the six ToBI classification tasks on BURNC speaker f2b.
Task | Version 1.2 | Version 1.3 |
---|---|---|
Pitch Accent Detection | 81.01% F1:83.28 | 84.83% F1:86.58 |
Intermediate Phrase Detection | 75.41% F1:43.15 | 77.97% F1:44.43 |
Intonational Phrase Detection | 86.91% F1:74.50 | 90.36% F1:76.49 |
Pitch Accent Classification | 18.46% Average Recall:18.97 | 16.33% Average Recall:21.06 |
Phrase Accent Classification | 48.34% Average Recall:47.99 | 47.44% Average Recall:48.31 |
Phrase Accent/Boundary Tone Classification | 73.18% Average Recall:25.92 | 74.47% Average Recall:26.02 |
There are also a number of improvements to AuToBI from a technical side and as a piece of code.
First of all, unit test coverage has increased from ~11% to ~73% between v1.2 and v1.3.
Second, there was a bug in the PitchExtractor code causing a pretty serious under prediction of unvoiced grames. (A big thanks to Victor Soto for finding this bug.)
Third, memory use is much lower by a more aggressive deletion of prediction attributes, and through a modification of how WavReader works.
I'd like to thank Victor Soto, Fabio Tesser, Samuel Sanchez, Jay Liang, Ian Kaplan, Erica Cooper and, as ever, Julia Hirschberg and anyone else who has been using AuToBI, for their patience and feedback.
I've been pretty lax about posting here. I'll try to get better about it in the coming academic year.
This fall is full of travel which will lead to a lot of ideas and not enough time to work on them.
Subscribe to:
Posts (Atom)