Here's the premise: I was invited to give a guest lecture in Advanced Natural Language Processing. The students will get one week out of 14 focusing on speech and spoken language processing. But it's early in the semester, so there's an opportunity to give a perspective about how speech fits in to the lessons that they'll be learning in more detail later in the semester.
Here's the question: how do you spend 75 minutes to provide a useful survey of speech and spoken language processing?
My answer, in powerpoint form, can be found here.
I spent about 2/3 or so of the material on speech recognition. I figured most students are fascinated by the idea of a machine being able to get words from speech, so let's go through the fundamentals of the technology behind it.
The remaining 1/3rd or so, I focus on the notion that speech recognition is not sufficient for speech understanding. This a lot of other information in speech that is either 1) unavailable in text, or 2) unavailable in ASR transcripts. The premise in this section is to convince students that speech isn't just a noisy string of unadorned words, but that there's a lot of information about structure, and intention that is available from the speech signal. What's more, we can use it in spoken language processing.
There are an outrageous amount of important concepts that get almost no attention here including but not limited to: Digital signal processing, human speech production and perception, speech synthesis, multimodal speech processing, speaker identification, language identification, building speech corpora, linguistic annotation, discourse and dialog, and conversational agents.
Would you do it differently? I'm curious what some other takes on this problem might look like.