I hadn't really planned for this current improvement to AuToBI be a milestone release.
I'm about halfway through an effort to get test coverage up to 90-95% of lines and 100% of classes. I promise it'll get there eventually.
But in the mean time, I was playing with an improvement to how attributes are associated to data points. I knew this was a significant source of inefficiency, but didn't quite expect this much.
Here are memory usage graphs for training a Pitch Accent Detection model on the Boston University Radio News Corpus -- about 22k data points and 136 features. The first one is on my MacBookPro Laptop with 4G RAM (and a lot of other nonsense running).
The max memory usage of Version 1.1 was 1914Mb, with this improvement it tops out at 1049Mb. An improvement of about 45%. (You'll notice it also ends a little bit quicker too, but this is probably because of fewer or quicker garbage collection calls.)
I figured I'd check on a compute server too, one of the Speech Lab @ Queens College's Quad Core Intel Xeon Processor E5450 (3.0GHz,2X6ML2,1333) with 4Gb RAM.
Similar results here. Max memory usage of version 1.1 was 2343Mb and with the improvement 1392Mb. Improving by 40%. (And the speed improvement is here too.) I don't have a good explanation for why the linux version is taking more memory to run, but for now I'll assume it has something to do with the difference to the JVM.
There are some other bugfixes in this version, but this is the big reason to upgrade.
The version 1.2 is available from github
git clone git@github.com:AndrewRosenberg/AuToBI.git
Some thoughts on Spoken Language Processing, with tangents on Natural Language Processing, Machine Learning, and Signal Processing thrown in for good measure.
Tuesday, January 10, 2012
Tuesday, January 03, 2012
English Pronunciation by G. Nolst Trenité
This is a repost of a poem posted on spelling.wordpress.com that's been going around facebook today.
It's an incredibly elegant set of examples about why grapheme-to-phoneme (letter-to-sound) conversion is so difficult in English. (Maybe this should be a required regression test for any TTS frontend...)
Please enjoy.
English Pronunciation by G. Nolst Trenité (after the break)
It's an incredibly elegant set of examples about why grapheme-to-phoneme (letter-to-sound) conversion is so difficult in English. (Maybe this should be a required regression test for any TTS frontend...)
Please enjoy.
English Pronunciation by G. Nolst Trenité (after the break)
Wednesday, December 21, 2011
Evaluating multi-class classification performance
Classification performance is generally easy to evaluate. Just use accuracy. It's the most intuitive evaluation measure you can think of.
$\frac{Correct Predictions}{Number of Data Points}$
We can also represent classification results as a contingency matrix A, with $A_{ij}$ for $i, j \in \vec{t} = \{t_1,\ldots, t_k\}$ where $k$ is the number of possible target values, and A_{ij} is the number of times that a data point from with true label $t_i$ was classified as label $t_j$. Then accuracy is defined as follows:
$Accuracy = \frac{\sum_i A_{ii}}{\sum_i \sum_j A_{ij}}$
This is great until the balance of classes gets out of whack.
This is most commonly seen in cases, where you're not really classifying but detecting something. Say, a classification task where 95% of data points are non-cancerous, and 5% are cancerous. A degenerate classifier which assigns the majority class label to all points will lead to 95% accuracy (seemingly very good) but is completely uninformative.
When the distribution of class labels is skewed, like this, accuracy becomes a poor evaluation measure. When there are only 2 labels, there are a variety of choices for evaluation that have been developed through investigation of detection problems, including F-measure, and ROC curves.
There are fewer available choices with more than 2 labels, and the community (at least in natural language processing and speech) has not yet settled on a consistent evaluation measure for these tasks.
Here are some options:
* Ian Read and Stephen Cox proposed the use of "Balanced Error Rate" in "Automatic Pitch Accent Prediction for Text-To-Speech Synthesis", Interspeech 2007.
\[
BER = 1 - \frac{1}{k}\sum_i\frac{A_{ii}}{\sum_j A_{ij}}
\]
This is one minus the average recall (correct predictions of a class / true instances) treating each class evenly, regardless of its class membership. It's worth noting that this measure -- average recall -- was also used in the Interspeech Challenges in 2009, 2010, and 2011. The organizers of this challenge, for reasons that escape explanation, chose to call the evaluation measure "unweighted accuracy" rather than "average recall" or "balanced error rate".
* In my thesis, I put a spin on this measure and defined, "Combined Error Rate". This was the average of the total false positive and false negative rates taken across each class.
\[
CER = \frac{\sum_i \frac{n_i}{n} * FP_i + \sum_i\frac{n_i}{n} * FN_i}{2}
\]
This was fairly ad hoc, and not all that well motivated.
* There is also micro and macro-averaged F-measure, which extend the idea of precision and recall from a single class to a set of classes.
* Mutual Information treats the class and hypothesis distributions as random multinomial variables, and calculates their, well,...mutual information. The downside to this technique is that perfectly bad predictions will have perfect MI, just as a set of perfectly good predictions will. This is a serious problem, making it significantly less useful.
With the exception of Mutual Information, each of these have the form, of examining the contingency table, and weighting the contribution of a correct prediction (or error) based on its true or hypothesized class membership.
This allows us to generalize (many of) these measures into something like
\[
Measure = \frac{1}{Z(A)}\sum_i \sum_j w(i,j) A_{ij}
\]
where w(i,j) is a function that determines how much a classification of a point of class $i$ as class $j$ contributes to the measure, and Z(A) is a normalizing function.
Essentially we're trying to determine how important it is for each point to be classified correctly, on the basis of its class membership, and how damaging an incorrect classification.
Information theory can give us a values for how important each point is. It seems as though w(i,j) := $\delta(i = j)*(-p_i \log p_i)$ would be a valuable evaluation measure. Where $\delta(i = j)$ is a Kronecker delta function, which is 1 if $i = j$ and 0 otherwise. This says that correct classifications contribute the amount of information in its true class to the measure function, while incorrect classifications contribute nothing. This instantiation of an accuracy-like measure weights correct classifications by the amount of information they transmit to the user.
I don't know of anyone who has done this, but it seems like someone should have. I'll keep looking in the literature until I find something, or I give up and write a little paper on this.
$\frac{Correct Predictions}{Number of Data Points}$
We can also represent classification results as a contingency matrix A, with $A_{ij}$ for $i, j \in \vec{t} = \{t_1,\ldots, t_k\}$ where $k$ is the number of possible target values, and A_{ij} is the number of times that a data point from with true label $t_i$ was classified as label $t_j$. Then accuracy is defined as follows:
$Accuracy = \frac{\sum_i A_{ii}}{\sum_i \sum_j A_{ij}}$
This is great until the balance of classes gets out of whack.
This is most commonly seen in cases, where you're not really classifying but detecting something. Say, a classification task where 95% of data points are non-cancerous, and 5% are cancerous. A degenerate classifier which assigns the majority class label to all points will lead to 95% accuracy (seemingly very good) but is completely uninformative.
When the distribution of class labels is skewed, like this, accuracy becomes a poor evaluation measure. When there are only 2 labels, there are a variety of choices for evaluation that have been developed through investigation of detection problems, including F-measure, and ROC curves.
There are fewer available choices with more than 2 labels, and the community (at least in natural language processing and speech) has not yet settled on a consistent evaluation measure for these tasks.
Here are some options:
* Ian Read and Stephen Cox proposed the use of "Balanced Error Rate" in "Automatic Pitch Accent Prediction for Text-To-Speech Synthesis", Interspeech 2007.
\[
BER = 1 - \frac{1}{k}\sum_i\frac{A_{ii}}{\sum_j A_{ij}}
\]
This is one minus the average recall (correct predictions of a class / true instances) treating each class evenly, regardless of its class membership. It's worth noting that this measure -- average recall -- was also used in the Interspeech Challenges in 2009, 2010, and 2011. The organizers of this challenge, for reasons that escape explanation, chose to call the evaluation measure "unweighted accuracy" rather than "average recall" or "balanced error rate".
* In my thesis, I put a spin on this measure and defined, "Combined Error Rate". This was the average of the total false positive and false negative rates taken across each class.
\[
CER = \frac{\sum_i \frac{n_i}{n} * FP_i + \sum_i\frac{n_i}{n} * FN_i}{2}
\]
This was fairly ad hoc, and not all that well motivated.
* There is also micro and macro-averaged F-measure, which extend the idea of precision and recall from a single class to a set of classes.
* Mutual Information treats the class and hypothesis distributions as random multinomial variables, and calculates their, well,...mutual information. The downside to this technique is that perfectly bad predictions will have perfect MI, just as a set of perfectly good predictions will. This is a serious problem, making it significantly less useful.
With the exception of Mutual Information, each of these have the form, of examining the contingency table, and weighting the contribution of a correct prediction (or error) based on its true or hypothesized class membership.
This allows us to generalize (many of) these measures into something like
\[
Measure = \frac{1}{Z(A)}\sum_i \sum_j w(i,j) A_{ij}
\]
where w(i,j) is a function that determines how much a classification of a point of class $i$ as class $j$ contributes to the measure, and Z(A) is a normalizing function.
Essentially we're trying to determine how important it is for each point to be classified correctly, on the basis of its class membership, and how damaging an incorrect classification.
Information theory can give us a values for how important each point is. It seems as though w(i,j) := $\delta(i = j)*(-p_i \log p_i)$ would be a valuable evaluation measure. Where $\delta(i = j)$ is a Kronecker delta function, which is 1 if $i = j$ and 0 otherwise. This says that correct classifications contribute the amount of information in its true class to the measure function, while incorrect classifications contribute nothing. This instantiation of an accuracy-like measure weights correct classifications by the amount of information they transmit to the user.
I don't know of anyone who has done this, but it seems like someone should have. I'll keep looking in the literature until I find something, or I give up and write a little paper on this.
Tuesday, September 06, 2011
Interspeech 2011 Recap
Interspeech 2011 was held in Florence Italy about a week ago, August 28-August 31. A vacation in Italy was too good to pass up on, so The Lady joined me, and we stayed until Labor Day.
I ended up sending a bulk of work to Interspeech, so spent more time than usual in sessions that I was presenting in rather than seeing a lot of papers.
Two interesting themes stood out to me this year. Not for nothing, but these represent some novel ideas about speech science through understanding dialog and the speech engineering.
Entrainment
Julia Hirschberg, my former advisor, received the ISCA medal for her years of work in speech. Her talk was on current work with Agustín Gravano and Ani Nenkova on entrainment. Entrainment is the phenomenon by which when people are speaking to each other, their speech becomes more similar. This can be realized in terms of the words that are used to describe a concept, as well as speaking rate, pitch, intensity. How this happens isn't totally understood, and measures of entrainment are still being developed. This research theme is still in its early phases, but I haven't seen an idea spread around a conference as quickly or as thoroughly as this did. There were questions and discussions all over the place (like Tom Mitchell's keynote about fMRI data and word meaning) about this phenomenon. The more engineering folks weren't as compelled by the utility of this in helping speech processing, but within the speech perception and production communities, and specifically the dialog and IVR folks, it was all the rage. It'll be something to see how this develops.
I-vectors
I-vectors are a topic that I need to spend some time learning. I hadn't heard of it before this conference, where there were no less than a dozen papers that used this approach. Essentially the idea is this: The location of mixture components in a GMM model are composed of (in at least one form) a UBM, a channel component, and a "interesting" component. This "interesting" component can be the contribution of a particular speaker, or a language/dialect, or anything else you're trying to model. Joint Factor Analysis is used to decompose the observation into these components in an unsupervised fashion. It's in this part where my understanding of the math is still limited. The crux is that the "interesting" component can be represented by a \[Dx\] transformation, where the dimensionality of x can be set by the user. In comparison to a supervector representation, where the dimensionality of the supervector is constrained to be equal to the number of parameters (or means) of the GMM, i-vectors can be significantly smaller leading to better estimation and smaller models. I'll be reading this tutorial by Howard Lei over the next few weeks to get up to speed.
There were few specific papers that stood out to me this conference. I'm intrigued by Functional Data Analysis as a way to model continuous time-value observations. Michele Gubian gave a tutorial on this that I sadly missed, and included it in at least one paper, Predicting Taiwan Mandarin tone shapes from their duration by Chierh Chung and Michele Gubian. This paper wasn't totally convincing in the utility of the technique, but there may be more appropriate applications.
It was a satisfying and inspiring conference, to be sure. I think I was more interested in talking to people than in papers in particular this time. If anyone has particular favorites that I missed, please use the comments to share or just email me.
I ended up sending a bulk of work to Interspeech, so spent more time than usual in sessions that I was presenting in rather than seeing a lot of papers.
Two interesting themes stood out to me this year. Not for nothing, but these represent some novel ideas about speech science through understanding dialog and the speech engineering.
Entrainment
Julia Hirschberg, my former advisor, received the ISCA medal for her years of work in speech. Her talk was on current work with Agustín Gravano and Ani Nenkova on entrainment. Entrainment is the phenomenon by which when people are speaking to each other, their speech becomes more similar. This can be realized in terms of the words that are used to describe a concept, as well as speaking rate, pitch, intensity. How this happens isn't totally understood, and measures of entrainment are still being developed. This research theme is still in its early phases, but I haven't seen an idea spread around a conference as quickly or as thoroughly as this did. There were questions and discussions all over the place (like Tom Mitchell's keynote about fMRI data and word meaning) about this phenomenon. The more engineering folks weren't as compelled by the utility of this in helping speech processing, but within the speech perception and production communities, and specifically the dialog and IVR folks, it was all the rage. It'll be something to see how this develops.
I-vectors
I-vectors are a topic that I need to spend some time learning. I hadn't heard of it before this conference, where there were no less than a dozen papers that used this approach. Essentially the idea is this: The location of mixture components in a GMM model are composed of (in at least one form) a UBM, a channel component, and a "interesting" component. This "interesting" component can be the contribution of a particular speaker, or a language/dialect, or anything else you're trying to model. Joint Factor Analysis is used to decompose the observation into these components in an unsupervised fashion. It's in this part where my understanding of the math is still limited. The crux is that the "interesting" component can be represented by a \[Dx\] transformation, where the dimensionality of x can be set by the user. In comparison to a supervector representation, where the dimensionality of the supervector is constrained to be equal to the number of parameters (or means) of the GMM, i-vectors can be significantly smaller leading to better estimation and smaller models. I'll be reading this tutorial by Howard Lei over the next few weeks to get up to speed.
There were few specific papers that stood out to me this conference. I'm intrigued by Functional Data Analysis as a way to model continuous time-value observations. Michele Gubian gave a tutorial on this that I sadly missed, and included it in at least one paper, Predicting Taiwan Mandarin tone shapes from their duration by Chierh Chung and Michele Gubian. This paper wasn't totally convincing in the utility of the technique, but there may be more appropriate applications.
It was a satisfying and inspiring conference, to be sure. I think I was more interested in talking to people than in papers in particular this time. If anyone has particular favorites that I missed, please use the comments to share or just email me.
Wednesday, August 24, 2011
Interspeech Tutorial
I'm giving my first tutorial at Interspeech on Saturday in Florence.
It's titled "More than Words can Say: Prosodic Analysis Techniques and Applications".
The (very likely) final version of the slides are available here.
Now off to Italy. Interspeech Recap to come.
It's titled "More than Words can Say: Prosodic Analysis Techniques and Applications".
The (very likely) final version of the slides are available here.
Now off to Italy. Interspeech Recap to come.
Thursday, May 12, 2011
Because Research is a Creative Act.
This quote by Ira Glass has been going around design and writing blogs and tumblrs. It's equally applicable to research.
Do good work. But maybe more importantly, work.
Nobody tells this to people who are beginners, I wish someone told me. All of us who do creative work, we get into it because we have good taste. But there is this gap. For the first couple years you make stuff, it’s just not that good. It’s trying to be good, it has potential, but it’s not. But your taste, the thing that got you into the game, is still killer. And your taste is why your work disappoints you. A lot of people never get past this phase, they quit. Most people I know who do interesting, creative work went through years of this. We know our work doesn’t have this special thing that we want it to have. We all go through this. And if you are just starting out or you are still in this phase, you gotta know its normal and the most important thing you can do is do a lot of work. Put yourself on a deadline so that every week you will finish one story. It is only by going through a volume of work that you will close that gap, and your work will be as good as your ambitions. And I took longer to figure out how to do this than anyone I’ve ever met. It’s gonna take awhile. It’s normal to take awhile. You’ve just gotta fight your way through.
Do good work. But maybe more importantly, work.
Friday, February 18, 2011
I'll take "The Humans" for $1,000,000, Alex. (The obligatory Watson post.)
Full disclosure: I worked one day a week for a year on improving Watson's speech synthesis. While I think the voice added flair and personality to the exhibition, everyone knows that question-answering is what Watson is all about. And I didn't have any hand in that.
Wednesday marked the end of the three day exhibition Jeopardy! match between Watson, IBM's question answering system, Ken Jennings and Brad Rutter. Watson decisively beat the human competitors.
There is no shortage of commentary on the implications of Watson's victory. One of my favorite pieces was by Ken Jennings himself. He's played more games of Jeopardy! than anyone else, and describes the experience in a very measured way.
One of his observations sticks out: "I expected Watson's bag of cognitive tricks to be fairly shallow, but I felt an uneasy sense of familiarity as its programmers briefed us before the big match: The computer's techniques for unraveling Jeopardy! clues sounded just like mine."
It is in this that Watson represents a step forward for Artificial Intelligence.
There is a human exceptionalism that most people have when discussing "intelligence". This exceptionalism leads to a dismissal of the success of Deep Blue's defeat of Kasparov as computation not intelligence. It leads to descriptions of Watson as a "big database". The same exceptionalism leads to the Turing Test as the final arbiter of intelligence.
Prediction: No machine will ever pass the Turing Test.
The test is explicitly defined to defend the exceptionalism of human intelligence. It all but explicitly says, "humans are intelligent. To be intelligent, you must be indistinguishable from a human."
Before this test is passed, machines will continue to perform tasks which we consider to demonstrate intelligence better than humans can. Chess: done. Quiz shows: done. Mathematical theorem proving: done. What's next? Perfect SAT scores. Perfect GRE scores? Writing a five paragraph essay? All conceivable in the next, say, five years. Performing basic research and writing an accepted peer-reviewed publication of its findings. Certainly further off, but not inconceivable. But none of these systems would pass the Turing Test; people would still ask, "can it make me coffee?" or "can it tell me why Ronnie and Sam are still together?" And yet, progress in artificial intelligence will be undeterred.
Ray Kurzweil's prediction of a self-aware system will almost certainly predate one that can convince us that it is human. Despite its ability to describe its "thought" process, and outperform humans on tasks that were formerly considered representative behavior, there will be those that won't consider this system "intelligent" until it passes the Turing Test.
Watson's success comes from some very impressive machine learning and natural language processing, not only in its ability to generate answers quickly, but in its collection of knowledge. The individual ml and nlp components used in Watson may individually be incremental improvements on previously existing approaches. Yet this doesn't mean that Watson is itself an incremental milestone. Watson bested a human at a task that we believed, until Wednesday night, required intelligence. But more impressively, it won in a way that was similar (in spirit if not in hardware) to the way humans play Jeopardy!
Requisite CYA: As always, this post doesn't represent anyone's views other than my own.
Wednesday marked the end of the three day exhibition Jeopardy! match between Watson, IBM's question answering system, Ken Jennings and Brad Rutter. Watson decisively beat the human competitors.
There is no shortage of commentary on the implications of Watson's victory. One of my favorite pieces was by Ken Jennings himself. He's played more games of Jeopardy! than anyone else, and describes the experience in a very measured way.
One of his observations sticks out: "I expected Watson's bag of cognitive tricks to be fairly shallow, but I felt an uneasy sense of familiarity as its programmers briefed us before the big match: The computer's techniques for unraveling Jeopardy! clues sounded just like mine."
It is in this that Watson represents a step forward for Artificial Intelligence.
There is a human exceptionalism that most people have when discussing "intelligence". This exceptionalism leads to a dismissal of the success of Deep Blue's defeat of Kasparov as computation not intelligence. It leads to descriptions of Watson as a "big database". The same exceptionalism leads to the Turing Test as the final arbiter of intelligence.
Prediction: No machine will ever pass the Turing Test.
The test is explicitly defined to defend the exceptionalism of human intelligence. It all but explicitly says, "humans are intelligent. To be intelligent, you must be indistinguishable from a human."
Before this test is passed, machines will continue to perform tasks which we consider to demonstrate intelligence better than humans can. Chess: done. Quiz shows: done. Mathematical theorem proving: done. What's next? Perfect SAT scores. Perfect GRE scores? Writing a five paragraph essay? All conceivable in the next, say, five years. Performing basic research and writing an accepted peer-reviewed publication of its findings. Certainly further off, but not inconceivable. But none of these systems would pass the Turing Test; people would still ask, "can it make me coffee?" or "can it tell me why Ronnie and Sam are still together?" And yet, progress in artificial intelligence will be undeterred.
Ray Kurzweil's prediction of a self-aware system will almost certainly predate one that can convince us that it is human. Despite its ability to describe its "thought" process, and outperform humans on tasks that were formerly considered representative behavior, there will be those that won't consider this system "intelligent" until it passes the Turing Test.
Watson's success comes from some very impressive machine learning and natural language processing, not only in its ability to generate answers quickly, but in its collection of knowledge. The individual ml and nlp components used in Watson may individually be incremental improvements on previously existing approaches. Yet this doesn't mean that Watson is itself an incremental milestone. Watson bested a human at a task that we believed, until Wednesday night, required intelligence. But more impressively, it won in a way that was similar (in spirit if not in hardware) to the way humans play Jeopardy!
Requisite CYA: As always, this post doesn't represent anyone's views other than my own.
Subscribe to:
Posts (Atom)