Wednesday, December 21, 2011

Evaluating multi-class classification performance

Classification performance is generally easy to evaluate.  Just use accuracy.  It's the most intuitive evaluation measure you can think of.

$\frac{Correct Predictions}{Number of Data Points}$

We can also represent classification results as a contingency matrix A, with $A_{ij}$ for $i, j \in \vec{t} = \{t_1,\ldots, t_k\}$ where $k$ is the number of possible target values, and A_{ij} is the number of times that a data point from with true label $t_i$ was classified as label $t_j$.  Then accuracy is defined as follows:

$Accuracy = \frac{\sum_i A_{ii}}{\sum_i \sum_j A_{ij}}$

This is great until the balance of classes gets out of whack.

This is most commonly seen in cases, where you're not really classifying but detecting something.  Say, a classification task where 95% of data points are non-cancerous, and 5% are cancerous.   A degenerate classifier which assigns the majority class label to all points will lead to 95% accuracy (seemingly very good) but is completely uninformative.

When the distribution of class labels is skewed, like this, accuracy becomes a poor evaluation measure.  When there are only 2 labels, there are a variety of choices for evaluation that have been developed through investigation of detection problems, including F-measure, and ROC curves.

There are fewer available choices with more than 2 labels, and the community (at least in natural language processing and speech) has not yet settled on a consistent evaluation measure for these tasks.

Here are some options:

* Ian Read and Stephen Cox proposed the use of "Balanced Error Rate" in "Automatic Pitch Accent Prediction for Text-To-Speech Synthesis", Interspeech 2007.

BER = 1 - \frac{1}{k}\sum_i\frac{A_{ii}}{\sum_j A_{ij}}

This is one minus the average recall (correct predictions of a class / true instances) treating each class evenly, regardless of its class membership.  It's worth noting that this measure -- average recall -- was also used in the Interspeech Challenges in 2009, 2010, and 2011.  The organizers of this challenge, for reasons that escape explanation, chose to call the evaluation measure "unweighted accuracy" rather than "average recall" or "balanced error rate".

* In my thesis, I put a spin on this measure and defined, "Combined Error Rate".  This was the average of the total false positive and false negative rates taken across each class.

CER = \frac{\sum_i \frac{n_i}{n} * FP_i + \sum_i\frac{n_i}{n} * FN_i}{2}

This was fairly ad hoc, and not all that well motivated.

* There is also micro and macro-averaged F-measure, which extend the idea of precision and recall from a single class to a set of classes.

* Mutual Information treats the class and hypothesis distributions as random multinomial variables, and calculates their, well, information.  The downside to this technique is that perfectly bad predictions will have perfect MI, just as a set of perfectly good predictions will.  This is a serious problem, making it significantly less useful.

With the exception of Mutual Information, each of these have the form, of examining the contingency table, and weighting the contribution of a correct prediction (or error) based on its true or hypothesized class membership.

This allows us to generalize (many of) these measures into something like

Measure = \frac{1}{Z(A)}\sum_i \sum_j w(i,j) A_{ij}

where w(i,j) is a function that determines how much a classification of a point of class $i$ as class $j$ contributes to the measure, and Z(A) is a normalizing function.

Essentially we're trying to determine how important it is for each point to be classified correctly, on the basis of its class membership, and how damaging an incorrect classification.

Information theory can give us a values for how important each point is. It seems as though w(i,j) := $\delta(i = j)*(-p_i \log p_i)$ would be a valuable evaluation measure.  Where $\delta(i = j)$ is a Kronecker delta function, which is 1 if $i = j$ and 0 otherwise. This says that correct classifications contribute the amount of information in its true class to the measure function, while incorrect classifications contribute nothing.  This instantiation of an accuracy-like measure weights correct classifications by the amount of information they transmit to the user.

I don't know of anyone who has done this, but it seems like someone should have.  I'll keep looking in the literature until I find something, or I give up and write a little paper on this.

Tuesday, September 06, 2011

Interspeech 2011 Recap

Interspeech 2011 was held in Florence Italy about a week ago, August 28-August 31.  A vacation in Italy was too good to pass up on, so The Lady joined me, and we stayed until Labor Day.

I ended up sending a bulk of work to Interspeech, so spent more time than usual in sessions that I was presenting in rather than seeing a lot of papers.

Two interesting themes stood out to me this year.  Not for nothing, but these represent some novel ideas about speech science through understanding dialog and the speech engineering.

Julia Hirschberg, my former advisor, received the ISCA medal for her years of work in speech.  Her talk was on current work with Agustín Gravano and Ani Nenkova on entrainment.  Entrainment is the phenomenon by which when people are speaking to each other, their speech becomes more similar.  This can be realized in terms of the words that are used to describe a concept, as well as speaking rate, pitch, intensity.  How this happens isn't totally understood, and measures of entrainment are still being developed.  This research theme is still in its early phases, but I haven't seen an idea spread around a conference as quickly or as thoroughly as this did.  There were questions and discussions all over the place (like Tom Mitchell's keynote about fMRI data and word meaning) about this phenomenon.  The more engineering folks weren't as compelled by the utility of this in helping speech processing, but within the speech perception and production communities, and specifically the dialog and IVR folks, it was all the rage.  It'll be something to see how this develops.

I-vectors are a topic that I need to spend some time learning.  I hadn't heard of it before this conference, where there were no less than a dozen papers that used this approach.  Essentially the idea is this:  The location of mixture components in a GMM model are composed of (in at least one form) a UBM, a channel component, and a "interesting" component.  This "interesting" component can be the contribution of a particular speaker, or a language/dialect, or anything else you're trying to model.  Joint Factor Analysis is used to decompose the observation into these components in an unsupervised fashion.  It's in this part where my understanding of the math is still limited.  The crux is that the "interesting" component can be represented by a \[Dx\] transformation, where the dimensionality of x can be set by the user.  In comparison to a supervector representation, where the dimensionality of the supervector is constrained to be equal to the number of parameters (or means) of the GMM, i-vectors can be significantly smaller leading to better estimation and smaller models.  I'll be reading this tutorial by Howard Lei over the next few weeks to get up to speed.

There were few specific papers that stood out to me this conference.  I'm intrigued by Functional Data Analysis as a way to model continuous time-value observations. Michele Gubian gave a tutorial on this that I sadly missed, and included it in at least one paper, Predicting Taiwan Mandarin tone shapes from their duration by Chierh Chung and Michele Gubian.  This paper wasn't totally convincing in the utility of the technique, but there may be more appropriate applications.

It was a satisfying and inspiring conference, to be sure.  I think I was more interested in talking to people than in papers in particular this time.  If anyone has particular favorites that I missed, please use the comments to share or just email me.

Wednesday, August 24, 2011

Interspeech Tutorial

I'm giving my first tutorial at Interspeech on Saturday in Florence.

It's titled "More than Words can Say: Prosodic Analysis Techniques and Applications".

The (very likely) final version of the slides are available here.

Now off to Italy.  Interspeech Recap to come.

Thursday, May 12, 2011

Because Research is a Creative Act.

This quote by Ira Glass has been going around design and writing blogs and tumblrs.  It's equally applicable to research.

Nobody tells this to people who are beginners, I wish someone told me. All of us who do creative work, we get into it because we have good taste. But there is this gap. For the first couple years you make stuff, it’s just not that good. It’s trying to be good, it has potential, but it’s not. But your taste, the thing that got you into the game, is still killer. And your taste is why your work disappoints you. A lot of people never get past this phase, they quit. Most people I know who do interesting, creative work went through years of this. We know our work doesn’t have this special thing that we want it to have. We all go through this. And if you are just starting out or you are still in this phase, you gotta know its normal and the most important thing you can do is do a lot of work. Put yourself on a deadline so that every week you will finish one story. It is only by going through a volume of work that you will close that gap, and your work will be as good as your ambitions. And I took longer to figure out how to do this than anyone I’ve ever met. It’s gonna take awhile. It’s normal to take awhile. You’ve just gotta fight your way through.

Do good work.  But maybe more importantly, work.

Friday, February 18, 2011

I'll take "The Humans" for $1,000,000, Alex. (The obligatory Watson post.)

Full disclosure: I worked one day a week for a year on improving Watson's speech synthesis.  While I think the voice added flair and personality to the exhibition, everyone knows that question-answering is what Watson is all about. And I didn't have any hand in that.

Wednesday marked the end of the three day exhibition Jeopardy! match between Watson, IBM's question answering system, Ken Jennings and Brad Rutter.  Watson decisively beat the human competitors.

There is no shortage of commentary on the implications of Watson's victory.   One of my favorite pieces was by Ken Jennings himself.  He's played more games of Jeopardy! than anyone else, and describes the experience in a very measured way.

One of his observations sticks out: "I expected Watson's bag of cognitive tricks to be fairly shallow, but I felt an uneasy sense of familiarity as its programmers briefed us before the big match: The computer's techniques for unraveling Jeopardy! clues sounded just like mine."

 It is in this that Watson represents a step forward for Artificial Intelligence.

There is a human exceptionalism that most people have when discussing "intelligence".  This exceptionalism leads to a dismissal of the success of Deep Blue's defeat of Kasparov as computation not intelligence.  It leads to descriptions of Watson as a "big database".  The same exceptionalism leads to the Turing Test as the final arbiter of intelligence.

Prediction: No machine will ever pass the Turing Test.

The test is explicitly defined to defend the exceptionalism of human intelligence.  It all but explicitly says, "humans are intelligent. To be intelligent, you must be indistinguishable from a human."

Before this test is passed, machines will continue to perform tasks which we consider to demonstrate intelligence better than humans can.  Chess: done.  Quiz shows: done. Mathematical theorem proving: done.  What's next? Perfect SAT scores. Perfect GRE scores?  Writing a five paragraph essay?  All conceivable in the next, say, five years.  Performing basic research and writing an accepted peer-reviewed publication of its findings.  Certainly further off, but not inconceivable.  But none of these systems would pass the Turing Test; people would still ask, "can it make me coffee?" or "can it tell me why Ronnie and Sam are still together?"  And yet, progress in artificial intelligence will be undeterred.

Ray Kurzweil's prediction of a self-aware system will almost certainly predate one that can convince us that it is human.  Despite its ability to describe its "thought" process, and outperform humans on tasks that were formerly considered representative behavior, there will be those that won't consider this system "intelligent" until it passes the Turing Test.

Watson's success comes from some very impressive machine learning and natural language processing, not only in its ability to generate answers quickly, but in its collection of knowledge.  The individual ml and nlp components used in Watson may individually be incremental improvements on previously existing approaches.  Yet this doesn't mean that Watson is itself an incremental milestone.  Watson bested a human at a task that we believed, until Wednesday night, required intelligence.  But more impressively, it won in a way that was similar (in spirit if not in hardware) to the way humans play Jeopardy!

Requisite CYA: As always, this post doesn't represent anyone's views other than my own.

Thursday, January 20, 2011

P == NP. The battle rages on.

Last summer, Vinay Deolalikar claimed to have proved that P!=NP.  This kicked up some terrific math and computer science dust, and a lot of people got excited, myself included.  But this wasn't exactly a surprising result...most people suspect that P != NP after all, but a proof has been elusive.

Well, evening out the playing field, Vladimir Romanov claims to have come up with a polynomial time algorithm for 3-SAT -- one of the foundational NP-hard problems. Here's the paper describing the work.  But what's more...there's source code for it.  If he's right, then P==NP.

I fully expect that people will be picking this work apart pretty rigorously over the next few weeks, months and years.  But if it holds up, it's a really big deal.  Much bigger than P!=NP, most specifically because it implies that large factorization can be solved in polynomial time, which means all of our encryption is broken.

Tuesday, January 18, 2011

AuToBI Version 1.1

I've made enough improvements to AuToBI to consider the toolkit a milestone more mature.

In addition to some uninteresting bug fixes, and refactoring, the version 1.1 is made up by 3 significant changes.  As ever, the AuToBI homepage includes milestone releases, and the project itself is hosted on github.

1) Package restructuring.

The internal structure makes more sense now.  Classes are divided into 'core', 'feature extractors', 'classifiers', 'feature sets', 'io' and 'utilities'.  With over 100 classes, the code had outgrown a flat package structure.  Unfortunately this restructuring changed the serialization signatures of class names.  This means that old models won't work with the version 1.1 release.  But I have new models trained on the Boston Directions Corpus and Boston Radio News Corpus which will be available from the AuToBI homepage shortly.

2) Implemented reference counting for Feature maintenance.

The feature extraction process allows a user to specify the features they want without explicitly going through all the intermediate steps to extract them.  For example, extracting the mean speaker normalized pitch from the second half of a word requires pitch to be extracted, speaker normalization parameters to be calculated or retrieved, the pitch to be normalized, the second half of the word to be identified, and finally the mean calculated.  In AuToBI each of these steps are treated as features that are required by the feature extraction of the user-desired feature.  In version 1.0, AuToBI was able to identify which features were required, but never recognized when a feature wasn't needed any longer.  Version 1.1 includes functionality that maintains a reference count for each feature based on how many features that still need to be extracted are going to need it -- speaker normalization parameters may be required by many requested features.  This allows for tidier memory management, which should, in turn, allow AuToBI to operate on more material.

3) Reduced storage for acoustic contours.

In version 1.0 pitch and intensity contours were stored as lists of time-value pairs -- a storage class containing 2 doubles.  Assuming 10ms samples, and 16bytes per sample, this is about 1600bytes per second.  Not unacceptably inefficient, but definitely could be improved.  The other approach would be to only store values, and let the time of each point be specified by a start time (t0) and step size (dt).  This allows storage with 8bytes per sample plus 16bytes for the parameters.  The problem with this approach is that pitch points are invalid when there is no periodic material detected.  The time-value pair approach handled this simply and easily.  The new solution uses an array of one-bit booleans for each sample in the contour, describing whether it was a 'valid' point or not.  If the rate of periodic to aperiodic material is  less than 32:1, the new approach will reduce the memory requirements.  (As I'm writing this, I realize that this ratio could be explicitly tested during pitch extraction, selecting the most efficient storage solution at runtime.  Keep an eye out in a new release.)

The changes are mostly inside baseball kind of things, but possibly useful for AuToBI users who get their hands dirty in the code -- whoever you are.   For everyone else, the most obvious changes you'll notice is that version 1.0 models don't work any more, and a speedup of ~6%.