Wednesday, December 21, 2011

Evaluating multi-class classification performance

Classification performance is generally easy to evaluate.  Just use accuracy.  It's the most intuitive evaluation measure you can think of.

$\frac{Correct Predictions}{Number of Data Points}$

We can also represent classification results as a contingency matrix A, with $A_{ij}$ for $i, j \in \vec{t} = \{t_1,\ldots, t_k\}$ where $k$ is the number of possible target values, and A_{ij} is the number of times that a data point from with true label $t_i$ was classified as label $t_j$.  Then accuracy is defined as follows:

$Accuracy = \frac{\sum_i A_{ii}}{\sum_i \sum_j A_{ij}}$

This is great until the balance of classes gets out of whack.

This is most commonly seen in cases, where you're not really classifying but detecting something.  Say, a classification task where 95% of data points are non-cancerous, and 5% are cancerous.   A degenerate classifier which assigns the majority class label to all points will lead to 95% accuracy (seemingly very good) but is completely uninformative.

When the distribution of class labels is skewed, like this, accuracy becomes a poor evaluation measure.  When there are only 2 labels, there are a variety of choices for evaluation that have been developed through investigation of detection problems, including F-measure, and ROC curves.

There are fewer available choices with more than 2 labels, and the community (at least in natural language processing and speech) has not yet settled on a consistent evaluation measure for these tasks.

Here are some options:

* Ian Read and Stephen Cox proposed the use of "Balanced Error Rate" in "Automatic Pitch Accent Prediction for Text-To-Speech Synthesis", Interspeech 2007.

BER = 1 - \frac{1}{k}\sum_i\frac{A_{ii}}{\sum_j A_{ij}}

This is one minus the average recall (correct predictions of a class / true instances) treating each class evenly, regardless of its class membership.  It's worth noting that this measure -- average recall -- was also used in the Interspeech Challenges in 2009, 2010, and 2011.  The organizers of this challenge, for reasons that escape explanation, chose to call the evaluation measure "unweighted accuracy" rather than "average recall" or "balanced error rate".

* In my thesis, I put a spin on this measure and defined, "Combined Error Rate".  This was the average of the total false positive and false negative rates taken across each class.

CER = \frac{\sum_i \frac{n_i}{n} * FP_i + \sum_i\frac{n_i}{n} * FN_i}{2}

This was fairly ad hoc, and not all that well motivated.

* There is also micro and macro-averaged F-measure, which extend the idea of precision and recall from a single class to a set of classes.

* Mutual Information treats the class and hypothesis distributions as random multinomial variables, and calculates their, well, information.  The downside to this technique is that perfectly bad predictions will have perfect MI, just as a set of perfectly good predictions will.  This is a serious problem, making it significantly less useful.

With the exception of Mutual Information, each of these have the form, of examining the contingency table, and weighting the contribution of a correct prediction (or error) based on its true or hypothesized class membership.

This allows us to generalize (many of) these measures into something like

Measure = \frac{1}{Z(A)}\sum_i \sum_j w(i,j) A_{ij}

where w(i,j) is a function that determines how much a classification of a point of class $i$ as class $j$ contributes to the measure, and Z(A) is a normalizing function.

Essentially we're trying to determine how important it is for each point to be classified correctly, on the basis of its class membership, and how damaging an incorrect classification.

Information theory can give us a values for how important each point is. It seems as though w(i,j) := $\delta(i = j)*(-p_i \log p_i)$ would be a valuable evaluation measure.  Where $\delta(i = j)$ is a Kronecker delta function, which is 1 if $i = j$ and 0 otherwise. This says that correct classifications contribute the amount of information in its true class to the measure function, while incorrect classifications contribute nothing.  This instantiation of an accuracy-like measure weights correct classifications by the amount of information they transmit to the user.

I don't know of anyone who has done this, but it seems like someone should have.  I'll keep looking in the literature until I find something, or I give up and write a little paper on this.