Evaluation metric on multiclass classification

“In machine learningmulticlass or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary classification). (Ref: Wiki)”

As an evaluation scheme of the multiclass classification, you can compute Micro- and Macro- metric scores. E.g. precision scores can be computed as the following:

Suppose the result of the multiclass classification is the following:

ClassTrue Positive (TP)False Positive (FP)
A11
B1090
C11
D11
Example1: Classifier performed poorly on the class B where the majority number of instances are at.

The Macro-precision is simply the average of precisions per class. That is, (0.5 + 0.1 + 0.5 + 0.5) / 4 = 0.4.

The Micro-precision computes the overall performance, e.g. sum(TP) / (sum(TP) + sum(FP)). That is, 13 / 106 = 0.123.

In this example, Macro-precision > Micro-precision. This is because the classifier performed well on minority classes (e.g., A, C, D) but performed poorly on the majority class (e.g., B). Macro-precision do not consider the number of instances in each class, which means thought the classifier performed poorly on most of the instances, in terms of Macro- level, the performance (0.4) is not so bad. However, Micro-precision takes account the number of instances, which means, since the classifier performed poorly on the class with majority number of instances, Micro-precision score is low (0.123).


Take a look at another example. Suppose the result of the multiclass classification is the following:

ClassTrue Positive (TP)False Positive (FP)
A11
B9010
C11
D11
Example2: Classifier performed well on the class B where the majority number of instances are at.

The Macro-precision is simply the average of precisions per class. That is, (0.5 + 0.9 + 0.5 + 0.5) / 4 = 0.6.

The Micro-precision computes the overall performance, e.g. sum(TP) / (sum(TP) + sum(FP)). That is, 93 / 106 = 0.877.

In this example, Macro-precision < Micro-precision. This is because the classifier performed well on the majority class (e.g., B).

Reference

https://datascience.stackexchange.com/a/24051