The challenge is over, but a new challenge is on-going using the same datasets, check it out!
The results are evaluated according to the following performance measures. The validation set is used for ranking during the development period. The test set will be used for the final ranking.
The results for a classifier can be represented in a confusion matrix, where a,b,c and d represent the number of examples falling into each possible outcome:
Prediction | |||
---|---|---|---|
Class -1 | Class +1 | ||
Truth | Class -1 | a | b |
Class +1 | c | d |
The balanced error rate is the average of the errors on each class: BER = 0.5*(b/(a+b) + c/(c+d)). During the development period, the ranking is performed according to the validation BER.
The area under curve is defined as the area under the ROC curve. This area is equivalent to the area under the curve obtained by plotting a/(a+b) against d/(c+d) for each confidence value, starting at (0,1) and ending at (1,0). The area under this curve is calculated using the trapezoid method. In the case when no confidence values are supplied for the classification the curve is given by {(0,1),(d/(c+d),a/(a+b)),(1,0)} and AUC = 1 - BER.
The BER guess error (deltaBER) is the absolute value of the difference between the BER you obtained on the test set (testBER) and the BER you predicted (predictedBER). The predicted BER is the value supplied in the .guess file.
deltaBER = abs(predictedBER - testBER)
The final ranking is based on the "test score" computed from the test set balanced error rate (testBER) and the "BER guess error" (deltaBER), both of which should be made as low as possible. The test score is computed according to the formula:
E = testBER + deltaBER * (1- exp(-deltaBER/sigma))
where sigma is the standard error on testBER. See the FAQ for details.