Performance Prediction Challenge

The challenge is over, but a new challenge is on-going using the same datasets, check it out!

The Challenge

The aim of the challenge in performance prediction is to find methods to predict how accuratly a given predictive model will perform on test data, on ALL five benchmark datasets. To facilitate entering results for all five datasets, all tasks are two-class classification problems. You can download the datasets from the table below:

Dataset	Size	Type	Features	Training Examples	Validation Examples	Test Examples
ADA	0.6 MB	Dense	48	4147	415	41471
GINA	19.4 MB	Dense	970	3153	315	31532
HIVA	7.6 MB	Dense	1617	3845	384	38449
NOVA	2.3 MB	Sparse binary	16969	1754	175	17537
SYLVA	15.6 MB	Dense	216	13086	1308	130858

At the start of the challenge, participants had only access to labeled training data and unlabeled validation and test data. The submissions were evaluated on validation data only. The validation labels have been made available (one month before the end of the challenge). *** DOWNLOAD THE VALIDATION SET LABELS *** . The final ranking will be based on test data results, to be revealed only when the challenge is over.

Dataset Formats

All the data sets are in the same format and include 5 files in ASCII format:

dataname.param - Parameters and statistics about the data
dataname_train.data - Training set (a sparse or a regular matrix, patterns in lines, features in columns).
dataname_valid.data - Validation set.
dataname_test.data - Test set.
dataname_train.labels - Labels (truth values of the classes) for training examples.

The matrix data formats used are (in all cases, each line represents a pattern):

dense matrices - a space delimited file with a new-line character at the end of each line.
sparse binary matrices - for each line of the matrix, a space delimited list of indices of the non-zero values. A new-line character at the end of each line.

If you are a Matlab user, you can download some sample code to read and check the data.