Domains
A domain defines a particular machine learning task. Each domain is equipped with the following:- a particular dataset format,
- a particular program interface, and
- standard evaluation metrics.
There are two types of domains:
- Supervised learning (e.g., classification, regression): For these domains, a run consists of two phases: learn and predict. Datasets must be split into two shards (for train and test).
- Performing (e.g., clustering, optimization): for these domains, there is only one phase: perform. Datasets contain only one shard raw.
BinaryClassification
Type: supervised-learning
Task description: The goal of this task is to learn how to classify data points represented as real vectors into one of two classes (positive or negative).
Dataset format:
output featureIndex:featureValue ... featureIndex:featureValuewhere featureIndex is a positive integer, featureValue is a real number, and output ∈ {-1, +1}. The feature indices must be sorted in increasing order. For the test file, output is 0. The predictions file contains a line for each test example:
predicted-output
Click here to see a sample dataset.
MulticlassClassification
Type: supervised-learning
Task description: The goal of this task is to learn how to classify data points represented as real vectors into one of K classes.
Dataset format:
output featureIndex:featureValue ... featureIndex:featureValuewhere featureIndex is a positive integer, featureValue is a real number, and output ∈ {1, 2, ..., K}. The feature indices must be sorted in increasing order. For the test file, output is 0. The predictions file contains a line for each test example:
predicted-output
Click here to see a sample dataset.
Regression
Type: supervised-learning
Task description: The goal of this task is to learn how to predict a real value Y given an input vector X.
Dataset format:
output featureIndex:featureValue ... featureIndex:featureValuewhere featureIndex is a positive integer, featureValue is a real number, and output is a real number. The feature indices must be sorted in increasing order. For the test file, output is 0. The predictions file contains a line for each test example:
predicted-output
Click here to see a sample dataset.
SequenceTagging
Type: supervised-learning
Task description: In this task, the input is a sequence (e.g., a sentence) and the output is a tag label for each position of the sequence (e.g., part-of-speech tags for each word in the sentence). The key part of this problem is that there is are dependencies between the various labels. This is the canonical structured prediction task.
Dataset format:
input ... input outputwhere the input and output are strings. For example, for named-entity recognition:
France NNP B-LOCIf labels include B-X, I-X, then this is treated as a segmentation task where a sequence of B-X I-X ... I-X denotes a segment labeled as X. For these tasks (e.g., named-entity recognition), F1 is an appropriate evaluation metric. For the test file, output is "-". The predictions file contains lines parallel to the input like the following:
predicted-output
Click here to see a sample dataset.
CollaborativeFiltering
Type: supervised-learning
Task description: Given some entries of a matrix (e.g., where rows are users and columns are movies and each entry is a numeric rating), predict other entries.
Dataset format:
row-index column-index valuewhere row index and column index are positive integers and value is a real number. For the test file, value is 0. The predictions file contains a line for each test example:
predicted-value
Click here to see a sample dataset.
DocumentClassification
Type: supervised-learning
Task description: The goal of this task is to learn how to classify text documents as one of K classes.
Dataset format:
label1/doc1 label1/doc2 label2/doc3 ...At test time, your program will be passed unlabeled examples, which are arranged in a datashard like this:
unlabeled/doc1 unlabeled/doc2 unlabeled/doc3 ...Your predictions should be written to a directory with the following structure, where each file can be empty (only the file name matters).
label1/doc1 label1/doc2 label2/doc3 ...
Click here to see a sample dataset.
WordSegmentation
Type: performing
Task description: This is an unsupervised learning task where we are given an unsegmented sequence of characters (phonemes) as input and the goal is to determine the word boundaries and output the words.
Dataset format:
thisisatestExample output:
this is a test
Click here to see a sample dataset.
ConstituencyParsingTest
Type: performing
Task description: The goal of this task is labeled constituency parsing with integrated part-of-speech tagging.
Dataset format:
Click here to see a sample dataset.
DependencyParsingTest
Type: performing
Task description: The goal of this task is labeled dependency parsing with integrated part-of-speech tagging.
Dataset format:
Click here to see a sample dataset.
OnlineLearningMulticlass
Type: interactive-learning
Task description: The goal of this task is to learn how to classify data points represented as real vectors into one of K classes.
Dataset format:
featureIndex:featureValue ... featureIndex:featureValuewhere featureIndex is a positive integer and featureValue is a real number. The feature indices must be sorted in increasing order. The program should output to STDOUT its prediction:
predicted-outputwhere predicted-output element of {1, 2, ..., K}. The program does not know there are K classes, and it must find out through experience. Once prediction is received, the correct label is presented through STDIN:
correct-labelwhere correct-label element of {1, 2, ..., K}.
Click here to see a sample dataset.
BanditMulticlass
Type: interactive-learning
Task description: The goal of this task is to learn how to classify data points represented as real vectors into one of K classes. It is similar to OnlineLearningMulticlass, except that the oracle only tells you whether your prediction was correct or not.
Dataset format:
featureIndex:featureValue ... featureIndex:featureValuewhere featureIndex is a positive integer and featureValue is a real number. The feature indices must be sorted in increasing order. The program should output to STDOUT its prediction:
predicted-outputwhere predicted-output element of {1, 2, ..., K}. Unlike OnlineLearningMulticlass, the program knows that there are K classes (passed as first argument to ./run). Once prediction is received, either "yes" or "no" is presented through STDIN, depending on if the label was correct or not:
oracle-answerwhere oracle-answer is in {yes, no}.
Click here to see a sample dataset.
SemiSupervisedMulticlass
Type: supervised-learning
Task description: Takes a dataset with labeled and unlabeled instances and classifies using both types.
Dataset format:
label1 feature:value feature:value... label2 feature:value feature:value... label3 feature:value feature:value ...Test data consists of:
label1 feature:value feature:value... label2 feature:value feature:value... label3 feature:value feature:value... ...Predictions should be:
label1 label2 label3 ...
Click here to see a sample dataset.
Creating a New Domain
If you have a task that does not fall into one of standard categories, you can create a new domain. Follow the instructions below:- Decide if your domain kind is supervised-learning (involves a separate train/test phase) or performing (only one phase).
- Decide on the dataset format (input) and the format for the output of a program that operates in that domain.
- Create a sample dataset. For example, document classification. This dataset should be small but not degenerate, just complicated enough to demonstrate the characteristics of the format and domain.
- Create the helper program for this domain.
The program should support the following operations:
- inspect datashardPath: checks that the given datashard conforms to the format, and if so, extracts summary statistics and writes it out in a YAML format to a status file.
- split rawDatashardPath trainDatashardPath testDatashardPath: reads in the raw datashard and splits up the examples into training and test, and outputs the examples to the corresponding file. This operation is only for supervised-learning domains.
- stripLabels inDatashardPath outDatashardPath: reads in a datashard from inDatashardPath, removes the labels (e.g., replacing with -1, +1 with 0) and outputs the result to outDatashardPath.
- evaluate datashardPath predictionPath: read in the true outputs from datashardPath and a program's predicted outputs from predictionPath and computes any error metrics suitable for that domain. Results should be written to the status file.
- Write a configuration file, which should be a YAML file. The configuration file for document classification.