TPOT on the command line

To use TPOT via the command line, enter the following command with a path to the data file:

tpot /path_to/data_file.csv

TPOT offers several arguments that can be provided at the command line:

Argument Parameter Valid values Effect
-is INPUT_SEPARATOR Any string Character used to separate columns in the input file.
-target TARGET_NAME Any string Name of the target column in the input file.
-mode TPOT_MODE ['classification', 'regression'] Whether TPOT is being used for a supervised classification or regression problem.
-o OUTPUT_FILE String path to a file File to export the code for the final optimized pipeline.
-g GENERATIONS Any positive integer Number of iterations to run the pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.

TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.
-p POPULATION_SIZE Any positive integer Number of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline.

TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.
-os OFFSPRING_SIZE Any positive integer Number of offspring to produce in each GP generation.

By default, OFFSPRING_SIZE = POPULATION_SIZE.
-mr MUTATION_RATE [0.0, 1.0] GP mutation rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to apply random changes to every generation.

We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.
-xr CROSSOVER_RATE [0.0, 1.0] GP crossover rate in the range [0.0, 1.0]. This tells the GP algorithm how many pipelines to "breed" every generation.

We recommend using the default parameter unless you understand how the crossover rate affects GP algorithms.
-scoring SCORING_FN 'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',
'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc'
Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.

TPOT assumes that any function with "error" or "loss" in the name is meant to be minimized, whereas any other functions will be maximized.

See the section on scoring functions for more details.
-cv NUM_CV_FOLDS Any integer >1 Number of folds to evaluate each pipeline over in 'k-fold cross-validation during the TPOT optimization process.
-njobs NUM_JOBS Any positive integer or -1 Number of CPUs for evaluating pipelines in parallel during the TPOT optimization process.

Assigning this to -1 will use as many cores as available on the computer.
-maxtime MAX_TIME_MINS Any positive integer How many minutes TPOT has to optimize the pipeline.

If provided, this setting will override the "generations" parameter and allow TPOT to run until it runs out of time.
-maxeval MAX_EVAL_MINS Any positive integer How many minutes TPOT has to evaluate a single pipeline.

Setting this parameter to higher values will allow TPOT to explore more complex pipelines but will also allow TPOT to run longer.
-s RANDOM_STATE Any positive integer Random number generator seed for reproducibility.

Set this seed if you want your TPOT run to be reproducible with the same seed and data set in the future.
-config CONFIG_FILE String path to a file Configuration file for customizing the operators and parameters that TPOT uses in the optimization process.

See the custom configuration section for more information and examples.
-v VERBOSITY {0, 1, 2, 3} How much information TPOT communicates while it is running.

0 = none, 1 = minimal, 2 = high, 3 = all.

A setting of 2 or higher will add a progress bar during the optimization procedure.
--no-update-check Flag indicating whether the TPOT version checker should be disabled.
--version Show TPOT's version number and exit.
--help Show TPOT's help documentation and exit.

An example command-line call to TPOT may look like:

tpot data/mnist.csv -is , -target class -o tpot_exported_pipeline.py -g 5 -p 20 -cv 5 -s 42 -v 2

TPOT with code

We've taken care to design the TPOT interface to be as similar as possible to scikit-learn.

TPOT can be imported just like any regular Python module. To import TPOT, type:

from tpot import TPOTClassifier

then create an instance of TPOT as follows:

from tpot import TPOTClassifier

pipeline_optimizer = TPOTClassifier()

It's also possible to use TPOT for regression problems with the TPOTRegressor class. Other than the class name, a TPOTRegressor is used the same way as a TPOTClassifier.

Note that you can pass several parameters to the TPOT instantiation call:

Parameter Valid values Effect
generations Any positive integer Number of iterations to the run pipeline optimization process. Generally, TPOT will work better when you give it more generations (and therefore time) to optimize the pipeline.

TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.
population_size Any positive integer Number of individuals to retain in the GP population every generation. Generally, TPOT will work better when you give it more individuals (and therefore time) to optimize the pipeline.

TPOT will evaluate POPULATION_SIZE + GENERATIONS x OFFSPRING_SIZE pipelines in total.
offspring_size Any positive integer Number of offspring to produce in each GP generation.

By default, offspring_size = population_size.
mutation_rate [0.0, 1.0] Mutation rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the GP algorithm how many pipelines to apply random changes to every generation.

We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.
crossover_rate [0.0, 1.0] Crossover rate for the genetic programming algorithm in the range [0.0, 1.0]. This parameter tells the genetic programming algorithm how many pipelines to "breed" every generation.

We recommend using the default parameter unless you understand how the mutation rate affects GP algorithms.
scoring 'accuracy', 'adjusted_rand_score', 'average_precision', 'balanced_accuracy',
'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weighted', 'log_loss', 'mean_absolute_error', 'mean_squared_error', 'median_absolute_error', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'r2', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'roc_auc' or a callable function with signature scorer(y_true, y_pred)
Function used to evaluate the quality of a given pipeline for the problem. By default, accuracy is used for classification and mean squared error (MSE) is used for regression.

TPOT assumes that any function with "error" or "loss" in the name is meant to be minimized, whereas any other functions will be maximized.

See the section on scoring functions for more details.
cv Any integer >1 Number of folds to evaluate each pipeline over in k-fold cross-validation during the TPOT optimization process.
n_jobs Any positive integer or -1 Number of CPUs for evaluating pipelines in parallel during the TPOT optimization process.

Assigning this to -1 will use as many cores as available on the computer.
max_time_mins Any positive integer How many minutes TPOT has to optimize the pipeline.

If provided, this setting will override the "generations" parameter and allow TPOT to run until it runs out of time.
max_eval_time_mins Any positive integer How many minutes TPOT has to optimize a single pipeline.

Setting this parameter to higher values will allow TPOT to explore more complex pipelines, but will also allow TPOT to run longer.
random_state Any positive integer Random number generator seed for TPOT.

Use this to make sure that TPOT will give you the same results each time you run it against the same data set with that seed.
config_dict Python dictionary Configuration dictionary for customizing the operators and parameters that TPOT uses in the optimization process.

See the custom configuration section for more information and examples.
warm_start [True, False] Flag indicating whether the TPOT instance will reuse the population from previous calls to fit().
verbosity {0, 1, 2, 3} How much information TPOT communicates while it's running.

0 = none, 1 = minimal, 2 = high, 3 = all.

A setting of 2 or higher will add a progress bar during the optimization procedure.
disable_update_check [True, False] Flag indicating whether the TPOT version checker should be disabled.

Some example code with custom TPOT parameters might look like:

from tpot import TPOTClassifier

pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
                                    random_state=42, verbosity=2)

Now TPOT is ready to optimize a pipeline for you. You can tell TPOT to optimize a pipeline based on a data set with the fit function:

from tpot import TPOTClassifier

pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
                                    random_state=42, verbosity=2)
pipeline_optimizer.fit(training_features, training_classes)

The fit() function takes in a training data set and uses k-fold cross-validation when evaluating pipelines. It then initializes the genetic programming algoritm to find the best pipeline based on average k-fold score.

You can then proceed to evaluate the final pipeline on the testing set with the score() function:

from tpot import TPOTClassifier

pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
                                    random_state=42, verbosity=2)
pipeline_optimizer.fit(training_features, training_classes)
print(pipeline_optimizer.score(testing_features, testing_classes))

Finally, you can tell TPOT to export the corresponding Python code for the optimized pipeline to a text file with the export() function:

from tpot import TPOTClassifier

pipeline_optimizer = TPOTClassifier(generations=5, population_size=20, cv=5,
                                    random_state=42, verbosity=2)
pipeline_optimizer.fit(training_features, training_classes)
print(pipeline_optimizer.score(testing_features, testing_classes))
pipeline_optimizer.export('tpot_exported_pipeline.py')

Once this code finishes running, tpot_exported_pipeline.py will contain the Python code for the optimized pipeline.

Check our examples to see TPOT applied to some specific data sets.

Scoring functions

TPOT makes use of sklearn.model_selection.cross_val_score for evaluating pipelines, and as such offers the same support for scoring functions. There are two ways to make use of scoring functions with TPOT:

  1. You can pass in a string to the scoring parameter from the list above. Any other strings will cause TPOT to throw an exception.

  2. You can pass a function with the signature scorer(y_true, y_pred), where y_true are the true target values and y_pred are the predicted target values from an estimator. To do this, you should implement your own function. See the example below for further explanation.

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

def accuracy(y_true, y_pred):
    return float(sum(y_pred == y_true)) / len(y_true)

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,
                      scoring=accuracy)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')

Customizing TPOT's operators and parameters

TPOT comes with a handful of default operators and parameter configurations that we believe work well for optimizing machine learning pipelines. However, in some cases it is useful to limit the algorithms and parameters that TPOT explores. For that reason, we allow users to provide TPOT with a custom configuration for its operators and parameters.

The custom TPOT configuration must be in nested dictionary format, where the first level key is the path and name of the operator (e.g., sklearn.naive_bayes.MultinomialNB) and the second level key is the corresponding parameter name for that operator (e.g., fit_prior). The second level key should point to a list of parameter values for that parameter, e.g., 'fit_prior': [True, False].

For a simple example, the configuration could be:

classifier_config_dict = {
    'sklearn.naive_bayes.GaussianNB': {
    },
    'sklearn.naive_bayes.BernoulliNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    },
    'sklearn.naive_bayes.MultinomialNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    }
}

in which case TPOT would only explore pipelines containing GaussianNB, BernoulliNB, MultinomialNB, and tune those algorithm's parameters in the ranges provided. This dictionary can be passed directly within the code to the TPOTClassifier/TPOTRegressor config_dict parameter, described above. For example:

from tpot import TPOTClassifier
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split

digits = load_digits()
X_train, X_test, y_train, y_test = train_test_split(digits.data, digits.target,
                                                    train_size=0.75, test_size=0.25)

classifier_config_dict = {
    'sklearn.naive_bayes.GaussianNB': {
    },
    'sklearn.naive_bayes.BernoulliNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    },
    'sklearn.naive_bayes.MultinomialNB': {
        'alpha': [1e-3, 1e-2, 1e-1, 1., 10., 100.],
        'fit_prior': [True, False]
    }
}

tpot = TPOTClassifier(generations=5, population_size=20, verbosity=2,
                      config_dict=classifier_config_dict)
tpot.fit(X_train, y_train)
print(tpot.score(X_test, y_test))
tpot.export('tpot_mnist_pipeline.py')

Command-line users must create a separate .py file with the custom configuration and provide the path to the file to the tpot call. For example, if the simple example configuration above is saved in tpot_classifier_config.py, that configuration could be used on the command line with the command:

tpot data/mnist.csv -is , -target class -config tpot_classifier_config.py -g 5 -p 20 -v 2 -o tpot_exported_pipeline.py

For more detailed examples of how to customize TPOT's operator configuration, see the default configurations for classification and regression in TPOT's source code.

Note that you must have all of the corresponding packages for the operators installed on your computer, otherwise TPOT will not be able to use them. For example, if XGBoost is not installed on your computer, then TPOT will simply not import nor use XGBoost in the pipelines it explores.