autoBOTLib.optimization package

Submodules

autoBOTLib.optimization.optimization_engine module

class autoBOTLib.optimization.optimization_engine.GAlearner(train_sequences_raw, train_targets, time_constraint, num_cpu='all', device='cpu', task_name='Super cool task.', latent_dim=512, sparsity=0.1, hof_size=1, initial_separate_spaces=True, scoring_metric=None, top_k_importances=15, representation_type='neurosymbolic', binarize_importances=False, memory_storage='memory', learner=None, n_fold_cv=5, random_seed=8954, learner_hyperparameters=None, use_checkpoints=True, visualize_progress=False, custom_transformer_pipeline=None, combine_with_existing_representation=False, default_importance=0.05, learner_preset='default', task='classification', contextual_model='all-mpnet-base-v2', upsample=False, verbose=1, framework='scikit', normalization_norm='l2', validation_percentage=0.2, validation_type='cv')

Bases: object

The core GA class. It includes methods for evolution of a learner assembly. Each instance of autoBOT must be first instantiated. In general, the workflow for working with this class is as follows: 1.) Instantiate the class 2.) Evolve 3.) Predict

__init__(train_sequences_raw, train_targets, time_constraint, num_cpu='all', device='cpu', task_name='Super cool task.', latent_dim=512, sparsity=0.1, hof_size=1, initial_separate_spaces=True, scoring_metric=None, top_k_importances=15, representation_type='neurosymbolic', binarize_importances=False, memory_storage='memory', learner=None, n_fold_cv=5, random_seed=8954, learner_hyperparameters=None, use_checkpoints=True, visualize_progress=False, custom_transformer_pipeline=None, combine_with_existing_representation=False, default_importance=0.05, learner_preset='default', task='classification', contextual_model='all-mpnet-base-v2', upsample=False, verbose=1, framework='scikit', normalization_norm='l2', validation_percentage=0.2, validation_type='cv')
The object initialization method; specify the core optimization

parameter with this method.

Parameters
  • train_sequences_raw (list/PandasSeries) – a list of texts

  • train_targets (list/np.array) – a list of natural numbers (targets, multiclass), a list of lists (multilabel)

  • device (str) – Specification of the computation backend device

  • time_constraint (int) – Number of hours to evolve.

  • num_cpu (int/str) – Number of threads to exploit

  • task_name (str) – Task identifier for logging

  • latent_dim (int) – The latent dimension of embeddings

  • sparsity (float) – The assumed sparsity of the induced space (see paper)

  • hof_size (int) – Hof many final models to consider?

  • initial_separate_spaces (bool) – Whether to include separate spaces as part of the initial population.

  • scoring_metric (str) – The type of metric to optimize (sklearn-compatible)

  • top_k_importances (int) – How many top importances to remember for explanations.

  • representation_type (str) – “symbolic”, “neural”, “neurosymbolic”, “neurosymbolic-default”, “neurosymbolic-lite” or “custom”. The “symbolic” feature space will only include feature types that we humans directly comprehend. The “neural” will include the embedding-based ones. The “neurosymbolic-default” will include the ones based on the origin MLJ paper, the “neurosymbolic” is the current alpha version with some new additions (constantly updated/developed). The “neurosymbolic-lite” version includes language-agnostic features but does not consider document graphs (due to space constraints)

  • framework (str) – The framework used for obtaining the final models (torch, scikit)

  • binarize_importances (bool) – Feature selection instead of ranking as explanation

  • memory_storage (str) – The storage of the gzipped (TSV) triplets (SPO).

  • learner (obj) – custom learner. If none, linear learners are used.

  • learner_hyperparameters (obj) – The space to be optimized w.r.t. the learner param.

  • random_seed (int) – The random seed used.

  • contextual_model (str) – The language model string compatible with sentence-transformers library (this is in beta)

  • visualize_progress (bool) – Progress visualization (progress.pdf, reqires MPL).

  • task (str) – Either “classification” - SGDClassifier, or “regression” - SGDRegressor

  • n_fold_cv (int) – The number of folds to be used for model evaluation.

  • learner_preset (str) – Type of classification to be considered (default=paper), “”mini-l1”” or “”mini-l2” -> very lightweight regression, emphasis on space exploration.

  • default_importance (float) – Minimum possible initial weight.

  • upsample (bool) – Whether to equalize the number of instances by upsampling.

  • validation_percentage (float) – The percentage of data to used as test set if validation_type=”train_test”

  • validation_type (str) – type of validation, either train_val or cv (cross validation or train-val split)

get_label_map(train_targets)

Identify unique target labels and remember them.

Parameters

train_targets (list/np.array) – The training target space (or any other for that matter)

Return label_map, inverse_label_map

Two dicts, mapping to and from encoded space suitable for autoML loopings.

apply_label_map(targets, inverse=False)

A simple mapping back from encoded target space.

Parameters
  • targets (list/np.array) – The target space

  • inverse (bool) – Boolean if map to origin space or not (default encodes into continuum)

Return list new_targets

Encoded target space

update_global_feature_importances()

Aggregate feature importances across top learners to obtain the final ranking.

compute_time_diff()

A method for approximate time monitoring.

prune_redundant_info()

A method for removing redundant additional info which increases the final object’s size.

parallelize_dataframe(df, func)

A method for parallel traversal of a given dataframe.

Parameters
  • df (pd.DataFrame) – dataframe of text (Pandas object)

  • func (obj) – function to be executed (a function)

upsample_dataset(X, Y)

Perform very basic upsampling of less-present classes.

Parameters
  • X (list) – Input list of documents

  • Y (np.array/list) – Targets

Return X,Y

Return upsampled data.

return_dataframe_from_text(text)

A helper method that return a given dataframe from text.

Parameters

text (list/pd.Series) – list of texts.

Return parsed df

A parsed text (a DataFrame)

generate_random_initial_state(weights_importances)

The initialization method, capable of generation of individuals.

summarise_dataset(list_of_texts, targets)
custom_initialization()

Custom initialization employs random uniform prior. See the paper for more details.

apply_weights(parameters, custom_feature_space=False, custom_feature_matrix=None)

This method applies weights to individual parts of the feature space.

Parameters
  • parameters (np.array) – a vector of real-valued parameters - solution=an individual

  • custom_feature_space (bool) – Custom feature space, relevant during making of predictions.

Return np.array tmp_space

Temporary weighted space (individual)

cross_val_scores(tmp_feature_space, final_run=False)

Compute the learnability of the representation.

Parameters
  • tmp_feature_space (np.array) – An individual’s solution space.

  • final_run (bool) – Last run is more extensive.

Return float performance_score, clf

F1 performance and the learned learner.

evaluate_fitness(individual, max_num_feat=1000, return_clf_and_vec=False)

A helper method for evaluating an individual solution. Given a real-valued vector, this constructs the representations and evaluates a given learner.

Parameters
  • individual (np.array) – an individual (solution)

  • max_num_feat (int) – maximum number of features that are outputted

  • return_clf_and_vec (bool) – return learner and vectorizer? This is useful for deployment.

Return float score

The fitness score.

generate_and_update_stats(fits)

A helper method for generating stats.

Parameters

fits (list) – fitness values of the current population

Return float meanScore

The mean of the fitnesses

report_performance(fits, gen=0)

A helper method for performance reports.

Parameters
  • fits (np.array) – fitness values (vector of floats)

  • gen (int) – generation to be reported (int)

get_feature_space()

Extract final feature space considered for learning purposes.

predict_proba(instances)

Predict on new instances. Note that the prediction is actually a maxvote across the hall-of-fame.

Parameters

instances (list/pd.Series) – predict labels for new instances=texts.

probability_extraction(pred_matrix)

Predict probabilities for individual classes. Probabilities are based as proportions of a particular label predicted with a given learner.

Parameters

pred_matrix (np.array) – Matrix of predictions.

Return pd.DataFrame prob_df

A DataFrame of probabilities for each class.

transform(instances)

Generate only the representations (obtain a feature matrix subject to evolution in autoBOT)

Parameters

instances (list/pd.DataFrame) – A collection of instances to be transformed into feature matrix.

Return sparseMatrix output_representation

Representation of the documents.

predict(instances)

Predict on new instances. Note that the prediction is actually a maxvote across the hall-of-fame.

Parameters

instances (list/pd.Series) – predict labels for new instances=texts.

Return np.array all_predictions

Vector of predictions (decoded)

mode_pred(prediction_matrix)

Obtain most frequent elements for each row.

Parameters

prediction_matrix (np.array) – Matrix of predictions.

Return np.array prediction_vector

Vector of aggregate predictions.

summarise_final_learners()
generate_id_intervals()

Generate independent intervals.

get_feature_importance_report(individual, fitnesses)

Report feature importances.

Parameters
  • individual (np.array) – an individual solution (a vector of floats)

  • fitnesses (list) – fitness space (list of reals)

mutReg(individual, p=1)

Custom mutation operator used for regularization optimization.

Parameters

individual – individual (vector of floats)

Return individual

An individual solution.

update_intermediary_feature_space(custom_space=None)

Create the subset of the origin feature space based on the starting_feature_numbers vector that gets evolved.

visualize_learners(learner_dataframe, image_path)

A generic hyperparameter visualization method. This helps the user with understanding of the overall optimization.

Parameters
  • pd.DataFrame (learner_dataframe) – The learner dataframe.

  • str (image_path) – The output file’s path.

Returns

None

visualize_global_importances(importances_object, job_id, output_folder)
generate_report(output_folder='./report', job_id='genericJobId')

An auxilliary method for creating a report

Parameters
  • output_folder (string) – The folder containing the report

  • job_id (string) – The identifier of a given job

Returns

None

instantiate_validation_env()

This method refreshes the feature space. This is needed to maximize efficiency.

feature_type_importances(solution_index=0)

A method which prints feature type importances as a pandas df.

Parameters

solution_index – Which consequent individual to inspect.

Return feature_ranking

Final table of rankings

get_topic_explanation()

A method for extracting the key topics. :return pd.DataFrame topicList: A list of topic-id tuples.

visualize_fitness(image_path='fitnessExample.png')

A method for visualizing fitness.

Parameters

image_path – Path to file, ending denotes file type. If set to None, only DataFrame of statistics is returned.

Return dfx

DataFrame of evolution evaluations

store_top_solutions()

A method for storing the HOF

load_top_solutions()

Load the top solutions as HOF

evolve(nind=10, crossover_proba=0.4, mutpb=0.15, stopping_interval=20, strategy='evolution', representation_step_only=False)

The core evolution method. First constrain the maximum number of features to be taken into account by lowering the bound w.r.t performance. next, evolve.

Parameters
  • nind (int) – number of individuals (int)

  • crossover_proba (float) – crossover probability (float)

  • mutpb (float) – mutation probability (float)

  • stopping_interval (int) – stopping interval -> for how long no improvement is tolerated before a hard reset (int)

  • strategy (str) – type of evolution (str)

  • representation_step_only (bool) – Learn only the feature transformations, skip the evolution. Suitable for custom experiments with transform()

autoBOTLib.optimization.optimization_feature_constructors module

AutoBOT. Skrlj et al. 2021

autoBOTLib.optimization.optimization_feature_constructors.remove_punctuation(text)

This method removes punctuation

autoBOTLib.optimization.optimization_feature_constructors.remove_stopwords(text)

This method removes stopwords

Parameters

text (list/pd.Series) – Input string of text

Return str string

Preprocessed text

autoBOTLib.optimization.optimization_feature_constructors.remove_mentions(text, replace_token)

This method removes mentions (relevant for tweets)

Parameters
  • text (str) – Input string of text

  • replace_token (str) – A token to be replaced

Return str string

A new text string

autoBOTLib.optimization.optimization_feature_constructors.remove_hashtags(text, replace_token)

This method removes hashtags

Parameters
  • text (str) – Input string of text

  • replace_token (str) – The token to be replaced

Return str string

A new text

autoBOTLib.optimization.optimization_feature_constructors.remove_url(text, replace_token)

Removal of URLs

Parameters
  • text (str) – Input string of text

  • replace_token (str) – The token to be replaced

Return str string

A new text

autoBOTLib.optimization.optimization_feature_constructors.get_affix(text)

This method gets the affix information

Parameters

text (str) – Input text.

autoBOTLib.optimization.optimization_feature_constructors.get_pos_tags(text)

This method yields pos tags

Parameters

text (str) – Input string of text

Return str string

space delimited pos tags.

autoBOTLib.optimization.optimization_feature_constructors.ttr(text)

Number of unique tokens

Parameters

text (str) – Input string of text

Return float floatValue

Ratio of the unique/overall tokens

class autoBOTLib.optimization.optimization_feature_constructors.text_col(key)

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

A helper processor class

Parameters
  • BaseExtimator (obj) – Core estimator

  • TransformerMixin (obj) – Transformer object

Return obj object

Returns particular text column

__init__(key)

Initialize self. See help(type(self)) for accurate signature.

fit(x, y=None)
transform(data_dict)
class autoBOTLib.optimization.optimization_feature_constructors.digit_col

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Dealing with numeric features

Parameters
  • BaseExtimator (obj) – Core estimator

  • TransformerMixin (obj) – Transformer object

Return obj object

Returns transformed (scaled) space

fit(x, y=None)
transform(hd_searches)
autoBOTLib.optimization.optimization_feature_constructors.parallelize(data, method)

Helper method for parallelization

Parameters
  • data (pd.DataFrame) – Input data to be transformed

  • method (obj) – The method to parallelize

Return pd.DataFrame data

Returns the transformed data

autoBOTLib.optimization.optimization_feature_constructors.build_dataframe(data_docs)

One of the core methods responsible for construction of a dataframe object.

Parameters

data_docs (list/pd.Series) – The input data documents

Return pd.DataFrame df_data

A dataframe corresponding to text representations

class autoBOTLib.optimization.optimization_feature_constructors.FeaturePrunner(max_num_feat=2048)

Bases: object

Core class describing sentence embedding methodology employed here.

__init__(max_num_feat=2048)

Initialize self. See help(type(self)) for accurate signature.

fit(input_data, y=None)
transform(input_data)
get_feature_names()
autoBOTLib.optimization.optimization_feature_constructors.fast_screening_sgd(training, targets)
autoBOTLib.optimization.optimization_feature_constructors.get_subset(indice_list, data_matrix, vectorizer)
autoBOTLib.optimization.optimization_feature_constructors.get_simple_features(df_data, max_num_feat=10000)
autoBOTLib.optimization.optimization_feature_constructors.get_features(df_data, representation_type='neurosymbolic', targets=None, sparsity=0.1, embedding_dim=512, memory_location='memory', custom_pipeline=None, random_seed=54324, normalization_norm='l2', contextual_model='all-mpnet-base-v2', combine_with_existing_representation=False)

Method that computes various TF-IDF-alike features.

Parameters
  • df_data (list/pd.Series) – The input collection of texts

  • representation_type (str) – Type of representation to be used.

  • targets (list/np.array) – The target space (optional)

  • sparsity (float) – The hyperparameter determining the dimensionalities of separate subspaces

  • normalization_norm (str) – The normalization of each subspace

  • embedding_dim (int) – The latent dimension for doc. embeddings

  • memory_location (str) – Location of the gzipped ConceptNet-like memory.

  • custom_pipeline (obj) – Custom pipeline to be used for features if needed.

  • contextual_model (str) – The language model string compatible with sentence-transformers library (this is in beta)

  • random_seed (int) – The seed for the pseudo-random parts.

  • combine_with_existing_representation (bool) – Whether to use existing representations + user-specified ones.

Return obj/list/matrix

Transformer pipeline, feature names and the feature matrix.

autoBOTLib.optimization.optimization_metrics module

autoBOTLib.optimization.optimization_metrics.get_metric_report(y_true, y_prediction)

A generic metric report; suitable for multiobjective experiments (not the core paper)

autoBOTLib.optimization.optimization_random module

autoBOTLib.optimization.optimization_utils module

class autoBOTLib.optimization.optimization_utils.DataProcessor

Bases: object

Base class for data converters for sequence classification data sets.

get_train_examples(data_dir)

Gets a collection of `InputExample`s for the train set.

get_dev_examples(data_dir)

Gets a collection of `InputExample`s for the dev set.

get_labels()

Gets the list of labels for this data set.

read_pandas_tsv(input_file)
class autoBOTLib.optimization.optimization_utils.genericProcessor

Bases: autoBOTLib.optimization.optimization_utils.DataProcessor

get_train_examples(data_dir)

See base class.

get_dev_examples(data_dir)

See base class.

get_test_examples(data_dir)

See base class.

autoBOTLib.optimization.optimization_utils.simple_accuracy(preds, labels)
autoBOTLib.optimization.optimization_utils.acc_and_f1(preds, labels, average=None)
autoBOTLib.optimization.optimization_utils.pearson_and_spearman(preds, labels)
autoBOTLib.optimization.optimization_utils.compute_metrics(task_name, preds, labels)

Module contents