autoBOTLib.optimization package¶

Submodules¶

autoBOTLib.optimization.optimization_engine module¶

class autoBOTLib.optimization.optimization_engine.GAlearner(train_sequences_raw, train_targets, time_constraint, num_cpu='all', device='cpu', task_name='Super cool task.', latent_dim=512, sparsity=0.1, hof_size=1, initial_separate_spaces=True, scoring_metric=None, top_k_importances=15, representation_type='neurosymbolic', binarize_importances=False, memory_storage='memory', learner=None, n_fold_cv=5, random_seed=8954, learner_hyperparameters=None, use_checkpoints=True, visualize_progress=False, custom_transformer_pipeline=None, combine_with_existing_representation=False, default_importance=0.05, learner_preset='default', task='classification', contextual_model='all-mpnet-base-v2', upsample=False, verbose=1, framework='scikit', normalization_norm='l2', validation_percentage=0.2, validation_type='cv')¶

Bases: object

The core GA class. It includes methods for evolution of a learner assembly. Each instance of autoBOT must be first instantiated. In general, the workflow for working with this class is as follows: 1.) Instantiate the class 2.) Evolve 3.) Predict

__init__(train_sequences_raw, train_targets, time_constraint, num_cpu='all', device='cpu', task_name='Super cool task.', latent_dim=512, sparsity=0.1, hof_size=1, initial_separate_spaces=True, scoring_metric=None, top_k_importances=15, representation_type='neurosymbolic', binarize_importances=False, memory_storage='memory', learner=None, n_fold_cv=5, random_seed=8954, learner_hyperparameters=None, use_checkpoints=True, visualize_progress=False, custom_transformer_pipeline=None, combine_with_existing_representation=False, default_importance=0.05, learner_preset='default', task='classification', contextual_model='all-mpnet-base-v2', upsample=False, verbose=1, framework='scikit', normalization_norm='l2', validation_percentage=0.2, validation_type='cv')¶

The object initialization method; specify the core optimization: parameter with this method.

Parameters

train_sequences_raw (list/PandasSeries) – a list of texts
train_targets (list/np.array) – a list of natural numbers (targets, multiclass), a list of lists (multilabel)
device (str) – Specification of the computation backend device
time_constraint (int) – Number of hours to evolve.
num_cpu (int/str) – Number of threads to exploit
task_name (str) – Task identifier for logging
latent_dim (int) – The latent dimension of embeddings
sparsity (float) – The assumed sparsity of the induced space (see paper)
hof_size (int) – Hof many final models to consider?
initial_separate_spaces (bool) – Whether to include separate spaces as part of the initial population.
scoring_metric (str) – The type of metric to optimize (sklearn-compatible)
top_k_importances (int) – How many top importances to remember for explanations.
representation_type (str) – “symbolic”, “neural”, “neurosymbolic”, “neurosymbolic-default”, “neurosymbolic-lite” or “custom”. The “symbolic” feature space will only include feature types that we humans directly comprehend. The “neural” will include the embedding-based ones. The “neurosymbolic-default” will include the ones based on the origin MLJ paper, the “neurosymbolic” is the current alpha version with some new additions (constantly updated/developed). The “neurosymbolic-lite” version includes language-agnostic features but does not consider document graphs (due to space constraints)
framework (str) – The framework used for obtaining the final models (torch, scikit)
binarize_importances (bool) – Feature selection instead of ranking as explanation
memory_storage (str) – The storage of the gzipped (TSV) triplets (SPO).
learner (obj) – custom learner. If none, linear learners are used.
learner_hyperparameters (obj) – The space to be optimized w.r.t. the learner param.
random_seed (int) – The random seed used.
contextual_model (str) – The language model string compatible with sentence-transformers library (this is in beta)
visualize_progress (bool) – Progress visualization (progress.pdf, reqires MPL).
task (str) – Either “classification” - SGDClassifier, or “regression” - SGDRegressor
n_fold_cv (int) – The number of folds to be used for model evaluation.
learner_preset (str) – Type of classification to be considered (default=paper), “”mini-l1”” or “”mini-l2” -> very lightweight regression, emphasis on space exploration.
default_importance (float) – Minimum possible initial weight.
upsample (bool) – Whether to equalize the number of instances by upsampling.
validation_percentage (float) – The percentage of data to used as test set if validation_type=”train_test”
validation_type (str) – type of validation, either train_val or cv (cross validation or train-val split)

get_label_map(train_targets)¶

Identify unique target labels and remember them.

Parameters: train_targets (list/np.array) – The training target space (or any other for that matter)
Return label_map, inverse_label_map: Two dicts, mapping to and from encoded space suitable for autoML loopings.

apply_label_map(targets, inverse=False)¶

A simple mapping back from encoded target space.

Parameters

targets (list/np.array) – The target space
inverse (bool) – Boolean if map to origin space or not (default encodes into continuum)

Return list new_targets

Encoded target space

update_global_feature_importances()¶: Aggregate feature importances across top learners to obtain the final ranking.

compute_time_diff()¶: A method for approximate time monitoring.

prune_redundant_info()¶: A method for removing redundant additional info which increases the final object’s size.

parallelize_dataframe(df, func)¶

A method for parallel traversal of a given dataframe.

Parameters

df (pd.DataFrame) – dataframe of text (Pandas object)
func (obj) – function to be executed (a function)

upsample_dataset(X, Y)¶

Perform very basic upsampling of less-present classes.

Parameters

X (list) – Input list of documents
Y (np.array/list) – Targets

Return X,Y

Return upsampled data.

return_dataframe_from_text(text)¶

A helper method that return a given dataframe from text.

Parameters: text (list/pd.Series) – list of texts.
Return parsed df: A parsed text (a DataFrame)

generate_random_initial_state(weights_importances)¶: The initialization method, capable of generation of individuals.

summarise_dataset(list_of_texts, targets)¶

custom_initialization()¶: Custom initialization employs random uniform prior. See the paper for more details.

apply_weights(parameters, custom_feature_space=False, custom_feature_matrix=None)¶

This method applies weights to individual parts of the feature space.

Parameters

parameters (np.array) – a vector of real-valued parameters - solution=an individual
custom_feature_space (bool) – Custom feature space, relevant during making of predictions.

Return np.array tmp_space

Temporary weighted space (individual)

cross_val_scores(tmp_feature_space, final_run=False)¶

Compute the learnability of the representation.

Parameters

tmp_feature_space (np.array) – An individual’s solution space.
final_run (bool) – Last run is more extensive.

Return float performance_score, clf

F1 performance and the learned learner.

evaluate_fitness(individual, max_num_feat=1000, return_clf_and_vec=False)¶

A helper method for evaluating an individual solution. Given a real-valued vector, this constructs the representations and evaluates a given learner.

Parameters

individual (np.array) – an individual (solution)
max_num_feat (int) – maximum number of features that are outputted
return_clf_and_vec (bool) – return learner and vectorizer? This is useful for deployment.

Return float score

The fitness score.

generate_and_update_stats(fits)¶

A helper method for generating stats.

Parameters: fits (list) – fitness values of the current population
Return float meanScore: The mean of the fitnesses

report_performance(fits, gen=0)¶

A helper method for performance reports.

Parameters

fits (np.array) – fitness values (vector of floats)
gen (int) – generation to be reported (int)

get_feature_space()¶: Extract final feature space considered for learning purposes.

predict_proba(instances)¶

Predict on new instances. Note that the prediction is actually a maxvote across the hall-of-fame.

Parameters: instances (list/pd.Series) – predict labels for new instances=texts.

probability_extraction(pred_matrix)¶

Predict probabilities for individual classes. Probabilities are based as proportions of a particular label predicted with a given learner.

Parameters: pred_matrix (np.array) – Matrix of predictions.
Return pd.DataFrame prob_df: A DataFrame of probabilities for each class.

transform(instances)¶

Generate only the representations (obtain a feature matrix subject to evolution in autoBOT)

Parameters: instances (list/pd.DataFrame) – A collection of instances to be transformed into feature matrix.
Return sparseMatrix output_representation: Representation of the documents.

predict(instances)¶

Predict on new instances. Note that the prediction is actually a maxvote across the hall-of-fame.

Parameters: instances (list/pd.Series) – predict labels for new instances=texts.
Return np.array all_predictions: Vector of predictions (decoded)

mode_pred(prediction_matrix)¶

Obtain most frequent elements for each row.

Parameters: prediction_matrix (np.array) – Matrix of predictions.
Return np.array prediction_vector: Vector of aggregate predictions.

summarise_final_learners()¶

generate_id_intervals()¶: Generate independent intervals.

get_feature_importance_report(individual, fitnesses)¶

Report feature importances.

Parameters

individual (np.array) – an individual solution (a vector of floats)
fitnesses (list) – fitness space (list of reals)

mutReg(individual, p=1)¶

Custom mutation operator used for regularization optimization.

Parameters: individual – individual (vector of floats)
Return individual: An individual solution.

update_intermediary_feature_space(custom_space=None)¶: Create the subset of the origin feature space based on the starting_feature_numbers vector that gets evolved.

visualize_learners(learner_dataframe, image_path)¶

A generic hyperparameter visualization method. This helps the user with understanding of the overall optimization.

Parameters

pd.DataFrame (learner_dataframe) – The learner dataframe.
str (image_path) – The output file’s path.

Returns

None

visualize_global_importances(importances_object, job_id, output_folder)¶

generate_report(output_folder='./report', job_id='genericJobId')¶

An auxilliary method for creating a report

Parameters

output_folder (string) – The folder containing the report
job_id (string) – The identifier of a given job

Returns

None

instantiate_validation_env()¶: This method refreshes the feature space. This is needed to maximize efficiency.

feature_type_importances(solution_index=0)¶

A method which prints feature type importances as a pandas df.

Parameters: solution_index – Which consequent individual to inspect.
Return feature_ranking: Final table of rankings

get_topic_explanation()¶: A method for extracting the key topics. :return pd.DataFrame topicList: A list of topic-id tuples.

visualize_fitness(image_path='fitnessExample.png')¶

A method for visualizing fitness.

Parameters: image_path – Path to file, ending denotes file type. If set to None, only DataFrame of statistics is returned.
Return dfx: DataFrame of evolution evaluations

store_top_solutions()¶: A method for storing the HOF

load_top_solutions()¶: Load the top solutions as HOF

evolve(nind=10, crossover_proba=0.4, mutpb=0.15, stopping_interval=20, strategy='evolution', representation_step_only=False)¶

The core evolution method. First constrain the maximum number of features to be taken into account by lowering the bound w.r.t performance. next, evolve.

Parameters

nind (int) – number of individuals (int)
crossover_proba (float) – crossover probability (float)
mutpb (float) – mutation probability (float)
stopping_interval (int) – stopping interval -> for how long no improvement is tolerated before a hard reset (int)
strategy (str) – type of evolution (str)
representation_step_only (bool) – Learn only the feature transformations, skip the evolution. Suitable for custom experiments with transform()

autoBOTLib.optimization.optimization_feature_constructors module¶

AutoBOT. Skrlj et al. 2021

autoBOTLib.optimization.optimization_feature_constructors.remove_punctuation(text)¶: This method removes punctuation

autoBOTLib.optimization.optimization_feature_constructors.remove_stopwords(text)¶

This method removes stopwords

Parameters: text (list/pd.Series) – Input string of text
Return str string: Preprocessed text

autoBOTLib.optimization.optimization_feature_constructors.remove_mentions(text, replace_token)¶

This method removes mentions (relevant for tweets)

Parameters

text (str) – Input string of text
replace_token (str) – A token to be replaced

Return str string

A new text string

autoBOTLib.optimization.optimization_feature_constructors.remove_hashtags(text, replace_token)¶

This method removes hashtags

Parameters

text (str) – Input string of text
replace_token (str) – The token to be replaced

Return str string

A new text

autoBOTLib.optimization.optimization_feature_constructors.remove_url(text, replace_token)¶

Removal of URLs

Parameters

text (str) – Input string of text
replace_token (str) – The token to be replaced

Return str string

A new text

autoBOTLib.optimization.optimization_feature_constructors.get_affix(text)¶

This method gets the affix information

Parameters: text (str) – Input text.

autoBOTLib.optimization.optimization_feature_constructors.get_pos_tags(text)¶

This method yields pos tags

Parameters: text (str) – Input string of text
Return str string: space delimited pos tags.

autoBOTLib.optimization.optimization_feature_constructors.ttr(text)¶

Number of unique tokens

Parameters: text (str) – Input string of text
Return float floatValue: Ratio of the unique/overall tokens

class autoBOTLib.optimization.optimization_feature_constructors.text_col(key)¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

A helper processor class

Parameters

BaseExtimator (obj) – Core estimator
TransformerMixin (obj) – Transformer object

Return obj object

Returns particular text column

__init__(key)¶: Initialize self. See help(type(self)) for accurate signature.

fit(x, y=None)¶

transform(data_dict)¶

class autoBOTLib.optimization.optimization_feature_constructors.digit_col¶

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

Dealing with numeric features

Parameters

BaseExtimator (obj) – Core estimator
TransformerMixin (obj) – Transformer object

Return obj object

Returns transformed (scaled) space

fit(x, y=None)¶

transform(hd_searches)¶

autoBOTLib.optimization.optimization_feature_constructors.parallelize(data, method)¶

Helper method for parallelization

Parameters

data (pd.DataFrame) – Input data to be transformed
method (obj) – The method to parallelize

Return pd.DataFrame data

Returns the transformed data

autoBOTLib.optimization.optimization_feature_constructors.build_dataframe(data_docs)¶

One of the core methods responsible for construction of a dataframe object.

Parameters: data_docs (list/pd.Series) – The input data documents
Return pd.DataFrame df_data: A dataframe corresponding to text representations

class autoBOTLib.optimization.optimization_feature_constructors.FeaturePrunner(max_num_feat=2048)¶

Bases: object

Core class describing sentence embedding methodology employed here.

__init__(max_num_feat=2048)¶: Initialize self. See help(type(self)) for accurate signature.

fit(input_data, y=None)¶

transform(input_data)¶

get_feature_names()¶

autoBOTLib.optimization.optimization_feature_constructors.fast_screening_sgd(training, targets)¶

autoBOTLib.optimization.optimization_feature_constructors.get_subset(indice_list, data_matrix, vectorizer)¶

autoBOTLib.optimization.optimization_feature_constructors.get_simple_features(df_data, max_num_feat=10000)¶

autoBOTLib.optimization.optimization_feature_constructors.get_features(df_data, representation_type='neurosymbolic', targets=None, sparsity=0.1, embedding_dim=512, memory_location='memory', custom_pipeline=None, random_seed=54324, normalization_norm='l2', contextual_model='all-mpnet-base-v2', combine_with_existing_representation=False)¶

Method that computes various TF-IDF-alike features.

Parameters

df_data (list/pd.Series) – The input collection of texts
representation_type (str) – Type of representation to be used.
targets (list/np.array) – The target space (optional)
sparsity (float) – The hyperparameter determining the dimensionalities of separate subspaces
normalization_norm (str) – The normalization of each subspace
embedding_dim (int) – The latent dimension for doc. embeddings
memory_location (str) – Location of the gzipped ConceptNet-like memory.
custom_pipeline (obj) – Custom pipeline to be used for features if needed.
contextual_model (str) – The language model string compatible with sentence-transformers library (this is in beta)
random_seed (int) – The seed for the pseudo-random parts.
combine_with_existing_representation (bool) – Whether to use existing representations + user-specified ones.

Return obj/list/matrix

Transformer pipeline, feature names and the feature matrix.

autoBOTLib.optimization.optimization_metrics module¶

autoBOTLib.optimization.optimization_metrics.get_metric_report(y_true, y_prediction)¶: A generic metric report; suitable for multiobjective experiments (not the core paper)

autoBOTLib.optimization.optimization_random module¶

autoBOTLib.optimization.optimization_utils module¶

class autoBOTLib.optimization.optimization_utils.DataProcessor¶

Bases: object

Base class for data converters for sequence classification data sets.

get_train_examples(data_dir)¶: Gets a collection of `InputExample`s for the train set.

get_dev_examples(data_dir)¶: Gets a collection of `InputExample`s for the dev set.

get_labels()¶: Gets the list of labels for this data set.

read_pandas_tsv(input_file)¶

class autoBOTLib.optimization.optimization_utils.genericProcessor¶

Bases: autoBOTLib.optimization.optimization_utils.DataProcessor

get_train_examples(data_dir)¶: See base class.

get_dev_examples(data_dir)¶: See base class.

get_test_examples(data_dir)¶: See base class.

autoBOTLib.optimization.optimization_utils.simple_accuracy(preds, labels)¶

autoBOTLib.optimization.optimization_utils.acc_and_f1(preds, labels, average=None)¶

autoBOTLib.optimization.optimization_utils.pearson_and_spearman(preds, labels)¶

autoBOTLib.optimization.optimization_utils.compute_metrics(task_name, preds, labels)¶

autoBOTLib.optimization package¶

Submodules¶

autoBOTLib.optimization.optimization_engine module¶

autoBOTLib.optimization.optimization_feature_constructors module¶

autoBOTLib.optimization.optimization_metrics module¶

autoBOTLib.optimization.optimization_random module¶

autoBOTLib.optimization.optimization_utils module¶

Module contents¶