autoBOTLib.optimization package¶
Submodules¶
autoBOTLib.optimization.optimization_engine module¶
-
class
autoBOTLib.optimization.optimization_engine.
GAlearner
(train_sequences_raw, train_targets, time_constraint, num_cpu='all', device='cpu', task_name='Super cool task.', latent_dim=512, sparsity=0.1, hof_size=1, initial_separate_spaces=True, scoring_metric=None, top_k_importances=15, representation_type='neurosymbolic', binarize_importances=False, memory_storage='memory', learner=None, n_fold_cv=5, random_seed=8954, learner_hyperparameters=None, use_checkpoints=True, visualize_progress=False, custom_transformer_pipeline=None, combine_with_existing_representation=False, default_importance=0.05, learner_preset='default', task='classification', contextual_model='all-mpnet-base-v2', upsample=False, verbose=1, framework='scikit', normalization_norm='l2', validation_percentage=0.2, validation_type='cv')¶ Bases:
object
The core GA class. It includes methods for evolution of a learner assembly. Each instance of autoBOT must be first instantiated. In general, the workflow for working with this class is as follows: 1.) Instantiate the class 2.) Evolve 3.) Predict
-
__init__
(train_sequences_raw, train_targets, time_constraint, num_cpu='all', device='cpu', task_name='Super cool task.', latent_dim=512, sparsity=0.1, hof_size=1, initial_separate_spaces=True, scoring_metric=None, top_k_importances=15, representation_type='neurosymbolic', binarize_importances=False, memory_storage='memory', learner=None, n_fold_cv=5, random_seed=8954, learner_hyperparameters=None, use_checkpoints=True, visualize_progress=False, custom_transformer_pipeline=None, combine_with_existing_representation=False, default_importance=0.05, learner_preset='default', task='classification', contextual_model='all-mpnet-base-v2', upsample=False, verbose=1, framework='scikit', normalization_norm='l2', validation_percentage=0.2, validation_type='cv')¶ - The object initialization method; specify the core optimization
parameter with this method.
- Parameters
train_sequences_raw (list/PandasSeries) – a list of texts
train_targets (list/np.array) – a list of natural numbers (targets, multiclass), a list of lists (multilabel)
device (str) – Specification of the computation backend device
time_constraint (int) – Number of hours to evolve.
num_cpu (int/str) – Number of threads to exploit
task_name (str) – Task identifier for logging
latent_dim (int) – The latent dimension of embeddings
sparsity (float) – The assumed sparsity of the induced space (see paper)
hof_size (int) – Hof many final models to consider?
initial_separate_spaces (bool) – Whether to include separate spaces as part of the initial population.
scoring_metric (str) – The type of metric to optimize (sklearn-compatible)
top_k_importances (int) – How many top importances to remember for explanations.
representation_type (str) – “symbolic”, “neural”, “neurosymbolic”, “neurosymbolic-default”, “neurosymbolic-lite” or “custom”. The “symbolic” feature space will only include feature types that we humans directly comprehend. The “neural” will include the embedding-based ones. The “neurosymbolic-default” will include the ones based on the origin MLJ paper, the “neurosymbolic” is the current alpha version with some new additions (constantly updated/developed). The “neurosymbolic-lite” version includes language-agnostic features but does not consider document graphs (due to space constraints)
framework (str) – The framework used for obtaining the final models (torch, scikit)
binarize_importances (bool) – Feature selection instead of ranking as explanation
memory_storage (str) – The storage of the gzipped (TSV) triplets (SPO).
learner (obj) – custom learner. If none, linear learners are used.
learner_hyperparameters (obj) – The space to be optimized w.r.t. the learner param.
random_seed (int) – The random seed used.
contextual_model (str) – The language model string compatible with sentence-transformers library (this is in beta)
visualize_progress (bool) – Progress visualization (progress.pdf, reqires MPL).
task (str) – Either “classification” - SGDClassifier, or “regression” - SGDRegressor
n_fold_cv (int) – The number of folds to be used for model evaluation.
learner_preset (str) – Type of classification to be considered (default=paper), “”mini-l1”” or “”mini-l2” -> very lightweight regression, emphasis on space exploration.
default_importance (float) – Minimum possible initial weight.
upsample (bool) – Whether to equalize the number of instances by upsampling.
validation_percentage (float) – The percentage of data to used as test set if validation_type=”train_test”
validation_type (str) – type of validation, either train_val or cv (cross validation or train-val split)
-
get_label_map
(train_targets)¶ Identify unique target labels and remember them.
- Parameters
train_targets (list/np.array) – The training target space (or any other for that matter)
- Return label_map, inverse_label_map
Two dicts, mapping to and from encoded space suitable for autoML loopings.
-
apply_label_map
(targets, inverse=False)¶ A simple mapping back from encoded target space.
- Parameters
targets (list/np.array) – The target space
inverse (bool) – Boolean if map to origin space or not (default encodes into continuum)
- Return list new_targets
Encoded target space
-
update_global_feature_importances
()¶ Aggregate feature importances across top learners to obtain the final ranking.
-
compute_time_diff
()¶ A method for approximate time monitoring.
-
prune_redundant_info
()¶ A method for removing redundant additional info which increases the final object’s size.
-
parallelize_dataframe
(df, func)¶ A method for parallel traversal of a given dataframe.
- Parameters
df (pd.DataFrame) – dataframe of text (Pandas object)
func (obj) – function to be executed (a function)
-
upsample_dataset
(X, Y)¶ Perform very basic upsampling of less-present classes.
- Parameters
X (list) – Input list of documents
Y (np.array/list) – Targets
- Return X,Y
Return upsampled data.
-
return_dataframe_from_text
(text)¶ A helper method that return a given dataframe from text.
- Parameters
text (list/pd.Series) – list of texts.
- Return parsed df
A parsed text (a DataFrame)
-
generate_random_initial_state
(weights_importances)¶ The initialization method, capable of generation of individuals.
-
summarise_dataset
(list_of_texts, targets)¶
-
custom_initialization
()¶ Custom initialization employs random uniform prior. See the paper for more details.
-
apply_weights
(parameters, custom_feature_space=False, custom_feature_matrix=None)¶ This method applies weights to individual parts of the feature space.
- Parameters
parameters (np.array) – a vector of real-valued parameters - solution=an individual
custom_feature_space (bool) – Custom feature space, relevant during making of predictions.
- Return np.array tmp_space
Temporary weighted space (individual)
-
cross_val_scores
(tmp_feature_space, final_run=False)¶ Compute the learnability of the representation.
- Parameters
tmp_feature_space (np.array) – An individual’s solution space.
final_run (bool) – Last run is more extensive.
- Return float performance_score, clf
F1 performance and the learned learner.
-
evaluate_fitness
(individual, max_num_feat=1000, return_clf_and_vec=False)¶ A helper method for evaluating an individual solution. Given a real-valued vector, this constructs the representations and evaluates a given learner.
- Parameters
individual (np.array) – an individual (solution)
max_num_feat (int) – maximum number of features that are outputted
return_clf_and_vec (bool) – return learner and vectorizer? This is useful for deployment.
- Return float score
The fitness score.
-
generate_and_update_stats
(fits)¶ A helper method for generating stats.
- Parameters
fits (list) – fitness values of the current population
- Return float meanScore
The mean of the fitnesses
-
report_performance
(fits, gen=0)¶ A helper method for performance reports.
- Parameters
fits (np.array) – fitness values (vector of floats)
gen (int) – generation to be reported (int)
-
get_feature_space
()¶ Extract final feature space considered for learning purposes.
-
predict_proba
(instances)¶ Predict on new instances. Note that the prediction is actually a maxvote across the hall-of-fame.
- Parameters
instances (list/pd.Series) – predict labels for new instances=texts.
-
probability_extraction
(pred_matrix)¶ Predict probabilities for individual classes. Probabilities are based as proportions of a particular label predicted with a given learner.
- Parameters
pred_matrix (np.array) – Matrix of predictions.
- Return pd.DataFrame prob_df
A DataFrame of probabilities for each class.
-
transform
(instances)¶ Generate only the representations (obtain a feature matrix subject to evolution in autoBOT)
- Parameters
instances (list/pd.DataFrame) – A collection of instances to be transformed into feature matrix.
- Return sparseMatrix output_representation
Representation of the documents.
-
predict
(instances)¶ Predict on new instances. Note that the prediction is actually a maxvote across the hall-of-fame.
- Parameters
instances (list/pd.Series) – predict labels for new instances=texts.
- Return np.array all_predictions
Vector of predictions (decoded)
-
mode_pred
(prediction_matrix)¶ Obtain most frequent elements for each row.
- Parameters
prediction_matrix (np.array) – Matrix of predictions.
- Return np.array prediction_vector
Vector of aggregate predictions.
-
summarise_final_learners
()¶
-
generate_id_intervals
()¶ Generate independent intervals.
-
get_feature_importance_report
(individual, fitnesses)¶ Report feature importances.
- Parameters
individual (np.array) – an individual solution (a vector of floats)
fitnesses (list) – fitness space (list of reals)
-
mutReg
(individual, p=1)¶ Custom mutation operator used for regularization optimization.
- Parameters
individual – individual (vector of floats)
- Return individual
An individual solution.
-
update_intermediary_feature_space
(custom_space=None)¶ Create the subset of the origin feature space based on the starting_feature_numbers vector that gets evolved.
-
visualize_learners
(learner_dataframe, image_path)¶ A generic hyperparameter visualization method. This helps the user with understanding of the overall optimization.
- Parameters
pd.DataFrame (learner_dataframe) – The learner dataframe.
str (image_path) – The output file’s path.
- Returns
None
-
visualize_global_importances
(importances_object, job_id, output_folder)¶
-
generate_report
(output_folder='./report', job_id='genericJobId')¶ An auxilliary method for creating a report
- Parameters
output_folder (string) – The folder containing the report
job_id (string) – The identifier of a given job
- Returns
None
-
instantiate_validation_env
()¶ This method refreshes the feature space. This is needed to maximize efficiency.
-
feature_type_importances
(solution_index=0)¶ A method which prints feature type importances as a pandas df.
- Parameters
solution_index – Which consequent individual to inspect.
- Return feature_ranking
Final table of rankings
-
get_topic_explanation
()¶ A method for extracting the key topics. :return pd.DataFrame topicList: A list of topic-id tuples.
-
visualize_fitness
(image_path='fitnessExample.png')¶ A method for visualizing fitness.
- Parameters
image_path – Path to file, ending denotes file type. If set to None, only DataFrame of statistics is returned.
- Return dfx
DataFrame of evolution evaluations
-
store_top_solutions
()¶ A method for storing the HOF
-
load_top_solutions
()¶ Load the top solutions as HOF
-
evolve
(nind=10, crossover_proba=0.4, mutpb=0.15, stopping_interval=20, strategy='evolution', representation_step_only=False)¶ The core evolution method. First constrain the maximum number of features to be taken into account by lowering the bound w.r.t performance. next, evolve.
- Parameters
nind (int) – number of individuals (int)
crossover_proba (float) – crossover probability (float)
mutpb (float) – mutation probability (float)
stopping_interval (int) – stopping interval -> for how long no improvement is tolerated before a hard reset (int)
strategy (str) – type of evolution (str)
representation_step_only (bool) – Learn only the feature transformations, skip the evolution. Suitable for custom experiments with transform()
-
autoBOTLib.optimization.optimization_feature_constructors module¶
AutoBOT. Skrlj et al. 2021
-
autoBOTLib.optimization.optimization_feature_constructors.
remove_punctuation
(text)¶ This method removes punctuation
-
autoBOTLib.optimization.optimization_feature_constructors.
remove_stopwords
(text)¶ This method removes stopwords
- Parameters
text (list/pd.Series) – Input string of text
- Return str string
Preprocessed text
-
autoBOTLib.optimization.optimization_feature_constructors.
remove_mentions
(text, replace_token)¶ This method removes mentions (relevant for tweets)
- Parameters
text (str) – Input string of text
replace_token (str) – A token to be replaced
- Return str string
A new text string
This method removes hashtags
- Parameters
text (str) – Input string of text
replace_token (str) – The token to be replaced
- Return str string
A new text
-
autoBOTLib.optimization.optimization_feature_constructors.
remove_url
(text, replace_token)¶ Removal of URLs
- Parameters
text (str) – Input string of text
replace_token (str) – The token to be replaced
- Return str string
A new text
-
autoBOTLib.optimization.optimization_feature_constructors.
get_affix
(text)¶ This method gets the affix information
- Parameters
text (str) – Input text.
This method yields pos tags
- Parameters
text (str) – Input string of text
- Return str string
space delimited pos tags.
-
autoBOTLib.optimization.optimization_feature_constructors.
ttr
(text)¶ Number of unique tokens
- Parameters
text (str) – Input string of text
- Return float floatValue
Ratio of the unique/overall tokens
-
class
autoBOTLib.optimization.optimization_feature_constructors.
text_col
(key)¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
A helper processor class
- Parameters
BaseExtimator (obj) – Core estimator
TransformerMixin (obj) – Transformer object
- Return obj object
Returns particular text column
-
__init__
(key)¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(x, y=None)¶
-
transform
(data_dict)¶
-
class
autoBOTLib.optimization.optimization_feature_constructors.
digit_col
¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
Dealing with numeric features
- Parameters
BaseExtimator (obj) – Core estimator
TransformerMixin (obj) – Transformer object
- Return obj object
Returns transformed (scaled) space
-
fit
(x, y=None)¶
-
transform
(hd_searches)¶
-
autoBOTLib.optimization.optimization_feature_constructors.
parallelize
(data, method)¶ Helper method for parallelization
- Parameters
data (pd.DataFrame) – Input data to be transformed
method (obj) – The method to parallelize
- Return pd.DataFrame data
Returns the transformed data
-
autoBOTLib.optimization.optimization_feature_constructors.
build_dataframe
(data_docs)¶ One of the core methods responsible for construction of a dataframe object.
- Parameters
data_docs (list/pd.Series) – The input data documents
- Return pd.DataFrame df_data
A dataframe corresponding to text representations
-
class
autoBOTLib.optimization.optimization_feature_constructors.
FeaturePrunner
(max_num_feat=2048)¶ Bases:
object
Core class describing sentence embedding methodology employed here.
-
__init__
(max_num_feat=2048)¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(input_data, y=None)¶
-
transform
(input_data)¶
-
get_feature_names
()¶
-
-
autoBOTLib.optimization.optimization_feature_constructors.
fast_screening_sgd
(training, targets)¶
-
autoBOTLib.optimization.optimization_feature_constructors.
get_subset
(indice_list, data_matrix, vectorizer)¶
-
autoBOTLib.optimization.optimization_feature_constructors.
get_simple_features
(df_data, max_num_feat=10000)¶
-
autoBOTLib.optimization.optimization_feature_constructors.
get_features
(df_data, representation_type='neurosymbolic', targets=None, sparsity=0.1, embedding_dim=512, memory_location='memory', custom_pipeline=None, random_seed=54324, normalization_norm='l2', contextual_model='all-mpnet-base-v2', combine_with_existing_representation=False)¶ Method that computes various TF-IDF-alike features.
- Parameters
df_data (list/pd.Series) – The input collection of texts
representation_type (str) – Type of representation to be used.
targets (list/np.array) – The target space (optional)
sparsity (float) – The hyperparameter determining the dimensionalities of separate subspaces
normalization_norm (str) – The normalization of each subspace
embedding_dim (int) – The latent dimension for doc. embeddings
memory_location (str) – Location of the gzipped ConceptNet-like memory.
custom_pipeline (obj) – Custom pipeline to be used for features if needed.
contextual_model (str) – The language model string compatible with sentence-transformers library (this is in beta)
random_seed (int) – The seed for the pseudo-random parts.
combine_with_existing_representation (bool) – Whether to use existing representations + user-specified ones.
- Return obj/list/matrix
Transformer pipeline, feature names and the feature matrix.
autoBOTLib.optimization.optimization_metrics module¶
-
autoBOTLib.optimization.optimization_metrics.
get_metric_report
(y_true, y_prediction)¶ A generic metric report; suitable for multiobjective experiments (not the core paper)
autoBOTLib.optimization.optimization_random module¶
autoBOTLib.optimization.optimization_utils module¶
-
class
autoBOTLib.optimization.optimization_utils.
DataProcessor
¶ Bases:
object
Base class for data converters for sequence classification data sets.
-
get_labels
()¶ Gets the list of labels for this data set.
-
read_pandas_tsv
(input_file)¶
-
-
class
autoBOTLib.optimization.optimization_utils.
genericProcessor
¶ Bases:
autoBOTLib.optimization.optimization_utils.DataProcessor
-
get_train_examples
(data_dir)¶ See base class.
-
get_dev_examples
(data_dir)¶ See base class.
-
get_test_examples
(data_dir)¶ See base class.
-
-
autoBOTLib.optimization.optimization_utils.
simple_accuracy
(preds, labels)¶
-
autoBOTLib.optimization.optimization_utils.
acc_and_f1
(preds, labels, average=None)¶
-
autoBOTLib.optimization.optimization_utils.
pearson_and_spearman
(preds, labels)¶
-
autoBOTLib.optimization.optimization_utils.
compute_metrics
(task_name, preds, labels)¶