autoBOTLib.features package

Submodules

autoBOTLib.features.features_concepts module

class autoBOTLib.features.features_concepts.ConceptFeatures(max_features=10000, targets=None, knowledge_graph='../memory')

Bases: object

Core class describing sentence embedding methodology employed here.

__init__(max_features=10000, targets=None, knowledge_graph='../memory')

Initialize self. See help(type(self)) for accurate signature.

get_grounded_from_path(present_tokens, graph_path)

Method which performs a very simple term grounding. This simply evaluates if both terms are present in the corpus. :param list present_tokens: The present tokens :param str graph_path: Path to the triplet base (compressed)

add_triplet(tokens, index, relations=['is_a'])
concept_graph(document_space, graph_path)

If no prior knowledge graph is supplied, one is constructed. :param document_space: The list of input documents :param graph_path: The path of the knowledge graph used. :return grounded: Grounded relations.

get_propositionalized_rep(documents)

The method for constructing the representation.

Parameters

documents – The input list of documents.

fit(text_vector, refit=False, knowledge_graph=None)

Fit the model to a text vector.

Parameters

text_vector – Input list of documents.

transform(text_vector, use_conc_docs=False)

Transform the data into suitable form.

get_feature_names()
fit_transform(text_vector, b=None)

A classifc fit-transform method.

Parameters

text_vector – The input list of documents.

Return transformedObj

Transformed texts (to features).

autoBOTLib.features.features_contextual module

class autoBOTLib.features.features_contextual.ContextualDocs(model='all-mpnet-base-v2')

Bases: object

__init__(model='all-mpnet-base-v2')

Class initialization method.

Parameters
  • ndim – Number of latent dimensions

  • model – The sentence-transformer model

fit(documents)
Parameters

documents – The input set of documents.

transform(documents)
Parameters

documents – The input set of documents.

fit_transform(documents, b=None)
Parameters

documents – The input set of documents.

get_feature_names()
Parameters

fnames – Feature names (custom api artefact)

autoBOTLib.features.features_contextual_supervised module

autoBOTLib.features.features_document_graph module

class autoBOTLib.features.features_document_graph.RelationalDocs(ndim=128, random_seed=1965123, targets=None, ed_cutoff=- 2, verbose=True, neigh_size=None, doc_limit=4096, percentile_threshold=95)

Bases: object

__init__(ndim=128, random_seed=1965123, targets=None, ed_cutoff=- 2, verbose=True, neigh_size=None, doc_limit=4096, percentile_threshold=95)

Class initialization method.

Parameters
  • ndim – Number of latent dimensions

  • targets – The target vector

  • random_seed – The random seed used

  • ed_cutoff – Cutoff for fuzzy string matching when comparing documents

  • doc_limit – The max number of documents to be considered.

  • verbose – Whether to have the printouts

jaccard_index(set1, set2)

The classic Jaccard index.

Parameters
  • set1 – First set

  • set2 – Second set

Return JaccardIndex

fit(text_list)

The fit method.

Parameters

text_list – List of input texts

transform(new_documents)

Transform method.

Parameters

new_documents – The new set of documents to be transformed.

Return all_embeddings

The final embedding matrix

fit_transform(documents, b=None)

The sklearn-like fit-transform method.

get_feature_names()
get_graph(wspace, ltl)

A method to obtain a graph from a weighted space of documents.

Parameters
  • wspace – node1,node2 weight mapping

  • ltl – The number of documents

Return G

The document graph

autoBOTLib.features.features_keyword module

class autoBOTLib.features.features_keyword.KeywordFeatures(max_features=10000, targets=None)

Bases: object

Core class describing sentence embedding methodology employed here.

__init__(max_features=10000, targets=None)

Initialize self. See help(type(self)) for accurate signature.

fit(text_vector, refit=False)

Fit the model to a text vector.

Parameters

text_vector – The input list of texts

transform(text_vector)

Transform the data into suitable form.

Parameters

text_vector – The input list of texts.

Return transformedObject

The transformed input texts (feature space)

get_feature_names()
fit_transform(text_vector, b=None)

A classifc fit-transform method.

Parameters

text_vector – Input list of texts.

Return transformedObject

Transformed list of texts

autoBOTLib.features.features_sentence_embeddings module

class autoBOTLib.features.features_sentence_embeddings.documentEmbedder(max_features=10000, num_cpu=8, dm=1, pretrained_path='doc2vec.bin', ndim=512)

Bases: object

Core class describing sentence embedding methodology employed here. The class functions as a sklearn-like object.

__init__(max_features=10000, num_cpu=8, dm=1, pretrained_path='doc2vec.bin', ndim=512)

The standard sgn function.

Parameters
  • max_features – integer, number of latent dimensions

  • num_cpu – integer, number of CPUs to be used

  • dm – Whether to use the “distributed memory” model

  • pretrained_path – The path where a pretrained model is located (if any)

fit(text_vector, b=None, refit=False)

Fit the model to a text vector. :param text_vector: a list of texts

transform(text_vector)

Transform the data into suitable form. :param text_vector: The text vector to be transformed via a trained model

get_feature_names()
fit_transform(text_vector, a2=None)

A classifc fit-transform method. :param text_vector: a text vector used to build and transform a corpus.

autoBOTLib.features.features_token_relations module

class autoBOTLib.features.features_token_relations.relationExtractor(max_features=10000, split_char='|||', witem_separator='&&&&', num_cpu=8, neighborhood_token=64, min_token='bigrams', targets=None, verbose=True)

Bases: object

The main token relation extraction class. Works for arbitrary tokens.

__init__(max_features=10000, split_char='|||', witem_separator='&&&&', num_cpu=8, neighborhood_token=64, min_token='bigrams', targets=None, verbose=True)

Initialize self. See help(type(self)) for accurate signature.

compute_distance(pair, token_dict)

A core distance for computing index-based differences.

Parameters
  • pair – the pair of tokens

  • token_dict – distance map

Return pair[0], pair[1], dist

The two tokens and the distance

witem_kernel(instance)

A simple kernel for traversing a given document.

Parameters

instance – a piece of text

Return global_distances

Distances between tokens

fit(text_vector, b=None)

Fit the model to a text vector.

Parameters

text_vector – The input list of texts.

get_feature_names()

Return exact feature names.

transform(text_vector, custom_shape=None)

Transform the data into suitable form.

Parameters

text_vector – The input list of texts.

fit_transform(text_vector, a2)

A classifc fit-transform method.

Parameters

text_vector – Input list of texts.

autoBOTLib.features.features_topic module

class autoBOTLib.features.features_topic.TopicDocs(ndim=128, random_seed=1965123, topic_tokens=8196, verbose=True)

Bases: object

__init__(ndim=128, random_seed=1965123, topic_tokens=8196, verbose=True)

Class initialization method.

Parameters
  • ndim – Number of latent dimensions

  • targets – The target vector

  • random_seed – The random seed used

  • ed_cutoff – Cutoff for fuzzy string matching when comparing documents

  • doc_limit – The max number of documents to be considered.

  • verbose – Whether to have the printouts

fit(text_list)

The fit method.

Parameters

text_list – List of input texts

transform(new_documents)

Transform method.

Parameters

new_documents – The new set of documents to be transformed.

Return all_embeddings

The final embedding matrix

fit_transform(documents, b=None)

The sklearn-like fit-transform method.

get_feature_names()

Get feature names.

Module contents