autoBOTLib.features package¶
Submodules¶
autoBOTLib.features.features_concepts module¶
-
class
autoBOTLib.features.features_concepts.
ConceptFeatures
(max_features=10000, targets=None, knowledge_graph='../memory')¶ Bases:
object
Core class describing sentence embedding methodology employed here.
-
__init__
(max_features=10000, targets=None, knowledge_graph='../memory')¶ Initialize self. See help(type(self)) for accurate signature.
-
get_grounded_from_path
(present_tokens, graph_path)¶ Method which performs a very simple term grounding. This simply evaluates if both terms are present in the corpus. :param list present_tokens: The present tokens :param str graph_path: Path to the triplet base (compressed)
-
add_triplet
(tokens, index, relations=['is_a'])¶
-
concept_graph
(document_space, graph_path)¶ If no prior knowledge graph is supplied, one is constructed. :param document_space: The list of input documents :param graph_path: The path of the knowledge graph used. :return grounded: Grounded relations.
-
get_propositionalized_rep
(documents)¶ The method for constructing the representation.
- Parameters
documents – The input list of documents.
-
fit
(text_vector, refit=False, knowledge_graph=None)¶ Fit the model to a text vector.
- Parameters
text_vector – Input list of documents.
-
transform
(text_vector, use_conc_docs=False)¶ Transform the data into suitable form.
-
get_feature_names
()¶
-
fit_transform
(text_vector, b=None)¶ A classifc fit-transform method.
- Parameters
text_vector – The input list of documents.
- Return transformedObj
Transformed texts (to features).
-
autoBOTLib.features.features_contextual module¶
-
class
autoBOTLib.features.features_contextual.
ContextualDocs
(model='all-mpnet-base-v2')¶ Bases:
object
-
__init__
(model='all-mpnet-base-v2')¶ Class initialization method.
- Parameters
ndim – Number of latent dimensions
model – The sentence-transformer model
-
fit
(documents)¶ - Parameters
documents – The input set of documents.
-
transform
(documents)¶ - Parameters
documents – The input set of documents.
-
fit_transform
(documents, b=None)¶ - Parameters
documents – The input set of documents.
-
get_feature_names
()¶ - Parameters
fnames – Feature names (custom api artefact)
-
autoBOTLib.features.features_contextual_supervised module¶
autoBOTLib.features.features_document_graph module¶
-
class
autoBOTLib.features.features_document_graph.
RelationalDocs
(ndim=128, random_seed=1965123, targets=None, ed_cutoff=- 2, verbose=True, neigh_size=None, doc_limit=4096, percentile_threshold=95)¶ Bases:
object
-
__init__
(ndim=128, random_seed=1965123, targets=None, ed_cutoff=- 2, verbose=True, neigh_size=None, doc_limit=4096, percentile_threshold=95)¶ Class initialization method.
- Parameters
ndim – Number of latent dimensions
targets – The target vector
random_seed – The random seed used
ed_cutoff – Cutoff for fuzzy string matching when comparing documents
doc_limit – The max number of documents to be considered.
verbose – Whether to have the printouts
-
jaccard_index
(set1, set2)¶ The classic Jaccard index.
- Parameters
set1 – First set
set2 – Second set
- Return JaccardIndex
-
fit
(text_list)¶ The fit method.
- Parameters
text_list – List of input texts
-
transform
(new_documents)¶ Transform method.
- Parameters
new_documents – The new set of documents to be transformed.
- Return all_embeddings
The final embedding matrix
-
fit_transform
(documents, b=None)¶ The sklearn-like fit-transform method.
-
get_feature_names
()¶
-
get_graph
(wspace, ltl)¶ A method to obtain a graph from a weighted space of documents.
- Parameters
wspace – node1,node2 weight mapping
ltl – The number of documents
- Return G
The document graph
-
autoBOTLib.features.features_keyword module¶
-
class
autoBOTLib.features.features_keyword.
KeywordFeatures
(max_features=10000, targets=None)¶ Bases:
object
Core class describing sentence embedding methodology employed here.
-
__init__
(max_features=10000, targets=None)¶ Initialize self. See help(type(self)) for accurate signature.
-
fit
(text_vector, refit=False)¶ Fit the model to a text vector.
- Parameters
text_vector – The input list of texts
-
transform
(text_vector)¶ Transform the data into suitable form.
- Parameters
text_vector – The input list of texts.
- Return transformedObject
The transformed input texts (feature space)
-
get_feature_names
()¶
-
fit_transform
(text_vector, b=None)¶ A classifc fit-transform method.
- Parameters
text_vector – Input list of texts.
- Return transformedObject
Transformed list of texts
-
autoBOTLib.features.features_sentence_embeddings module¶
-
class
autoBOTLib.features.features_sentence_embeddings.
documentEmbedder
(max_features=10000, num_cpu=8, dm=1, pretrained_path='doc2vec.bin', ndim=512)¶ Bases:
object
Core class describing sentence embedding methodology employed here. The class functions as a sklearn-like object.
-
__init__
(max_features=10000, num_cpu=8, dm=1, pretrained_path='doc2vec.bin', ndim=512)¶ The standard sgn function.
- Parameters
max_features – integer, number of latent dimensions
num_cpu – integer, number of CPUs to be used
dm – Whether to use the “distributed memory” model
pretrained_path – The path where a pretrained model is located (if any)
-
fit
(text_vector, b=None, refit=False)¶ Fit the model to a text vector. :param text_vector: a list of texts
-
transform
(text_vector)¶ Transform the data into suitable form. :param text_vector: The text vector to be transformed via a trained model
-
get_feature_names
()¶
-
fit_transform
(text_vector, a2=None)¶ A classifc fit-transform method. :param text_vector: a text vector used to build and transform a corpus.
-
autoBOTLib.features.features_token_relations module¶
-
class
autoBOTLib.features.features_token_relations.
relationExtractor
(max_features=10000, split_char='|||', witem_separator='&&&&', num_cpu=8, neighborhood_token=64, min_token='bigrams', targets=None, verbose=True)¶ Bases:
object
The main token relation extraction class. Works for arbitrary tokens.
-
__init__
(max_features=10000, split_char='|||', witem_separator='&&&&', num_cpu=8, neighborhood_token=64, min_token='bigrams', targets=None, verbose=True)¶ Initialize self. See help(type(self)) for accurate signature.
-
compute_distance
(pair, token_dict)¶ A core distance for computing index-based differences.
- Parameters
pair – the pair of tokens
token_dict – distance map
- Return pair[0], pair[1], dist
The two tokens and the distance
-
witem_kernel
(instance)¶ A simple kernel for traversing a given document.
- Parameters
instance – a piece of text
- Return global_distances
Distances between tokens
-
fit
(text_vector, b=None)¶ Fit the model to a text vector.
- Parameters
text_vector – The input list of texts.
-
get_feature_names
()¶ Return exact feature names.
-
transform
(text_vector, custom_shape=None)¶ Transform the data into suitable form.
- Parameters
text_vector – The input list of texts.
-
fit_transform
(text_vector, a2)¶ A classifc fit-transform method.
- Parameters
text_vector – Input list of texts.
-
autoBOTLib.features.features_topic module¶
-
class
autoBOTLib.features.features_topic.
TopicDocs
(ndim=128, random_seed=1965123, topic_tokens=8196, verbose=True)¶ Bases:
object
-
__init__
(ndim=128, random_seed=1965123, topic_tokens=8196, verbose=True)¶ Class initialization method.
- Parameters
ndim – Number of latent dimensions
targets – The target vector
random_seed – The random seed used
ed_cutoff – Cutoff for fuzzy string matching when comparing documents
doc_limit – The max number of documents to be considered.
verbose – Whether to have the printouts
-
fit
(text_list)¶ The fit method.
- Parameters
text_list – List of input texts
-
transform
(new_documents)¶ Transform method.
- Parameters
new_documents – The new set of documents to be transformed.
- Return all_embeddings
The final embedding matrix
-
fit_transform
(documents, b=None)¶ The sklearn-like fit-transform method.
-
get_feature_names
()¶ Get feature names.
-