Automated Feature Engineering
Graphistry's Automated Feature Engineering tool streamlines the conversion of diverse data types—like text, numbers, and booleans—into AI-ready formats. It not only encodes model definitions but also enhances model comparisons, ensuring seamless feature production for integration with Graphistry and other libraries for machine learning. Key functions include:
- featurize: Extracts features from raw datasets for machine learning compatibility.
- embed: Produces embeddings for graph data, enhancing machine learning applications.
Predefined Models
The API provides predefined models for different applications:
- Ngrams Model: For extracting Ngrams from text data.
- Topic Model: Reliable topic models for features and targets.
- Embedding Model: Useful for text data that you want to paraphrase.
- Search Model: For applications where search input is smaller than the encoded documents.
- QA Model: Encodings suitable for question answering.
Customizable Parameters
The user can customize various parameters to fine-tune the feature engineering process:
- Featurization: kind, use_scaler, cardinality_threshold, n_topics, etc.
- Scaler Options: impute, n_quantiles, encode, strategy, and more.
Example: Embedding a Graph with Features
To demonstrate the embedding functionality, consider the following example:
def embed_example():
# Create an edge dataframe with source, destination, and relationship columns
edf = pd.DataFrame([[0, 1, 0], [1, 2, 0], [2, 0, 1]], columns=['src', 'dst', 'rel'])
# Create a node dataframe without explicit node IDs but with features
ndf_no_ids = pd.DataFrame([['a'], ['a'], ['b']], columns=['feat'])
# Create a graph from the edge dataframe
graph_no_feat = graphistry.edges(edf, 'src', 'dst')
# Add node features to the graph using the node dataframe
graph_with_feat_no_ids = graph_no_feat.nodes(ndf_no_ids)
# Embed the graph using the 'rel' column for embedding and specify the embedding dimension
embedding_dim = 4
kwargs = {'n_topics': 6, 'cardinality_threshold': 10, 'epochs': 1, 'sample_size': 10, 'num_steps': 10}
embedded_graph = graph_with_feat_no_ids.embed('rel', embedding_dim=embedding_dim, **kwargs)
# (For illustration purposes, we're returning the original and embedded graph.)
return graph_with_feat_no_ids, embedded_graph
Then, call the embed_example
function:
original_graph, embedded_graph = embed_example()