Automated Feature Engineering

Graphistry's Automated Feature Engineering tool streamlines the conversion of diverse data types—like text, numbers, and booleans—into AI-ready formats. It not only encodes model definitions but also enhances model comparisons, ensuring seamless feature production for integration with Graphistry and other libraries for machine learning. Key functions include:

  • featurize: Extracts features from raw datasets for machine learning compatibility.
  • embed: Produces embeddings for graph data, enhancing machine learning applications.

Predefined Models

The API provides predefined models for different applications:

  • Ngrams Model: For extracting Ngrams from text data.
  • Topic Model: Reliable topic models for features and targets.
  • Embedding Model: Useful for text data that you want to paraphrase.
  • Search Model: For applications where search input is smaller than the encoded documents.
  • QA Model: Encodings suitable for question answering.

Customizable Parameters

The user can customize various parameters to fine-tune the feature engineering process:

  • Featurization: kind, use_scaler, cardinality_threshold, n_topics, etc.
  • Scaler Options: impute, n_quantiles, encode, strategy, and more.

Example: Embedding a Graph with Features

To demonstrate the embedding functionality, consider the following example:


    def embed_example():
        # Create an edge dataframe with source, destination, and relationship columns
        edf = pd.DataFrame([[0, 1, 0], [1, 2, 0], [2, 0, 1]], columns=['src', 'dst', 'rel'])
        
        # Create a node dataframe without explicit node IDs but with features
        ndf_no_ids = pd.DataFrame([['a'], ['a'], ['b']], columns=['feat'])
        
        # Create a graph from the edge dataframe
        graph_no_feat = graphistry.edges(edf, 'src', 'dst')
        
        # Add node features to the graph using the node dataframe
        graph_with_feat_no_ids = graph_no_feat.nodes(ndf_no_ids)
        
        # Embed the graph using the 'rel' column for embedding and specify the embedding dimension
        embedding_dim = 4
        kwargs = {'n_topics': 6, 'cardinality_threshold': 10, 'epochs': 1, 'sample_size': 10, 'num_steps': 10}
        embedded_graph = graph_with_feat_no_ids.embed('rel', embedding_dim=embedding_dim, **kwargs)
        
        # (For illustration purposes, we're returning the original and embedded graph.)
        return graph_with_feat_no_ids, embedded_graph
    

Then, call the embed_example function:


    original_graph, embedded_graph = embed_example()