How to Run Random Walk Algorithms

Goal: Generate network embeddings and node representations using random walk algorithms.

Prerequisites: A loaded network (see How to Load and Build Networks).

Note

Where to find this data

Examples in this guide create networks programmatically for clarity. You can also:

  • Use built-in generators: from py3plex.algorithms import random_generators

  • Load from repository files: datasets/multiedgelist.txt

  • Fetch real-world datasets: from py3plex.datasets import fetch_multilayer

Node2Vec Embeddings

Node2Vec generates vector representations of nodes by simulating biased random walks:

from py3plex.core import multinet
from py3plex.wrappers import train_node2vec

# Create a sample network
network = multinet.multi_layer_network()
network.add_edges([
    ['Alice', 'friends', 'Bob', 'friends', 1],
    ['Bob', 'friends', 'Charlie', 'friends', 1],
    ['Alice', 'colleagues', 'Charlie', 'colleagues', 1],
], input_type="list")

# Train Node2Vec
embeddings = train_node2vec(
    network,
    dimensions=128,    # Embedding dimensionality
    walk_length=80,    # Length of each walk
    num_walks=10,      # Walks per node
    p=1.0,            # Return parameter
    q=1.0,            # In-out parameter
    workers=4
)

# Access embeddings
node = ('Alice', 'friends')
vector = embeddings[node]
print(f"Embedding dimension: {len(vector)}")

Expected output:

Embedding dimension: 128

The p and q parameters control the walk behavior:

  • p (return parameter): Likelihood of returning to previous node

  • q (in-out parameter): Likelihood of exploring outward vs. staying local

DeepWalk Embeddings

DeepWalk is a special case of Node2Vec with p=1, q=1:

from py3plex.wrappers import train_deepwalk

embeddings = train_deepwalk(
    network,
    dimensions=128,
    walk_length=80,
    num_walks=10,
    workers=4
)

Using Embeddings for Downstream Tasks

Node Classification

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
import numpy as np

# Prepare data
nodes = list(embeddings.keys())
X = np.array([embeddings[node] for node in nodes])

# Assuming you have labels
y = np.array([get_label(node) for node in nodes])

# Train classifier
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

clf = LogisticRegression()
clf.fit(X_train, y_train)

accuracy = clf.score(X_test, y_test)
print(f"Classification accuracy: {accuracy:.2%}")

Node Clustering

from sklearn.cluster import KMeans

# Cluster nodes based on embeddings
nodes = list(embeddings.keys())
X = np.array([embeddings[node] for node in nodes])

kmeans = KMeans(n_clusters=5, random_state=42)
clusters = kmeans.fit_predict(X)

# Map nodes to clusters
node_clusters = dict(zip(nodes, clusters))

print("Cluster assignments:")
for node, cluster in list(node_clusters.items())[:10]:
    print(f"{node} → Cluster {cluster}")

Visualizing Embeddings

Use dimensionality reduction to visualize:

from sklearn.manifold import TSNE
import matplotlib.pyplot as plt

# Reduce to 2D
nodes = list(embeddings.keys())
X = np.array([embeddings[node] for node in nodes])

tsne = TSNE(n_components=2, random_state=42)
X_2d = tsne.fit_transform(X)

# Plot
plt.figure(figsize=(12, 8))
plt.scatter(X_2d[:, 0], X_2d[:, 1], alpha=0.5)

# Label a few nodes
for i, node in enumerate(nodes[:20]):
    plt.annotate(
        str(node),
        (X_2d[i, 0], X_2d[i, 1]),
        fontsize=8
    )

plt.title('Node Embeddings (t-SNE)')
plt.savefig('embeddings_2d.png', dpi=300, bbox_inches='tight')
plt.show()

Saving and Loading Embeddings

Save to File

import pickle

# Save embeddings
with open('embeddings.pkl', 'wb') as f:
    pickle.dump(embeddings, f)

# Load embeddings
with open('embeddings.pkl', 'rb') as f:
    loaded_embeddings = pickle.load(f)

Export to CSV

import pandas as pd

# Convert to DataFrame
data = []
for node, vector in embeddings.items():
    row = {'node': str(node)}
    for i, val in enumerate(vector):
        row[f'dim_{i}'] = val
    data.append(row)

df = pd.DataFrame(data)
df.to_csv('embeddings.csv', index=False)

Parameter Tuning

Grid Search for Best Parameters

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
import numpy as np

param_grid = {
    'dimensions': [64, 128, 256],
    'walk_length': [40, 80, 120],
    'num_walks': [5, 10, 20]
}

best_score = 0
best_params = None

for dims in param_grid['dimensions']:
    for walk_len in param_grid['walk_length']:
        for num_walks in param_grid['num_walks']:
            # Train embeddings
            emb = train_node2vec(
                network,
                dimensions=dims,
                walk_length=walk_len,
                num_walks=num_walks
            )

            # Evaluate (assuming you have labels)
            X = np.array([emb[n] for n in nodes])
            scores = cross_val_score(
                LogisticRegression(),
                X, y, cv=5
            )

            mean_score = scores.mean()
            if mean_score > best_score:
                best_score = mean_score
                best_params = {
                    'dimensions': dims,
                    'walk_length': walk_len,
                    'num_walks': num_walks
                }

print(f"Best parameters: {best_params}")
print(f"Best score: {best_score:.3f}")

Layer-Specific Embeddings

Generate embeddings for individual layers:

from py3plex.dsl import Q, L

layer_embeddings = {}

for layer in network.get_layers():
    # Extract layer subgraph
    subgraph = Q.edges().from_layers(L[layer]).execute(network)

    # Train embeddings on this layer
    emb = train_node2vec(subgraph, dimensions=128)
    layer_embeddings[layer] = emb

    print(f"Generated embeddings for layer: {layer}")

Query and Filter Nodes Before Embedding with DSL

Goal: Use py3plex’s DSL to select specific node subsets for targeted embedding generation.

Random walks and embeddings on large networks can be computationally expensive. The DSL allows you to filter and extract subnetworks before running embedding algorithms, focusing on nodes of interest.

Filter High-Degree Nodes

Generate embeddings only for hub nodes:

from py3plex.core import multinet
from py3plex.wrappers import train_node2vec
from py3plex.dsl import Q, execute_query

# Load network
network = multinet.multi_layer_network(directed=False)
network.load_network(
    "py3plex/datasets/_data/synthetic_multilayer.edges",
    input_type="multiedgelist"
)

# Use DSL to find high-degree nodes (hubs)
hubs = (
    Q.nodes()
     .compute("degree")
     .where(degree__gt=8)  # Degree > 8
     .execute(network)
)

print(f"Found {len(hubs)} hub nodes")

# Extract subgraph containing only hubs
hub_subgraph = network.core_network.subgraph(hubs.keys())

# Train embeddings on hub subgraph
hub_network = multinet.multi_layer_network(directed=False)
hub_network.core_network = hub_subgraph.copy()

hub_embeddings = train_node2vec(
    hub_network,
    dimensions=64,
    walk_length=40,
    num_walks=10
)

print(f"Generated embeddings for {len(hub_embeddings)} hubs")

Expected output:

Found 15 hub nodes
Generated embeddings for 15 hubs

Layer-Specific Node Selection

Generate embeddings for nodes active in multiple layers:

from py3plex.dsl import Q, L
from collections import Counter

# Count layer participation for each node
node_layers = Counter()
for node, layer in network.get_nodes():
    node_layers[node] += 1

# Find nodes in 2+ layers
multilayer_nodes = {
    node for node, count in node_layers.items()
    if count >= 2
}

print(f"Nodes in 2+ layers: {len(multilayer_nodes)}")

# Convert back to (node, layer) tuples
multilayer_node_tuples = [
    (node, layer)
    for node, layer in network.get_nodes()
    if node in multilayer_nodes
]

# Extract subgraph
multi_subgraph = network.core_network.subgraph(multilayer_node_tuples)

# Generate embeddings
multi_network = multinet.multi_layer_network(directed=False)
multi_network.core_network = multi_subgraph.copy()

multi_embeddings = train_node2vec(multi_network, dimensions=64)
print(f"Generated embeddings for {len(multi_embeddings)} multilayer nodes")

Expected output:

Nodes in 2+ layers: 35
Generated embeddings for 105 multilayer nodes

Query Nodes by Centrality

Focus embeddings on central nodes:

from py3plex.dsl import Q

# Find nodes with high betweenness centrality
central_nodes = (
    Q.nodes()
     .compute("betweenness_centrality", "degree")
     .where(betweenness_centrality__gt=0.01)
     .execute(network)
)

print(f"High-centrality nodes: {len(central_nodes)}")

# Extract and embed
central_subgraph = network.core_network.subgraph(central_nodes.keys())
central_network = multinet.multi_layer_network(directed=False)
central_network.core_network = central_subgraph.copy()

central_embeddings = train_node2vec(
    central_network,
    dimensions=128,
    walk_length=80,
    num_walks=10
)

# Compare embedding similarity for central nodes
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

node_list = list(central_embeddings.keys())[:5]
vectors = np.array([central_embeddings[n] for n in node_list])

sim_matrix = cosine_similarity(vectors)
print("\nSimilarity matrix (first 5 central nodes):")
print(sim_matrix)

Combine Embeddings with Node Attributes

Query embedded nodes with specific properties:

# After generating embeddings, attach them as attributes
for node, vector in embeddings.items():
    # Store embedding norm as an attribute
    embedding_norm = np.linalg.norm(vector)
    network.core_network.nodes[node]['embedding_norm'] = embedding_norm

# Query nodes with large embedding norms
large_norm_nodes = execute_query(
    network,
    'SELECT nodes WHERE embedding_norm > 10.0'
)

print(f"Nodes with large embedding norms: {len(large_norm_nodes)}")

Layer-Specific Embedding Analysis

Compare embeddings across layers:

from py3plex.dsl import Q, L

# Generate embeddings per layer
layer_embeddings = {}
layer_stats = {}

for layer in network.get_layers():
    # Extract layer
    layer_nodes = Q.nodes().from_layers(L[layer]).execute(network)
    layer_edges = Q.edges().from_layers(L[layer]).execute(network)

    print(f"\nLayer: {layer}")
    print(f"  Nodes: {len(layer_nodes)}")
    print(f"  Edges: {len(layer_edges)}")

    # Extract layer subgraph
    layer_subgraph = network.core_network.subgraph(layer_nodes.keys())

    # Skip if too few nodes
    if len(layer_nodes) < 5:
        print(f"  [SKIP] Too few nodes")
        continue

    # Create layer network
    layer_net = multinet.multi_layer_network(directed=False)
    layer_net.core_network = layer_subgraph.copy()

    # Generate embeddings
    try:
        emb = train_node2vec(
            layer_net,
            dimensions=64,
            walk_length=40,
            num_walks=10,
            workers=1
        )

        layer_embeddings[layer] = emb

        # Compute statistics
        vectors = np.array(list(emb.values()))
        mean_norm = np.mean([np.linalg.norm(v) for v in vectors])

        layer_stats[layer] = {
            'n_nodes': len(emb),
            'mean_embedding_norm': mean_norm
        }

        print(f"  ✓ Embeddings generated: {len(emb)} nodes")
        print(f"  Mean embedding norm: {mean_norm:.2f}")
    except Exception as e:
        print(f"  [ERROR] {e}")

Expected output:

Layer: layer1
  Nodes: 40
  Edges: 95
  ✓ Embeddings generated: 40 nodes
  Mean embedding norm: 12.34

Layer: layer2
  Nodes: 40
  Edges: 87
  ✓ Embeddings generated: 40 nodes
  Mean embedding norm: 11.89

Layer: layer3
  Nodes: 40
  Edges: 102
  ✓ Embeddings generated: 40 nodes
  Mean embedding norm: 13.01

Export Embeddings with Metadata Using DSL

Create analysis-ready embedding exports:

import pandas as pd
from py3plex.dsl import Q

# Compute node metrics
metrics = (
    Q.nodes()
     .compute("degree", "betweenness_centrality")
     .execute(network)
)

# Combine embeddings with metrics
data = []
for node in embeddings.keys():
    row = {
        'node': node[0],
        'layer': node[1],
        'degree': metrics[node]['degree'],
        'betweenness': metrics[node]['betweenness_centrality']
    }

    # Add embedding dimensions
    for i, val in enumerate(embeddings[node]):
        row[f'emb_{i}'] = val

    data.append(row)

# Create DataFrame
df = pd.DataFrame(data)

# Export
df.to_csv('embeddings_with_metrics.csv', index=False)
print(f"Exported {len(df)} node embeddings with metadata")

Why use DSL for embedding workflows?

  • Targeted embedding: Focus on relevant node subsets, reducing computation time

  • Layer-aware: Generate layer-specific embeddings seamlessly

  • Metric integration: Combine embeddings with centrality and other network metrics

  • Filtering: Select nodes by degree, centrality, or custom attributes before embedding

  • Reproducible: Declarative queries document node selection criteria

Next steps with DSL:

Next Steps