SyntheticGraphDataset

Overview
Class Definition
Parameters
1. Constructor Parameters
Methods
1. process
2. _generate_features
  1. Parameters
3. static generate_features
  1. Parameters
  2. Returns
4. getitem
5. len
  1. Returns
Usage Examples
Implementation Details
Related Components

Overview

The SyntheticGraphDataset class is a DGL Dataset implementation that generates synthetic graph datasets with controllable properties such as homophily and community structure. This is particularly useful for benchmarking and analyzing graph neural networks under various controlled conditions.

Class Definition

class SyntheticGraphDataset(DGLDataset):
    def __init__(
        self,
        n: int = 100,
        k: int = 3,
        h: float = 0.8,
        d_mean: float = 3,
        sigma_intra_scalar: float = 0.1,
        sigma_inter_scalar: float = -0.05,
        tau_scalar: float = 1,
        eta_scalar: float = 1,
        in_feats: int = 5,
        d_in: Optional[int] = None,
        alpha: Optional[float] = None,
        sym: bool = True,
        mu: Optional[np.ndarray] = None
    ):
        # Implementation details...

Parameters

Constructor Parameters

Parameter	Type	Description
`n`	int	Number of nodes
`k`	int	Number of communities (classes/blocks)
`h`	float	Homophily ratio (higher values favor intra-community edges)
`d_mean`	float	Mean degree scaling factor
`sigma_intra_scalar`	float	Scalar for intra-class covariance
`sigma_inter_scalar`	float	Scalar for inter-class covariance
`tau_scalar`	float	Scalar for the global covariance (shared across nodes)
`eta_scalar`	float	Scalar for node-wise noise covariance
`in_feats`	int	Dimensionality of node features
`d_in`	Optional[int]	Input dimension for feature generation (defaults to k if not provided)
`alpha`	Optional[float]	Dirichlet concentration parameter for block proportions
`sym`	bool	If True, creates an undirected (symmetric) graph
`mu`	Optional[np.ndarray]	Class-specific mean matrix (if None, defaults are used)

Methods

process

def process(self) -> None

Generates the synthetic SBM graph and its features. This method builds a block connectivity matrix, generates community assignments, creates the graph adjacency, constructs a DGL graph, and assigns node features.

_generate_features

def _generate_features(self, num_mu_samples: int = 1) -> None

Generates node features using a block-based covariance model, where features are a sum of class-specific mean vectors, global random variation, and node-specific noise.

Parameters

Parameter	Type	Description
`num_mu_samples`	int	Number of independent realizations to generate

static generate_features

@staticmethod
def generate_features(
    num_nodes: int,
    num_features: int,
    labels: np.ndarray,
    inter_class_cov: np.ndarray,
    intra_class_cov: np.ndarray,
    global_cov: np.ndarray,
    noise_cov: np.ndarray,
    mu_repeats: int = 1
) -> np.ndarray

Static method that generates node features according to a block-based model. For each node, a class-specific mean vector is drawn from a multivariate normal with covariance having intra- and inter-class blocks. Then a global variation and node-specific noise are added.

Parameters

Parameter	Type	Description
`num_nodes`	int	Number of nodes in the graph
`num_features`	int	Dimensionality of node features
`labels`	np.ndarray	Node labels/community assignments
`inter_class_cov`	np.ndarray	Covariance matrix for inter-class relationships
`intra_class_cov`	np.ndarray	Covariance matrix for intra-class relationships
`global_cov`	np.ndarray	Covariance matrix for global variations
`noise_cov`	np.ndarray	Covariance matrix for node-specific noise
`mu_repeats`	int	Number of independent realizations to generate

Returns

Return Type	Description
np.ndarray	Generated features of shape (num_nodes, num_features, mu_repeats)

getitem

def __getitem__(self, idx: int) -> dgl.DGLGraph

Gets the graph at the specified index. Since the dataset contains only one graph, only index 0 is valid.

Parameters

Parameter	Type	Description
`idx`	int	Index of the graph to retrieve

Returns

Return Type	Description
dgl.DGLGraph	The graph at the specified index

Raises

Exception	Description
IndexError	If the index is out of bounds (only index 0 is valid)

len

def __len__(self) -> int

Gets the number of graphs in the dataset (always 1).

Returns

Return Type	Description
int	The number of graphs in the dataset (always 1)

Usage Examples

Basic Usage

from bridge.datasets import SyntheticGraphDataset

# Create a synthetic dataset with default parameters
dataset = SyntheticGraphDataset()

# Get the generated graph
g = dataset[0]

print(f"Number of nodes: {g.num_nodes()}")
print(f"Number of edges: {g.num_edges()}")
print(f"Feature dimensions: {g.ndata['feat'].shape}")
print(f"Number of classes: {len(torch.unique(g.ndata['label']))}")

Custom Homophily and Size

# Create a heterophilic graph with 1000 nodes and 5 classes
hetero_dataset = SyntheticGraphDataset(
    n=1000,          # 1000 nodes
    k=5,             # 5 classes
    h=0.2,           # Low homophily (heterophilic)
    d_mean=15,       # Higher mean degree
    in_feats=10      # 10-dimensional features
)

g_hetero = hetero_dataset[0]
print(f"Heterophilic graph created with {g_hetero.num_nodes()} nodes and {g_hetero.num_edges()} edges")

Custom Feature Generation

import numpy as np
from bridge.datasets import SyntheticGraphDataset

# Create a dataset with custom feature generation parameters
custom_dataset = SyntheticGraphDataset(
    n=500,
    k=4,
    h=0.6,
    sigma_intra_scalar=0.2,     # Stronger intra-class correlation
    sigma_inter_scalar=-0.1,    # Stronger inter-class distinction
    tau_scalar=0.5,             # Reduced global variation
    eta_scalar=0.8,             # Reduced node-specific noise
    in_feats=8
)

g_custom = custom_dataset[0]

# Analyze the feature distributions by class
labels = g_custom.ndata['label']
features = g_custom.ndata['feat']

# Calculate mean feature vector for each class
for class_id in range(4):
    class_mask = (labels == class_id)
    class_features = features[class_mask]
    class_mean = torch.mean(class_features, dim=0)
    print(f"Class {class_id} mean feature vector: {class_mean}")

Using with GNN Models

import torch
import dgl
from bridge.models import GCN
from bridge.datasets import SyntheticGraphDataset
from bridge.training import train

# Create synthetic dataset with controlled homophily
dataset = SyntheticGraphDataset(
    n=800,
    k=3,
    h=0.7,
    in_feats=5
)
g = dataset[0]

# Create a GCN model
in_feats = g.ndata['feat'].shape[1]
out_feats = len(torch.unique(g.ndata['label']))
model = GCN(
    in_feats=in_feats,
    h_feats=64,
    out_feats=out_feats,
    n_layers=2,
    dropout_p=0.5
)

# Train the model
train_acc, val_acc, test_acc, trained_model = train(
    g=g,
    model=model,
    train_mask=g.ndata['train_mask'],
    val_mask=g.ndata['val_mask'],
    test_mask=g.ndata['test_mask'],
    n_epochs=200,
    early_stopping=30
)

print(f"Train accuracy: {train_acc:.4f}")
print(f"Validation accuracy: {val_acc:.4f}")
print(f"Test accuracy: {test_acc:.4f}")

Implementation Details

The SyntheticGraphDataset class generates synthetic graphs using a Stochastic Block Model (SBM), which is a random graph model that incorporates community structure. The implementation involves:

Block Matrix Construction:
- Builds a block connectivity matrix B where intra-community connections (on the diagonal) have probability proportional to homophily h
- Inter-community connections (off-diagonal) have probability proportional to (1-h)/(k-1)
Node Assignment:
- Assigns each node to a community/class based on specified or uniform proportions
- This assignment determines the node labels
Edge Generation:
- Creates edges according to the probabilities in the block matrix
- The probability of an edge between nodes i and j depends on their community assignments
Feature Generation:
- Features are generated using a sophisticated covariance structure
- Each node’s features are a sum of:
  - Class-specific mean vector (different for each class)
  - Global random shift (shared across all nodes)
  - Node-specific noise
Train/Val/Test Split:
- Creates a random split of nodes for training, validation, and testing
- By default, uses 80% for training, 10% for validation, and 10% for testing

The dataset is particularly useful for studying the relationship between graph structure (especially homophily) and GNN performance, as it allows precise control over these properties.

run_sensitivity_experiment: Uses synthetic datasets for controlled experiments
GCN: Graph Neural Network model that can be trained on synthetic datasets
run_bridge_pipeline: Pipeline that can be applied to synthetic datasets
generate_features: Feature generation function used in sensitivity analysis

SyntheticGraphDataset

Table of contents

Overview

Class Definition

Parameters

Constructor Parameters

Methods

process

_generate_features

Parameters

static generate_features

Parameters

Returns

getitem

Parameters

Returns

Raises

len

Returns

Usage Examples

Basic Usage

Custom Homophily and Size

Custom Feature Generation

Using with GNN Models

Implementation Details

Related Components