Graph Partitioning for Molecular Fragmentation
Introduction
Graph partitioning algorithms divide a molecular graph into disjoint subsets (fragments) while minimizing the weight of edges between partitions. These methods are fundamental to molecular fragmentation, where the graph representation naturally captures atomic connectivity.
Molecular Graph Representation
A molecule is represented as a weighted graph $G = (V, E, w)$ where:
$V = {v_1, \ldots, v_n}$: vertices (atoms)
$E \subseteq V \times V$: edges (bonds)
$w: E \to \mathbb{R}^+$: edge weights (bond strengths)
Edge weights can encode:
Bond order: single (1.0), double (2.0), triple (3.0), aromatic (1.5)
Bond energy: energy cost of breaking the bond
Interaction strength: electrostatic or van der Waals interactions
Minimum Cut Problem
Definition
A cut $(S, \bar{S})$ partitions vertices into two disjoint sets where $S \cup \bar{S} = V$.
The cut weight is the sum of edge weights crossing the partition:
$$ \text{cut}(S, \bar{S}) = \sum_{u \in S, v \in \bar{S}} w(u, v) $$
The minimum cut minimizes this value:
$$ (S^, \bar{S}^) = \arg\min_{S \subset V} \text{cut}(S, \bar{S}) $$
Stoer-Wagner Algorithm
Finds the global minimum cut in $O(|V||E| + |V|^2 \log|V|)$ time:
Start with arbitrary vertex $a$
Grow a maximum adjacency ordering from $a$
The last two vertices $(s, t)$ define a cut-of-the-phase
Contract $s$ and $t$ into a single vertex
Repeat until two vertices remain
Return minimum cut found across all phases
Limitation: Trivial Cuts
Minimum cut tends to isolate single vertices (especially terminal atoms like hydrogen). For molecular fragmentation, we need balanced partitions.
Balanced Partitioning
Ratio Cut
Normalize the cut by partition sizes:
$$ \text{RatioCut}(S, \bar{S}) = \frac{\text{cut}(S, \bar{S})}{|S|} + \frac{\text{cut}(S, \bar{S})}{|\bar{S}|} $$
This penalizes highly imbalanced partitions.
Normalized Cut (NCut)
Normalize by the volume (total edge weight) of each partition:
$$ \text{NCut}(S, \bar{S}) = \frac{\text{cut}(S, \bar{S})}{\text{vol}(S)} + \frac{\text{cut}(S, \bar{S})}{\text{vol}(\bar{S})} $$
where:
$$ \text{vol}(S) = \sum_{i \in S} d_i = \sum_{i \in S} \sum_{j} w_{ij} $$
NCut is equivalent to finding the second smallest eigenvector of the normalized Laplacian (spectral clustering).
Kernighan-Lin Algorithm
Iterative improvement algorithm for balanced bipartitioning:
Initialize: Split vertices into two equal-sized sets $A$ and $B$
Compute gains: For each pair $(a, b)$ with $a \in A$, $b \in B$: $$ g(a, b) = D_a + D_b - 2w(a, b) $$ where $D_a = \sum_{v \in B} w(a, v) - \sum_{v \in A} w(a, v)$ is the external - internal cost
Select best swap: Find $(a^, b^)$ maximizing $g$
Lock and update: Mark swapped vertices, update gains
Repeat: Until all vertices considered
Apply prefix: Keep swaps up to maximum cumulative gain
Complexity: $O(n^2 \log n)$ per pass, typically converges in few passes.
Community Detection
Community detection finds densely connected groups without requiring balance constraints.
Modularity
Modularity measures the quality of a partition by comparing edge density within communities to a random graph:
$$ Q = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \frac{k_i k_j}{2m} \right] \delta(c_i, c_j) $$
where:
$A_{ij}$: adjacency matrix entry
$k_i = \sum_j A_{ij}$: degree of vertex $i$
$m = \frac{1}{2}\sum_{ij} A_{ij}$: total edge weight
$c_i$: community assignment of vertex $i$
$\delta(c_i, c_j) = 1$ if $c_i = c_j$, else $0$
Modularity ranges from $-0.5$ to $1$, with higher values indicating stronger community structure.
Louvain Algorithm
Fast modularity optimization with hierarchical aggregation:
Phase 1 (Local moves):
For each vertex $i$ in random order
Compute modularity gain from moving $i$ to each neighbor’s community: $$ \Delta Q = \left[ \frac{\Sigma_{in} + k_{i,in}}{2m} - \left(\frac{\Sigma_{tot} + k_i}{2m}\right)^2 \right] - \left[ \frac{\Sigma_{in}}{2m} - \left(\frac{\Sigma_{tot}}{2m}\right)^2 - \left(\frac{k_i}{2m}\right)^2 \right] $$ where $\Sigma_{in}$ = sum of weights inside community, $\Sigma_{tot}$ = total weight of community, $k_{i,in}$ = sum of weights from $i$ to community
Move $i$ to community with maximum positive $\Delta Q$
Repeat until no improvement
Phase 2 (Aggregation):
Build new graph where each community becomes a super-vertex
Edge weights between super-vertices = sum of edges between constituent communities
Self-loops = internal edges of original community
Repeat: Apply Phase 1 to aggregated graph until convergence.
Complexity: $O(n \log n)$ for sparse graphs, making it suitable for large molecules.
Resolution Parameter
Modularity has a resolution limit—it may merge small communities. Introduce resolution parameter $\gamma$:
$$ Q_\gamma = \frac{1}{2m} \sum_{ij} \left[ A_{ij} - \gamma\frac{k_i k_j}{2m} \right] \delta(c_i, c_j) $$
$\gamma > 1$: Favor smaller communities (more fragments)
$\gamma < 1$: Favor larger communities (fewer fragments)
METIS Algorithm
METIS is an industry-standard multilevel graph partitioning library optimized for large graphs.
Multilevel Paradigm
1. Coarsening Phase Progressively collapse vertices into coarser graphs:
Heavy Edge Matching (HEM): Match vertices connected by heaviest edges
Sorted Heavy Edge Matching (SHEM): Process vertices in decreasing degree order
After $\ell$ coarsening levels: $|V_\ell| \ll |V_0|$
2. Initial Partitioning Apply exact or heuristic algorithm to small coarsened graph. Common choices:
Greedy bisection
Spectral partitioning
Kernighan-Lin
3. Uncoarsening/Refinement Project partition back to finer levels, refining at each step:
Boundary refinement: Only consider moving vertices at partition boundary
FM refinement: Fiduccia-Mattheyses variant optimized for moves
METIS for Molecules
METIS partitions into $k$ balanced parts minimizing edge cut. For molecules:
import pymetis
# Molecule as adjacency list
adjacency = [[1, 2], [0, 2, 3], [0, 1], [1, 4], [3]]
# Partition into 2 fragments
n_cuts, membership = pymetis.part_graph(2, adjacency)
# membership[i] = fragment assignment for atom i
Advantages:
Very fast: $O(|E|)$ time complexity
Produces balanced partitions
Minimal edge cut (bond breaking)
Considerations:
Requires installation of
pymetisBalance constraint may conflict with chemical constraints
k-way Partitioning
Direct k-way
Partition into $k > 2$ parts directly:
$$ \text{minimize} \quad \sum_{i=1}^{k-1} \sum_{j=i+1}^{k} \text{cut}(V_i, V_j) $$
subject to balance constraints:
$$ |V_i| \leq (1 + \epsilon) \frac{|V|}{k}, \quad \forall i $$
Recursive Bisection
Alternative approach:
Bisect graph into two parts
Recursively bisect each part
Continue until $k$ parts obtained
For $k = 2^p$, requires $k-1$ cuts.
Application to Molecular Fragmentation
Weight Functions for Chemistry
Design edge weights that encode chemical knowledge:
Bond energy weight: $$ w(u, v) = E_{\text{bond}}(u, v) $$ Penalizes breaking strong bonds.
Aromaticity weight: $$ w(u, v) = \begin{cases} \infty & \text{if bond is aromatic} \ E_{\text{bond}} & \text{otherwise} \end{cases} $$ Prevents breaking aromatic systems.
Functional group weight: $$ w(u, v) = \begin{cases} \infty & \text{if bond is within functional group} \ E_{\text{bond}} \cdot f(\text{rotatable}) & \text{otherwise} \end{cases} $$
Comparison of Methods
Method |
Complexity |
Balance |
Quality |
Best For |
|---|---|---|---|---|
Min-cut |
$O(VE + V^2 \log V)$ |
No |
Exact |
Bipartition |
Kernighan-Lin |
$O(n^2 \log n)$ |
Strict |
Good |
Balanced bipartition |
Louvain |
$O(n \log n)$ |
No |
Good |
Natural clusters |
METIS |
$O(E)$ |
Yes |
Excellent |
Large molecules |
References
Stoer, M., & Wagner, F. (1997). A simple min-cut algorithm. JACM, 44(4), 585-591.
Kernighan, B. W., & Lin, S. (1970). An efficient heuristic procedure for partitioning graphs. Bell System Technical Journal, 49(2), 291-307.
Blondel, V. D., et al. (2008). Fast unfolding of communities in large networks. JSTAT, P10008.
Karypis, G., & Kumar, V. (1998). A fast and high quality multilevel scheme for partitioning irregular graphs. SIAM Journal on Scientific Computing, 20(1), 359-392.
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE TPAMI, 22(8), 888-905.