Clustering Algorithms for Molecular Fragmentation
Introduction
Clustering algorithms partition atoms or molecular units into groups based on spatial proximity, connectivity, or chemical similarity. In molecular fragmentation, these algorithms help identify natural groupings that can be treated as computational fragments.
AutoFragment implements several clustering approaches, each with distinct advantages:
K-means: Fast, scalable, best for roughly spherical clusters
Spectral clustering: Captures complex connectivity patterns
Hierarchical methods: Produces nested fragment hierarchies
K-Means Clustering
Objective Function
Given $n$ atoms and $k$ target fragments, k-means minimizes the within-cluster sum of squares (WCSS):
$$ J = \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} |\mathbf{x} - \boldsymbol{\mu}_i|^2 $$
where:
$C_i$ is the set of atoms in cluster $i$
$\boldsymbol{\mu}i = \frac{1}{|C_i|} \sum{\mathbf{x} \in C_i} \mathbf{x}$ is the centroid of cluster $i$
$|\cdot|$ denotes the Euclidean distance
Algorithm (Lloyd’s Method)
Initialize: Choose $k$ initial centroids $\boldsymbol{\mu}_1^{(0)}, \ldots, \boldsymbol{\mu}_k^{(0)}$
Assign: For each atom $\mathbf{x}j$, assign to nearest centroid: $$ c_j^{(t)} = \arg\min{i} |\mathbf{x}_j - \boldsymbol{\mu}_i^{(t)}|^2 $$
Update: Recompute centroids: $$ \boldsymbol{\mu}i^{(t+1)} = \frac{1}{|C_i^{(t)}|} \sum{\mathbf{x}_j \in C_i^{(t)}} \mathbf{x}_j $$
Repeat: Until convergence ($c_j^{(t+1)} = c_j^{(t)}$ for all $j$)
Initialization Strategies
Random Initialization
Select $k$ atoms uniformly at random as initial centroids. Simple but can lead to poor local minima.
K-means++ (Recommended)
Weighted random selection that favors spread-out centroids:
Choose first centroid uniformly at random
For each atom $\mathbf{x}_j$, compute $D(\mathbf{x}_j)$ = distance to nearest centroid
Choose next centroid with probability: $$ P(\mathbf{x}j) = \frac{D(\mathbf{x}j)^2}{\sum{\ell} D(\mathbf{x}\ell)^2} $$
Repeat until $k$ centroids chosen
K-means++ provides an expected approximation ratio of $O(\log k)$ compared to optimal clustering.
Convergence Properties
Guaranteed convergence: The objective $J$ decreases monotonically and is bounded below
Local minimum: Converges to local (not necessarily global) minimum
Recommendation: Run multiple restarts (typically 10-20) and select best solution
Complexity Analysis
Operation |
Time Complexity |
|---|---|
One iteration |
$O(nkd)$ |
Total (t iterations) |
$O(tnkd)$ |
K-means++ init |
$O(nk)$ |
where $n$ = number of atoms, $k$ = number of clusters, $d$ = dimension (typically 3 for coordinates).
Spectral Clustering
Spectral clustering uses eigenvalues of a graph Laplacian matrix to embed atoms in a space where k-means works well, even for non-convex clusters.
Molecular Graph Representation
Represent the molecule as a weighted graph $G = (V, E, W)$:
Vertices $V$: atoms
Edges $E$: bonds or proximity-based connections
Weights $W$: bond strengths, distances, or interaction energies
Affinity Matrix
The affinity (or similarity) matrix $A \in \mathbb{R}^{n \times n}$ encodes pairwise relationships:
$$ A_{ij} = \begin{cases} w_{ij} & \text{if atoms } i, j \text{ are connected} \ 0 & \text{otherwise} \end{cases} $$
For distance-based affinity, use Gaussian kernel:
$$ A_{ij} = \exp\left(-\frac{|\mathbf{x}_i - \mathbf{x}_j|^2}{2\sigma^2}\right) $$
Degree Matrix
The degree matrix $D \in \mathbb{R}^{n \times n}$ is diagonal:
$$ D_{ii} = \sum_{j=1}^{n} A_{ij} = d_i $$
Graph Laplacian
The unnormalized Laplacian:
$$ L = D - A $$
Properties:
Symmetric, positive semi-definite
Smallest eigenvalue is 0 with eigenvector $\mathbf{1}$
Number of 0 eigenvalues = number of connected components
Normalized Laplacian
The symmetric normalized Laplacian:
$$ L_{\text{sym}} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} A D^{-1/2} $$
The random walk Laplacian:
$$ L_{\text{rw}} = D^{-1} L = I - D^{-1} A $$
Spectral Clustering Algorithm
Compute the first $k$ eigenvectors of $L_{\text{sym}}$ (smallest eigenvalues): $$ L_{\text{sym}} \mathbf{u}_i = \lambda_i \mathbf{u}_i, \quad i = 1, \ldots, k $$
Form matrix $U \in \mathbb{R}^{n \times k}$ with eigenvectors as columns
Normalize rows of $U$ to unit length: $$ \tilde{\mathbf{u}}_j = \frac{\mathbf{u}_j}{|\mathbf{u}_j|} $$
Apply k-means to the rows of $\tilde{U}$ to obtain clusters
Spectral Gap Heuristic
The optimal number of clusters can be estimated from the spectral gap:
$$ k^* = \arg\max_k (\lambda_{k+1} - \lambda_k) $$
A large gap indicates natural cluster structure.
Complexity Analysis
Operation |
Time Complexity |
|---|---|
Laplacian construction |
$O(n^2)$ or $O( |
Eigendecomposition |
$O(n^3)$ or $O(kn^2)$ for $k$ eigenvectors |
K-means on embeddings |
$O(tnk^2)$ |
Hierarchical Clustering
Agglomerative Clustering
Bottom-up approach that builds a hierarchy (dendrogram):
Start with each atom as its own cluster
Merge the two closest clusters
Repeat until single cluster remains
Linkage Criteria
Distance between clusters $C_i$ and $C_j$:
Single linkage (minimum): $$ d(C_i, C_j) = \min_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} |\mathbf{x} - \mathbf{y}| $$
Complete linkage (maximum): $$ d(C_i, C_j) = \max_{\mathbf{x} \in C_i, \mathbf{y} \in C_j} |\mathbf{x} - \mathbf{y}| $$
Average linkage (UPGMA): $$ d(C_i, C_j) = \frac{1}{|C_i||C_j|} \sum_{\mathbf{x} \in C_i} \sum_{\mathbf{y} \in C_j} |\mathbf{x} - \mathbf{y}| $$
Ward’s method (minimize variance): $$ d(C_i, C_j) = \frac{|C_i||C_j|}{|C_i| + |C_j|} |\boldsymbol{\mu}_i - \boldsymbol{\mu}_j|^2 $$
Choosing Number of Clusters
Cut the dendrogram at a chosen height or number of clusters. For molecular fragmentation, Ward’s method often produces the most balanced fragments.
Application to Molecular Fragmentation
Distance Metrics for Molecules
Beyond Euclidean distance, consider:
Through-space distance: $$ d_{\text{space}}(\mathbf{x}_i, \mathbf{x}_j) = |\mathbf{x}_i - \mathbf{x}_j| $$
Through-bond distance: $$ d_{\text{bond}}(i, j) = \text{shortest path in molecular graph} $$
Weighted distance: $$ d_{\text{combined}}(i, j) = \alpha \cdot d_{\text{space}}(i, j) + (1-\alpha) \cdot d_{\text{bond}}(i, j) $$
Constraint-Aware Clustering
Incorporate chemical constraints:
Must-link: Atoms that must be in the same fragment
Cannot-link: Atoms that must be in different fragments
These constraints can be incorporated into spectral clustering by modifying the affinity matrix.
References
Lloyd, S. P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28(2), 129-137.
Arthur, D., & Vassilvitskii, S. (2007). k-means++: The advantages of careful seeding. SODA, 1027-1035.
Von Luxburg, U. (2007). A tutorial on spectral clustering. Statistics and Computing, 17(4), 395-416.
Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On spectral clustering: Analysis and an algorithm. NIPS, 849-856.