# Water Cluster Fragmentation Tutorial

This tutorial demonstrates a complete workflow for fragmenting water clusters and
writing quantum chemistry input files for fragment-based calculations. Water
clusters are ideal starting points because individual water molecules are natural
fragments with well-defined boundaries.

## Overview

We'll cover:
1. Loading a water cluster from an XYZ file
2. Partitioning into fragment groups with `MolecularPartitioner`
3. Choosing a clustering method (including balanced partitioning)
4. Tiered hierarchical fragmentation for large clusters
5. Writing fragment-based input files for GAMESS, Psi4, and other programs
6. Exporting results as QCSchema JSON

## Prerequisites

```python
import autofragment as af
from autofragment.partitioners import MolecularPartitioner
from autofragment.io.writers import (
    write_gamess_fmo,
    write_psi4_sapt,
    write_psi4_fragment,
)
```

## Step 1: Load Water Cluster

### From XYZ File

`read_xyz` auto-detects water molecules (groups of 3 atoms each) by default:

```python
system = af.io.read_xyz("water20.xyz")

print(f"Total atoms: {len(system.atoms)}")
print(f"Molecules: {len(system.metadata.get('molecule_atom_indices', []))}")
```

### Customizing Molecule Detection

For non-water systems or to disable water validation:

```python
# Explicit atoms-per-molecule (e.g., methanol with 6 atoms each)
system = af.io.read_xyz("methanol_cluster.xyz", atoms_per_molecule=6)

# Skip water geometry validation
system = af.io.read_xyz("water20.xyz", validate_water=False)
```

### Example XYZ Format

```text
60
(H2O)20 cluster
O    0.000    0.000    0.000
H    0.957    0.000    0.000
H   -0.240    0.927    0.000
O    2.800    0.000    0.000
...
```

## Step 2: Partition into Fragments

### Basic Partitioning

```python
partitioner = MolecularPartitioner(n_fragments=4, method="kmeans")
tree = partitioner.partition(system)

print(f"Fragments: {tree.n_fragments}")
print(f"Total atoms: {tree.n_atoms}")

for frag in tree.fragments:
    print(f"  {frag.id}: {len(frag.symbols)} atoms")
```

### Available Clustering Methods

`MolecularPartitioner` supports several clustering algorithms via the `method` parameter:

| Method | Description | Extra required |
|--------|-------------|----------------|
| `"kmeans"` | Standard k-means (default) | — |
| `"kmeans_constrained"` | K-means with balanced cluster sizes | `pip install autofragment[balanced]` |
| `"agglomerative"` | Hierarchical agglomerative clustering | — |
| `"spectral"` | Spectral clustering | `pip install autofragment[graph]` |
| `"gmm"` | Gaussian Mixture Model | — |
| `"birch"` | BIRCH clustering | — |
| `"geom_planes"` | Geometric partitioning by cutting planes | — |

### Balanced Partitioning with `kmeans_constrained`

For many-body expansion (MBE) and FMO workflows, equally-sized fragments are
often critical for load balancing and accuracy. The `kmeans_constrained` method
enforces balanced cluster sizes:

```bash
pip install "autofragment[balanced]"
```

```python
partitioner = MolecularPartitioner(
    n_fragments=4,
    method="kmeans_constrained",
)
tree = partitioner.partition(system)

# Verify balanced sizes
sizes = [len(frag.symbols) for frag in tree.fragments]
print(f"Fragment sizes: {sizes}")
# e.g., [15, 15, 15, 15] for 20 waters into 4 fragments
```

You can also enable balanced partitioning explicitly with any method that supports it:

```python
partitioner = MolecularPartitioner(
    n_fragments=4,
    method="kmeans",
    strict_balanced=True,  # automatically uses kmeans_constrained
)
```

### Topology-Aware Refinement

For better chemical locality, enable topology refinement which reassigns
molecules based on neighborhood overlap:

```python
partitioner = MolecularPartitioner(
    n_fragments=4,
    method="kmeans",
    topology_refine=True,
    topology_mode="graph",
    topology_hops=1,
)
tree = partitioner.partition(system)
```

### Tiered Hierarchical Partitioning

For large clusters, tiered partitioning creates a hierarchy of fragments. This is
useful for hierarchical many-body expansion (HMBE) workflows where you need
multiple levels of spatial grouping:

```python
# 2-tier: 4 primary fragments, each with 4 sub-fragments
partitioner = MolecularPartitioner(
    tiers=2, n_primary=4, n_secondary=4, method="kmeans"
)
tree = partitioner.partition(system)

print(f"Primary fragments: {tree.n_primary}")
print(f"Total fragments: {tree.n_fragments}")

for pf in tree.fragments:
    print(f"  {pf.id}: {len(pf.fragments)} sub-fragments")
    for sf in pf.fragments:
        print(f"    {sf.id}: {sf.n_atoms} atoms")
```

For even finer control, use 3-tier partitioning:

```python
partitioner = MolecularPartitioner(
    tiers=3, n_primary=2, n_secondary=2, n_tertiary=2, method="kmeans"
)
tree = partitioner.partition(system)
# PF1 -> PF1_SF1 -> PF1_SF1_TF1, PF1_SF1_TF2, ...
```

### K-Means Seeding Strategies

Control the initial centroids for k-means clustering:

```python
# PCA-based seeding
partitioner = MolecularPartitioner(
    n_fragments=4, method="kmeans", init_strategy="pca"
)

# Per-tier seeding overrides (tiered mode)
partitioner = MolecularPartitioner(
    tiers=2, n_primary=4, n_secondary=4,
    method="kmeans",
    init_strategy="pca",              # default for all tiers
    init_strategy_secondary="radial"  # override for sub-fragments
)
```

Available strategies: `"halfplane"`, `"pca"`, `"axis"`, `"radial"`.

## Step 3: Write Input Files

### GAMESS FMO

```python
write_gamess_fmo(
    tree.fragments,
    "water_fmo.inp",
    basis="aug-cc-pVDZ",
    method="MP2",
    runtype="energy",
    fmo_level=2,
    nbody=2,
)
```

### GAMESS EFMO

```python
from autofragment.io.writers import write_gamess_efmo

write_gamess_efmo(
    tree.fragments,
    "water_efmo.inp",
    basis="aug-cc-pVDZ",
)
```

### Psi4 SAPT (Dimer Interaction Energies)

For symmetry-adapted perturbation theory between fragment pairs:

```python
# Select two fragments for a SAPT calculation
dimer = [tree.fragments[0], tree.fragments[1]]

write_psi4_sapt(
    dimer,
    "water_sapt.dat",
    method="sapt2+",
    basis="aug-cc-pVDZ",
)
```

### Psi4 General Fragment Calculation

```python
write_psi4_fragment(
    tree.fragments,
    "water_frags.dat",
    method="mp2",
    basis="aug-cc-pVDZ",
    runtype="energy",
)
```

### Other Programs

autofragment also provides writers for Q-Chem, NWChem, ORCA, Molpro, Turbomole,
and CFOUR. See [Output Formats](output_formats.md) for the full list.

## Step 4: Export and Save Results

### QCSchema JSON

```python
# Save the full FragmentTree
tree.to_json("water20_fragments.json")

# Reload later
tree = af.FragmentTree.from_json("water20_fragments.json")
```

### QCSchema Writer

```python
from autofragment.io.writers import write_qcschema

write_qcschema(system, "water20_qcschema.json")
```

### XYZ Fragment Output

```python
from autofragment.io.writers import write_xyz_fragments

write_xyz_fragments(tree.fragments, "water20_fragments.xyz")
```

## Step 5: Quick API

For simple workflows, the convenience function handles reading and partitioning
in a single call:

```python
import autofragment as af

tree = af.partition_xyz("water20.xyz", n_fragments=4, method="kmeans")
tree.to_json("water20_fragments.json")

for frag in tree.fragments:
    print(f"Fragment {frag.id}: {len(frag.symbols)} atoms")
```

## Complete Example

```python
#!/usr/bin/env python
"""Fragment a water cluster and write GAMESS FMO input."""

import autofragment as af
from autofragment.partitioners import MolecularPartitioner
from autofragment.io.writers import write_gamess_fmo

def main():
    # Load
    system = af.io.read_xyz("water20.xyz")
    print(f"Loaded {len(system.atoms)} atoms")

    # Partition with balanced clusters
    partitioner = MolecularPartitioner(
        n_fragments=4,
        method="kmeans_constrained",  # requires autofragment[balanced]
    )
    tree = partitioner.partition(system)
    print(f"Created {tree.n_fragments} fragments")

    for frag in tree.fragments:
        print(f"  {frag.id}: {len(frag.symbols)} atoms")

    # Write GAMESS FMO input
    write_gamess_fmo(
        tree.fragments,
        "water20_fmo.inp",
        basis="aug-cc-pVDZ",
        method="MP2",
        fmo_level=2,
        nbody=2,
    )
    print("Wrote water20_fmo.inp")

    # Also save as JSON for later use
    tree.to_json("water20_fragments.json")

if __name__ == "__main__":
    main()
```

## Tips and Best Practices

1. **Basis set**: Use augmented basis sets (aug-cc-pVDZ, aug-cc-pVTZ) for accurate hydrogen bonding.
2. **Balanced fragments**: Use `method="kmeans_constrained"` (or `strict_balanced=True`) when equal-sized fragments matter for your workflow.
3. **Fragment count**: Start with fewer fragments and increase; more fragments means more n-body terms.
4. **Topology refinement**: Enable `topology_refine=True` for better spatial locality in larger clusters.
5. **Parallelization**: Fragment-based QC calculations are embarrassingly parallel — each fragment or pair can run independently.

## Next Steps

- [Algorithms Guide](../guides/algorithms.md): Details on each clustering method
- [Protein Fragmentation](proteins.md): Apply fragmentation to biomolecules
- [Output Formats](output_formats.md): Complete list of QC program writers