# Fragmentation Scoring Functions

## Overview

Scoring functions quantify the quality of a molecular fragmentation to guide optimization algorithms. A good fragmentation should:

1. Minimize breaking of strong chemical bonds
2. Produce fragments of reasonable, uniform size
3. Minimize interfragment interactions
4. Preserve chemical integrity (functional groups, rings)
5. Balance computational cost across fragments

## Total Score Formulation

The total score is a weighted combination of component scores:

$$
S_{\text{total}} = \sum_{i} w_i \cdot S_i
$$

where:
- $w_i$ are tunable weights for each component
- $S_i$ are individual score components
- Higher scores indicate better fragmentations

Alternative formulation as minimization (penalty-based):

$$
P_{\text{total}} = \sum_{i} w_i \cdot P_i
$$

where $P_i$ are penalty terms and lower values are better.

## Component Scores

### Bond Breaking Penalty ($S_{\text{bond}}$)

The most critical component—penalizes breaking covalent bonds:

$$
S_{\text{bond}} = -\sum_{b \in \text{broken}} E_{\text{bond}}(b) \cdot f_{\text{type}}(b)
$$

where:
- $E_{\text{bond}}(b)$ is the estimated bond energy
- $f_{\text{type}}(b)$ is a multiplier based on bond type

#### Bond Type Multipliers

| Bond Type | $f_{\text{type}}$ | Rationale |
|-----------|-------------------|-----------|
| Single (rotatable) | 1.0 | Preferred break points |
| Single (non-rotatable) | 1.5 | Somewhat penalized |
| Aromatic | 10.0 | Strongly discouraged |
| Double | 50.0 | Almost forbidden |
| Triple | 100.0 | Forbidden |

#### Bond Energy Estimation

Approximate bond energies (kcal/mol):

$$
E_{\text{bond}}(X-Y) \approx \sqrt{E(X-X) \cdot E(Y-Y)}
$$

| Bond | Energy (kcal/mol) |
|------|-------------------|
| C-C | 83 |
| C-H | 99 |
| C-N | 73 |
| C-O | 86 |
| C=O | 180 |
| C=C | 146 |
| C≡C | 200 |

### Size Variance ($S_{\text{size}}$)

Encourages fragments of uniform size to balance computational load:

$$
S_{\text{size}} = -\alpha \cdot \frac{\sigma(n_i)}{\bar{n}}
$$

where:
- $n_i$ is the number of atoms in fragment $i$
- $\bar{n} = \frac{1}{k}\sum_i n_i$ is the mean fragment size
- $\sigma(n_i) = \sqrt{\frac{1}{k}\sum_i (n_i - \bar{n})^2}$ is the standard deviation
- $\alpha$ is a scaling parameter

**Alternative**: Use coefficient of variation as penalty:

$$
\text{CV} = \frac{\sigma(n_i)}{\bar{n}}
$$

Normalized score:

$$
S_{\text{size}} = 1 - \text{CV}
$$

### Size Range Penalty

Penalize fragments outside acceptable size range:

$$
P_{\text{range}} = \sum_i \max(0, n_i - n_{\max}) + \sum_i \max(0, n_{\min} - n_i)
$$

where $n_{\min}$ and $n_{\max}$ are target fragment size bounds.

### Interface Score ($S_{\text{interface}}$)

Minimizes interaction energy across fragment boundaries:

$$
S_{\text{interface}} = -\sum_{(F_i, F_j)} E_{\text{int}}(F_i, F_j)
$$

#### Approximations for Interface Energy

**Exposed surface area**:

$$
E_{\text{int}} \approx \gamma \cdot A_{\text{exposed}}
$$

where $\gamma$ is surface tension and $A_{\text{exposed}}$ is solvent-accessible surface area at cut points.

**Electrostatic interaction**:

$$
E_{\text{int}} \approx \sum_{a \in F_i} \sum_{b \in F_j} \frac{q_a q_b}{r_{ab}}
$$

**Number of cut bonds** (simplified):

$$
E_{\text{int}} \approx N_{\text{cut}}
$$

### Chemical Integrity ($S_{\text{chem}}$)

Rewards preservation of chemically meaningful units:

$$
S_{\text{chem}} = \sum_f I_{\text{valid}}(f)
$$

where $I_{\text{valid}}(f)$ is an indicator for chemically sensible fragments.

#### Validity Criteria

1. **No broken rings**: 
   $$
   I_{\text{ring}}(f) = \begin{cases}
   0 & \text{if ring is split} \\
   1 & \text{otherwise}
   \end{cases}
   $$

2. **Complete functional groups**:
   $$
   I_{\text{func}}(f) = \begin{cases}
   0 & \text{if functional group is split} \\
   1 & \text{otherwise}
   \end{cases}
   $$

3. **Proper valence**:
   $$
   I_{\text{valence}}(f) = \begin{cases}
   1 & \text{if all atoms have valid valence (after capping)} \\
   0 & \text{otherwise}
   \end{cases}
   $$

### Computational Cost ($S_{\text{comp}}$)

Estimates total computational cost of the fragmentation:

$$
S_{\text{comp}} = -\sum_i C(n_i)
$$

where $C(n)$ is the cost function for QM calculations.

#### Scaling Estimates

| Method | Scaling | $C(n)$ |
|--------|---------|--------|
| HF | $O(n^4)$ | $n^4$ |
| DFT | $O(n^3)$ | $n^3$ |
| MP2 | $O(n^5)$ | $n^5$ |
| CCSD | $O(n^6)$ | $n^6$ |
| CCSD(T) | $O(n^7)$ | $n^7$ |

For many-body expansion, include pair interaction costs:

$$
S_{\text{comp}}^{\text{MBE}} = -\left[ \sum_i C(n_i) + \frac{1}{2}\sum_{i \neq j} C(n_i + n_j) \right]
$$

### Distance from Target Size ($S_{\text{target}}$)

Encourages fragments close to specified target size:

$$
S_{\text{target}} = -\sum_i \left| n_i - n_{\text{target}} \right|^p
$$

Common choices:
- $p = 1$: Linear penalty (MAE)
- $p = 2$: Quadratic penalty (MSE), more severe for large deviations

### Number of Fragments ($S_{\text{count}}$)

Penalize excessive fragmentation:

$$
S_{\text{count}} = -|k - k_{\text{target}}|
$$

where $k$ is the number of fragments.

## Weight Selection

### Default Weights

Balanced for general molecular fragmentation:

| Component | Weight | Rationale |
|-----------|--------|-----------|
| $w_{\text{bond}}$ | 1.0 | Primary consideration |
| $w_{\text{size}}$ | 0.3 | Moderate importance |
| $w_{\text{interface}}$ | 0.2 | Secondary |
| $w_{\text{chem}}$ | 0.5 | Important for validity |
| $w_{\text{comp}}$ | 0.1 | Use when balancing load |

### Application-Specific Weights

**FMO calculations** (prioritize size balance):
```python
weights = {
    "bond": 1.0,
    "size": 0.8,  # Higher for balanced fragments
    "chem": 1.0,  # Critical for FMO
    "interface": 0.1,
}
```

**Large protein fragmentation** (prioritize computational cost):
```python
weights = {
    "bond": 1.0,
    "size": 0.3,
    "comp": 0.5,  # Consider cost
    "chem": 0.5,
}
```

**Water clusters** (minimize interactions):
```python
weights = {
    "bond": 0.5,  # Weak H-bonds
    "interface": 1.0,  # Minimize water-water
    "size": 0.1,  # Less important
}
```

## Multi-Objective Optimization

### Pareto Optimality

Multiple objectives can conflict. A fragmentation is **Pareto optimal** if no objective can be improved without worsening another.

The **Pareto frontier** contains all Pareto-optimal fragmentations:

$$
\mathcal{P} = \{f : \nexists f' \text{ such that } S_i(f') \geq S_i(f) \forall i \text{ and } S_j(f') > S_j(f) \text{ for some } j\}
$$

### Scalarization

Convert to single objective via weighted sum:

$$
S_{\text{total}}(\mathbf{w}) = \sum_i w_i S_i
$$

Different weight vectors explore different points on Pareto frontier.

### ε-Constraint Method

Optimize one objective subject to constraints on others:

$$
\maximize \quad S_1 \\
\text{subject to} \quad S_i \geq \epsilon_i, \quad i = 2, \ldots, k
$$

## Trade-offs and Considerations

### Bond Breaking vs. Size Balance

- Fewer cuts → larger fragments → worse computational scaling
- More cuts → smaller fragments → more interface energy

Optimal balance depends on:
- Target QM method (scaling exponent)
- System type (proteins need larger fragments than water)
- Accuracy requirements

### Chemical Integrity vs. Flexibility

Strict preservation of all functional groups may produce:
- Fragments that are too large
- Poor size balance
- Suboptimal partitioning

Consider relaxing constraints for:
- Long-chain functional groups
- Repeated patterns (polymers)

### Computational Cost vs. Accuracy

**Many-body expansion trade-off**:
- Fewer, larger fragments → fewer pairs → faster → less accurate
- More, smaller fragments → more pairs → slower → more accurate

Target fragment size should be chosen based on desired accuracy.

## Score Normalization

### Z-Score Normalization

For comparing across systems:

$$
S_i^{\text{norm}} = \frac{S_i - \mu_i}{\sigma_i}
$$

where $\mu_i$ and $\sigma_i$ are mean and std from reference fragmentations.

### Min-Max Scaling

$$
S_i^{\text{scaled}} = \frac{S_i - S_i^{\min}}{S_i^{\max} - S_i^{\min}}
$$

Bounds scores to $[0, 1]$ range.

## Implementation Notes

```python
from autofragment.scoring import FragmentationScorer

scorer = FragmentationScorer(
    weights={
        "bond": 1.0,
        "size": 0.3,
        "interface": 0.2,
        "chem": 0.5,
    }
)

# Score a fragmentation
fragments = partitioner.partition(system)
score = scorer.score(fragments)

# Get component breakdown
breakdown = scorer.score_breakdown(fragments)
for component, value in breakdown.items():
    print(f"{component}: {value:.3f}")
```

## References

1. Fedorov, D. G., & Kitaura, K. (2007). Extending the power of quantum chemistry to large systems with the fragment molecular orbital method. *JPC A*, 111(30), 6904-6914.
2. Gordon, M. S., et al. (2012). Fragmentation methods: A route to accurate calculations on large systems. *Chemical Reviews*, 112(1), 632-672.
3. Collins, M. A., & Bettens, R. P. (2015). Energy-based molecular fragmentation methods. *Chemical Reviews*, 115(12), 5607-5642.