# 10. Performance Tuning and Scaling

## Performance mindset

Optimize with measurement, not assumptions. Distributed performance is typically constrained by communication, synchronization frequency, data movement, and imbalance.

## High-impact levers

- partition strategy
- placement policy and transfer behavior
- communication granularity and batching
- collective frequency and scope
- local kernel/vectorization efficiency

## Partition and locality

Select partitioning to minimize cross-rank dependencies for your dominant access pattern.

- block-like partitions often help contiguous workloads
- cyclic-like partitions can smooth imbalance for irregular workloads

## Placement and memory movement

- keep data where computation occurs when possible
- minimize host-device round trips
- use unified/device policies intentionally, not by default habit

## Algorithm-level optimization

- perform local aggregation before collective reduction
- batch remote operations
- avoid per-element remote calls in hot loops

## Synchronization strategy

- reduce unnecessary global barriers
- prefer narrower synchronization domains when semantics allow
- separate control-plane sync from data-plane operations

## Benchmarking approach

1. establish baseline with fixed input and backend config
2. vary one policy/parameter at a time
3. track both throughput and tail latency
4. validate correctness after each optimization change

## Common scaling bottlenecks

- rank skew from uneven partitioning
- high-frequency small-message collectives
- implicit sync points hidden in helper abstractions

## Further reading

- `docs/user_guide/performance_tuning.md`

## Deep-dive reference

- [Legacy Deep-Dive: Performance Tuning](performance_tuning.md)