10. Performance Tuning and Scaling
Performance mindset
Optimize with measurement, not assumptions. Distributed performance is typically constrained by communication, synchronization frequency, data movement, and imbalance.
High-impact levers
partition strategy
placement policy and transfer behavior
communication granularity and batching
collective frequency and scope
local kernel/vectorization efficiency
Partition and locality
Select partitioning to minimize cross-rank dependencies for your dominant access pattern.
block-like partitions often help contiguous workloads
cyclic-like partitions can smooth imbalance for irregular workloads
Placement and memory movement
keep data where computation occurs when possible
minimize host-device round trips
use unified/device policies intentionally, not by default habit
Algorithm-level optimization
perform local aggregation before collective reduction
batch remote operations
avoid per-element remote calls in hot loops
Synchronization strategy
reduce unnecessary global barriers
prefer narrower synchronization domains when semantics allow
separate control-plane sync from data-plane operations
Benchmarking approach
establish baseline with fixed input and backend config
vary one policy/parameter at a time
track both throughput and tail latency
validate correctness after each optimization change
Common scaling bottlenecks
rank skew from uneven partitioning
high-frequency small-message collectives
implicit sync points hidden in helper abstractions
Further reading
docs/user_guide/performance_tuning.md