10. Performance Tuning and Scaling

Performance mindset

Optimize with measurement, not assumptions. Distributed performance is typically constrained by communication, synchronization frequency, data movement, and imbalance.

High-impact levers

  • partition strategy

  • placement policy and transfer behavior

  • communication granularity and batching

  • collective frequency and scope

  • local kernel/vectorization efficiency

Partition and locality

Select partitioning to minimize cross-rank dependencies for your dominant access pattern.

  • block-like partitions often help contiguous workloads

  • cyclic-like partitions can smooth imbalance for irregular workloads

Placement and memory movement

  • keep data where computation occurs when possible

  • minimize host-device round trips

  • use unified/device policies intentionally, not by default habit

Algorithm-level optimization

  • perform local aggregation before collective reduction

  • batch remote operations

  • avoid per-element remote calls in hot loops

Synchronization strategy

  • reduce unnecessary global barriers

  • prefer narrower synchronization domains when semantics allow

  • separate control-plane sync from data-plane operations

Benchmarking approach

  1. establish baseline with fixed input and backend config

  2. vary one policy/parameter at a time

  3. track both throughput and tail latency

  4. validate correctness after each optimization change

Common scaling bottlenecks

  • rank skew from uneven partitioning

  • high-frequency small-message collectives

  • implicit sync points hidden in helper abstractions

Further reading

  • docs/user_guide/performance_tuning.md

Deep-dive reference