Legacy Deep-Dive: Performance Tuning
This page is retained as a detailed reference. The canonical user path is now the chaptered handbook.
Primary chapter: 10-performance-tuning-and-scaling.md
Runtime and handles: Runtime and Handle Model
Detailed Reference (Legacy)
This guide covers the key decisions that affect DTL application performance, from view selection and execution policies to memory placement and communication patterns.
Table of Contents
View Selection
Choosing the right view is the single most impactful performance decision in DTL. Each view type has fundamentally different communication characteristics.
local_view: The Fastest Path
local_view provides zero-overhead, contiguous access to the local partition. Its iterators are raw pointers (T*), making it fully compatible with STL algorithms, SIMD vectorization, and compiler auto-optimization.
When to use:
All operations that only need local data
STL algorithm integration
Performance-critical inner loops
Any situation where you do not need remote data
auto local = vec.local_view();
// Direct pointer access -- identical to std::vector::data()
double* ptr = local.data();
std::size_t n = local.size();
// Fully vectorizable loop
for (std::size_t i = 0; i < n; ++i) {
ptr[i] = std::sin(ptr[i]);
}
// STL algorithms work at full speed
std::sort(local.begin(), local.end());
auto sum = std::accumulate(local.begin(), local.end(), 0.0);
Performance characteristics:
Zero communication overhead
Contiguous random-access iterators (raw pointers)
Compiler can auto-vectorize loops
Cache-friendly sequential access
Construction cost: O(1)
segmented_view: The Distributed Algorithm Path
segmented_view is the primary iteration substrate for distributed algorithms. It provides iteration over segments (one per rank), where each segment exposes a local_view for data access. A one-time O(p) offset cache is built at construction.
When to use:
Writing distributed algorithms (local compute + collective communication)
Iterating over the distributed structure
When you need both local data and distribution metadata
auto segv = vec.segmented_view();
// Efficient: iterate only local segment, skip remote
segv.for_each_local([](double& x) {
x = std::sqrt(x);
});
// Or use the local_segment() shortcut
auto seg = segv.local_segment();
for (auto& x : seg) {
x *= 2.0;
}
Performance characteristics:
No communication during iteration
O(p) construction cost (offset cache, where p = number of ranks)
O(1) per-segment offset lookups after construction
Local segments provide
local_view(same contiguous access)Slight overhead compared to
local_view()directly (segment descriptor copy)
When to prefer local_view() over segmented_view():
If you only need the local data and do not need segment metadata (rank, global_offset), call local_view() directly. It avoids the O(p) offset cache construction.
// Prefer this for simple local operations:
auto local = vec.local_view();
std::transform(local.begin(), local.end(), local.begin(),
[](double x) { return x * x; });
// Use segmented_view when you need distribution awareness:
auto segv = vec.segmented_view();
auto seg = segv.local_segment();
index_t global_start = seg.global_offset; // Need this metadata
global_view: The Correctness Path
global_view provides a logical global index space. Every access returns remote_ref<T>, even for local elements. This is by design to make communication costs explicit.
When to use:
Debugging and correctness verification
Sparse, infrequent remote element access
Prototyping before optimization
Algorithms that genuinely need global indexing
When NOT to use:
Performance-critical inner loops
Dense iteration over any elements (local or remote)
Bulk data access patterns
auto global = vec.global_view();
// Every access returns remote_ref -- even local elements
auto ref = global[idx]; // remote_ref<double>
double val = ref.get().value(); // Explicit, may communicate
// NEVER do this in a loop:
// for (index_t i = 0; i < global.size(); ++i) {
// sum += global[i].get().value(); // O(n) communications!
// }
Performance characteristics:
O(1) per-access, but each access returns
remote_ref(indirection)Remote element access triggers communication (RMA get/put)
Even local element access has
remote_refwrapper overheadNo built-in batching – each
get()/put()is independent
View Selection Summary
Scenario |
Recommended View |
Reason |
|---|---|---|
Local computation |
|
Zero overhead, STL compatible |
Distributed algorithm |
|
Distribution-aware, no communication |
Sparse remote access |
|
Explicit communication via |
Performance-critical loop |
|
Raw pointer iterators, vectorizable |
Need global offset metadata |
|
Provides |
Debugging |
|
Shows logical global structure |
Rule of thumb: Start with local_view(). Use segmented_view() when you need distribution metadata. Use global_view() only for sparse remote access or prototyping.
Execution Policy Selection
Execution policies control how local computation within an algorithm is dispatched. They do not affect communication patterns – communication is determined by the algorithm itself.
seq: Sequential Execution
Single-threaded, blocking, deterministic.
dtl::for_each(dtl::seq{}, vec, [](double& x) { x *= 2.0; });
Characteristics:
Deterministic element processing order
No threading overhead
SIMD vectorization is still allowed
Easiest to debug
Best for:
Small data sizes (< 10K elements per rank)
Operations with complex dependencies
Debugging and development
When deterministic ordering is required
par: Parallel Execution
Multi-threaded within a rank, blocking completion.
dtl::for_each(dtl::par{}, vec, [](double& x) { x *= 2.0; });
// Explicit thread count
dtl::for_each(dtl::par_n<4>{}, vec, [](double& x) { x *= 2.0; });
Characteristics:
Multi-threaded local computation (uses
cpu_executorthread pool)Non-deterministic element processing order
Call blocks until all threads finish
User-provided functions must be thread-safe
Best for:
Large local partitions (> 100K elements per rank)
Compute-bound operations (not memory-bound)
Independent element operations (map, filter, transform)
When you have spare CPU cores (i.e., not oversubscribed with MPI)
Thread count guidance:
With MPI, each rank is a process. If you have R ranks per node and C cores per node, set threads to C / R:
unsigned int threads_per_rank = std::thread::hardware_concurrency() / ranks_per_node;
dtl::for_each(dtl::par_n<threads_per_rank>{}, vec, f);
async: Asynchronous Execution
Non-blocking, returns distributed_future<T>.
auto future = dtl::async_reduce(vec, 0.0, std::plus<>{});
// ... overlap with other work ...
double result = future.get(); // Block when result is needed
Characteristics:
Returns immediately with a future
Enables overlap of computation and communication
Supports continuation chaining via
.then()Requires polling or background progress for completion
Best for:
Overlapping communication with computation
Pipelining multiple operations
Non-blocking collective operations
Latency hiding in multi-phase algorithms
Progress requirements:
Async operations require progress to complete. Either poll explicitly or enable background progress:
// Option 1: Explicit polling
auto future = dtl::async_reduce(vec, 0.0, std::plus<>{});
while (!future.is_ready()) {
dtl::futures::progress_engine::instance().poll();
// do other work
}
// Option 2: Just call .get() -- it polls internally
double result = future.get();
Execution Policy Decision Tree
Is data size < 10K elements per rank?
YES --> Use seq{}
NO --> Is the operation compute-bound (not memory-bound)?
YES --> Are there spare CPU cores?
YES --> Use par{}
NO --> Use seq{} (avoid oversubscription)
NO --> Do you need to overlap with communication?
YES --> Use async{}
NO --> Use seq{} (memory-bound won't benefit from par)
Placement Policy Impact
Placement policies determine where container data resides in memory. The choice affects both access patterns and performance.
host_only (Default)
Data resides in CPU memory. This is the default and the simplest option.
dtl::distributed_vector<float> vec(1000); // Implicitly host_only
dtl::distributed_vector<float, dtl::host_only> vec_explicit(1000);
No GPU interaction needed
Direct CPU access, no copies
STL algorithms work directly
Suitable for CPU-only workloads and most MPI-based applications
device_only
Data resides exclusively on a specific GPU device. Host access requires explicit copies.
dtl::distributed_vector<float, dtl::device_only<0>> vec(1000, ctx);
Data allocated via
cudaMalloc(or equivalent)GPU kernels access data directly – no transfer latency
Host access requires
cudaMemcpy(explicit)Best for GPU-resident compute pipelines
Compile-time device selection via template parameter
Performance tip: If your entire pipeline runs on GPU, device_only avoids all host-device transfer overhead. The data never touches host memory.
unified_memory
Data is accessible from both host and device via CUDA Unified Memory (cudaMallocManaged).
dtl::distributed_vector<float, dtl::unified_memory> vec(1000, ctx);
Automatic page migration between host and device
No explicit copies needed – access from either side
First access after migration incurs page fault latency
Use
prefetch_to_device()/prefetch_to_host()to reduce migration stalls
Performance tip: Prefetch data before you need it:
auto local = vec.local_view();
// Prefetch to GPU before kernel launch
dtl::unified_memory::prefetch_to_device(local.data(), local.size_bytes(), 0);
// GPU kernel runs without page faults
launch_kernel(local.data(), local.size());
When to use unified_memory:
Mixed host/device access patterns
Prototyping GPU code before optimizing transfers
When migration overhead is acceptable (large, infrequent transfers)
When NOT to use:
Tight host-device ping-pong patterns (migration thrashing)
When you know exactly which side needs data (use
device_onlyorhost_only)Performance-critical GPU kernels (explicit copies give more control)
device_preferred
Data is preferentially placed on GPU with automatic host fallback if GPU memory is exhausted.
dtl::distributed_vector<float, dtl::device_preferred> vec(1000, ctx);
Attempts
cudaMallocfirstFalls back to host memory if allocation fails
Useful for workloads that can gracefully degrade
Placement Policy Comparison
Policy |
Location |
Host Access |
Device Access |
Copies Needed |
Best For |
|---|---|---|---|---|---|
|
CPU RAM |
Direct |
Explicit copy |
To device |
CPU workloads |
|
GPU N |
Explicit copy |
Direct |
To host |
GPU-resident pipelines |
|
Migrated |
Automatic |
Automatic |
None (implicit) |
Mixed access |
|
GPU (fallback CPU) |
Depends |
Depends |
Depends |
Flexible GPU |
Avoiding remote_ref in Loops
The most common performance mistake in DTL is using remote_ref in a loop. Each get() or put() on a remote element triggers a separate communication operation.
The Anti-Pattern
// DISASTROUS PERFORMANCE: one communication per element
auto global = vec.global_view();
double sum = 0.0;
for (index_t i = 0; i < global.size(); ++i) {
sum += global[i].get().value(); // N communications total!
}
For a vector of 1,000,000 elements across 4 ranks, this triggers ~750,000 remote get() calls – each with network latency.
The Correct Pattern
// FAST: local computation + one collective
auto local = vec.local_view();
double local_sum = std::accumulate(local.begin(), local.end(), 0.0);
// One allreduce replaces 750,000 remote gets
double global_sum = dtl::reduce(vec, 0.0, std::plus<>{});
When remote_ref Is Acceptable
remote_ref is designed for sparse, infrequent access where the communication is intentional:
// OK: Occasional boundary element exchange
auto global = vec.global_view();
if (my_rank > 0) {
// Get one boundary element from the previous rank
auto boundary = global[my_local_start - 1].get().value();
// Use it in a stencil computation
}
For bulk boundary exchange, use halo regions instead:
// BETTER: Halo exchange for stencil patterns
auto halo = tensor.halo_view(1);
halo.exchange(); // One collective instead of per-element gets
Batch vs Single-Element Operations on distributed_map
distributed_map uses hash-based key distribution. Single-element operations on remote keys require point-to-point communication. Batch operations amortize this cost.
Single-Element (Slow for Remote Keys)
dtl::distributed_map<std::string, int> map(ctx);
// Each insert/lookup on a remote key triggers communication
for (const auto& key : keys) {
map.insert(key, compute_value(key)); // May communicate per insert
}
Batch Operations (Preferred)
// Batch insert: sorts keys by owner rank, sends in bulk
std::vector<std::pair<std::string, int>> batch;
for (const auto& key : keys) {
batch.emplace_back(key, compute_value(key));
}
map.batch_insert(batch); // One round of communication
Local-Only Iteration
When iterating over a distributed_map, only local key-value pairs are visited. This is always communication-free:
// Iterating local pairs -- no communication
for (const auto& [key, value] : map) {
process(key, value); // All pairs are local to this rank
}
MPI Thread Level Selection
MPI thread support level affects both correctness and performance. DTL requests the thread level via environment_options.
Level |
Constant |
Meaning |
When to Use |
|---|---|---|---|
|
0 |
Only one thread calls MPI |
Single-threaded DTL, no |
|
1 |
Only the main thread calls MPI |
|
|
2 |
Any thread may call MPI, but not concurrently |
Safe default for most DTL usage |
|
3 |
Any thread may call MPI concurrently |
|
Recommendations
For most applications: Request MPI_THREAD_SERIALIZED. This allows using par{} for local computation while DTL serializes MPI calls internally.
auto opts = dtl::environment_options::defaults();
// defaults() already requests MPI_THREAD_SERIALIZED
dtl::environment env(argc, argv, opts);
For async-heavy workloads: Request MPI_THREAD_MULTIPLE if you use async{} execution and the progress engine needs to make concurrent MPI calls.
auto opts = dtl::environment_options::defaults();
opts.mpi_thread_level = 3; // MPI_THREAD_MULTIPLE
dtl::environment env(argc, argv, opts);
Performance note: MPI_THREAD_MULTIPLE may reduce MPI performance on some implementations due to internal locking. Only use it if you genuinely need concurrent MPI calls from multiple threads.
NCCL vs MPI for GPU Collectives
When both NCCL and MPI are available, the choice of collective backend significantly affects GPU communication performance.
When to Use NCCL
GPU-to-GPU collectives (allreduce, broadcast, allgather)
Multi-GPU within a node (NVLink, NVSwitch)
Large message sizes (> 1MB)
Deep learning and dense linear algebra workloads
NCCL is optimized for GPU-resident data and can use NVLink/NVSwitch for intra-node transfers, bypassing PCIe entirely.
When to Use MPI
CPU-to-CPU communication
Small message sizes (< 1KB), where NCCL launch overhead dominates
Point-to-point operations (MPI has richer P2P support)
Mixed CPU/GPU workloads where data is on host
Heterogeneous clusters without NVLink
Hybrid Strategy
For the best performance, use NCCL for GPU collectives and MPI for everything else:
auto opts = dtl::environment_options::defaults();
// Enable both backends
dtl::environment env(argc, argv, opts);
// CPU operations use MPI automatically
auto cpu_ctx = env.make_world_context();
// GPU operations can leverage NCCL
auto gpu_ctx = env.make_world_context(/*device_id=*/0);
Memory Allocation Patterns
Pre-allocate Containers
Avoid repeated allocation and deallocation in loops:
// BAD: Allocates and deallocates each iteration
for (int iter = 0; iter < 1000; ++iter) {
dtl::distributed_vector<double> temp(n, ctx);
compute(temp);
} // temp deallocated here
// GOOD: Allocate once, reuse
dtl::distributed_vector<double> temp(n, ctx);
for (int iter = 0; iter < 1000; ++iter) {
compute(temp);
}
Reuse Views
Views are lightweight (pointer + size), but segmented_view has O(p) construction cost for the offset cache. Reuse views when possible:
// OK for local_view (O(1) construction)
for (int iter = 0; iter < 1000; ++iter) {
auto local = vec.local_view(); // Cheap
process(local);
}
// Better for segmented_view (O(p) construction)
auto segv = vec.segmented_view(); // Build cache once
for (int iter = 0; iter < 1000; ++iter) {
segv.for_each_local([](double& x) { x *= 2.0; });
}
// Note: view is invalid after structural operations (resize, redistribute)
Minimize Structural Operations
resize() and redistribute() are collective operations that reallocate memory and invalidate all views:
// BAD: Resize in a loop
for (int iter = 0; iter < n; ++iter) {
vec.resize(vec.size() + 1); // Collective + realloc each iteration
}
// GOOD: Compute final size, resize once
vec.resize(vec.size() + n);
See Also
Views Guide – View types and semantics
Policies Guide – Full policy reference
Environment Guide – Backend lifecycle