Legacy Deep-Dive: Performance Tuning

This page is retained as a detailed reference. The canonical user path is now the chaptered handbook.

Primary chapter: 10-performance-tuning-and-scaling.md

Runtime and handles: Runtime and Handle Model


Detailed Reference (Legacy)

This guide covers the key decisions that affect DTL application performance, from view selection and execution policies to memory placement and communication patterns.


Table of Contents


View Selection

Choosing the right view is the single most impactful performance decision in DTL. Each view type has fundamentally different communication characteristics.

local_view: The Fastest Path

local_view provides zero-overhead, contiguous access to the local partition. Its iterators are raw pointers (T*), making it fully compatible with STL algorithms, SIMD vectorization, and compiler auto-optimization.

When to use:

  • All operations that only need local data

  • STL algorithm integration

  • Performance-critical inner loops

  • Any situation where you do not need remote data

auto local = vec.local_view();

// Direct pointer access -- identical to std::vector::data()
double* ptr = local.data();
std::size_t n = local.size();

// Fully vectorizable loop
for (std::size_t i = 0; i < n; ++i) {
    ptr[i] = std::sin(ptr[i]);
}

// STL algorithms work at full speed
std::sort(local.begin(), local.end());
auto sum = std::accumulate(local.begin(), local.end(), 0.0);

Performance characteristics:

  • Zero communication overhead

  • Contiguous random-access iterators (raw pointers)

  • Compiler can auto-vectorize loops

  • Cache-friendly sequential access

  • Construction cost: O(1)

segmented_view: The Distributed Algorithm Path

segmented_view is the primary iteration substrate for distributed algorithms. It provides iteration over segments (one per rank), where each segment exposes a local_view for data access. A one-time O(p) offset cache is built at construction.

When to use:

  • Writing distributed algorithms (local compute + collective communication)

  • Iterating over the distributed structure

  • When you need both local data and distribution metadata

auto segv = vec.segmented_view();

// Efficient: iterate only local segment, skip remote
segv.for_each_local([](double& x) {
    x = std::sqrt(x);
});

// Or use the local_segment() shortcut
auto seg = segv.local_segment();
for (auto& x : seg) {
    x *= 2.0;
}

Performance characteristics:

  • No communication during iteration

  • O(p) construction cost (offset cache, where p = number of ranks)

  • O(1) per-segment offset lookups after construction

  • Local segments provide local_view (same contiguous access)

  • Slight overhead compared to local_view() directly (segment descriptor copy)

When to prefer local_view() over segmented_view():

If you only need the local data and do not need segment metadata (rank, global_offset), call local_view() directly. It avoids the O(p) offset cache construction.

// Prefer this for simple local operations:
auto local = vec.local_view();
std::transform(local.begin(), local.end(), local.begin(),
               [](double x) { return x * x; });

// Use segmented_view when you need distribution awareness:
auto segv = vec.segmented_view();
auto seg = segv.local_segment();
index_t global_start = seg.global_offset;  // Need this metadata

global_view: The Correctness Path

global_view provides a logical global index space. Every access returns remote_ref<T>, even for local elements. This is by design to make communication costs explicit.

When to use:

  • Debugging and correctness verification

  • Sparse, infrequent remote element access

  • Prototyping before optimization

  • Algorithms that genuinely need global indexing

When NOT to use:

  • Performance-critical inner loops

  • Dense iteration over any elements (local or remote)

  • Bulk data access patterns

auto global = vec.global_view();

// Every access returns remote_ref -- even local elements
auto ref = global[idx];     // remote_ref<double>
double val = ref.get().value();  // Explicit, may communicate

// NEVER do this in a loop:
// for (index_t i = 0; i < global.size(); ++i) {
//     sum += global[i].get().value();  // O(n) communications!
// }

Performance characteristics:

  • O(1) per-access, but each access returns remote_ref (indirection)

  • Remote element access triggers communication (RMA get/put)

  • Even local element access has remote_ref wrapper overhead

  • No built-in batching – each get()/put() is independent

View Selection Summary

Scenario

Recommended View

Reason

Local computation

local_view()

Zero overhead, STL compatible

Distributed algorithm

segmented_view()

Distribution-aware, no communication

Sparse remote access

global_view()

Explicit communication via remote_ref

Performance-critical loop

local_view()

Raw pointer iterators, vectorizable

Need global offset metadata

segmented_view()

Provides global_offset per segment

Debugging

global_view()

Shows logical global structure

Rule of thumb: Start with local_view(). Use segmented_view() when you need distribution metadata. Use global_view() only for sparse remote access or prototyping.


Execution Policy Selection

Execution policies control how local computation within an algorithm is dispatched. They do not affect communication patterns – communication is determined by the algorithm itself.

seq: Sequential Execution

Single-threaded, blocking, deterministic.

dtl::for_each(dtl::seq{}, vec, [](double& x) { x *= 2.0; });

Characteristics:

  • Deterministic element processing order

  • No threading overhead

  • SIMD vectorization is still allowed

  • Easiest to debug

Best for:

  • Small data sizes (< 10K elements per rank)

  • Operations with complex dependencies

  • Debugging and development

  • When deterministic ordering is required

par: Parallel Execution

Multi-threaded within a rank, blocking completion.

dtl::for_each(dtl::par{}, vec, [](double& x) { x *= 2.0; });

// Explicit thread count
dtl::for_each(dtl::par_n<4>{}, vec, [](double& x) { x *= 2.0; });

Characteristics:

  • Multi-threaded local computation (uses cpu_executor thread pool)

  • Non-deterministic element processing order

  • Call blocks until all threads finish

  • User-provided functions must be thread-safe

Best for:

  • Large local partitions (> 100K elements per rank)

  • Compute-bound operations (not memory-bound)

  • Independent element operations (map, filter, transform)

  • When you have spare CPU cores (i.e., not oversubscribed with MPI)

Thread count guidance:

With MPI, each rank is a process. If you have R ranks per node and C cores per node, set threads to C / R:

unsigned int threads_per_rank = std::thread::hardware_concurrency() / ranks_per_node;
dtl::for_each(dtl::par_n<threads_per_rank>{}, vec, f);

async: Asynchronous Execution

Non-blocking, returns distributed_future<T>.

auto future = dtl::async_reduce(vec, 0.0, std::plus<>{});
// ... overlap with other work ...
double result = future.get();  // Block when result is needed

Characteristics:

  • Returns immediately with a future

  • Enables overlap of computation and communication

  • Supports continuation chaining via .then()

  • Requires polling or background progress for completion

Best for:

  • Overlapping communication with computation

  • Pipelining multiple operations

  • Non-blocking collective operations

  • Latency hiding in multi-phase algorithms

Progress requirements:

Async operations require progress to complete. Either poll explicitly or enable background progress:

// Option 1: Explicit polling
auto future = dtl::async_reduce(vec, 0.0, std::plus<>{});
while (!future.is_ready()) {
    dtl::futures::progress_engine::instance().poll();
    // do other work
}

// Option 2: Just call .get() -- it polls internally
double result = future.get();

Execution Policy Decision Tree

Is data size < 10K elements per rank?
  YES --> Use seq{}
  NO  --> Is the operation compute-bound (not memory-bound)?
            YES --> Are there spare CPU cores?
                      YES --> Use par{}
                      NO  --> Use seq{} (avoid oversubscription)
            NO  --> Do you need to overlap with communication?
                      YES --> Use async{}
                      NO  --> Use seq{} (memory-bound won't benefit from par)

Placement Policy Impact

Placement policies determine where container data resides in memory. The choice affects both access patterns and performance.

host_only (Default)

Data resides in CPU memory. This is the default and the simplest option.

dtl::distributed_vector<float> vec(1000);  // Implicitly host_only
dtl::distributed_vector<float, dtl::host_only> vec_explicit(1000);
  • No GPU interaction needed

  • Direct CPU access, no copies

  • STL algorithms work directly

  • Suitable for CPU-only workloads and most MPI-based applications

device_only

Data resides exclusively on a specific GPU device. Host access requires explicit copies.

dtl::distributed_vector<float, dtl::device_only<0>> vec(1000, ctx);
  • Data allocated via cudaMalloc (or equivalent)

  • GPU kernels access data directly – no transfer latency

  • Host access requires cudaMemcpy (explicit)

  • Best for GPU-resident compute pipelines

  • Compile-time device selection via template parameter

Performance tip: If your entire pipeline runs on GPU, device_only avoids all host-device transfer overhead. The data never touches host memory.

unified_memory

Data is accessible from both host and device via CUDA Unified Memory (cudaMallocManaged).

dtl::distributed_vector<float, dtl::unified_memory> vec(1000, ctx);
  • Automatic page migration between host and device

  • No explicit copies needed – access from either side

  • First access after migration incurs page fault latency

  • Use prefetch_to_device() / prefetch_to_host() to reduce migration stalls

Performance tip: Prefetch data before you need it:

auto local = vec.local_view();
// Prefetch to GPU before kernel launch
dtl::unified_memory::prefetch_to_device(local.data(), local.size_bytes(), 0);

// GPU kernel runs without page faults
launch_kernel(local.data(), local.size());

When to use unified_memory:

  • Mixed host/device access patterns

  • Prototyping GPU code before optimizing transfers

  • When migration overhead is acceptable (large, infrequent transfers)

When NOT to use:

  • Tight host-device ping-pong patterns (migration thrashing)

  • When you know exactly which side needs data (use device_only or host_only)

  • Performance-critical GPU kernels (explicit copies give more control)

device_preferred

Data is preferentially placed on GPU with automatic host fallback if GPU memory is exhausted.

dtl::distributed_vector<float, dtl::device_preferred> vec(1000, ctx);
  • Attempts cudaMalloc first

  • Falls back to host memory if allocation fails

  • Useful for workloads that can gracefully degrade

Placement Policy Comparison

Policy

Location

Host Access

Device Access

Copies Needed

Best For

host_only

CPU RAM

Direct

Explicit copy

To device

CPU workloads

device_only<N>

GPU N

Explicit copy

Direct

To host

GPU-resident pipelines

unified_memory

Migrated

Automatic

Automatic

None (implicit)

Mixed access

device_preferred

GPU (fallback CPU)

Depends

Depends

Depends

Flexible GPU


Avoiding remote_ref in Loops

The most common performance mistake in DTL is using remote_ref in a loop. Each get() or put() on a remote element triggers a separate communication operation.

The Anti-Pattern

// DISASTROUS PERFORMANCE: one communication per element
auto global = vec.global_view();
double sum = 0.0;
for (index_t i = 0; i < global.size(); ++i) {
    sum += global[i].get().value();  // N communications total!
}

For a vector of 1,000,000 elements across 4 ranks, this triggers ~750,000 remote get() calls – each with network latency.

The Correct Pattern

// FAST: local computation + one collective
auto local = vec.local_view();
double local_sum = std::accumulate(local.begin(), local.end(), 0.0);
// One allreduce replaces 750,000 remote gets
double global_sum = dtl::reduce(vec, 0.0, std::plus<>{});

When remote_ref Is Acceptable

remote_ref is designed for sparse, infrequent access where the communication is intentional:

// OK: Occasional boundary element exchange
auto global = vec.global_view();
if (my_rank > 0) {
    // Get one boundary element from the previous rank
    auto boundary = global[my_local_start - 1].get().value();
    // Use it in a stencil computation
}

For bulk boundary exchange, use halo regions instead:

// BETTER: Halo exchange for stencil patterns
auto halo = tensor.halo_view(1);
halo.exchange();  // One collective instead of per-element gets

Batch vs Single-Element Operations on distributed_map

distributed_map uses hash-based key distribution. Single-element operations on remote keys require point-to-point communication. Batch operations amortize this cost.

Single-Element (Slow for Remote Keys)

dtl::distributed_map<std::string, int> map(ctx);

// Each insert/lookup on a remote key triggers communication
for (const auto& key : keys) {
    map.insert(key, compute_value(key));  // May communicate per insert
}

Batch Operations (Preferred)

// Batch insert: sorts keys by owner rank, sends in bulk
std::vector<std::pair<std::string, int>> batch;
for (const auto& key : keys) {
    batch.emplace_back(key, compute_value(key));
}
map.batch_insert(batch);  // One round of communication

Local-Only Iteration

When iterating over a distributed_map, only local key-value pairs are visited. This is always communication-free:

// Iterating local pairs -- no communication
for (const auto& [key, value] : map) {
    process(key, value);  // All pairs are local to this rank
}

MPI Thread Level Selection

MPI thread support level affects both correctness and performance. DTL requests the thread level via environment_options.

Level

Constant

Meaning

When to Use

MPI_THREAD_SINGLE

0

Only one thread calls MPI

Single-threaded DTL, no par{}

MPI_THREAD_FUNNELED

1

Only the main thread calls MPI

par{} for local compute, collectives on main thread

MPI_THREAD_SERIALIZED

2

Any thread may call MPI, but not concurrently

Safe default for most DTL usage

MPI_THREAD_MULTIPLE

3

Any thread may call MPI concurrently

async{} operations with concurrent collectives

Recommendations

For most applications: Request MPI_THREAD_SERIALIZED. This allows using par{} for local computation while DTL serializes MPI calls internally.

auto opts = dtl::environment_options::defaults();
// defaults() already requests MPI_THREAD_SERIALIZED
dtl::environment env(argc, argv, opts);

For async-heavy workloads: Request MPI_THREAD_MULTIPLE if you use async{} execution and the progress engine needs to make concurrent MPI calls.

auto opts = dtl::environment_options::defaults();
opts.mpi_thread_level = 3;  // MPI_THREAD_MULTIPLE
dtl::environment env(argc, argv, opts);

Performance note: MPI_THREAD_MULTIPLE may reduce MPI performance on some implementations due to internal locking. Only use it if you genuinely need concurrent MPI calls from multiple threads.


NCCL vs MPI for GPU Collectives

When both NCCL and MPI are available, the choice of collective backend significantly affects GPU communication performance.

When to Use NCCL

  • GPU-to-GPU collectives (allreduce, broadcast, allgather)

  • Multi-GPU within a node (NVLink, NVSwitch)

  • Large message sizes (> 1MB)

  • Deep learning and dense linear algebra workloads

NCCL is optimized for GPU-resident data and can use NVLink/NVSwitch for intra-node transfers, bypassing PCIe entirely.

When to Use MPI

  • CPU-to-CPU communication

  • Small message sizes (< 1KB), where NCCL launch overhead dominates

  • Point-to-point operations (MPI has richer P2P support)

  • Mixed CPU/GPU workloads where data is on host

  • Heterogeneous clusters without NVLink

Hybrid Strategy

For the best performance, use NCCL for GPU collectives and MPI for everything else:

auto opts = dtl::environment_options::defaults();
// Enable both backends
dtl::environment env(argc, argv, opts);

// CPU operations use MPI automatically
auto cpu_ctx = env.make_world_context();

// GPU operations can leverage NCCL
auto gpu_ctx = env.make_world_context(/*device_id=*/0);

Memory Allocation Patterns

Pre-allocate Containers

Avoid repeated allocation and deallocation in loops:

// BAD: Allocates and deallocates each iteration
for (int iter = 0; iter < 1000; ++iter) {
    dtl::distributed_vector<double> temp(n, ctx);
    compute(temp);
}  // temp deallocated here

// GOOD: Allocate once, reuse
dtl::distributed_vector<double> temp(n, ctx);
for (int iter = 0; iter < 1000; ++iter) {
    compute(temp);
}

Reuse Views

Views are lightweight (pointer + size), but segmented_view has O(p) construction cost for the offset cache. Reuse views when possible:

// OK for local_view (O(1) construction)
for (int iter = 0; iter < 1000; ++iter) {
    auto local = vec.local_view();  // Cheap
    process(local);
}

// Better for segmented_view (O(p) construction)
auto segv = vec.segmented_view();  // Build cache once
for (int iter = 0; iter < 1000; ++iter) {
    segv.for_each_local([](double& x) { x *= 2.0; });
}
// Note: view is invalid after structural operations (resize, redistribute)

Minimize Structural Operations

resize() and redistribute() are collective operations that reallocate memory and invalidate all views:

// BAD: Resize in a loop
for (int iter = 0; iter < n; ++iter) {
    vec.resize(vec.size() + 1);  // Collective + realloc each iteration
}

// GOOD: Compute final size, resize once
vec.resize(vec.size() + n);

See Also