Backend Selection Guide

This guide helps you choose the right DTL backend configuration for your use case.

Available Backends

DTL supports multiple backends that can be combined based on your hardware and requirements.

Backend

CMake Option

Purpose

Status

MPI

(always available)

Multi-node distributed computing

Complete

CPU

(always available)

CPU thread pool execution

Complete

CUDA

DTL_ENABLE_CUDA

NVIDIA GPU acceleration

Complete

HIP

DTL_ENABLE_HIP

AMD GPU acceleration

Headers (V1.2)

NCCL

DTL_ENABLE_NCCL

GPU-native collective communication

Headers (V1.2)

OpenSHMEM

DTL_ENABLE_SHMEM

PGAS communication model

Planned

Environment Lifecycle (V1.2)

DTL provides environment for unified backend lifecycle management:

#include <dtl/core/environment.hpp>
#include <dtl/core/environment_options.hpp>

int main() {
    // DTL manages MPI init/finalize
    dtl::environment env{dtl::environment_options::defaults()};

    // Or adopt externally-initialized MPI
    MPI_Init(nullptr, nullptr);
    dtl::environment env{dtl::environment_options::adopt_mpi()};

    // Query backend availability (instance methods)
    if (env.has_mpi()) { /* ... */ }
    if (env.has_cuda()) { /* ... */ }
}
// Backends finalized on last guard destruction

Backend Ownership Modes:

  • dtl_owns - DTL initializes and finalizes the backend

  • adopt_external - Backend initialized externally; DTL does not finalize

  • optional - DTL tries to initialize; silently ignores failure

  • disabled - Backend not used

Decision Tree

1. Single Node vs Multi-Node

Single Node (one machine):

  • Use CPU backend for CPU-only workloads

  • Use CUDA/HIP backend for GPU acceleration

  • No MPI initialization required (use local_context)

Multi-Node (cluster/supercomputer):

  • Use MPI backend for inter-node communication

  • Add CUDA/HIP for GPU nodes

  • Add NCCL for optimized GPU collective operations

2. CPU vs GPU

CPU-Only Workloads:

#include <dtl/dtl.hpp>
#include <backends/cpu/cpu_executor.hpp>

// Use CPU thread pool for parallel execution
dtl::cpu::cpu_executor exec(std::thread::hardware_concurrency());
exec.parallel_for(0, n, [&](size_t i) {
    data[i] = compute(i);
});

GPU-Accelerated Workloads:

#include <dtl/dtl.hpp>
#include <backends/cuda/cuda_memory_space.hpp>
#include <backends/cuda/cuda_executor.hpp>

// Allocate on GPU
dtl::cuda::cuda_memory_space gpu_mem;
void* d_data = gpu_mem.allocate(n * sizeof(double));

// Execute on GPU
dtl::cuda::cuda_executor exec;
exec.launch_kernel(...);

3. Communication Requirements

No Communication (embarrassingly parallel):

// Local containers only, no MPI needed
dtl::distributed_vector<double> vec(local_size);
auto local = vec.local_view();
// Work purely on local data

Collective Communication:

// MPI with collective operations
dtl::mpi::mpi_comm_adapter comm;
double local_sum = compute_local_sum();
double global_sum = comm.allreduce_sum_value(local_sum);

Point-to-Point Communication:

// Direct rank-to-rank messaging
comm.send(&data, count, dest_rank, tag);
comm.recv(&buffer, count, src_rank, tag);

Backend Combinations

CPU Cluster (No GPUs)

CMake Configuration:

cmake .. -DCMAKE_BUILD_TYPE=Release

Code Pattern:

#include <dtl/dtl.hpp>
#include <backends/mpi/mpi_comm_adapter.hpp>
#include <backends/cpu/cpu_executor.hpp>

int main(int argc, char** argv) {
    dtl::mpi::scoped_init mpi(argc, argv);
    dtl::mpi::mpi_comm_adapter comm;
    dtl::cpu::cpu_executor exec;

    // Distributed vector across MPI ranks
    dtl::distributed_vector<double> vec(global_size);

    // Local parallel computation
    auto local = vec.local_view();
    exec.parallel_for(local.size(), [&](size_t i) {
        local[i] = compute(i + vec.local_offset());
    });

    // Global reduction
    double local_sum = std::reduce(local.begin(), local.end());
    double global_sum = comm.allreduce_sum_value(local_sum);

    return 0;
}

GPU Cluster (NVIDIA)

CMake Configuration:

cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DDTL_ENABLE_CUDA=ON \
         -DDTL_ENABLE_NCCL=ON

Code Pattern:

#include <dtl/dtl.hpp>
#include <backends/mpi/mpi_comm_adapter.hpp>
#include <backends/cuda/cuda_memory_space.hpp>
#include <backends/cuda/cuda_executor.hpp>
#include <backends/nccl/nccl_communicator.hpp>

int main(int argc, char** argv) {
    dtl::mpi::scoped_init mpi(argc, argv);

    // Set GPU based on local rank
    int local_rank = get_local_rank();
    cudaSetDevice(local_rank);

    // Use GPU memory and NCCL for GPU-direct communication
    dtl::cuda::cuda_memory_space gpu_mem;
    dtl::nccl::nccl_communicator nccl_comm;

    // Allocate on GPU
    double* d_data = static_cast<double*>(
        gpu_mem.allocate(local_size * sizeof(double)));

    // Compute on GPU (kernel launch)
    // ...

    // GPU-native collective (no CPU staging)
    nccl_comm.allreduce_sum(d_data, d_result, local_size);

    return 0;
}

GPU Cluster (AMD)

CMake Configuration:

cmake .. -DCMAKE_BUILD_TYPE=Release \
         -DDTL_ENABLE_HIP=ON

Code Pattern:

#include <dtl/dtl.hpp>
#include <backends/mpi/mpi_comm_adapter.hpp>
#include <backends/hip/hip_memory_space.hpp>
#include <backends/hip/hip_executor.hpp>

// Similar to CUDA pattern, using HIP equivalents

Shared Memory (Single Node, No MPI)

CMake Configuration:

cmake .. -DCMAKE_BUILD_TYPE=Release

Code Pattern:

#include <dtl/dtl.hpp>
#include <backends/shared_memory/shared_memory_communicator.hpp>
#include <backends/cpu/cpu_executor.hpp>

int main() {
    // No MPI initialization needed
    dtl::shared_memory::shared_memory_communicator comm;
    dtl::cpu::cpu_executor exec;

    // Use shared memory for inter-thread communication
    // ...

    return 0;
}

Performance Considerations

When to Use Each Backend

Scenario

Recommended Backend

Rationale

Small data, single node

CPU only

MPI overhead not justified

Large data, single node

CPU + MPI

Process isolation, NUMA awareness

Multi-node cluster

MPI + CPU

Standard HPC configuration

GPU workloads

CUDA/HIP + NCCL

GPU-native operations

Mixed CPU/GPU

MPI + CUDA + NCCL

MPI for CPU, NCCL for GPU

Communication Backend Selection

Communication Pattern

Best Backend

Notes

CPU-to-CPU

MPI

Mature, well-optimized

GPU-to-GPU (same node)

NCCL

Uses NVLink/PCIe directly

GPU-to-GPU (cross-node)

NCCL

Uses GPU-direct RDMA

CPU-to-GPU

MPI + cudaMemcpy

Two-stage transfer

Memory Space Selection

Workload

Memory Space

Notes

CPU computation

host_memory_space

Default, system allocator

GPU computation

cuda_memory_space

Device memory

Frequent CPU/GPU transfer

cuda_managed_memory_space

Unified memory with prefetch support

DMA/RDMA transfers

pinned_memory_space

Page-locked host (V1.2: CUDA/HIP/fallback)

AMD GPU computation

hip_memory_space

HIP device memory (V1.2 headers)

AMD GPU unified

hip_managed_memory_space

HIP managed memory with prefetch/advise (V1.2)

Prefetch Policies (V1.2)

For unified/managed memory, DTL provides prefetch hints:

#include <dtl/memory/prefetch_policy.hpp>

// Prefetch policies: none, to_device, to_host, bidirectional
auto hint = dtl::make_device_prefetch(device_id, offset, size);
auto hint = dtl::make_host_prefetch(offset, size);

Executor Selection

Workload Type

Recommended Executor

Notes

Sequential

inline_executor

Zero overhead

CPU parallel

cpu_executor

Thread pool

GPU parallel

cuda_executor

Kernel launch

Mixed

Both

CPU for orchestration, GPU for compute

Thread Count Guidelines:

// Match hardware threads (typical)
dtl::cpu::cpu_executor exec;  // Uses hardware_concurrency()

// Custom thread count (e.g., for hyperthreading)
dtl::cpu::cpu_executor exec(num_physical_cores);

// Single-threaded (for debugging)
dtl::cpu::cpu_executor exec(1);

Hybrid Configurations

MPI + OpenMP + CUDA

For maximum flexibility on GPU clusters:

// MPI across nodes
dtl::mpi::mpi_comm_adapter mpi_comm;

// OpenMP within node (via cpu_executor)
dtl::cpu::cpu_executor cpu_exec(omp_get_max_threads());

// CUDA for GPU work
dtl::cuda::cuda_executor gpu_exec;

// NCCL for GPU collectives
dtl::nccl::nccl_communicator nccl_comm;

Process Placement

Layout

Description

Use Case

1 rank per node

All GPUs shared by one process

Simple, good for NCCL

1 rank per GPU

Each GPU has dedicated process

Easier resource management

Multiple ranks per GPU

GPU sharing via MPS

Memory-limited, multi-tenant

Troubleshooting

Backend Detection Issues

MPI not found:

# Verify MPI installation
which mpicc
mpicc --version

# Force MPI compiler
cmake .. -DMPI_C_COMPILER=$(which mpicc) -DMPI_CXX_COMPILER=$(which mpicxx)

CUDA not found:

# Verify CUDA installation
nvcc --version
nvidia-smi

# Set CUDA path
export CUDA_HOME=/usr/local/cuda
cmake .. -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc

Runtime Issues

MPI initialization fails:

  • Ensure scoped_init is created before any DTL distributed operations

  • Check that all ranks reach the same code paths

GPU out of memory:

  • Reduce local partition sizes

  • Use unified memory for oversubscription

  • Check for memory leaks with cuda-memcheck

NCCL deadlock:

  • Ensure all ranks call collective operations

  • Check that communicators are created in the same order

MPI Send Mode Variants (V1.2)

DTL supports four MPI send modes via send_mode enum:

#include <dtl/communication/send_mode.hpp>

// send_mode::standard     - MPI_Send (default)
// send_mode::synchronous  - MPI_Ssend (handshake, guaranteed no buffering)
// send_mode::ready        - MPI_Rsend (receiver must have pre-posted recv)
// send_mode::buffered     - MPI_Bsend (user-managed buffer)

The mpi_comm_adapter provides blocking and non-blocking variants:

  • ssend_impl() / issend_impl() - Synchronous send

  • rsend_impl() / irsend_impl() - Ready send

MPMD Support (V1.2)

DTL supports Multiple Program, Multiple Data (MPMD) patterns:

#include <dtl/mpmd/role_manager.hpp>
#include <dtl/mpmd/inter_group_comm.hpp>

// Define roles
dtl::role_manager mgr;
mgr.register_role("worker", dtl::role_assignment::first_n_ranks(3));
mgr.register_role("coordinator", dtl::role_assignment::last_rank_only());

// Initialize (assigns ranks to roles based on predicates)
mgr.initialize(world_comm);

// Query roles
if (mgr.has_role("worker")) {
    auto& group = mgr.get_group("worker");
    // group.local_rank(), group.size(), group.members()
}

// Inter-group communication via rank translation
auto world_rank = dtl::translate_to_world_rank(dest_group, local_rank);

Runtime Library

DTL requires libdtl_runtime.so at runtime for backend lifecycle management. This shared library contains the process-global singleton (runtime_registry) that manages MPI, CUDA, HIP, NCCL, and SHMEM initialization and finalization.

When using CMake with DTL::dtl, the runtime library is linked automatically (transitive dependency). For manual builds, add -ldtl_runtime to your linker flags.

See Also