Backend Selection Guide
This guide helps you choose the right DTL backend configuration for your use case.
Available Backends
DTL supports multiple backends that can be combined based on your hardware and requirements.
Backend |
CMake Option |
Purpose |
Status |
|---|---|---|---|
MPI |
(always available) |
Multi-node distributed computing |
Complete |
CPU |
(always available) |
CPU thread pool execution |
Complete |
CUDA |
|
NVIDIA GPU acceleration |
Complete |
HIP |
|
AMD GPU acceleration |
Headers (V1.2) |
NCCL |
|
GPU-native collective communication |
Headers (V1.2) |
OpenSHMEM |
|
PGAS communication model |
Planned |
Environment Lifecycle (V1.2)
DTL provides environment for unified backend lifecycle management:
#include <dtl/core/environment.hpp>
#include <dtl/core/environment_options.hpp>
int main() {
// DTL manages MPI init/finalize
dtl::environment env{dtl::environment_options::defaults()};
// Or adopt externally-initialized MPI
MPI_Init(nullptr, nullptr);
dtl::environment env{dtl::environment_options::adopt_mpi()};
// Query backend availability (instance methods)
if (env.has_mpi()) { /* ... */ }
if (env.has_cuda()) { /* ... */ }
}
// Backends finalized on last guard destruction
Backend Ownership Modes:
dtl_owns- DTL initializes and finalizes the backendadopt_external- Backend initialized externally; DTL does not finalizeoptional- DTL tries to initialize; silently ignores failuredisabled- Backend not used
Decision Tree
1. Single Node vs Multi-Node
Single Node (one machine):
Use CPU backend for CPU-only workloads
Use CUDA/HIP backend for GPU acceleration
No MPI initialization required (use
local_context)
Multi-Node (cluster/supercomputer):
Use MPI backend for inter-node communication
Add CUDA/HIP for GPU nodes
Add NCCL for optimized GPU collective operations
2. CPU vs GPU
CPU-Only Workloads:
#include <dtl/dtl.hpp>
#include <backends/cpu/cpu_executor.hpp>
// Use CPU thread pool for parallel execution
dtl::cpu::cpu_executor exec(std::thread::hardware_concurrency());
exec.parallel_for(0, n, [&](size_t i) {
data[i] = compute(i);
});
GPU-Accelerated Workloads:
#include <dtl/dtl.hpp>
#include <backends/cuda/cuda_memory_space.hpp>
#include <backends/cuda/cuda_executor.hpp>
// Allocate on GPU
dtl::cuda::cuda_memory_space gpu_mem;
void* d_data = gpu_mem.allocate(n * sizeof(double));
// Execute on GPU
dtl::cuda::cuda_executor exec;
exec.launch_kernel(...);
3. Communication Requirements
No Communication (embarrassingly parallel):
// Local containers only, no MPI needed
dtl::distributed_vector<double> vec(local_size);
auto local = vec.local_view();
// Work purely on local data
Collective Communication:
// MPI with collective operations
dtl::mpi::mpi_comm_adapter comm;
double local_sum = compute_local_sum();
double global_sum = comm.allreduce_sum_value(local_sum);
Point-to-Point Communication:
// Direct rank-to-rank messaging
comm.send(&data, count, dest_rank, tag);
comm.recv(&buffer, count, src_rank, tag);
Backend Combinations
CPU Cluster (No GPUs)
CMake Configuration:
cmake .. -DCMAKE_BUILD_TYPE=Release
Code Pattern:
#include <dtl/dtl.hpp>
#include <backends/mpi/mpi_comm_adapter.hpp>
#include <backends/cpu/cpu_executor.hpp>
int main(int argc, char** argv) {
dtl::mpi::scoped_init mpi(argc, argv);
dtl::mpi::mpi_comm_adapter comm;
dtl::cpu::cpu_executor exec;
// Distributed vector across MPI ranks
dtl::distributed_vector<double> vec(global_size);
// Local parallel computation
auto local = vec.local_view();
exec.parallel_for(local.size(), [&](size_t i) {
local[i] = compute(i + vec.local_offset());
});
// Global reduction
double local_sum = std::reduce(local.begin(), local.end());
double global_sum = comm.allreduce_sum_value(local_sum);
return 0;
}
GPU Cluster (NVIDIA)
CMake Configuration:
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DDTL_ENABLE_CUDA=ON \
-DDTL_ENABLE_NCCL=ON
Code Pattern:
#include <dtl/dtl.hpp>
#include <backends/mpi/mpi_comm_adapter.hpp>
#include <backends/cuda/cuda_memory_space.hpp>
#include <backends/cuda/cuda_executor.hpp>
#include <backends/nccl/nccl_communicator.hpp>
int main(int argc, char** argv) {
dtl::mpi::scoped_init mpi(argc, argv);
// Set GPU based on local rank
int local_rank = get_local_rank();
cudaSetDevice(local_rank);
// Use GPU memory and NCCL for GPU-direct communication
dtl::cuda::cuda_memory_space gpu_mem;
dtl::nccl::nccl_communicator nccl_comm;
// Allocate on GPU
double* d_data = static_cast<double*>(
gpu_mem.allocate(local_size * sizeof(double)));
// Compute on GPU (kernel launch)
// ...
// GPU-native collective (no CPU staging)
nccl_comm.allreduce_sum(d_data, d_result, local_size);
return 0;
}
GPU Cluster (AMD)
CMake Configuration:
cmake .. -DCMAKE_BUILD_TYPE=Release \
-DDTL_ENABLE_HIP=ON
Code Pattern:
#include <dtl/dtl.hpp>
#include <backends/mpi/mpi_comm_adapter.hpp>
#include <backends/hip/hip_memory_space.hpp>
#include <backends/hip/hip_executor.hpp>
// Similar to CUDA pattern, using HIP equivalents
Performance Considerations
When to Use Each Backend
Scenario |
Recommended Backend |
Rationale |
|---|---|---|
Small data, single node |
CPU only |
MPI overhead not justified |
Large data, single node |
CPU + MPI |
Process isolation, NUMA awareness |
Multi-node cluster |
MPI + CPU |
Standard HPC configuration |
GPU workloads |
CUDA/HIP + NCCL |
GPU-native operations |
Mixed CPU/GPU |
MPI + CUDA + NCCL |
MPI for CPU, NCCL for GPU |
Communication Backend Selection
Communication Pattern |
Best Backend |
Notes |
|---|---|---|
CPU-to-CPU |
MPI |
Mature, well-optimized |
GPU-to-GPU (same node) |
NCCL |
Uses NVLink/PCIe directly |
GPU-to-GPU (cross-node) |
NCCL |
Uses GPU-direct RDMA |
CPU-to-GPU |
MPI + cudaMemcpy |
Two-stage transfer |
Memory Space Selection
Workload |
Memory Space |
Notes |
|---|---|---|
CPU computation |
|
Default, system allocator |
GPU computation |
|
Device memory |
Frequent CPU/GPU transfer |
|
Unified memory with prefetch support |
DMA/RDMA transfers |
|
Page-locked host (V1.2: CUDA/HIP/fallback) |
AMD GPU computation |
|
HIP device memory (V1.2 headers) |
AMD GPU unified |
|
HIP managed memory with prefetch/advise (V1.2) |
Prefetch Policies (V1.2)
For unified/managed memory, DTL provides prefetch hints:
#include <dtl/memory/prefetch_policy.hpp>
// Prefetch policies: none, to_device, to_host, bidirectional
auto hint = dtl::make_device_prefetch(device_id, offset, size);
auto hint = dtl::make_host_prefetch(offset, size);
Executor Selection
Workload Type |
Recommended Executor |
Notes |
|---|---|---|
Sequential |
|
Zero overhead |
CPU parallel |
|
Thread pool |
GPU parallel |
|
Kernel launch |
Mixed |
Both |
CPU for orchestration, GPU for compute |
Thread Count Guidelines:
// Match hardware threads (typical)
dtl::cpu::cpu_executor exec; // Uses hardware_concurrency()
// Custom thread count (e.g., for hyperthreading)
dtl::cpu::cpu_executor exec(num_physical_cores);
// Single-threaded (for debugging)
dtl::cpu::cpu_executor exec(1);
Hybrid Configurations
MPI + OpenMP + CUDA
For maximum flexibility on GPU clusters:
// MPI across nodes
dtl::mpi::mpi_comm_adapter mpi_comm;
// OpenMP within node (via cpu_executor)
dtl::cpu::cpu_executor cpu_exec(omp_get_max_threads());
// CUDA for GPU work
dtl::cuda::cuda_executor gpu_exec;
// NCCL for GPU collectives
dtl::nccl::nccl_communicator nccl_comm;
Process Placement
Layout |
Description |
Use Case |
|---|---|---|
1 rank per node |
All GPUs shared by one process |
Simple, good for NCCL |
1 rank per GPU |
Each GPU has dedicated process |
Easier resource management |
Multiple ranks per GPU |
GPU sharing via MPS |
Memory-limited, multi-tenant |
Troubleshooting
Backend Detection Issues
MPI not found:
# Verify MPI installation
which mpicc
mpicc --version
# Force MPI compiler
cmake .. -DMPI_C_COMPILER=$(which mpicc) -DMPI_CXX_COMPILER=$(which mpicxx)
CUDA not found:
# Verify CUDA installation
nvcc --version
nvidia-smi
# Set CUDA path
export CUDA_HOME=/usr/local/cuda
cmake .. -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
Runtime Issues
MPI initialization fails:
Ensure
scoped_initis created before any DTL distributed operationsCheck that all ranks reach the same code paths
GPU out of memory:
Reduce local partition sizes
Use unified memory for oversubscription
Check for memory leaks with
cuda-memcheck
NCCL deadlock:
Ensure all ranks call collective operations
Check that communicators are created in the same order
MPI Send Mode Variants (V1.2)
DTL supports four MPI send modes via send_mode enum:
#include <dtl/communication/send_mode.hpp>
// send_mode::standard - MPI_Send (default)
// send_mode::synchronous - MPI_Ssend (handshake, guaranteed no buffering)
// send_mode::ready - MPI_Rsend (receiver must have pre-posted recv)
// send_mode::buffered - MPI_Bsend (user-managed buffer)
The mpi_comm_adapter provides blocking and non-blocking variants:
ssend_impl()/issend_impl()- Synchronous sendrsend_impl()/irsend_impl()- Ready send
MPMD Support (V1.2)
DTL supports Multiple Program, Multiple Data (MPMD) patterns:
#include <dtl/mpmd/role_manager.hpp>
#include <dtl/mpmd/inter_group_comm.hpp>
// Define roles
dtl::role_manager mgr;
mgr.register_role("worker", dtl::role_assignment::first_n_ranks(3));
mgr.register_role("coordinator", dtl::role_assignment::last_rank_only());
// Initialize (assigns ranks to roles based on predicates)
mgr.initialize(world_comm);
// Query roles
if (mgr.has_role("worker")) {
auto& group = mgr.get_group("worker");
// group.local_rank(), group.size(), group.members()
}
// Inter-group communication via rank translation
auto world_rank = dtl::translate_to_world_rank(dest_group, local_rank);
Runtime Library
DTL requires libdtl_runtime.so at runtime for backend lifecycle management. This shared library contains the process-global singleton (runtime_registry) that manages MPI, CUDA, HIP, NCCL, and SHMEM initialization and finalization.
When using CMake with DTL::dtl, the runtime library is linked automatically (transitive dependency). For manual builds, add -ldtl_runtime to your linker flags.
See Also
Backend Concepts - Concept definitions
Implementing a Backend - Adding new backends