# Backend Comparison **Last Updated:** 2026-03-06 ## Overview DTL supports multiple backends that can be combined for heterogeneous distributed computing. This table reflects the public, supported surface, not internal experiments or partially implemented parity work. ## Backend Summary | Backend | Purpose | Header Prefix | CMake Flag | Status | |---------|---------|--------------|------------|--------| | [CPU](cpu_guide.md) | Multi-threaded local execution | `backends/cpu/` | Always available | Production | | [MPI](mpi_guide.md) | Distributed communication | `backends/mpi/` | `DTL_ENABLE_MPI` | Production | | [CUDA](cuda_guide.md) | NVIDIA GPU execution | `backends/cuda/` | `DTL_ENABLE_CUDA` | Production | | [HIP](hip_guide.md) | AMD GPU execution | `backends/hip/` | `DTL_ENABLE_HIP` | Production | | [NCCL](nccl_backend.md) | Explicit GPU-native collectives | `backends/nccl/` | `DTL_ENABLE_NCCL` | Experimental | | [OpenSHMEM](shmem_backend.md) | PGAS one-sided communication | `backends/shmem/` | `DTL_ENABLE_SHMEM` | Production | ## Execution Capabilities | Feature | CPU | CUDA | HIP | MPI | NCCL | SHMEM | |---------|-----|------|-----|-----|------|-------| | Local parallel execution | ✅ | ✅ | ✅ | — | — | — | | Thread pool | ✅ | — | — | — | — | — | | Stream-based async | — | ✅ | ✅ | — | — | — | | Kernel dispatch | — | ✅ | ✅ | — | — | — | | Execution policies | `seq`/`par`/`async` | `on_stream` | `on_stream` | — | — | — | ## Communication Capabilities | Feature | CPU | CUDA | HIP | MPI | NCCL | SHMEM | |---------|-----|------|-----|-----|------|-------| | Point-to-point | — | — | — | ✅ | ✅ device buffers only | ✅ | | Broadcast | — | — | — | ✅ | ✅ device buffers only | ✅ | | Reduce / Allreduce | — | — | — | ✅ | ✅ explicit device-buffer paths | ✅ | | Gather / Scatter | — | — | — | ✅ | ✅ fixed-size device-buffer paths | — | | All-to-all | — | — | — | ✅ | ✅ fixed-size device-buffer paths | — | | Barrier | — | — | — | ✅ | ✅ explicit NCCL path | ✅ | | Variable-size collectives | — | — | — | ✅ | — | — | | Scan / Exscan | — | CUDA local algorithms only | HIP local algorithms only | ✅ | — | — | | One-sided (RMA) | — | — | — | ✅ | — | ✅ | ## Memory Capabilities | Feature | CPU | CUDA | HIP | MPI | NCCL | SHMEM | |---------|-----|------|-----|-----|------|-------| | Host memory | ✅ | — | — | ✅ | — | ✅ | | Device memory | — | ✅ | ✅ | — | ✅ | — | | Unified memory | — | ✅ | ✅ | — | — | — | | Pinned memory | — | ✅ | ✅ | — | — | — | | Symmetric memory | — | — | — | — | — | ✅ | | RMA windows | — | — | — | ✅ | — | ✅ | ## Placement Policy Support | Placement | CPU | CUDA | HIP | |-----------|-----|------|-----| | `host_only` | ✅ Default | ✅ | ✅ | | `device_only` | — | ✅ | ✅ | | `device_only_runtime` | — | ✅ | ✅ | | `unified_memory` | — | ✅ | ✅ | | `device_preferred` | — | ✅ | ✅ | | `explicit_placement` | ✅ | ✅ | ✅ | ## Common Backend Combinations ### MPI + CUDA Use this for distributed GPU work when your generic algorithm path still needs MPI semantics or host-resident coordination. ### MPI + CUDA + NCCL Use this when you have an explicit NCCL communication path over CUDA device-resident buffers. Do not assume a context with NCCL will implicitly reroute generic distributed algorithms away from MPI. ## Decision Guide ``` Need distributed computing? ├── No → CPU backend ├── Yes │ ├── Need generic distributed algorithms or host-buffer collectives? │ │ └── MPI backend │ ├── Need NVIDIA GPU execution? │ │ ├── Local GPU algorithms → CUDA backend │ │ └── Explicit GPU-native collectives on device buffers → CUDA + NCCL │ ├── Need AMD GPU execution? │ │ └── HIP backend │ └── Need one-sided communication? │ └── SHMEM or MPI RMA ``` ## Notes on NCCL - NCCL is explicit and device-buffer-only. - NCCL is not the generic default communicator for contexts. - Unsupported MPI-style helpers remain unsupported rather than being emulated with host-side scratch or scalar wrappers. ## See Also - [CUDA Backend Guide](cuda_guide.md) - [MPI Backend Guide](mpi_guide.md) - [NCCL Backend](nccl_backend.md) - [NCCL/CUDA Audit](nccl_cuda_audit.md)