Backend Comparison
Last Updated: 2026-03-06
Overview
DTL supports multiple backends that can be combined for heterogeneous distributed computing. This table reflects the public, supported surface, not internal experiments or partially implemented parity work.
Backend Summary
Backend |
Purpose |
Header Prefix |
CMake Flag |
Status |
|---|---|---|---|---|
Multi-threaded local execution |
|
Always available |
Production |
|
Distributed communication |
|
|
Production |
|
NVIDIA GPU execution |
|
|
Production |
|
AMD GPU execution |
|
|
Production |
|
Explicit GPU-native collectives |
|
|
Experimental |
|
PGAS one-sided communication |
|
|
Production |
Execution Capabilities
Feature |
CPU |
CUDA |
HIP |
MPI |
NCCL |
SHMEM |
|---|---|---|---|---|---|---|
Local parallel execution |
✅ |
✅ |
✅ |
— |
— |
— |
Thread pool |
✅ |
— |
— |
— |
— |
— |
Stream-based async |
— |
✅ |
✅ |
— |
— |
— |
Kernel dispatch |
— |
✅ |
✅ |
— |
— |
— |
Execution policies |
|
|
|
— |
— |
— |
Communication Capabilities
Feature |
CPU |
CUDA |
HIP |
MPI |
NCCL |
SHMEM |
|---|---|---|---|---|---|---|
Point-to-point |
— |
— |
— |
✅ |
✅ device buffers only |
✅ |
Broadcast |
— |
— |
— |
✅ |
✅ device buffers only |
✅ |
Reduce / Allreduce |
— |
— |
— |
✅ |
✅ explicit device-buffer paths |
✅ |
Gather / Scatter |
— |
— |
— |
✅ |
✅ fixed-size device-buffer paths |
— |
All-to-all |
— |
— |
— |
✅ |
✅ fixed-size device-buffer paths |
— |
Barrier |
— |
— |
— |
✅ |
✅ explicit NCCL path |
✅ |
Variable-size collectives |
— |
— |
— |
✅ |
— |
— |
Scan / Exscan |
— |
CUDA local algorithms only |
HIP local algorithms only |
✅ |
— |
— |
One-sided (RMA) |
— |
— |
— |
✅ |
— |
✅ |
Memory Capabilities
Feature |
CPU |
CUDA |
HIP |
MPI |
NCCL |
SHMEM |
|---|---|---|---|---|---|---|
Host memory |
✅ |
— |
— |
✅ |
— |
✅ |
Device memory |
— |
✅ |
✅ |
— |
✅ |
— |
Unified memory |
— |
✅ |
✅ |
— |
— |
— |
Pinned memory |
— |
✅ |
✅ |
— |
— |
— |
Symmetric memory |
— |
— |
— |
— |
— |
✅ |
RMA windows |
— |
— |
— |
✅ |
— |
✅ |
Placement Policy Support
Placement |
CPU |
CUDA |
HIP |
|---|---|---|---|
|
✅ Default |
✅ |
✅ |
|
— |
✅ |
✅ |
|
— |
✅ |
✅ |
|
— |
✅ |
✅ |
|
— |
✅ |
✅ |
|
✅ |
✅ |
✅ |
Common Backend Combinations
MPI + CUDA
Use this for distributed GPU work when your generic algorithm path still needs MPI semantics or host-resident coordination.
MPI + CUDA + NCCL
Use this when you have an explicit NCCL communication path over CUDA device-resident buffers. Do not assume a context with NCCL will implicitly reroute generic distributed algorithms away from MPI.
Decision Guide
Need distributed computing?
├── No → CPU backend
├── Yes
│ ├── Need generic distributed algorithms or host-buffer collectives?
│ │ └── MPI backend
│ ├── Need NVIDIA GPU execution?
│ │ ├── Local GPU algorithms → CUDA backend
│ │ └── Explicit GPU-native collectives on device buffers → CUDA + NCCL
│ ├── Need AMD GPU execution?
│ │ └── HIP backend
│ └── Need one-sided communication?
│ └── SHMEM or MPI RMA
Notes on NCCL
NCCL is explicit and device-buffer-only.
NCCL is not the generic default communicator for contexts.
Unsupported MPI-style helpers remain unsupported rather than being emulated with host-side scratch or scalar wrappers.