Backend Comparison

Last Updated: 2026-03-06

Overview

DTL supports multiple backends that can be combined for heterogeneous distributed computing. This table reflects the public, supported surface, not internal experiments or partially implemented parity work.

Backend Summary

Backend	Purpose	Header Prefix	CMake Flag	Status
CPU	Multi-threaded local execution	`backends/cpu/`	Always available	Production
MPI	Distributed communication	`backends/mpi/`	`DTL_ENABLE_MPI`	Production
CUDA	NVIDIA GPU execution	`backends/cuda/`	`DTL_ENABLE_CUDA`	Production
HIP	AMD GPU execution	`backends/hip/`	`DTL_ENABLE_HIP`	Production
NCCL	Explicit GPU-native collectives	`backends/nccl/`	`DTL_ENABLE_NCCL`	Experimental
OpenSHMEM	PGAS one-sided communication	`backends/shmem/`	`DTL_ENABLE_SHMEM`	Production

Execution Capabilities

Feature	CPU	CUDA	HIP	MPI	NCCL	SHMEM
Local parallel execution	✅	✅	✅	—	—	—
Thread pool	✅	—	—	—	—	—
Stream-based async	—	✅	✅	—	—	—
Kernel dispatch	—	✅	✅	—	—	—
Execution policies	`seq`/`par`/`async`	`on_stream`	`on_stream`	—	—	—

Communication Capabilities

Feature	CPU	CUDA	HIP	MPI	NCCL	SHMEM
Point-to-point	—	—	—	✅	✅ device buffers only	✅
Broadcast	—	—	—	✅	✅ device buffers only	✅
Reduce / Allreduce	—	—	—	✅	✅ explicit device-buffer paths	✅
Gather / Scatter	—	—	—	✅	✅ fixed-size device-buffer paths	—
All-to-all	—	—	—	✅	✅ fixed-size device-buffer paths	—
Barrier	—	—	—	✅	✅ explicit NCCL path	✅
Variable-size collectives	—	—	—	✅	—	—
Scan / Exscan	—	CUDA local algorithms only	HIP local algorithms only	✅	—	—
One-sided (RMA)	—	—	—	✅	—	✅

Memory Capabilities

Feature	CPU	CUDA	HIP	MPI	NCCL	SHMEM
Host memory	✅	—	—	✅	—	✅
Device memory	—	✅	✅	—	✅	—
Unified memory	—	✅	✅	—	—	—
Pinned memory	—	✅	✅	—	—	—
Symmetric memory	—	—	—	—	—	✅
RMA windows	—	—	—	✅	—	✅

Placement Policy Support

Placement	CPU	CUDA	HIP
`host_only`	✅ Default	✅	✅
`device_only<N>`	—	✅	✅
`device_only_runtime`	—	✅	✅
`unified_memory`	—	✅	✅
`device_preferred`	—	✅	✅
`explicit_placement`	✅	✅	✅

Common Backend Combinations

MPI + CUDA

Use this for distributed GPU work when your generic algorithm path still needs MPI semantics or host-resident coordination.

MPI + CUDA + NCCL

Use this when you have an explicit NCCL communication path over CUDA device-resident buffers. Do not assume a context with NCCL will implicitly reroute generic distributed algorithms away from MPI.

Decision Guide

Need distributed computing?
├── No → CPU backend
├── Yes
│   ├── Need generic distributed algorithms or host-buffer collectives?
│   │   └── MPI backend
│   ├── Need NVIDIA GPU execution?
│   │   ├── Local GPU algorithms → CUDA backend
│   │   └── Explicit GPU-native collectives on device buffers → CUDA + NCCL
│   ├── Need AMD GPU execution?
│   │   └── HIP backend
│   └── Need one-sided communication?
│       └── SHMEM or MPI RMA

Notes on NCCL

NCCL is explicit and device-buffer-only.
NCCL is not the generic default communicator for contexts.
Unsupported MPI-style helpers remain unsupported rather than being emulated with host-side scratch or scalar wrappers.