NCCL Backend

The NCCL backend provides explicit GPU-to-GPU collective communication for CUDA device buffers. It is not a generic MPI-parity communication layer, and it is not selected implicitly for the generic distributed algorithm layer.

Overview

Use NCCL when all of the following are true:

Data is already resident in CUDA device memory.
You need GPU-native collectives between ranks.
You are calling the NCCL domain/communicator/adapter explicitly.

Keep using MPI for generic multi-rank algorithms, host-buffer collectives, and contexts passed into the existing distributed algorithm layer.

Requirements

CUDA Toolkit 11+
NCCL 2.7+ for send/recv support
MPI for communicator bootstrap and domain creation
Compatible NVIDIA GPUs

CMake Configuration

cmake -DDTL_ENABLE_NCCL=ON -DDTL_ENABLE_CUDA=ON -DDTL_ENABLE_MPI=ON ..

Support Model

What Is Supported

nccl_domain::from_mpi(...)
context::with_nccl(device_id[, mode])
context::split_nccl(...[, mode])
Explicit nccl_domain::adapter() access
Device-buffer point-to-point operations (send/recv, isend/irecv)
Device-buffer collectives:
- broadcast
- scatter
- gather
- allgather
- alltoall
- barrier
- typed/allreduce-sum and reduce-sum paths that are actually backed by NCCL

What Is Not Supported

Generic algorithm auto-dispatch from context to NCCL
Host-buffer collectives
Scalar convenience helpers such as allreduce_sum_value
Scan/exscan on the NCCL adapter
Variable-size collectives (gatherv, scatterv, allgatherv, alltoallv)
Logical reductions (land, lor)
MPI-style feature parity claims beyond the explicitly supported device-buffer API

Domain Creation

Create an NCCL domain from MPI:

#include <dtl/core/domain.hpp>
#include <dtl/core/domain_impl.hpp>

dtl::mpi_domain mpi;
int device_id = mpi.rank() % num_gpus;

auto nccl_result = dtl::nccl_domain::from_mpi(mpi, device_id);
if (!nccl_result) {
    throw std::runtime_error(nccl_result.error().message());
}

dtl::nccl_domain nccl = std::move(*nccl_result);

Context Integration

Add NCCL explicitly to an MPI context:

#include <dtl/core/context.hpp>

dtl::mpi_context ctx;
int device_id = ctx.rank() % num_gpus;

auto nccl_result = ctx.with_nccl(device_id);
if (!nccl_result) {
    throw std::runtime_error(nccl_result.error().message());
}

auto& nccl_ctx = *nccl_result;
auto& mpi = nccl_ctx.get<dtl::mpi_domain>();
auto& nccl = nccl_ctx.get<dtl::nccl_domain>();

Important:

Context rank/size queries remain MPI-oriented for generic distributed code.
Having an NCCL domain in a context does not imply that generic algorithms will switch to NCCL.
C/Python/Fortran bindings also expose with_nccl and split_nccl with explicit mode controls via _ex/mode-aware APIs.

Example programs:

C: examples/c/nccl_modes.c
Python: examples/python/scripts/nccl_modes.py

Communicator Layers

`nccl_communicator`

This is the low-level result-returning API. It is the right layer when you want direct NCCL control and explicit status handling.

#include <backends/nccl/nccl_communicator.hpp>

double* d_send = /* device buffer */;
double* d_recv = /* device buffer */;

auto result = comm.allreduce(d_send, d_recv, count, dtl::nccl::nccl_op::sum);
if (!result) {
    throw std::runtime_error(result.error().message());
}

`nccl_comm_adapter`

This is the explicit device-buffer adapter exposed from nccl_domain::adapter(). It throws nccl::communication_error on failure and is intentionally narrower than MPI.

#include <backends/nccl/nccl_comm_adapter.hpp>

auto& adapter = nccl_domain.adapter();
adapter.allreduce_sum(d_send, d_recv, count);
adapter.broadcast(d_buf, count, 0);

The adapter should be read as:

explicit
device-buffer-only
suitable for NCCL-native communication paths
not a drop-in replacement for host-oriented MPI convenience APIs

Buffer and Synchronization Semantics

NCCL collectives require CUDA device memory.
Host pointers, stack scalars, and host scratch buffers are invalid.
Blocking NCCL operations synchronize the CUDA stream before returning.
Non-blocking NCCL operations complete through CUDA event tracking, and wait() / test() surface CUDA/NCCL failures rather than silently hiding them.

C API

NCCL context and mode APIs are exposed in the C bindings:

#include <dtl/bindings/c/dtl_context.h>

dtl_context_t ctx;
dtl_context_create_default(&ctx);

dtl_context_t nccl_ctx;
dtl_status status = dtl_context_with_nccl(ctx, device_id, &nccl_ctx);
dtl_status status2 = dtl_context_with_nccl_ex(
    ctx, device_id, DTL_NCCL_MODE_HYBRID_PARITY, &nccl_ctx);

Split and capability introspection are also available:

dtl_context_t split_ctx;
dtl_context_split_nccl_ex(nccl_ctx, color, key, device_id,
                          DTL_NCCL_MODE_HYBRID_PARITY, &split_ctx);

int mode = dtl_context_nccl_mode(split_ctx);
int can_native = dtl_context_nccl_supports_native(split_ctx, DTL_NCCL_OP_ALLREDUCE);
int can_hybrid = dtl_context_nccl_supports_hybrid(split_ctx, DTL_NCCL_OP_SCAN);

Unsupported Operations

Feature	Status
Message tags	Accepted for API compatibility; ignored by NCCL
`probe` / `iprobe`	Unsupported
`ssend` / `rsend` / `issend` / `irsend`	Unsupported
Scalar convenience helpers	Unsupported on the NCCL adapter
Logical reductions	Unsupported
Bitwise reductions	Unsupported
Scan/exscan	Unsupported on the NCCL adapter
Variable-size collectives	Unsupported
RMA / one-sided communication	Unsupported
Generic context auto-dispatch to NCCL	Unsupported by design

Testing

Relevant coverage should focus on:

device-buffer collectives on real NCCL contexts
rejection of host buffers
blocking completion semantics
explicit context/domain usage

Run NCCL-tagged integration tests when MPI, CUDA, and NCCL are available:

ctest -L nccl --output-on-failure

Design Notes

NCCL remains separate from CUDA execution and generic algorithm dispatch.
MPI remains the primary communicator for generic multi-rank algorithms.
Unsupported MPI features stay unsupported instead of being approximated with unsafe host-side emulation.