NCCL Backend

The NCCL backend provides explicit GPU-to-GPU collective communication for CUDA device buffers. It is not a generic MPI-parity communication layer, and it is not selected implicitly for the generic distributed algorithm layer.

Overview

Use NCCL when all of the following are true:

  • Data is already resident in CUDA device memory.

  • You need GPU-native collectives between ranks.

  • You are calling the NCCL domain/communicator/adapter explicitly.

Keep using MPI for generic multi-rank algorithms, host-buffer collectives, and contexts passed into the existing distributed algorithm layer.

Requirements

  • CUDA Toolkit 11+

  • NCCL 2.7+ for send/recv support

  • MPI for communicator bootstrap and domain creation

  • Compatible NVIDIA GPUs

CMake Configuration

cmake -DDTL_ENABLE_NCCL=ON -DDTL_ENABLE_CUDA=ON -DDTL_ENABLE_MPI=ON ..

Support Model

What Is Supported

  • nccl_domain::from_mpi(...)

  • context::with_nccl(device_id[, mode])

  • context::split_nccl(...[, mode])

  • Explicit nccl_domain::adapter() access

  • Device-buffer point-to-point operations (send/recv, isend/irecv)

  • Device-buffer collectives:

    • broadcast

    • scatter

    • gather

    • allgather

    • alltoall

    • barrier

    • typed/allreduce-sum and reduce-sum paths that are actually backed by NCCL

What Is Not Supported

  • Generic algorithm auto-dispatch from context to NCCL

  • Host-buffer collectives

  • Scalar convenience helpers such as allreduce_sum_value

  • Scan/exscan on the NCCL adapter

  • Variable-size collectives (gatherv, scatterv, allgatherv, alltoallv)

  • Logical reductions (land, lor)

  • MPI-style feature parity claims beyond the explicitly supported device-buffer API

Domain Creation

Create an NCCL domain from MPI:

#include <dtl/core/domain.hpp>
#include <dtl/core/domain_impl.hpp>

dtl::mpi_domain mpi;
int device_id = mpi.rank() % num_gpus;

auto nccl_result = dtl::nccl_domain::from_mpi(mpi, device_id);
if (!nccl_result) {
    throw std::runtime_error(nccl_result.error().message());
}

dtl::nccl_domain nccl = std::move(*nccl_result);

Context Integration

Add NCCL explicitly to an MPI context:

#include <dtl/core/context.hpp>

dtl::mpi_context ctx;
int device_id = ctx.rank() % num_gpus;

auto nccl_result = ctx.with_nccl(device_id);
if (!nccl_result) {
    throw std::runtime_error(nccl_result.error().message());
}

auto& nccl_ctx = *nccl_result;
auto& mpi = nccl_ctx.get<dtl::mpi_domain>();
auto& nccl = nccl_ctx.get<dtl::nccl_domain>();

Important:

  • Context rank/size queries remain MPI-oriented for generic distributed code.

  • Having an NCCL domain in a context does not imply that generic algorithms will switch to NCCL.

  • C/Python/Fortran bindings also expose with_nccl and split_nccl with explicit mode controls via _ex/mode-aware APIs.

Example programs:

  • C: examples/c/nccl_modes.c

  • Python: examples/python/scripts/nccl_modes.py

Communicator Layers

nccl_communicator

This is the low-level result-returning API. It is the right layer when you want direct NCCL control and explicit status handling.

#include <backends/nccl/nccl_communicator.hpp>

double* d_send = /* device buffer */;
double* d_recv = /* device buffer */;

auto result = comm.allreduce(d_send, d_recv, count, dtl::nccl::nccl_op::sum);
if (!result) {
    throw std::runtime_error(result.error().message());
}

nccl_comm_adapter

This is the explicit device-buffer adapter exposed from nccl_domain::adapter(). It throws nccl::communication_error on failure and is intentionally narrower than MPI.

#include <backends/nccl/nccl_comm_adapter.hpp>

auto& adapter = nccl_domain.adapter();
adapter.allreduce_sum(d_send, d_recv, count);
adapter.broadcast(d_buf, count, 0);

The adapter should be read as:

  • explicit

  • device-buffer-only

  • suitable for NCCL-native communication paths

  • not a drop-in replacement for host-oriented MPI convenience APIs

Buffer and Synchronization Semantics

  • NCCL collectives require CUDA device memory.

  • Host pointers, stack scalars, and host scratch buffers are invalid.

  • Blocking NCCL operations synchronize the CUDA stream before returning.

  • Non-blocking NCCL operations complete through CUDA event tracking, and wait() / test() surface CUDA/NCCL failures rather than silently hiding them.

C API

NCCL context and mode APIs are exposed in the C bindings:

#include <dtl/bindings/c/dtl_context.h>

dtl_context_t ctx;
dtl_context_create_default(&ctx);

dtl_context_t nccl_ctx;
dtl_status status = dtl_context_with_nccl(ctx, device_id, &nccl_ctx);
dtl_status status2 = dtl_context_with_nccl_ex(
    ctx, device_id, DTL_NCCL_MODE_HYBRID_PARITY, &nccl_ctx);

Split and capability introspection are also available:

dtl_context_t split_ctx;
dtl_context_split_nccl_ex(nccl_ctx, color, key, device_id,
                          DTL_NCCL_MODE_HYBRID_PARITY, &split_ctx);

int mode = dtl_context_nccl_mode(split_ctx);
int can_native = dtl_context_nccl_supports_native(split_ctx, DTL_NCCL_OP_ALLREDUCE);
int can_hybrid = dtl_context_nccl_supports_hybrid(split_ctx, DTL_NCCL_OP_SCAN);

Unsupported Operations

Feature

Status

Message tags

Accepted for API compatibility; ignored by NCCL

probe / iprobe

Unsupported

ssend / rsend / issend / irsend

Unsupported

Scalar convenience helpers

Unsupported on the NCCL adapter

Logical reductions

Unsupported

Bitwise reductions

Unsupported

Scan/exscan

Unsupported on the NCCL adapter

Variable-size collectives

Unsupported

RMA / one-sided communication

Unsupported

Generic context auto-dispatch to NCCL

Unsupported by design

Testing

Relevant coverage should focus on:

  • device-buffer collectives on real NCCL contexts

  • rejection of host buffers

  • blocking completion semantics

  • explicit context/domain usage

Run NCCL-tagged integration tests when MPI, CUDA, and NCCL are available:

ctest -L nccl --output-on-failure

Design Notes

  • NCCL remains separate from CUDA execution and generic algorithm dispatch.

  • MPI remains the primary communicator for generic multi-rank algorithms.

  • Unsupported MPI features stay unsupported instead of being approximated with unsafe host-side emulation.

See Also