NCCL/CUDA Audit

This audit records the supported surface and the current remediation status for the CUDA and NCCL backends as of 2026-03-06.

Supported Operations Matrix

Surface

MPI

CUDA

NCCL

Generic distributed algorithms

Supported

Local execution only

Not selected implicitly

Context rank/size for generic distributed code

Primary

N/A

Not a selector

Device memory allocation

N/A

Supported

N/A

Device local algorithms

N/A

Supported

N/A

with_nccl(device_id)

N/A

Requires CUDA

Supported

split_nccl(...)

Bootstrap via MPI split

Requires CUDA

Supported in C++ and bindings via mode-aware _ex APIs

Point-to-point

Supported

N/A

Supported for device buffers

Broadcast

Supported

N/A

Supported for device buffers

Gather / Scatter

Supported

N/A

Supported for fixed-size device buffers

Allgather / Alltoall

Supported

N/A

Supported for fixed-size device buffers

Reduce / Allreduce

Supported

N/A

Supported for explicit device-buffer paths

Variable-size collectives

Supported

N/A

Hybrid parity supported in explicit C device APIs (*_device_ex)

Scan / Exscan

Supported

Local-only helpers

Hybrid parity supported in explicit C device APIs (*_device_ex)

Scalar convenience reductions

Supported

N/A

Unsupported

Logical reductions

Supported

N/A

Hybrid parity supported in explicit C device APIs (*_device_ex)

Host-buffer collectives

Supported

N/A

Unsupported

RMA / one-sided

Supported

N/A

Unsupported

Bug List

High Severity

Subsystem

Issue

Status

Generic algorithm dispatch

Context-based NCCL auto-selection routed generic host-oriented algorithms through NCCL

Remediated by removing umbrella exposure and restoring MPI-primary dispatch

NCCL adapter semantics

Blocking methods returned before CUDA stream completion

Remediated by explicit synchronization before return

NCCL adapter memory contract

Host scalars and host buffers could enter NCCL collectives

Remediated by device-buffer validation and removal of scalar helpers

Scan/exscan parity

Host scratch emulation for NCCL scan/exscan was invalid

Remediated by removing/marking unsupported

Async completion

wait() / test() did not propagate CUDA errors

Remediated

Medium Severity

Subsystem

Issue

Status

Documentation

NCCL docs overstated MPI-like parity and implicit context behavior

Remediated in backend docs

Public API scope

Mode-aware binding parity for with_nccl/split_nccl was missing

Remediated with _ex APIs and binding updates

Test coverage

No explicit contract tests for host-buffer rejection and blocking completion

Remediated with explicit NCCL adapter contract coverage

Low Severity

Subsystem

Issue

Status

Warning hygiene

NCCL communicator still emits reorder/sign-conversion warnings

Open

Broader CUDA docs

Some higher-level docs outside the backend pages still describe NCCL optimistically

Open follow-up

Parity Gaps

Must-Fix Correctness

  • Keep generic distributed algorithms MPI-primary until there is a device-aware distributed container and algorithm path.

  • Preserve the rule that NCCL only accepts CUDA device buffers.

  • Keep blocking and async completion semantics explicit and error-reporting.

Must-Document Unsupported

  • Scalar convenience reductions on the NCCL adapter

  • Scan/exscan on the NCCL adapter

  • Variable-size collectives

  • Logical reductions

  • Host-buffer communication

  • split_nccl(...) now available in C/Python/Fortran via mode-aware bindings

Future Parity Work

  • Explicit device-resident distributed container support

  • Opt-in NCCL-aware algorithm entry points

  • Continued expansion of high-level Python convenience wrappers for explicit NCCL device collectives

  • Additional multi-rank GPU integration coverage once MPI-enabled CI/build environments are available consistently