NCCL/CUDA Audit

This audit records the supported surface and the current remediation status for the CUDA and NCCL backends as of 2026-03-06.

Supported Operations Matrix

Surface	MPI	CUDA	NCCL
Generic distributed algorithms	Supported	Local execution only	Not selected implicitly
Context rank/size for generic distributed code	Primary	N/A	Not a selector
Device memory allocation	N/A	Supported	N/A
Device local algorithms	N/A	Supported	N/A
`with_nccl(device_id)`	N/A	Requires CUDA	Supported
`split_nccl(...)`	Bootstrap via MPI split	Requires CUDA	Supported in C++ and bindings via mode-aware `_ex` APIs
Point-to-point	Supported	N/A	Supported for device buffers
Broadcast	Supported	N/A	Supported for device buffers
Gather / Scatter	Supported	N/A	Supported for fixed-size device buffers
Allgather / Alltoall	Supported	N/A	Supported for fixed-size device buffers
Reduce / Allreduce	Supported	N/A	Supported for explicit device-buffer paths
Variable-size collectives	Supported	N/A	Hybrid parity supported in explicit C device APIs (`*_device_ex`)
Scan / Exscan	Supported	Local-only helpers	Hybrid parity supported in explicit C device APIs (`*_device_ex`)
Scalar convenience reductions	Supported	N/A	Unsupported
Logical reductions	Supported	N/A	Hybrid parity supported in explicit C device APIs (`*_device_ex`)
Host-buffer collectives	Supported	N/A	Unsupported
RMA / one-sided	Supported	N/A	Unsupported

Subsystem	Issue	Status
Generic algorithm dispatch	Context-based NCCL auto-selection routed generic host-oriented algorithms through NCCL	Remediated by removing umbrella exposure and restoring MPI-primary dispatch
NCCL adapter semantics	Blocking methods returned before CUDA stream completion	Remediated by explicit synchronization before return
NCCL adapter memory contract	Host scalars and host buffers could enter NCCL collectives	Remediated by device-buffer validation and removal of scalar helpers
Scan/exscan parity	Host scratch emulation for NCCL scan/exscan was invalid	Remediated by removing/marking unsupported
Async completion	`wait()` / `test()` did not propagate CUDA errors	Remediated

Subsystem	Issue	Status
Documentation	NCCL docs overstated MPI-like parity and implicit context behavior	Remediated in backend docs
Public API scope	Mode-aware binding parity for `with_nccl/split_nccl` was missing	Remediated with `_ex` APIs and binding updates
Test coverage	No explicit contract tests for host-buffer rejection and blocking completion	Remediated with explicit NCCL adapter contract coverage

Subsystem	Issue	Status
Warning hygiene	NCCL communicator still emits reorder/sign-conversion warnings	Open
Broader CUDA docs	Some higher-level docs outside the backend pages still describe NCCL optimistically	Open follow-up

Keep generic distributed algorithms MPI-primary until there is a device-aware distributed container and algorithm path.
Preserve the rule that NCCL only accepts CUDA device buffers.
Keep blocking and async completion semantics explicit and error-reporting.

Explicit device-resident distributed container support
Opt-in NCCL-aware algorithm entry points
Continued expansion of high-level Python convenience wrappers for explicit NCCL device collectives
Additional multi-rank GPU integration coverage once MPI-enabled CI/build environments are available consistently