# NCCL/CUDA Audit This audit records the supported surface and the current remediation status for the CUDA and NCCL backends as of 2026-03-06. ## Supported Operations Matrix | Surface | MPI | CUDA | NCCL | |---|---|---|---| | Generic distributed algorithms | Supported | Local execution only | Not selected implicitly | | Context rank/size for generic distributed code | Primary | N/A | Not a selector | | Device memory allocation | N/A | Supported | N/A | | Device local algorithms | N/A | Supported | N/A | | `with_nccl(device_id)` | N/A | Requires CUDA | Supported | | `split_nccl(...)` | Bootstrap via MPI split | Requires CUDA | Supported in C++ and bindings via mode-aware `_ex` APIs | | Point-to-point | Supported | N/A | Supported for device buffers | | Broadcast | Supported | N/A | Supported for device buffers | | Gather / Scatter | Supported | N/A | Supported for fixed-size device buffers | | Allgather / Alltoall | Supported | N/A | Supported for fixed-size device buffers | | Reduce / Allreduce | Supported | N/A | Supported for explicit device-buffer paths | | Variable-size collectives | Supported | N/A | Hybrid parity supported in explicit C device APIs (`*_device_ex`) | | Scan / Exscan | Supported | Local-only helpers | Hybrid parity supported in explicit C device APIs (`*_device_ex`) | | Scalar convenience reductions | Supported | N/A | Unsupported | | Logical reductions | Supported | N/A | Hybrid parity supported in explicit C device APIs (`*_device_ex`) | | Host-buffer collectives | Supported | N/A | Unsupported | | RMA / one-sided | Supported | N/A | Unsupported | ## Bug List ### High Severity | Subsystem | Issue | Status | |---|---|---| | Generic algorithm dispatch | Context-based NCCL auto-selection routed generic host-oriented algorithms through NCCL | Remediated by removing umbrella exposure and restoring MPI-primary dispatch | | NCCL adapter semantics | Blocking methods returned before CUDA stream completion | Remediated by explicit synchronization before return | | NCCL adapter memory contract | Host scalars and host buffers could enter NCCL collectives | Remediated by device-buffer validation and removal of scalar helpers | | Scan/exscan parity | Host scratch emulation for NCCL scan/exscan was invalid | Remediated by removing/marking unsupported | | Async completion | `wait()` / `test()` did not propagate CUDA errors | Remediated | ### Medium Severity | Subsystem | Issue | Status | |---|---|---| | Documentation | NCCL docs overstated MPI-like parity and implicit context behavior | Remediated in backend docs | | Public API scope | Mode-aware binding parity for `with_nccl/split_nccl` was missing | Remediated with `_ex` APIs and binding updates | | Test coverage | No explicit contract tests for host-buffer rejection and blocking completion | Remediated with explicit NCCL adapter contract coverage | ### Low Severity | Subsystem | Issue | Status | |---|---|---| | Warning hygiene | NCCL communicator still emits reorder/sign-conversion warnings | Open | | Broader CUDA docs | Some higher-level docs outside the backend pages still describe NCCL optimistically | Open follow-up | ## Parity Gaps ### Must-Fix Correctness - Keep generic distributed algorithms MPI-primary until there is a device-aware distributed container and algorithm path. - Preserve the rule that NCCL only accepts CUDA device buffers. - Keep blocking and async completion semantics explicit and error-reporting. ### Must-Document Unsupported - Scalar convenience reductions on the NCCL adapter - Scan/exscan on the NCCL adapter - Variable-size collectives - Logical reductions - Host-buffer communication - `split_nccl(...)` now available in C/Python/Fortran via mode-aware bindings ### Future Parity Work - Explicit device-resident distributed container support - Opt-in NCCL-aware algorithm entry points - Continued expansion of high-level Python convenience wrappers for explicit NCCL device collectives - Additional multi-rank GPU integration coverage once MPI-enabled CI/build environments are available consistently