Legacy Deep-Dive: Troubleshooting

This page is retained as a detailed reference. The canonical user path is now the chaptered handbook.

Primary chapter: 11-troubleshooting-and-diagnostics.md

Runtime and handles: Runtime and Handle Model

Detailed Reference (Legacy)

This guide covers common issues when building, configuring, and running DTL applications, along with their solutions.

Table of Contents

MPI Initialization Failures

Conda MPI Conflicts

Symptom: Build fails with “x86_64-conda-linux-gnu-cc: not found” or runtime crashes with MPI version mismatches.

Cause: Conda installs its own OpenMPI (often 5.0.x) with broken compiler wrappers. Conda’s mpicc cannot find its expected cross-compiler, and conda’s linker (compiler_compat/ld) cannot resolve system OpenMPI’s shared libraries.

Solution: Always pass explicit system compiler paths to CMake when conda is on PATH:

cmake .. \
  -DCMAKE_C_COMPILER=/usr/bin/gcc \
  -DCMAKE_CXX_COMPILER=/usr/bin/g++ \
  -DMPI_C_COMPILER=/usr/bin/mpicc \
  -DMPI_CXX_COMPILER=/usr/bin/mpicxx

For Python packages that link MPI (e.g., mpi4py), build from source against system MPI:

MPICC=/usr/bin/mpicc CC=/usr/bin/gcc LDSHARED="/usr/bin/gcc -shared" \
  pip install --no-binary mpi4py --force-reinstall mpi4py

Verification:

# Ensure system MPI is used, NOT conda's
which mpicc          # Should be /usr/bin/mpicc
mpicc --version      # Should show system GCC, not conda wrapper
mpirun --version     # Should match the expected OpenMPI version

Thread Level Mismatch

Symptom: Runtime warning: “MPI thread level requested X but only Y provided.” Or unexpected behavior with par{} or async{} execution policies.

Cause: The MPI implementation does not support the requested thread safety level. For example, requesting MPI_THREAD_MULTIPLE when the MPI library was built without thread support.

Solution: Check what your MPI supports and request an appropriate level:

auto opts = dtl::environment_options::defaults();
// Check what level was actually provided
dtl::environment env(argc, argv, opts);
std::cout << "Thread level: " << env.mpi_thread_level_name() << "\n";

If you need MPI_THREAD_MULTIPLE, ensure your MPI library was configured with --enable-mpi-thread-multiple (OpenMPI) or similar.

MPI Already Initialized

Symptom: Error “MPI_Init has already been called” or “environment construction failed.”

Cause: Another library or framework initialized MPI before DTL’s environment constructor.

Solution: Use adopt_external mode to let DTL use the existing MPI initialization:

auto opts = dtl::environment_options::defaults();
opts.mpi_mode = dtl::backend_ownership::adopt_external;
dtl::environment env(argc, argv, opts);

Or, for library authors, inject an existing communicator:

// Your library receives an MPI communicator from the application
auto env = dtl::environment::from_comm(app_comm);

MPI Not Found During Build

Symptom: CMake error: “Could NOT find MPI” or MPI headers not found.

Cause: CMake’s FindMPI module cannot locate your MPI installation.

Solution:

Ensure MPI is installed: sudo apt install openmpi-bin libopenmpi-dev
Check for an empty cmake/FindMPI.cmake file in the DTL source tree. If it exists and is empty, delete it – it shadows CMake’s built-in FindMPI module.
Set explicit MPI compiler paths:

cmake .. -DMPI_C_COMPILER=/usr/bin/mpicc -DMPI_CXX_COMPILER=/usr/bin/mpicxx

CUDA Device Selection Issues

No CUDA Devices Available

Symptom: nvidia-smi shows “No devices found” or DTL reports has_cuda() == false.

Cause (WSL2): GPU passthrough requires:

NVIDIA driver on the Windows host (not in WSL2)
CUDA toolkit installed in WSL2 (toolkit only, not the driver)
The /dev/dxg device accessible in WSL2

Solution (WSL2):

Install NVIDIA driver on Windows (not in WSL2)
Install only the CUDA toolkit in WSL2:

sudo apt install cuda-toolkit-12-6

Do NOT install nvidia-driver-* packages inside WSL2
Verify:

nvidia-smi               # Should show GPU via passthrough
nvcc --version            # Should show CUDA toolkit version

Cause (native Linux): Driver not installed or not loaded.

Solution: Install the NVIDIA driver and verify with nvidia-smi.

Wrong Device Selected

Symptom: Operations execute on the wrong GPU, or memory is allocated on an unexpected device.

Cause: The device_only<N> template parameter does not match the runtime CUDA device context.

Solution: Ensure the device ID in your placement policy matches your environment:

// Verify device count
int device_count;
cudaGetDeviceCount(&device_count);

// Use the correct device
auto ctx = env.make_world_context(/*device_id=*/0);
dtl::distributed_vector<float, dtl::device_only<0>> vec(1000, ctx);

For multi-GPU nodes, map ranks to devices:

int local_rank = get_local_rank();  // Rank within the node
int device_id = local_rank % device_count;
auto ctx = env.make_world_context(device_id);

NVCC Not on PATH

Symptom: CMake error: “Could NOT find CUDA” or nvcc: command not found.

Cause: The CUDA toolkit is installed but its bin/ directory is not in PATH.

Solution: Add CUDA to PATH:

export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}

Or pass the NVCC path explicitly to CMake:

cmake .. -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc

NCCL Communicator Creation Failures

NCCL Not Found

Symptom: CMake warning: “NCCL not found” or has_nccl() == false at runtime.

Solution:

Install NCCL: sudo apt install libnccl2 libnccl-dev
If NCCL is installed to a non-standard location, set the CMake hint:

cmake .. -DNCCL_ROOT=/path/to/nccl

Verify at runtime:

dtl::environment env(argc, argv);
std::cout << "NCCL available: " << env.has_nccl() << "\n";

Single-GPU Testing

Note: NCCL supports single-GPU testing for correctness verification. Multiple MPI ranks share the same GPU, and NCCL uses shared memory transport between them. This does NOT test multi-GPU performance paths.

# Correctness test with 2 ranks on 1 GPU
mpirun -np 2 ./test_nccl_collectives

Deadlocks in Collective Operations

Mismatched Collective Calls

Symptom: Program hangs indefinitely at a collective operation (reduce, barrier, allgather, etc.).

Cause: Not all ranks participate in the collective, or ranks call different collectives.

Example of the bug:

// DEADLOCK: only rank 0 calls reduce
if (ctx.rank() == 0) {
    auto result = dtl::reduce(vec, 0.0, std::plus<>{});  // Hangs!
}

Solution: Collective operations must be called by ALL ranks in the communicator:

// CORRECT: all ranks participate
auto result = dtl::reduce(vec, 0.0, std::plus<>{});
if (ctx.rank() == 0) {
    std::cout << "Result: " << result << "\n";
}

Barrier Placement Errors

Symptom: Data corruption or stale values after a collective.

Cause: Missing barrier between write and read phases.

auto local = vec.local_view();
for (auto& x : local) x = compute(x);

// Missing barrier! Other ranks may not be done writing yet
auto result = dtl::reduce(vec, 0.0, std::plus<>{});  // May read stale data

Solution: Insert a barrier when needed between phases:

auto local = vec.local_view();
for (auto& x : local) x = compute(x);

vec.barrier();  // Ensure all ranks finish writing

auto result = dtl::reduce(vec, 0.0, std::plus<>{});

Note: Many DTL collective algorithms include an implicit barrier. Check the algorithm documentation.

Conditional Collective Calls

Symptom: Deadlock when a collective is inside a conditional that evaluates differently on different ranks.

// DEADLOCK: condition may differ across ranks
if (local_error_detected) {
    dtl::reduce(error_counts, 0, std::plus<>{});  // Not all ranks reach this
}

Solution: Move the collective outside the conditional, or ensure the condition is the same on all ranks:

// CORRECT: all ranks participate, then check locally
auto total_errors = dtl::reduce(error_counts, 0, std::plus<>{});
if (total_errors > 0) {
    handle_errors();
}

Futures Timeout Diagnostics

Future Never Completes

Symptom: future.get() blocks indefinitely or future.is_ready() never returns true.

Possible causes:

Missing progress: The underlying async operation needs polling to complete.
Deadlocked collective: The async operation wraps a collective that deadlocked.
MPI progress issue: MPI non-blocking operations need MPI_Test or MPI_Wait calls.

Solution:

auto future = dtl::async_reduce(vec, 0.0, std::plus<>{});

// Ensure progress is being made
while (!future.is_ready()) {
    dtl::futures::progress_engine::instance().poll();

    // Add a timeout for debugging
    static int iterations = 0;
    if (++iterations > 1000000) {
        std::cerr << "WARNING: future not completing after 1M polls\n";
        break;
    }
}

Progress Engine Not Polled

Symptom: Async operations appear to hang, but the progress engine is never polled.

Cause: DTL’s async operations register callbacks with the progress engine. If nobody polls, the callbacks never execute.

Solution: Either:

Call future.get() (which polls internally)
Periodically call dtl::futures::progress_engine::instance().poll()
Use the blocking variants instead of async if you do not need overlap

Build System Issues

FindMPI.cmake Empty File

Symptom: MPI is not detected despite being installed. CMake silently skips MPI.

Cause: An empty cmake/FindMPI.cmake file in the DTL source tree shadows CMake’s built-in FindMPI module.

Solution: Delete the empty file:

rm cmake/FindMPI.cmake  # If it exists and is empty

FindNCCL.cmake Not Found

Symptom: CMake cannot find NCCL even though it is installed.

Solution: DTL provides its own FindNCCL.cmake module. If it is missing, ensure the cmake/ directory is in the CMake module path:

cmake .. -DCMAKE_MODULE_PATH=/path/to/dtl/cmake

Compiler Version Too Old

Symptom: Build errors related to C++20 features: concepts, <source_location>, requires clauses, std::span.

Minimum requirements:

GCC 11+ (GCC 13+ recommended)
Clang 15+
MSVC 19.29+ (Visual Studio 2019 16.10+ with /std:c++20)

Solution: Upgrade your compiler:

# Ubuntu/Debian
sudo apt install g++-13

# Use explicitly
cmake .. -DCMAKE_CXX_COMPILER=/usr/bin/g++-13

libdtl_runtime.so Not Found at Runtime

Symptom: Runtime error: “error while loading shared libraries: libdtl_runtime.so: cannot open shared object file.”

Cause: libdtl_runtime.so is not in the library search path.

Solution:

# Option 1: Set LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/path/to/dtl/build/runtime:$LD_LIBRARY_PATH

# Option 2: Install DTL system-wide
sudo cmake --install build/

# Option 3: Use rpath during build
cmake .. -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON

Python Binding Import Errors

Module Not Found

Symptom: import dtl fails with ModuleNotFoundError.

Cause: The Python extension module was not built or not installed.

Solution:

# Build Python bindings
cmake .. -DDTL_BUILD_PYTHON=ON
make _dtl

# Install into Python environment
make python_install

# Or set PYTHONPATH
export PYTHONPATH=/path/to/dtl/build/bindings/python:$PYTHONPATH

MPI Library Mismatch

Symptom: Python crashes on import dtl with a segfault, or mpi4py initialization fails.

Cause: mpi4py was built against a different MPI library than DTL. This commonly happens when conda provides its own MPI.

Solution: Rebuild mpi4py against the system MPI:

MPICC=/usr/bin/mpicc CC=/usr/bin/gcc LDSHARED="/usr/bin/gcc -shared" \
  pip install --no-binary mpi4py --force-reinstall mpi4py

Verification:

python3 -c "
import mpi4py
print('mpi4py version:', mpi4py.__version__)
from mpi4py import MPI
print('MPI library:', MPI.Get_library_version())
"

The MPI library version printed should match your system MPI (e.g., OpenMPI 4.1.6), not conda’s.

NumPy Version Incompatibility

Symptom: Import error related to NumPy ABI version mismatch.

Cause: DTL’s Python bindings were built against a different NumPy version than what is currently installed.

Solution: Rebuild the Python bindings after updating NumPy:

pip install numpy  # Ensure desired version
cd build
cmake .. -DDTL_BUILD_PYTHON=ON
make _dtl
make python_install

General Debugging Tips

Enable Verbose Output

Set environment variables for verbose MPI and CUDA output:

# MPI verbose output
export OMPI_MCA_mpi_show_mca_params=all

# CUDA error checking
export CUDA_LAUNCH_BLOCKING=1

# DTL debug mode (if built with Debug)
export DTL_DEBUG=1

Check Backend Availability

Print a diagnostic report at startup:

dtl::environment env(argc, argv);
std::cout << "MPI:   " << env.has_mpi() << " (thread level: " << env.mpi_thread_level_name() << ")\n";
std::cout << "CUDA:  " << env.has_cuda() << "\n";
std::cout << "HIP:   " << env.has_hip() << "\n";
std::cout << "NCCL:  " << env.has_nccl() << "\n";
std::cout << "SHMEM: " << env.has_shmem() << "\n";

Run with Address Sanitizer

For memory issues, build with sanitizers:

cmake .. -DCMAKE_BUILD_TYPE=Debug \
  -DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer"

Run with MPI Error Handler

Enable MPI error handlers for better diagnostics:

# OpenMPI: enable error messages instead of abort
export OMPI_MCA_mpi_abort_print_stack=1

Legacy Deep-Dive: Troubleshooting

Detailed Reference (Legacy)

Table of Contents

MPI Initialization Failures

Conda MPI Conflicts

Thread Level Mismatch

MPI Already Initialized

MPI Not Found During Build

CUDA Device Selection Issues

No CUDA Devices Available

Wrong Device Selected

NVCC Not on PATH

NCCL Communicator Creation Failures

NCCL Not Found

Single-GPU Testing

Deadlocks in Collective Operations

Mismatched Collective Calls

Barrier Placement Errors

Conditional Collective Calls

Futures Timeout Diagnostics

Future Never Completes

Progress Engine Not Polled

Build System Issues

FindMPI.cmake Empty File

FindNCCL.cmake Not Found

Compiler Version Too Old

libdtl_runtime.so Not Found at Runtime

Python Binding Import Errors

Module Not Found

MPI Library Mismatch

NumPy Version Incompatibility

General Debugging Tips

Enable Verbose Output

Check Backend Availability

Run with Address Sanitizer

Run with MPI Error Handler

See Also