Legacy Deep-Dive: Troubleshooting
This page is retained as a detailed reference. The canonical user path is now the chaptered handbook.
Primary chapter: 11-troubleshooting-and-diagnostics.md
Runtime and handles: Runtime and Handle Model
Detailed Reference (Legacy)
This guide covers common issues when building, configuring, and running DTL applications, along with their solutions.
Table of Contents
MPI Initialization Failures
Conda MPI Conflicts
Symptom: Build fails with “x86_64-conda-linux-gnu-cc: not found” or runtime crashes with MPI version mismatches.
Cause: Conda installs its own OpenMPI (often 5.0.x) with broken compiler wrappers. Conda’s mpicc cannot find its expected cross-compiler, and conda’s linker (compiler_compat/ld) cannot resolve system OpenMPI’s shared libraries.
Solution: Always pass explicit system compiler paths to CMake when conda is on PATH:
cmake .. \
-DCMAKE_C_COMPILER=/usr/bin/gcc \
-DCMAKE_CXX_COMPILER=/usr/bin/g++ \
-DMPI_C_COMPILER=/usr/bin/mpicc \
-DMPI_CXX_COMPILER=/usr/bin/mpicxx
For Python packages that link MPI (e.g., mpi4py), build from source against system MPI:
MPICC=/usr/bin/mpicc CC=/usr/bin/gcc LDSHARED="/usr/bin/gcc -shared" \
pip install --no-binary mpi4py --force-reinstall mpi4py
Verification:
# Ensure system MPI is used, NOT conda's
which mpicc # Should be /usr/bin/mpicc
mpicc --version # Should show system GCC, not conda wrapper
mpirun --version # Should match the expected OpenMPI version
Thread Level Mismatch
Symptom: Runtime warning: “MPI thread level requested X but only Y provided.” Or unexpected behavior with par{} or async{} execution policies.
Cause: The MPI implementation does not support the requested thread safety level. For example, requesting MPI_THREAD_MULTIPLE when the MPI library was built without thread support.
Solution: Check what your MPI supports and request an appropriate level:
auto opts = dtl::environment_options::defaults();
// Check what level was actually provided
dtl::environment env(argc, argv, opts);
std::cout << "Thread level: " << env.mpi_thread_level_name() << "\n";
If you need MPI_THREAD_MULTIPLE, ensure your MPI library was configured with --enable-mpi-thread-multiple (OpenMPI) or similar.
MPI Already Initialized
Symptom: Error “MPI_Init has already been called” or “environment construction failed.”
Cause: Another library or framework initialized MPI before DTL’s environment constructor.
Solution: Use adopt_external mode to let DTL use the existing MPI initialization:
auto opts = dtl::environment_options::defaults();
opts.mpi_mode = dtl::backend_ownership::adopt_external;
dtl::environment env(argc, argv, opts);
Or, for library authors, inject an existing communicator:
// Your library receives an MPI communicator from the application
auto env = dtl::environment::from_comm(app_comm);
MPI Not Found During Build
Symptom: CMake error: “Could NOT find MPI” or MPI headers not found.
Cause: CMake’s FindMPI module cannot locate your MPI installation.
Solution:
Ensure MPI is installed:
sudo apt install openmpi-bin libopenmpi-devCheck for an empty
cmake/FindMPI.cmakefile in the DTL source tree. If it exists and is empty, delete it – it shadows CMake’s built-in FindMPI module.Set explicit MPI compiler paths:
cmake .. -DMPI_C_COMPILER=/usr/bin/mpicc -DMPI_CXX_COMPILER=/usr/bin/mpicxx
CUDA Device Selection Issues
No CUDA Devices Available
Symptom: nvidia-smi shows “No devices found” or DTL reports has_cuda() == false.
Cause (WSL2): GPU passthrough requires:
NVIDIA driver on the Windows host (not in WSL2)
CUDA toolkit installed in WSL2 (toolkit only, not the driver)
The
/dev/dxgdevice accessible in WSL2
Solution (WSL2):
Install NVIDIA driver on Windows (not in WSL2)
Install only the CUDA toolkit in WSL2:
sudo apt install cuda-toolkit-12-6
Do NOT install
nvidia-driver-*packages inside WSL2Verify:
nvidia-smi # Should show GPU via passthrough
nvcc --version # Should show CUDA toolkit version
Cause (native Linux): Driver not installed or not loaded.
Solution: Install the NVIDIA driver and verify with nvidia-smi.
Wrong Device Selected
Symptom: Operations execute on the wrong GPU, or memory is allocated on an unexpected device.
Cause: The device_only<N> template parameter does not match the runtime CUDA device context.
Solution: Ensure the device ID in your placement policy matches your environment:
// Verify device count
int device_count;
cudaGetDeviceCount(&device_count);
// Use the correct device
auto ctx = env.make_world_context(/*device_id=*/0);
dtl::distributed_vector<float, dtl::device_only<0>> vec(1000, ctx);
For multi-GPU nodes, map ranks to devices:
int local_rank = get_local_rank(); // Rank within the node
int device_id = local_rank % device_count;
auto ctx = env.make_world_context(device_id);
NVCC Not on PATH
Symptom: CMake error: “Could NOT find CUDA” or nvcc: command not found.
Cause: The CUDA toolkit is installed but its bin/ directory is not in PATH.
Solution: Add CUDA to PATH:
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
Or pass the NVCC path explicitly to CMake:
cmake .. -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc
NCCL Communicator Creation Failures
NCCL Not Found
Symptom: CMake warning: “NCCL not found” or has_nccl() == false at runtime.
Solution:
Install NCCL:
sudo apt install libnccl2 libnccl-devIf NCCL is installed to a non-standard location, set the CMake hint:
cmake .. -DNCCL_ROOT=/path/to/nccl
Verify at runtime:
dtl::environment env(argc, argv);
std::cout << "NCCL available: " << env.has_nccl() << "\n";
Single-GPU Testing
Note: NCCL supports single-GPU testing for correctness verification. Multiple MPI ranks share the same GPU, and NCCL uses shared memory transport between them. This does NOT test multi-GPU performance paths.
# Correctness test with 2 ranks on 1 GPU
mpirun -np 2 ./test_nccl_collectives
Deadlocks in Collective Operations
Mismatched Collective Calls
Symptom: Program hangs indefinitely at a collective operation (reduce, barrier, allgather, etc.).
Cause: Not all ranks participate in the collective, or ranks call different collectives.
Example of the bug:
// DEADLOCK: only rank 0 calls reduce
if (ctx.rank() == 0) {
auto result = dtl::reduce(vec, 0.0, std::plus<>{}); // Hangs!
}
Solution: Collective operations must be called by ALL ranks in the communicator:
// CORRECT: all ranks participate
auto result = dtl::reduce(vec, 0.0, std::plus<>{});
if (ctx.rank() == 0) {
std::cout << "Result: " << result << "\n";
}
Barrier Placement Errors
Symptom: Data corruption or stale values after a collective.
Cause: Missing barrier between write and read phases.
auto local = vec.local_view();
for (auto& x : local) x = compute(x);
// Missing barrier! Other ranks may not be done writing yet
auto result = dtl::reduce(vec, 0.0, std::plus<>{}); // May read stale data
Solution: Insert a barrier when needed between phases:
auto local = vec.local_view();
for (auto& x : local) x = compute(x);
vec.barrier(); // Ensure all ranks finish writing
auto result = dtl::reduce(vec, 0.0, std::plus<>{});
Note: Many DTL collective algorithms include an implicit barrier. Check the algorithm documentation.
Conditional Collective Calls
Symptom: Deadlock when a collective is inside a conditional that evaluates differently on different ranks.
// DEADLOCK: condition may differ across ranks
if (local_error_detected) {
dtl::reduce(error_counts, 0, std::plus<>{}); // Not all ranks reach this
}
Solution: Move the collective outside the conditional, or ensure the condition is the same on all ranks:
// CORRECT: all ranks participate, then check locally
auto total_errors = dtl::reduce(error_counts, 0, std::plus<>{});
if (total_errors > 0) {
handle_errors();
}
Futures Timeout Diagnostics
Future Never Completes
Symptom: future.get() blocks indefinitely or future.is_ready() never returns true.
Possible causes:
Missing progress: The underlying async operation needs polling to complete.
Deadlocked collective: The async operation wraps a collective that deadlocked.
MPI progress issue: MPI non-blocking operations need
MPI_TestorMPI_Waitcalls.
Solution:
auto future = dtl::async_reduce(vec, 0.0, std::plus<>{});
// Ensure progress is being made
while (!future.is_ready()) {
dtl::futures::progress_engine::instance().poll();
// Add a timeout for debugging
static int iterations = 0;
if (++iterations > 1000000) {
std::cerr << "WARNING: future not completing after 1M polls\n";
break;
}
}
Progress Engine Not Polled
Symptom: Async operations appear to hang, but the progress engine is never polled.
Cause: DTL’s async operations register callbacks with the progress engine. If nobody polls, the callbacks never execute.
Solution: Either:
Call
future.get()(which polls internally)Periodically call
dtl::futures::progress_engine::instance().poll()Use the blocking variants instead of async if you do not need overlap
Build System Issues
FindMPI.cmake Empty File
Symptom: MPI is not detected despite being installed. CMake silently skips MPI.
Cause: An empty cmake/FindMPI.cmake file in the DTL source tree shadows CMake’s built-in FindMPI module.
Solution: Delete the empty file:
rm cmake/FindMPI.cmake # If it exists and is empty
FindNCCL.cmake Not Found
Symptom: CMake cannot find NCCL even though it is installed.
Solution: DTL provides its own FindNCCL.cmake module. If it is missing, ensure the cmake/ directory is in the CMake module path:
cmake .. -DCMAKE_MODULE_PATH=/path/to/dtl/cmake
Compiler Version Too Old
Symptom: Build errors related to C++20 features: concepts, <source_location>, requires clauses, std::span.
Minimum requirements:
GCC 11+ (GCC 13+ recommended)
Clang 15+
MSVC 19.29+ (Visual Studio 2019 16.10+ with
/std:c++20)
Solution: Upgrade your compiler:
# Ubuntu/Debian
sudo apt install g++-13
# Use explicitly
cmake .. -DCMAKE_CXX_COMPILER=/usr/bin/g++-13
libdtl_runtime.so Not Found at Runtime
Symptom: Runtime error: “error while loading shared libraries: libdtl_runtime.so: cannot open shared object file.”
Cause: libdtl_runtime.so is not in the library search path.
Solution:
# Option 1: Set LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/path/to/dtl/build/runtime:$LD_LIBRARY_PATH
# Option 2: Install DTL system-wide
sudo cmake --install build/
# Option 3: Use rpath during build
cmake .. -DCMAKE_INSTALL_RPATH_USE_LINK_PATH=ON
Python Binding Import Errors
Module Not Found
Symptom: import dtl fails with ModuleNotFoundError.
Cause: The Python extension module was not built or not installed.
Solution:
# Build Python bindings
cmake .. -DDTL_BUILD_PYTHON=ON
make _dtl
# Install into Python environment
make python_install
# Or set PYTHONPATH
export PYTHONPATH=/path/to/dtl/build/bindings/python:$PYTHONPATH
MPI Library Mismatch
Symptom: Python crashes on import dtl with a segfault, or mpi4py initialization fails.
Cause: mpi4py was built against a different MPI library than DTL. This commonly happens when conda provides its own MPI.
Solution: Rebuild mpi4py against the system MPI:
MPICC=/usr/bin/mpicc CC=/usr/bin/gcc LDSHARED="/usr/bin/gcc -shared" \
pip install --no-binary mpi4py --force-reinstall mpi4py
Verification:
python3 -c "
import mpi4py
print('mpi4py version:', mpi4py.__version__)
from mpi4py import MPI
print('MPI library:', MPI.Get_library_version())
"
The MPI library version printed should match your system MPI (e.g., OpenMPI 4.1.6), not conda’s.
NumPy Version Incompatibility
Symptom: Import error related to NumPy ABI version mismatch.
Cause: DTL’s Python bindings were built against a different NumPy version than what is currently installed.
Solution: Rebuild the Python bindings after updating NumPy:
pip install numpy # Ensure desired version
cd build
cmake .. -DDTL_BUILD_PYTHON=ON
make _dtl
make python_install
General Debugging Tips
Enable Verbose Output
Set environment variables for verbose MPI and CUDA output:
# MPI verbose output
export OMPI_MCA_mpi_show_mca_params=all
# CUDA error checking
export CUDA_LAUNCH_BLOCKING=1
# DTL debug mode (if built with Debug)
export DTL_DEBUG=1
Check Backend Availability
Print a diagnostic report at startup:
dtl::environment env(argc, argv);
std::cout << "MPI: " << env.has_mpi() << " (thread level: " << env.mpi_thread_level_name() << ")\n";
std::cout << "CUDA: " << env.has_cuda() << "\n";
std::cout << "HIP: " << env.has_hip() << "\n";
std::cout << "NCCL: " << env.has_nccl() << "\n";
std::cout << "SHMEM: " << env.has_shmem() << "\n";
Run with Address Sanitizer
For memory issues, build with sanitizers:
cmake .. -DCMAKE_BUILD_TYPE=Debug \
-DCMAKE_CXX_FLAGS="-fsanitize=address -fno-omit-frame-pointer"
Run with MPI Error Handler
Enable MPI error handlers for better diagnostics:
# OpenMPI: enable error messages instead of abort
export OMPI_MCA_mpi_abort_print_stack=1
See Also
Environment Guide – Backend lifecycle
Performance Tuning Guide – Optimization strategies
Migration Guide – Upgrading from V1.0 to V1.5
Contributing Guide – Development environment and contribution workflow