Distributed Template Library (DTL)

DTL is a C++20 header-only template library providing STL-inspired abstractions for distributed and heterogeneous computing. It uses policy-based design with C++20 concepts to enable portable code across CPUs, GPUs, and multi-node systems.

Current Version: 0.1.0-alpha.1 Status: alpha pre-release

New to DTL? Start with the README for a quick overview and installation guide.

Quick Links

Getting Started	User Guide	Reference
Installation & Quick Start	User Guide Index	API Reference
Hello Distributed Example	Containers	Developer Guide
Building from Source	Views	Developer Runtime/Handle
	Policies
	Algorithms
	Bindings
	Runtime and Handles

Key Design Philosophy

DTL does not pretend distribution is transparent. Remote access is “syntactically loud” via remote_ref<T> (no implicit T& conversions), and communication costs are explicit in the API.

Core Principles

STL familiarity where honest - Local views and iterators align with STL expectations
Explicit distribution - Ownership, partitioning, and communication are first-class concepts
HPC-first architecture - Compile-time binding and thin wrappers by default
Orthogonal policy system - Partition, placement, execution, consistency, and error policies are independent axes
Segmented iteration - The primary performance substrate for distributed algorithms

Documentation Structure

For New Users

Start with the Getting Started Guide to build DTL and run your first distributed program.

User Guide

The User Guide covers day-to-day usage of DTL:

Environment - Backend lifecycle and context creation
Containers - distributed_vector, distributed_array, distributed_span, distributed_tensor, distributed_map
Views - local_view, global_view, segmented_view, remote_ref
Policies - Partition, placement, execution, consistency, error
Algorithms - for_each, transform, reduce, scan, sort
Error Handling - Result-based vs throwing patterns
Language Bindings - C, Python, and Fortran bindings

Language Bindings

DTL supports multiple programming languages:

C Bindings - Stable C ABI for interoperability
Python Bindings - NumPy-integrated Python API
Fortran Bindings - Via ISO_C_BINDING

Migration Guide

From STL - Mapping STL patterns to DTL equivalents

Backend Guide

For backend developers and advanced users:

Backend Concepts - Communicator, MemorySpace, Executor
Backend Selection - When to use MPI vs CUDA
Implementing a Backend - Step-by-step guide

Reference Documentation

API Reference - Doxygen-generated documentation

Quick Example

#include <dtl/dtl.hpp>
#include <iostream>

int main() {
    // Create a distributed vector (standalone mode, no MPI required)
    dtl::distributed_vector<int> vec(100, /*num_ranks=*/1, /*my_rank=*/0);

    // Fill with values using local view (STL-compatible, no communication)
    auto local = vec.local_view();
    for (dtl::size_type i = 0; i < local.size(); ++i) {
        local[i] = static_cast<int>(i);
    }

    // Use DTL's for_each algorithm
    dtl::for_each(vec, [](int& x) { x = x * x; });

    // Local reduce (no MPI communication needed in single-rank mode)
    int sum = dtl::local_reduce(vec, 0, std::plus<>{});
    std::cout << "Sum of squares: " << sum << "\n";

    return 0;
}

Alpha Limitations

The following features are planned for future releases:

distributed_vector::redistribute() runtime partition changes
remote_ref cross-rank get/put operations
Network topology multi-host discovery
Distributed map binding surface (C++ core available, bindings deferred)
Remote/RPC binding surface (deferred)

Supported Backends

Backend	Status	Description
CPU (host_only)	Complete	Single-node, CPU-only execution
MPI	Complete	Multi-node distributed execution
MPI RMA	Complete	One-sided remote memory access
CUDA	Experimental	GPU acceleration (NVIDIA)
HIP	Experimental	GPU acceleration (AMD)
NCCL	Experimental	GPU collective communication

CUDA Placement Policies

DTL provides transparent GPU memory management through placement policies:

// GPU-resident data on device 0
dtl::distributed_vector<float, dtl::device_only<0>> gpu_vec0(N, 1, 0);

// GPU-resident data on device 1 (different type, different device)
dtl::distributed_vector<float, dtl::device_only<1>> gpu_vec1(N, 1, 0);

// Each container carries device affinity
assert(gpu_vec0.device_id() == 0);
assert(gpu_vec1.device_id() == 1);

// Unified memory (host+device accessible)
dtl::distributed_vector<float, dtl::unified_memory> unified_vec(N, 1, 0);
auto local = unified_vec.local_view();  // Direct host access
local[0] = 42.0f;  // Write from host
transform_kernel<<<grid, block>>>(unified_vec.local_data(), n);  // Use on GPU

// GPU with host fallback (uses device if CUDA enabled, host otherwise)
dtl::distributed_vector<float, dtl::device_preferred> vec(N, 1, 0);

Device Selection: Allocations for device_only<N> are scoped to device N using RAII device guards. The caller’s current CUDA context device is preserved across container construction and destruction.

See examples/gpu/ for complete GPU examples.

Requirements

C++ Standard: C++20
Compilers: GCC 11+, Clang 15+, MSVC 19.29+
Optional: MPI (OpenMPI, MPICH), CUDA 11.4+

See Getting Started for detailed build instructions.

Project Links

Repository: GitHub
Issue Tracker: GitHub Issues