Distributed Template Library (DTL)

DTL is a C++20 header-only template library providing STL-inspired abstractions for distributed and heterogeneous computing. It uses policy-based design with C++20 concepts to enable portable code across CPUs, GPUs, and multi-node systems.

Current Version: 0.1.0-alpha.1 Status: alpha pre-release

New to DTL? Start with the README for a quick overview and installation guide.



Key Design Philosophy

DTL does not pretend distribution is transparent. Remote access is “syntactically loud” via remote_ref<T> (no implicit T& conversions), and communication costs are explicit in the API.

Core Principles

  1. STL familiarity where honest - Local views and iterators align with STL expectations

  2. Explicit distribution - Ownership, partitioning, and communication are first-class concepts

  3. HPC-first architecture - Compile-time binding and thin wrappers by default

  4. Orthogonal policy system - Partition, placement, execution, consistency, and error policies are independent axes

  5. Segmented iteration - The primary performance substrate for distributed algorithms


Documentation Structure

For New Users

Start with the Getting Started Guide to build DTL and run your first distributed program.

User Guide

The User Guide covers day-to-day usage of DTL:

  • Environment - Backend lifecycle and context creation

  • Containers - distributed_vector, distributed_array, distributed_span, distributed_tensor, distributed_map

  • Views - local_view, global_view, segmented_view, remote_ref

  • Policies - Partition, placement, execution, consistency, error

  • Algorithms - for_each, transform, reduce, scan, sort

  • Error Handling - Result-based vs throwing patterns

  • Language Bindings - C, Python, and Fortran bindings

Language Bindings

DTL supports multiple programming languages:

Migration Guide

  • From STL - Mapping STL patterns to DTL equivalents

Backend Guide

For backend developers and advanced users:

Reference Documentation


Quick Example

#include <dtl/dtl.hpp>
#include <iostream>

int main() {
    // Create a distributed vector (standalone mode, no MPI required)
    dtl::distributed_vector<int> vec(100, /*num_ranks=*/1, /*my_rank=*/0);

    // Fill with values using local view (STL-compatible, no communication)
    auto local = vec.local_view();
    for (dtl::size_type i = 0; i < local.size(); ++i) {
        local[i] = static_cast<int>(i);
    }

    // Use DTL's for_each algorithm
    dtl::for_each(vec, [](int& x) { x = x * x; });

    // Local reduce (no MPI communication needed in single-rank mode)
    int sum = dtl::local_reduce(vec, 0, std::plus<>{});
    std::cout << "Sum of squares: " << sum << "\n";

    return 0;
}

Alpha Limitations

The following features are planned for future releases:

  • distributed_vector::redistribute() runtime partition changes

  • remote_ref cross-rank get/put operations

  • Network topology multi-host discovery

  • Distributed map binding surface (C++ core available, bindings deferred)

  • Remote/RPC binding surface (deferred)


Supported Backends

Backend

Status

Description

CPU (host_only)

Complete

Single-node, CPU-only execution

MPI

Complete

Multi-node distributed execution

MPI RMA

Complete

One-sided remote memory access

CUDA

Experimental

GPU acceleration (NVIDIA)

HIP

Experimental

GPU acceleration (AMD)

NCCL

Experimental

GPU collective communication

CUDA Placement Policies

DTL provides transparent GPU memory management through placement policies:

// GPU-resident data on device 0
dtl::distributed_vector<float, dtl::device_only<0>> gpu_vec0(N, 1, 0);

// GPU-resident data on device 1 (different type, different device)
dtl::distributed_vector<float, dtl::device_only<1>> gpu_vec1(N, 1, 0);

// Each container carries device affinity
assert(gpu_vec0.device_id() == 0);
assert(gpu_vec1.device_id() == 1);

// Unified memory (host+device accessible)
dtl::distributed_vector<float, dtl::unified_memory> unified_vec(N, 1, 0);
auto local = unified_vec.local_view();  // Direct host access
local[0] = 42.0f;  // Write from host
transform_kernel<<<grid, block>>>(unified_vec.local_data(), n);  // Use on GPU

// GPU with host fallback (uses device if CUDA enabled, host otherwise)
dtl::distributed_vector<float, dtl::device_preferred> vec(N, 1, 0);

Device Selection: Allocations for device_only<N> are scoped to device N using RAII device guards. The caller’s current CUDA context device is preserved across container construction and destruction.

See examples/gpu/ for complete GPU examples.


Requirements

  • C++ Standard: C++20

  • Compilers: GCC 11+, Clang 15+, MSVC 19.29+

  • Optional: MPI (OpenMPI, MPICH), CUDA 11.4+

See Getting Started for detailed build instructions.