Backend Concepts

DTL uses C++20 concepts to define backend requirements. This ensures compile-time verification that backend implementations satisfy required interfaces.

Overview

DTL’s backend architecture has three primary concept hierarchies:

Concept Family	Purpose	Key Implementations
Communicator	Point-to-point and collective communication	`mpi_comm_adapter`
MemorySpace	Memory allocation and properties	`host_memory_space`, `cuda_memory_space`
Executor	Computation dispatch	`cpu_executor`, `inline_executor`

Communicator Concepts

Communicators provide distributed communication capabilities across ranks.

Communicator (Base)

The core concept for point-to-point communication:

template <typename T>
concept Communicator = requires(T& comm, const T& ccomm,
                                void* buf, const void* cbuf,
                                size_type count, rank_t rank,
                                int tag, request_handle& req) {
    typename T::size_type;

    // Query operations
    { ccomm.rank() } -> std::same_as<rank_t>;
    { ccomm.size() } -> std::same_as<rank_t>;

    // Blocking point-to-point
    { comm.send(cbuf, count, rank, tag) } -> std::same_as<void>;
    { comm.recv(buf, count, rank, tag) } -> std::same_as<void>;

    // Non-blocking point-to-point
    { comm.isend(cbuf, count, rank, tag) } -> std::same_as<request_handle>;
    { comm.irecv(buf, count, rank, tag) } -> std::same_as<request_handle>;

    // Request completion
    { comm.wait(req) } -> std::same_as<void>;
    { comm.test(req) } -> std::same_as<bool>;
};

Required Types:

size_type: Type for counts and sizes

Required Operations:

Operation	Description
`rank()`	Returns this process’s rank (0 to size-1)
`size()`	Returns total number of ranks
`send()`	Blocking send to destination rank
`recv()`	Blocking receive from source rank
`isend()`	Non-blocking send, returns request handle
`irecv()`	Non-blocking receive, returns request handle
`wait()`	Wait for non-blocking operation to complete
`test()`	Test if non-blocking operation completed

CollectiveCommunicator

Extends Communicator with collective operations:

template <typename T>
concept CollectiveCommunicator = Communicator<T> &&
    requires(T& comm, void* buf, const void* cbuf,
             size_type count, rank_t root) {
    { comm.barrier() } -> std::same_as<void>;
    { comm.broadcast(buf, count, root) } -> std::same_as<void>;
    { comm.scatter(cbuf, buf, count, root) } -> std::same_as<void>;
    { comm.gather(cbuf, buf, count, root) } -> std::same_as<void>;
    { comm.allgather(cbuf, buf, count) } -> std::same_as<void>;
    { comm.alltoall(cbuf, buf, count) } -> std::same_as<void>;
};

Additional Operations:

Operation	Description
`barrier()`	Synchronize all ranks
`broadcast()`	Root sends data to all ranks
`scatter()`	Root distributes data to all ranks
`gather()`	All ranks send data to root
`allgather()`	Gather data to all ranks
`alltoall()`	All-to-all exchange

ReducingCommunicator

Extends CollectiveCommunicator with reduction operations:

template <typename T>
concept ReducingCommunicator = CollectiveCommunicator<T> &&
    requires(T& comm, const void* sendbuf, void* recvbuf,
             size_type count, rank_t root) {
    { comm.reduce_sum(sendbuf, recvbuf, count, root) } -> std::same_as<void>;
    { comm.allreduce_sum(sendbuf, recvbuf, count) } -> std::same_as<void>;
};

Additional Operations:

Operation	Description
`reduce_sum()`	Sum reduction to root rank
`allreduce_sum()`	Sum reduction to all ranks

Communicator Tags

Tag types identify communicator implementations:

struct mpi_communicator_tag {};
struct shared_memory_communicator_tag {};
struct gpu_communicator_tag {};

MemorySpace Concepts

Memory spaces abstract memory allocation across different address spaces.

MemorySpace (Base)

template <typename T>
concept MemorySpace = requires(T& space, const T& cspace,
                               size_type size, size_type alignment,
                               void* ptr) {
    typename T::pointer;
    typename T::size_type;

    // Allocation
    { space.allocate(size) } -> std::same_as<void*>;
    { space.allocate(size, alignment) } -> std::same_as<void*>;
    { space.deallocate(ptr, size) } -> std::same_as<void>;

    // Properties
    { cspace.properties() } -> std::same_as<memory_space_properties>;
    { cspace.name() } -> std::convertible_to<const char*>;
};

Required Types:

pointer: Raw pointer type (typically void*)
size_type: Size type for counts

Required Operations:

Operation	Description
`allocate(size)`	Allocate `size` bytes
`allocate(size, alignment)`	Allocate with alignment requirement
`deallocate(ptr, size)`	Free allocation
`properties()`	Returns `memory_space_properties`
`name()`	Returns human-readable name

Memory Space Properties

struct memory_space_properties {
    bool host_accessible = true;    // CPU can access
    bool device_accessible = false; // GPU can access
    bool unified = false;           // Coherent across host/device
    bool supports_atomics = true;   // Atomic operations work
    bool pageable = true;           // Pageable (vs pinned)
    size_type alignment = alignof(std::max_align_t);
};

TypedMemorySpace

Adds typed allocation support:

template <typename Space, typename T>
concept TypedMemorySpace = MemorySpace<Space> &&
    requires(Space& space, size_type count, T* ptr) {
    { space.template allocate_typed<T>(count) } -> std::same_as<T*>;
    { space.deallocate_typed(ptr, count) } -> std::same_as<void>;
    { space.template construct<T>(ptr) } -> std::same_as<void>;
    { space.destroy(ptr) } -> std::same_as<void>;
};

Memory Space Tags

struct host_memory_space_tag {};    // CPU memory
struct device_memory_space_tag {};  // GPU memory
struct unified_memory_space_tag {}; // Managed memory
struct pinned_memory_space_tag {};  // Page-locked host memory

Executor Concepts

Executors dispatch computation to processing resources.

Executor (Base)

template <typename T>
concept Executor = requires(T& exec, const T& cexec,
                            std::function<void()> f) {
    { exec.execute(f) } -> std::same_as<void>;
    { cexec.name() } -> std::convertible_to<const char*>;
};

Required Operations:

Operation	Description
`execute(f)`	Execute callable `f`
`name()`	Returns executor name

ParallelExecutor

Adds parallel execution capabilities:

template <typename T>
concept ParallelExecutor = Executor<T> &&
    requires(T& exec, const T& cexec,
             size_type count, std::function<void(size_type)> f) {
    { exec.parallel_for(count, f) } -> std::same_as<void>;
    { cexec.max_parallelism() } -> std::same_as<size_type>;
    { cexec.suggested_parallelism() } -> std::same_as<size_type>;
};

Additional Operations:

Operation	Description
`parallel_for(count, f)`	Execute `f(i)` for i in [0, count)
`max_parallelism()`	Maximum concurrent work items
`suggested_parallelism()`	Recommended parallelism level

BulkExecutor

Optimized for bulk operations with chunking:

template <typename T>
concept BulkExecutor = ParallelExecutor<T> &&
    requires(T& exec, size_type count,
             std::function<void(size_type, size_type)> f) {
    { exec.bulk_execute(count, f) } -> std::same_as<void>;
};

Executor Properties

struct executor_properties {
    size_type max_concurrency = 1;    // Max concurrent work items
    bool in_order = true;              // Ordered execution
    bool owns_threads = false;         // Manages thread pool
    bool supports_work_stealing = false;
};

Executor Tags

struct inline_executor_tag {};
struct thread_pool_executor_tag {};
struct single_thread_executor_tag {};
struct gpu_executor_tag {};

Standard Executors

DTL provides these built-in executors:

inline_executor

Executes work immediately in the calling thread:

class inline_executor {
public:
    template <typename F>
    void execute(F&& f) { std::forward<F>(f)(); }

    static constexpr const char* name() noexcept { return "inline"; }
    static constexpr bool is_synchronous() noexcept { return true; }
};

sequential_executor

Sequential execution with parallel interface:

class sequential_executor {
public:
    template <typename F>
    void parallel_for(size_type count, F&& f) {
        for (size_type i = 0; i < count; ++i) { f(i); }
    }

    static constexpr size_type max_parallelism() noexcept { return 1; }
};

Concept Verification

Implementations use static_assert to verify concept satisfaction:

// In mpi_comm_adapter.hpp
static_assert(Communicator<mpi_comm_adapter>,
              "mpi_comm_adapter must satisfy Communicator concept");
static_assert(CollectiveCommunicator<mpi_comm_adapter>,
              "mpi_comm_adapter must satisfy CollectiveCommunicator concept");
static_assert(ReducingCommunicator<mpi_comm_adapter>,
              "mpi_comm_adapter must satisfy ReducingCommunicator concept");

// In cuda_memory_space.hpp
static_assert(MemorySpace<cuda_memory_space>,
              "cuda_memory_space must satisfy MemorySpace concept");

// In cpu_executor.hpp
static_assert(Executor<cpu_executor>,
              "cpu_executor must satisfy Executor concept");
static_assert(ParallelExecutor<cpu_executor>,
              "cpu_executor must satisfy ParallelExecutor concept");

Concept Header Locations

Concept	Header
Communicator	`include/dtl/backend/concepts/communicator.hpp`
MemorySpace	`include/dtl/backend/concepts/memory_space.hpp`
Executor	`include/dtl/backend/concepts/executor.hpp`
MemoryTransfer	`include/dtl/backend/concepts/memory_transfer.hpp`
Serializer	`include/dtl/backend/concepts/serializer.hpp`

Backend Concepts

Overview

Communicator Concepts

Communicator (Base)

CollectiveCommunicator

ReducingCommunicator

Communicator Tags

MemorySpace Concepts

MemorySpace (Base)

Memory Space Properties

TypedMemorySpace

Memory Space Tags

Executor Concepts

Executor (Base)

ParallelExecutor

BulkExecutor

Executor Properties

Executor Tags

Standard Executors

inline_executor

sequential_executor

Concept Verification

Concept Header Locations

See Also