Legacy Deep-Dive: Policies

This page is retained as a detailed reference. The canonical user path is now the chaptered handbook.

Primary chapter: 06-policies-and-execution-control.md

Runtime and handles: Runtime and Handle Model


Detailed Reference (Legacy)

DTL uses a policy-based design that separates concerns into orthogonal configuration axes. This allows flexible, compile-time configuration of distributed behavior.


Table of Contents


Overview

Distributed programming entangles multiple concerns:

  • How data is partitioned across ranks

  • Where data resides (host vs device memory)

  • How operations execute (sync vs async)

  • When writes become visible

  • How errors are handled

DTL separates these into five orthogonal policy axes, allowing you to configure each independently.

Why Policies?

// Without policies: hardcoded behavior
distributed_vector<double> vec(1000);  // What partition? What memory? What error handling?

// With policies: explicit, configurable behavior
distributed_vector<double, block_partition<>, host_only> vec(1000);
// Or using policy_set for runtime composition

The Five Policy Axes

Axis

Question

Default

Partition

How is data divided across ranks?

block_partition

Placement

Where does data live (host/device)?

host_only

Execution

How do operations execute?

seq (synchronous)

Consistency

When are writes visible?

bulk_synchronous

Error

How are errors reported?

expected (result-based)


Partition Policies

Partition policies determine how global indices map to ranks.

block_partition (Default)

Divides data into contiguous chunks:

// 1000 elements across 4 ranks:
// Rank 0: indices [0, 250)
// Rank 1: indices [250, 500)
// Rank 2: indices [500, 750)
// Rank 3: indices [750, 1000)

dtl::distributed_vector<double, dtl::block_partition<>> vec(1000, size, rank);

// Block partition is the default
dtl::distributed_vector<double> vec_default(1000, size, rank);  // Same as above

Properties:

  • Contiguous local storage

  • Good cache locality

  • Simple ownership queries

  • Best for sequential access patterns

cyclic_partition

Round-robin element distribution (planned):

// 1000 elements across 4 ranks:
// Rank 0: indices 0, 4, 8, 12, ...
// Rank 1: indices 1, 5, 9, 13, ...
// Rank 2: indices 2, 6, 10, 14, ...
// Rank 3: indices 3, 7, 11, 15, ...

dtl::distributed_vector<double, dtl::cyclic_partition<>> vec(1000, size, rank);

Properties:

  • Better load balancing for irregular access

  • Non-contiguous local storage

  • Higher overhead for sequential access

block_cyclic_partition

Combines block and cyclic (planned):

// Block size 64, cyclic distribution:
// Rank 0: indices [0,64), [256,320), ...
// Rank 1: indices [64,128), [320,384), ...
// etc.

dtl::distributed_vector<double, dtl::block_cyclic_partition<64>> vec(1000, size, rank);

Properties:

  • Balance between locality and load balancing

  • Standard in scientific computing (ScaLAPACK)

hash_partition

Hash-based distribution (for associative containers):

// Elements distributed by hash of key
dtl::distributed_unordered_map<std::string, int, dtl::hash_partition<>> map(size, rank);

// Custom hash function
dtl::distributed_unordered_map<Key, Value, dtl::hash_partition<MyHash>> map(size, rank);

replicated

Full copy on each rank:

// Every rank has complete copy
dtl::distributed_vector<double, dtl::replicated> lookup_table(1000, size, rank);

Properties:

  • No communication for reads

  • Writes require synchronization

  • Memory scales with rank count


Placement Policies

Placement policies determine where data resides physically.

host_only (Default)

Data resides in host (CPU) memory:

dtl::distributed_vector<double, dtl::block_partition<>, dtl::host_only> vec(1000, size, rank);

// host_only is the default
dtl::distributed_vector<double> vec_default(1000, size, rank);  // Same as above

Properties:

  • Universal compatibility

  • No GPU required

  • Standard allocators

device_only

Data resides in device (GPU) memory:

// Requires DTL_ENABLE_CUDA or DTL_ENABLE_HIP
dtl::distributed_vector<double, dtl::block_partition<>, dtl::device_only<0>> vec(1000, size, rank);

// Access requires GPU kernels or explicit transfer
auto local = vec.local_view();  // Returns device pointer

Properties:

  • Data stays on GPU

  • Host access requires transfer

  • Best for GPU-only workflows

device_preferred

Prefers device memory with automatic fallback:

dtl::distributed_vector<double, dtl::block_partition<>, dtl::device_preferred> vec(1000, size, rank);

// Uses GPU memory if available, host memory otherwise

unified_memory

CUDA Unified Memory (managed memory):

dtl::distributed_vector<double, dtl::block_partition<>, dtl::unified_memory> vec(1000, size, rank);

// Accessible from both host and device
// Automatic page migration

Properties:

  • Convenience for mixed host/device access

  • Performance implications from page faults

  • Requires CUDA unified memory support


Execution Policies

Execution policies control how operations are performed.

seq (Default)

Synchronous, blocking execution:

// Operation completes before returning
dtl::for_each(dtl::seq, vec, [](double& x) { x *= 2; });

// seq is the default
dtl::for_each(vec, [](double& x) { x *= 2; });  // Same as above

Properties:

  • Simple to reason about

  • Deterministic completion

  • No concurrent execution

par

Parallel execution (blocking):

// Uses multiple threads, but still blocks until complete
dtl::for_each(dtl::par, vec, [](double& x) { x *= 2; });

Properties:

  • Utilizes multiple CPU cores

  • Still blocks caller

  • Thread-safe functor required

par_unseq

Parallel and vectorized (blocking):

// Enables SIMD and multi-threading
dtl::for_each(dtl::par_unseq, vec, [](double& x) { x *= 2; });

Properties:

  • Maximum CPU parallelism

  • Functor must be vectorization-safe

  • No synchronization in functor

async

Non-blocking execution:

// Returns immediately with a future
auto future = dtl::for_each(dtl::async, vec, [](double& x) { x *= 2; });

// Do other work...

// Wait for completion
future.wait();

Properties:

  • Enables overlap of computation and communication

  • Returns future/event handle

  • Requires explicit synchronization

Usage with Algorithms

// Transform with parallel execution
dtl::transform(dtl::par, vec, output, [](double x) { return x * x; });

// Reduce with async execution
auto future = dtl::reduce(dtl::async, vec, 0.0, std::plus<>{});
// ... do other work ...
double result = future.get();

Consistency Policies

Consistency policies define when writes become visible to other ranks.

bulk_synchronous (Default)

BSP model with explicit barriers:

// Writes not visible until barrier
dtl::distributed_vector<double, ..., dtl::bulk_synchronous> vec(1000, size, rank);

auto local = vec.local_view();
local[0] = 42.0;  // Local write

// Writes become visible after barrier
vec.barrier();

Properties:

  • Clear synchronization points

  • Simple reasoning about visibility

  • Standard HPC model

sequential_consistent

Strongest consistency (planned):

dtl::distributed_vector<double, ..., dtl::sequential_consistent> vec(1000, size, rank);

// All operations appear in a single global order
// Higher synchronization overhead

release_acquire

C++ memory model consistency (planned):

// Writes in release-ordered operations visible to acquire-ordered readers

relaxed

Minimal ordering (planned):

// Only atomicity guaranteed, no ordering
// Maximum performance, complex reasoning

Error Policies

Error policies determine how errors are reported.

expected (Default)

Result-based error handling:

dtl::distributed_vector<double, ..., dtl::expected> vec(1000, size, rank);

auto global = vec.global_view();
auto result = global[500].get();

if (result.has_value()) {
    double val = result.value();
} else {
    auto error = result.error();
    // Handle error
}

Properties:

  • No exceptions

  • Explicit error checking

  • Compile-time enforced handling

throwing

Exception-based error handling:

dtl::distributed_vector<double, ..., dtl::throwing> vec(1000, size, rank);

try {
    auto global = vec.global_view();
    double val = global[500].get();  // Throws on error
} catch (const dtl::communication_error& e) {
    // Handle error
}

Properties:

  • Familiar exception patterns

  • Automatic propagation

  • Cannot be ignored


Policy Composition

Using policy_set

Combine policies into a single set:

using my_policies = dtl::policy_set<
    dtl::block_partition<>,
    dtl::host_only,
    dtl::par,
    dtl::bulk_synchronous,
    dtl::expected
>;

dtl::distributed_vector<double, my_policies> vec(1000, size, rank);

Partial Specification

Unspecified axes use defaults:

// Only specify partition, others use defaults
dtl::distributed_vector<double, dtl::cyclic_partition<>> vec(1000, size, rank);
// Equivalent to:
// dtl::distributed_vector<double, cyclic_partition<>, host_only, seq, bulk_synchronous, expected>

Call-Site Override

Override policies per-operation:

dtl::distributed_vector<double> vec(1000, size, rank);  // Default policies

// Override execution policy for this call
dtl::for_each(dtl::par, vec, [](double& x) { x *= 2; });

// Override multiple policies
dtl::for_each(
    dtl::policy_set<dtl::par, dtl::async>{},
    vec,
    [](double& x) { x *= 2; }
);

Policy Precedence

When multiple policy sources exist, precedence is:

  1. Call-site policy_set (highest priority)

  2. Container-level defaults

  3. Context default policy_set

  4. Library defaults (lowest priority)

// Context with default parallel execution
auto ctx = dtl::context(dtl::policy_set<dtl::par>{});

// Container uses context default (par)
dtl::distributed_vector<double> vec(ctx, 1000, size, rank);

// Operation uses container default (par)
dtl::for_each(vec, func);  // Parallel execution

// Call-site override beats all
dtl::for_each(dtl::seq, vec, func);  // Sequential execution

Conflict Detection

Conflicting policies at the same level cause errors:

// COMPILE ERROR: two partition policies
dtl::distributed_vector<double,
    dtl::policy_set<dtl::block_partition<>, dtl::cyclic_partition<>>
> vec(1000, size, rank);

Common Policy Combinations

High-Performance Computing (Default)

using hpc_policies = dtl::policy_set<
    dtl::block_partition<>,
    dtl::host_only,
    dtl::par,
    dtl::bulk_synchronous,
    dtl::expected
>;

GPU Accelerated

using gpu_policies = dtl::policy_set<
    dtl::block_partition<>,
    dtl::device_only<0>,
    dtl::par,
    dtl::bulk_synchronous,
    dtl::expected
>;

Development/Debugging

using debug_policies = dtl::policy_set<
    dtl::block_partition<>,
    dtl::host_only,
    dtl::seq,           // Sequential for easier debugging
    dtl::bulk_synchronous,
    dtl::throwing       // Exceptions for stack traces
>;

Maximum Throughput

using throughput_policies = dtl::policy_set<
    dtl::block_partition<>,
    dtl::device_preferred,
    dtl::par_unseq,
    dtl::bulk_synchronous,
    dtl::expected
>;

See Also