Legacy Deep-Dive: Policies

This page is retained as a detailed reference. The canonical user path is now the chaptered handbook.

Primary chapter: 06-policies-and-execution-control.md

Runtime and handles: Runtime and Handle Model

Detailed Reference (Legacy)

DTL uses a policy-based design that separates concerns into orthogonal configuration axes. This allows flexible, compile-time configuration of distributed behavior.

Overview

Distributed programming entangles multiple concerns:

How data is partitioned across ranks
Where data resides (host vs device memory)
How operations execute (sync vs async)
When writes become visible
How errors are handled

DTL separates these into five orthogonal policy axes, allowing you to configure each independently.

Why Policies?

// Without policies: hardcoded behavior
distributed_vector<double> vec(1000);  // What partition? What memory? What error handling?

// With policies: explicit, configurable behavior
distributed_vector<double, block_partition<>, host_only> vec(1000);
// Or using policy_set for runtime composition

The Five Policy Axes

Axis	Question	Default
Partition	How is data divided across ranks?	`block_partition`
Placement	Where does data live (host/device)?	`host_only`
Execution	How do operations execute?	`seq` (synchronous)
Consistency	When are writes visible?	`bulk_synchronous`
Error	How are errors reported?	`expected` (result-based)

Partition Policies

Partition policies determine how global indices map to ranks.

block_partition (Default)

Divides data into contiguous chunks:

// 1000 elements across 4 ranks:
// Rank 0: indices [0, 250)
// Rank 1: indices [250, 500)
// Rank 2: indices [500, 750)
// Rank 3: indices [750, 1000)

dtl::distributed_vector<double, dtl::block_partition<>> vec(1000, size, rank);

// Block partition is the default
dtl::distributed_vector<double> vec_default(1000, size, rank);  // Same as above

Properties:

Contiguous local storage
Good cache locality
Simple ownership queries
Best for sequential access patterns

cyclic_partition

Round-robin element distribution (planned):

// 1000 elements across 4 ranks:
// Rank 0: indices 0, 4, 8, 12, ...
// Rank 1: indices 1, 5, 9, 13, ...
// Rank 2: indices 2, 6, 10, 14, ...
// Rank 3: indices 3, 7, 11, 15, ...

dtl::distributed_vector<double, dtl::cyclic_partition<>> vec(1000, size, rank);

Properties:

Better load balancing for irregular access
Non-contiguous local storage
Higher overhead for sequential access

block_cyclic_partition

Combines block and cyclic (planned):

// Block size 64, cyclic distribution:
// Rank 0: indices [0,64), [256,320), ...
// Rank 1: indices [64,128), [320,384), ...
// etc.

dtl::distributed_vector<double, dtl::block_cyclic_partition<64>> vec(1000, size, rank);

Properties:

Balance between locality and load balancing
Standard in scientific computing (ScaLAPACK)

hash_partition

Hash-based distribution (for associative containers):

// Elements distributed by hash of key
dtl::distributed_unordered_map<std::string, int, dtl::hash_partition<>> map(size, rank);

// Custom hash function
dtl::distributed_unordered_map<Key, Value, dtl::hash_partition<MyHash>> map(size, rank);

replicated

Full copy on each rank:

// Every rank has complete copy
dtl::distributed_vector<double, dtl::replicated> lookup_table(1000, size, rank);

Properties:

No communication for reads
Writes require synchronization
Memory scales with rank count

Placement Policies

Placement policies determine where data resides physically.

host_only (Default)

Data resides in host (CPU) memory:

dtl::distributed_vector<double, dtl::block_partition<>, dtl::host_only> vec(1000, size, rank);

// host_only is the default
dtl::distributed_vector<double> vec_default(1000, size, rank);  // Same as above

Properties:

Universal compatibility
No GPU required
Standard allocators

device_only

Data resides in device (GPU) memory:

// Requires DTL_ENABLE_CUDA or DTL_ENABLE_HIP
dtl::distributed_vector<double, dtl::block_partition<>, dtl::device_only<0>> vec(1000, size, rank);

// Access requires GPU kernels or explicit transfer
auto local = vec.local_view();  // Returns device pointer

Properties:

Data stays on GPU
Host access requires transfer
Best for GPU-only workflows

device_preferred

Prefers device memory with automatic fallback:

dtl::distributed_vector<double, dtl::block_partition<>, dtl::device_preferred> vec(1000, size, rank);

// Uses GPU memory if available, host memory otherwise

unified_memory

CUDA Unified Memory (managed memory):

dtl::distributed_vector<double, dtl::block_partition<>, dtl::unified_memory> vec(1000, size, rank);

// Accessible from both host and device
// Automatic page migration

Properties:

Convenience for mixed host/device access
Performance implications from page faults
Requires CUDA unified memory support

Execution Policies

Execution policies control how operations are performed.

seq (Default)

Synchronous, blocking execution:

// Operation completes before returning
dtl::for_each(dtl::seq, vec, [](double& x) { x *= 2; });

// seq is the default
dtl::for_each(vec, [](double& x) { x *= 2; });  // Same as above

Properties:

Simple to reason about
Deterministic completion
No concurrent execution

par

Parallel execution (blocking):

// Uses multiple threads, but still blocks until complete
dtl::for_each(dtl::par, vec, [](double& x) { x *= 2; });

Properties:

Utilizes multiple CPU cores
Still blocks caller
Thread-safe functor required

par_unseq

Parallel and vectorized (blocking):

// Enables SIMD and multi-threading
dtl::for_each(dtl::par_unseq, vec, [](double& x) { x *= 2; });

Properties:

Maximum CPU parallelism
Functor must be vectorization-safe
No synchronization in functor

async

Non-blocking execution:

// Returns immediately with a future
auto future = dtl::for_each(dtl::async, vec, [](double& x) { x *= 2; });

// Do other work...

// Wait for completion
future.wait();

Properties:

Enables overlap of computation and communication
Returns future/event handle
Requires explicit synchronization

Usage with Algorithms

// Transform with parallel execution
dtl::transform(dtl::par, vec, output, [](double x) { return x * x; });

// Reduce with async execution
auto future = dtl::reduce(dtl::async, vec, 0.0, std::plus<>{});
// ... do other work ...
double result = future.get();

Consistency Policies

Consistency policies define when writes become visible to other ranks.

bulk_synchronous (Default)

BSP model with explicit barriers:

// Writes not visible until barrier
dtl::distributed_vector<double, ..., dtl::bulk_synchronous> vec(1000, size, rank);

auto local = vec.local_view();
local[0] = 42.0;  // Local write

// Writes become visible after barrier
vec.barrier();

Properties:

Clear synchronization points
Simple reasoning about visibility
Standard HPC model

sequential_consistent

Strongest consistency (planned):

dtl::distributed_vector<double, ..., dtl::sequential_consistent> vec(1000, size, rank);

// All operations appear in a single global order
// Higher synchronization overhead

release_acquire

C++ memory model consistency (planned):

// Writes in release-ordered operations visible to acquire-ordered readers

relaxed

Minimal ordering (planned):

// Only atomicity guaranteed, no ordering
// Maximum performance, complex reasoning

Error Policies

Error policies determine how errors are reported.

expected (Default)

Result-based error handling:

dtl::distributed_vector<double, ..., dtl::expected> vec(1000, size, rank);

auto global = vec.global_view();
auto result = global[500].get();

if (result.has_value()) {
    double val = result.value();
} else {
    auto error = result.error();
    // Handle error
}

Properties:

No exceptions
Explicit error checking
Compile-time enforced handling

throwing

Exception-based error handling:

dtl::distributed_vector<double, ..., dtl::throwing> vec(1000, size, rank);

try {
    auto global = vec.global_view();
    double val = global[500].get();  // Throws on error
} catch (const dtl::communication_error& e) {
    // Handle error
}

Properties:

Familiar exception patterns
Automatic propagation
Cannot be ignored

Policy Composition

Using policy_set

Combine policies into a single set:

using my_policies = dtl::policy_set<
    dtl::block_partition<>,
    dtl::host_only,
    dtl::par,
    dtl::bulk_synchronous,
    dtl::expected
>;

dtl::distributed_vector<double, my_policies> vec(1000, size, rank);

Partial Specification

Unspecified axes use defaults:

// Only specify partition, others use defaults
dtl::distributed_vector<double, dtl::cyclic_partition<>> vec(1000, size, rank);
// Equivalent to:
// dtl::distributed_vector<double, cyclic_partition<>, host_only, seq, bulk_synchronous, expected>

Call-Site Override

Override policies per-operation:

dtl::distributed_vector<double> vec(1000, size, rank);  // Default policies

// Override execution policy for this call
dtl::for_each(dtl::par, vec, [](double& x) { x *= 2; });

// Override multiple policies
dtl::for_each(
    dtl::policy_set<dtl::par, dtl::async>{},
    vec,
    [](double& x) { x *= 2; }
);

Policy Precedence

When multiple policy sources exist, precedence is:

Call-site policy_set (highest priority)
Container-level defaults
Context default policy_set
Library defaults (lowest priority)

// Context with default parallel execution
auto ctx = dtl::context(dtl::policy_set<dtl::par>{});

// Container uses context default (par)
dtl::distributed_vector<double> vec(ctx, 1000, size, rank);

// Operation uses container default (par)
dtl::for_each(vec, func);  // Parallel execution

// Call-site override beats all
dtl::for_each(dtl::seq, vec, func);  // Sequential execution

Conflict Detection

Conflicting policies at the same level cause errors:

// COMPILE ERROR: two partition policies
dtl::distributed_vector<double,
    dtl::policy_set<dtl::block_partition<>, dtl::cyclic_partition<>>
> vec(1000, size, rank);

Common Policy Combinations

High-Performance Computing (Default)

using hpc_policies = dtl::policy_set<
    dtl::block_partition<>,
    dtl::host_only,
    dtl::par,
    dtl::bulk_synchronous,
    dtl::expected
>;

GPU Accelerated

using gpu_policies = dtl::policy_set<
    dtl::block_partition<>,
    dtl::device_only<0>,
    dtl::par,
    dtl::bulk_synchronous,
    dtl::expected
>;

Development/Debugging

using debug_policies = dtl::policy_set<
    dtl::block_partition<>,
    dtl::host_only,
    dtl::seq,           // Sequential for easier debugging
    dtl::bulk_synchronous,
    dtl::throwing       // Exceptions for stack traces
>;

Maximum Throughput

using throughput_policies = dtl::policy_set<
    dtl::block_partition<>,
    dtl::device_preferred,
    dtl::par_unseq,
    dtl::bulk_synchronous,
    dtl::expected
>;

Legacy Deep-Dive: Policies

Detailed Reference (Legacy)

Table of Contents

Overview

Why Policies?

The Five Policy Axes

Partition Policies

block_partition (Default)

cyclic_partition

block_cyclic_partition

hash_partition

replicated

Placement Policies

host_only (Default)

device_only

device_preferred

unified_memory

Execution Policies

seq (Default)

par

par_unseq

async

Usage with Algorithms

Consistency Policies

bulk_synchronous (Default)

sequential_consistent

release_acquire

relaxed

Error Policies

expected (Default)

throwing

Policy Composition

Using policy_set

Partial Specification

Call-Site Override

Policy Precedence

Conflict Detection

Common Policy Combinations

High-Performance Computing (Default)

GPU Accelerated

Development/Debugging

Maximum Throughput

See Also