Legacy Deep-Dive: Policies
This page is retained as a detailed reference. The canonical user path is now the chaptered handbook.
Primary chapter: 06-policies-and-execution-control.md
Runtime and handles: Runtime and Handle Model
Detailed Reference (Legacy)
DTL uses a policy-based design that separates concerns into orthogonal configuration axes. This allows flexible, compile-time configuration of distributed behavior.
Table of Contents
Overview
Distributed programming entangles multiple concerns:
How data is partitioned across ranks
Where data resides (host vs device memory)
How operations execute (sync vs async)
When writes become visible
How errors are handled
DTL separates these into five orthogonal policy axes, allowing you to configure each independently.
Why Policies?
// Without policies: hardcoded behavior
distributed_vector<double> vec(1000); // What partition? What memory? What error handling?
// With policies: explicit, configurable behavior
distributed_vector<double, block_partition<>, host_only> vec(1000);
// Or using policy_set for runtime composition
The Five Policy Axes
Axis |
Question |
Default |
|---|---|---|
Partition |
How is data divided across ranks? |
|
Placement |
Where does data live (host/device)? |
|
Execution |
How do operations execute? |
|
Consistency |
When are writes visible? |
|
Error |
How are errors reported? |
|
Partition Policies
Partition policies determine how global indices map to ranks.
block_partition (Default)
Divides data into contiguous chunks:
// 1000 elements across 4 ranks:
// Rank 0: indices [0, 250)
// Rank 1: indices [250, 500)
// Rank 2: indices [500, 750)
// Rank 3: indices [750, 1000)
dtl::distributed_vector<double, dtl::block_partition<>> vec(1000, size, rank);
// Block partition is the default
dtl::distributed_vector<double> vec_default(1000, size, rank); // Same as above
Properties:
Contiguous local storage
Good cache locality
Simple ownership queries
Best for sequential access patterns
cyclic_partition
Round-robin element distribution (planned):
// 1000 elements across 4 ranks:
// Rank 0: indices 0, 4, 8, 12, ...
// Rank 1: indices 1, 5, 9, 13, ...
// Rank 2: indices 2, 6, 10, 14, ...
// Rank 3: indices 3, 7, 11, 15, ...
dtl::distributed_vector<double, dtl::cyclic_partition<>> vec(1000, size, rank);
Properties:
Better load balancing for irregular access
Non-contiguous local storage
Higher overhead for sequential access
block_cyclic_partition
Combines block and cyclic (planned):
// Block size 64, cyclic distribution:
// Rank 0: indices [0,64), [256,320), ...
// Rank 1: indices [64,128), [320,384), ...
// etc.
dtl::distributed_vector<double, dtl::block_cyclic_partition<64>> vec(1000, size, rank);
Properties:
Balance between locality and load balancing
Standard in scientific computing (ScaLAPACK)
hash_partition
Hash-based distribution (for associative containers):
// Elements distributed by hash of key
dtl::distributed_unordered_map<std::string, int, dtl::hash_partition<>> map(size, rank);
// Custom hash function
dtl::distributed_unordered_map<Key, Value, dtl::hash_partition<MyHash>> map(size, rank);
replicated
Full copy on each rank:
// Every rank has complete copy
dtl::distributed_vector<double, dtl::replicated> lookup_table(1000, size, rank);
Properties:
No communication for reads
Writes require synchronization
Memory scales with rank count
Placement Policies
Placement policies determine where data resides physically.
host_only (Default)
Data resides in host (CPU) memory:
dtl::distributed_vector<double, dtl::block_partition<>, dtl::host_only> vec(1000, size, rank);
// host_only is the default
dtl::distributed_vector<double> vec_default(1000, size, rank); // Same as above
Properties:
Universal compatibility
No GPU required
Standard allocators
device_only
Data resides in device (GPU) memory:
// Requires DTL_ENABLE_CUDA or DTL_ENABLE_HIP
dtl::distributed_vector<double, dtl::block_partition<>, dtl::device_only<0>> vec(1000, size, rank);
// Access requires GPU kernels or explicit transfer
auto local = vec.local_view(); // Returns device pointer
Properties:
Data stays on GPU
Host access requires transfer
Best for GPU-only workflows
device_preferred
Prefers device memory with automatic fallback:
dtl::distributed_vector<double, dtl::block_partition<>, dtl::device_preferred> vec(1000, size, rank);
// Uses GPU memory if available, host memory otherwise
unified_memory
CUDA Unified Memory (managed memory):
dtl::distributed_vector<double, dtl::block_partition<>, dtl::unified_memory> vec(1000, size, rank);
// Accessible from both host and device
// Automatic page migration
Properties:
Convenience for mixed host/device access
Performance implications from page faults
Requires CUDA unified memory support
Execution Policies
Execution policies control how operations are performed.
seq (Default)
Synchronous, blocking execution:
// Operation completes before returning
dtl::for_each(dtl::seq, vec, [](double& x) { x *= 2; });
// seq is the default
dtl::for_each(vec, [](double& x) { x *= 2; }); // Same as above
Properties:
Simple to reason about
Deterministic completion
No concurrent execution
par
Parallel execution (blocking):
// Uses multiple threads, but still blocks until complete
dtl::for_each(dtl::par, vec, [](double& x) { x *= 2; });
Properties:
Utilizes multiple CPU cores
Still blocks caller
Thread-safe functor required
par_unseq
Parallel and vectorized (blocking):
// Enables SIMD and multi-threading
dtl::for_each(dtl::par_unseq, vec, [](double& x) { x *= 2; });
Properties:
Maximum CPU parallelism
Functor must be vectorization-safe
No synchronization in functor
async
Non-blocking execution:
// Returns immediately with a future
auto future = dtl::for_each(dtl::async, vec, [](double& x) { x *= 2; });
// Do other work...
// Wait for completion
future.wait();
Properties:
Enables overlap of computation and communication
Returns future/event handle
Requires explicit synchronization
Usage with Algorithms
// Transform with parallel execution
dtl::transform(dtl::par, vec, output, [](double x) { return x * x; });
// Reduce with async execution
auto future = dtl::reduce(dtl::async, vec, 0.0, std::plus<>{});
// ... do other work ...
double result = future.get();
Consistency Policies
Consistency policies define when writes become visible to other ranks.
bulk_synchronous (Default)
BSP model with explicit barriers:
// Writes not visible until barrier
dtl::distributed_vector<double, ..., dtl::bulk_synchronous> vec(1000, size, rank);
auto local = vec.local_view();
local[0] = 42.0; // Local write
// Writes become visible after barrier
vec.barrier();
Properties:
Clear synchronization points
Simple reasoning about visibility
Standard HPC model
sequential_consistent
Strongest consistency (planned):
dtl::distributed_vector<double, ..., dtl::sequential_consistent> vec(1000, size, rank);
// All operations appear in a single global order
// Higher synchronization overhead
release_acquire
C++ memory model consistency (planned):
// Writes in release-ordered operations visible to acquire-ordered readers
relaxed
Minimal ordering (planned):
// Only atomicity guaranteed, no ordering
// Maximum performance, complex reasoning
Error Policies
Error policies determine how errors are reported.
expected (Default)
Result-based error handling:
dtl::distributed_vector<double, ..., dtl::expected> vec(1000, size, rank);
auto global = vec.global_view();
auto result = global[500].get();
if (result.has_value()) {
double val = result.value();
} else {
auto error = result.error();
// Handle error
}
Properties:
No exceptions
Explicit error checking
Compile-time enforced handling
throwing
Exception-based error handling:
dtl::distributed_vector<double, ..., dtl::throwing> vec(1000, size, rank);
try {
auto global = vec.global_view();
double val = global[500].get(); // Throws on error
} catch (const dtl::communication_error& e) {
// Handle error
}
Properties:
Familiar exception patterns
Automatic propagation
Cannot be ignored
Policy Composition
Using policy_set
Combine policies into a single set:
using my_policies = dtl::policy_set<
dtl::block_partition<>,
dtl::host_only,
dtl::par,
dtl::bulk_synchronous,
dtl::expected
>;
dtl::distributed_vector<double, my_policies> vec(1000, size, rank);
Partial Specification
Unspecified axes use defaults:
// Only specify partition, others use defaults
dtl::distributed_vector<double, dtl::cyclic_partition<>> vec(1000, size, rank);
// Equivalent to:
// dtl::distributed_vector<double, cyclic_partition<>, host_only, seq, bulk_synchronous, expected>
Call-Site Override
Override policies per-operation:
dtl::distributed_vector<double> vec(1000, size, rank); // Default policies
// Override execution policy for this call
dtl::for_each(dtl::par, vec, [](double& x) { x *= 2; });
// Override multiple policies
dtl::for_each(
dtl::policy_set<dtl::par, dtl::async>{},
vec,
[](double& x) { x *= 2; }
);
Policy Precedence
When multiple policy sources exist, precedence is:
Call-site policy_set (highest priority)
Container-level defaults
Context default policy_set
Library defaults (lowest priority)
// Context with default parallel execution
auto ctx = dtl::context(dtl::policy_set<dtl::par>{});
// Container uses context default (par)
dtl::distributed_vector<double> vec(ctx, 1000, size, rank);
// Operation uses container default (par)
dtl::for_each(vec, func); // Parallel execution
// Call-site override beats all
dtl::for_each(dtl::seq, vec, func); // Sequential execution
Conflict Detection
Conflicting policies at the same level cause errors:
// COMPILE ERROR: two partition policies
dtl::distributed_vector<double,
dtl::policy_set<dtl::block_partition<>, dtl::cyclic_partition<>>
> vec(1000, size, rank);
Common Policy Combinations
High-Performance Computing (Default)
using hpc_policies = dtl::policy_set<
dtl::block_partition<>,
dtl::host_only,
dtl::par,
dtl::bulk_synchronous,
dtl::expected
>;
GPU Accelerated
using gpu_policies = dtl::policy_set<
dtl::block_partition<>,
dtl::device_only<0>,
dtl::par,
dtl::bulk_synchronous,
dtl::expected
>;
Development/Debugging
using debug_policies = dtl::policy_set<
dtl::block_partition<>,
dtl::host_only,
dtl::seq, // Sequential for easier debugging
dtl::bulk_synchronous,
dtl::throwing // Exceptions for stack traces
>;
Maximum Throughput
using throughput_policies = dtl::policy_set<
dtl::block_partition<>,
dtl::device_preferred,
dtl::par_unseq,
dtl::bulk_synchronous,
dtl::expected
>;
See Also
Containers Guide - Using containers with policies
Algorithms Guide - Algorithm execution policies
Error Handling Guide - Error policy details