Legacy Deep-Dive: Error Handling
This page is retained as a detailed reference. The canonical user path is now the chaptered handbook.
Primary chapter: 09-error-handling-and-reliability.md
Runtime and handles: Runtime and Handle Model
Detailed Reference (Legacy)
DTL provides two error handling modes: result-based (default) and exception-based. This guide covers both approaches and how to handle distributed errors.
Table of Contents
Overview
Distributed systems introduce error scenarios not present in single-process code:
Communication failures
Rank failures
Collective operation mismatches
Structural invalidation across ranks
DTL handles these through configurable error policies.
Error Policy Selection
// Result-based (default)
dtl::distributed_vector<double, ..., dtl::expected> vec(1000, size, rank);
// Exception-based
dtl::distributed_vector<double, ..., dtl::throwing> vec(1000, size, rank);
Error Categories
DTL defines standard error codes for all operations:
Error Code |
Description |
|---|---|
|
Operation succeeded |
|
Invalid parameter |
|
Index out of bounds |
|
Operation not supported |
|
Backend (MPI, CUDA) error |
|
Collective operation failed |
|
Serialization failed |
|
Consistency policy violated |
|
View/iterator invalidated |
|
Operation timed out |
|
Operation canceled |
Checking Error Codes
dtl::error err = some_operation();
switch (err.code()) {
case dtl::error_code::ok:
// Success
break;
case dtl::error_code::out_of_range:
// Handle bounds error
break;
case dtl::error_code::collective_failure:
// Handle distributed failure
break;
default:
// Handle other errors
break;
}
Result-Based Error Handling
The default mode uses result<T> (similar to std::expected).
Basic Usage
dtl::distributed_vector<double> vec(1000, size, rank); // default: expected policy
auto global = vec.global_view();
// get() returns result<T>
auto result = global[500].get();
if (result.has_value()) {
double val = result.value();
std::cout << "Value: " << val << "\n";
} else {
dtl::error err = result.error();
std::cerr << "Error: " << err.message() << "\n";
}
Result Type API
dtl::result<T> result = operation();
// Check success
if (result) { ... } // Explicit bool conversion
if (result.has_value()) { ... }
// Access value (precondition: has_value())
T val = result.value();
T val = *result; // Same as value()
// Access value with default
T val = result.value_or(default_value);
// Access error (precondition: !has_value())
dtl::error err = result.error();
Chaining Operations
auto result = global[idx].get()
.and_then([](double x) -> dtl::result<double> {
return x * 2.0;
})
.or_else([](dtl::error e) -> dtl::result<double> {
std::cerr << "Error: " << e.message() << "\n";
return 0.0; // Default value
});
Void Results
Operations that don’t return a value use result<void>:
dtl::result<void> result = global[500].put(42.0);
if (!result) {
std::cerr << "Put failed: " << result.error().message() << "\n";
}
Exception-Based Error Handling
Enable exceptions with the throwing error policy.
Configuration
// Container with throwing policy
dtl::distributed_vector<double, dtl::block_partition<>, dtl::host_only,
dtl::seq, dtl::bulk_synchronous, dtl::throwing> vec(1000, size, rank);
Basic Usage
try {
auto global = vec.global_view();
double val = global[500].get(); // Throws on error
global[500].put(42.0); // Throws on error
} catch (const dtl::dtl_exception& e) {
std::cerr << "DTL error: " << e.what() << "\n";
std::cerr << "Error code: " << static_cast<int>(e.code()) << "\n";
}
Exception Hierarchy
std::exception
└── dtl::dtl_exception
├── dtl::invalid_argument_error
├── dtl::out_of_range_error
├── dtl::communication_error
│ └── dtl::collective_error
├── dtl::serialization_error
├── dtl::consistency_error
└── dtl::invalidation_error
Catching Specific Exceptions
try {
// ... operations ...
} catch (const dtl::collective_error& e) {
// Handle collective failure specifically
std::cerr << "Collective failed: " << e.what() << "\n";
// e.failing_ranks() may be available
} catch (const dtl::communication_error& e) {
// Handle any communication error
std::cerr << "Communication error: " << e.what() << "\n";
} catch (const dtl::dtl_exception& e) {
// Handle any DTL error
std::cerr << "DTL error: " << e.what() << "\n";
}
Collective Errors
Distributed operations may fail on subsets of ranks. DTL provides collective_error to aggregate failure information.
Collective Error Structure
struct collective_error {
error_code code;
std::string message;
// Did any rank fail?
bool any_failed() const;
// Get representative error
error representative_error() const;
// Get failing ranks (if available)
std::optional<std::vector<rank_t>> failing_ranks() const;
};
Handling Collective Errors
// Result-based
auto result = dtl::distributed_reduce(vec, 0.0, std::plus<>{});
if (!result) {
auto& err = result.error();
if (err.code() == dtl::error_code::collective_failure) {
// Some ranks failed
if (auto ranks = err.failing_ranks()) {
std::cerr << "Failing ranks: ";
for (auto r : *ranks) std::cerr << r << " ";
std::cerr << "\n";
}
}
}
// Exception-based
try {
double sum = dtl::distributed_reduce(vec, 0.0, std::plus<>{});
} catch (const dtl::collective_error& e) {
std::cerr << "Collective failed: " << e.what() << "\n";
// Access failing rank information
}
Collective Semantics
All ranks must participate in collective operations:
// WRONG: Only some ranks call reduce
if (rank < 2) {
auto sum = dtl::distributed_reduce(vec, 0.0, std::plus<>{}); // DEADLOCK
}
// CORRECT: All ranks call
auto sum = dtl::distributed_reduce(vec, 0.0, std::plus<>{});
Structural Invalidation
Views and iterators are invalidated by structural operations. Using invalidated views produces deterministic errors.
What Causes Invalidation
Operation |
Invalidates Views? |
|---|---|
|
Yes |
|
Yes |
Element modification |
No |
Obtaining new views |
No |
Detection
auto local = vec.local_view();
// ... use local ...
vec.resize(2000); // Structural operation
// local is now invalid
// Result-based:
auto result = local[0]; // Returns error with structural_invalidation
// Exception-based:
try {
auto val = local[0]; // Throws dtl::invalidation_error
} catch (const dtl::invalidation_error& e) {
std::cerr << "View invalidated: " << e.what() << "\n";
}
Safe Pattern
void process(dtl::distributed_vector<double>& vec) {
auto local = vec.local_view();
// Phase 1: Use view
for (double& x : local) {
x *= 2.0;
}
// Phase 2: Structural change
if (needs_resize()) {
vec.resize(new_size);
local = vec.local_view(); // Get fresh view
}
// Phase 3: Continue with valid view
for (double& x : local) {
x += 1.0;
}
}
Best Practices
1. Choose Error Policy Based on Use Case
// For production/library code: result-based
// - Forces explicit error handling
// - No unexpected stack unwinding
dtl::distributed_vector<double, ..., dtl::expected> prod_vec(1000, size, rank);
// For development/debugging: exception-based
// - Stack traces on error
// - Easier to debug
dtl::distributed_vector<double, ..., dtl::throwing> debug_vec(1000, size, rank);
2. Always Handle Collective Errors
auto result = dtl::distributed_reduce(vec, 0.0, std::plus<>{});
// Don't ignore collective failures
if (!result) {
// Log and potentially abort
MPI_Abort(MPI_COMM_WORLD, 1);
}
3. Use RAII for View Management
class view_scope {
distributed_vector<double>& vec_;
decltype(vec_.local_view()) view_;
public:
view_scope(distributed_vector<double>& vec)
: vec_(vec), view_(vec.local_view()) {}
auto& get() { return view_; }
void refresh() { view_ = vec_.local_view(); }
};
4. Propagate Errors in Helper Functions
// Result-based
dtl::result<double> compute_sum(const dtl::distributed_vector<double>& vec) {
auto result = dtl::distributed_reduce(vec, 0.0, std::plus<>{});
if (!result) {
return result.error(); // Propagate error
}
return result.value();
}
// Exception-based: errors propagate automatically
double compute_sum(const dtl::distributed_vector<double>& vec) {
return dtl::distributed_reduce(vec, 0.0, std::plus<>{});
}
5. Document Error Conditions
/// @brief Computes global average
/// @param vec Input vector
/// @return Global average value
/// @throws dtl::collective_error If collective operation fails
/// @throws dtl::invalid_argument_error If vector is empty
double global_average(const dtl::distributed_vector<double>& vec);
6. Use Error Policy Consistently
// All containers in a program should typically use the same policy
template<typename T>
using my_vector = dtl::distributed_vector<T, dtl::block_partition<>,
dtl::host_only, dtl::seq,
dtl::bulk_synchronous,
MY_ERROR_POLICY>;
Error Information
Error Object API
dtl::error err = ...;
// Error code
dtl::error_code code = err.code();
// Human-readable message
std::string msg = err.message();
// Check specific codes
bool is_timeout = (err.code() == dtl::error_code::timeout);
// Optional: backend-specific info
if (auto* mpi_err = err.backend_error<mpi_error>()) {
int mpi_code = mpi_err->code;
}
Custom Error Messages
dtl::error custom_err(dtl::error_code::invalid_argument,
"Custom error message: value out of range");
See Also
Policies Guide - Error policy configuration
Views Guide - View invalidation details