9. Error Handling and Reliability
Error model overview
DTL surfaces errors through status/result patterns and, in some APIs, exception-style or callback-configurable behavior.
C++ guidance
prefer explicit
result/statuschecking in distributed control pathsenrich error context near backend boundaries
avoid suppressing backend or collective participation errors
C ABI guidance
check every returned
dtl_statustranslate status to logs/messages at call boundaries
treat
*_UNAVAILABLEand*_FAILEDdistinctly
Python guidance
catch binding exceptions and preserve actionable context
keep async request failures observable at await/wait boundaries
Reliability practices
validate inputs/handles early
use explicit synchronization points where required
fail fast on collective contract violations
ensure deterministic cleanup paths for partially initialized states
Failure categories to handle explicitly
invalid argument / null pointer
backend unavailable or failed initialization
communication/collective failure
timeout/cancellation paths (where applicable)
Operational recommendations
centralize status-to-message translation in application adapters
record rank/context identifiers in logs
include backend capability snapshot in startup diagnostics
Next step
Proceed to Chapter 10.