9. Error Handling and Reliability

Error model overview

DTL surfaces errors through status/result patterns and, in some APIs, exception-style or callback-configurable behavior.

C++ guidance

prefer explicit result/status checking in distributed control paths
enrich error context near backend boundaries
avoid suppressing backend or collective participation errors

C ABI guidance

check every returned dtl_status
translate status to logs/messages at call boundaries
treat *_UNAVAILABLE and *_FAILED distinctly

Python guidance

catch binding exceptions and preserve actionable context
keep async request failures observable at await/wait boundaries

Reliability practices

validate inputs/handles early
use explicit synchronization points where required
fail fast on collective contract violations
ensure deterministic cleanup paths for partially initialized states

Failure categories to handle explicitly

invalid argument / null pointer
backend unavailable or failed initialization
communication/collective failure
timeout/cancellation paths (where applicable)

Operational recommendations

centralize status-to-message translation in application adapters
record rank/context identifiers in logs
include backend capability snapshot in startup diagnostics

Next step

Proceed to Chapter 10.

Deep-dive reference