9. Error Handling and Reliability

Error model overview

DTL surfaces errors through status/result patterns and, in some APIs, exception-style or callback-configurable behavior.

C++ guidance

  • prefer explicit result/status checking in distributed control paths

  • enrich error context near backend boundaries

  • avoid suppressing backend or collective participation errors

C ABI guidance

  • check every returned dtl_status

  • translate status to logs/messages at call boundaries

  • treat *_UNAVAILABLE and *_FAILED distinctly

Python guidance

  • catch binding exceptions and preserve actionable context

  • keep async request failures observable at await/wait boundaries

Reliability practices

  1. validate inputs/handles early

  2. use explicit synchronization points where required

  3. fail fast on collective contract violations

  4. ensure deterministic cleanup paths for partially initialized states

Failure categories to handle explicitly

  • invalid argument / null pointer

  • backend unavailable or failed initialization

  • communication/collective failure

  • timeout/cancellation paths (where applicable)

Operational recommendations

  • centralize status-to-message translation in application adapters

  • record rank/context identifiers in logs

  • include backend capability snapshot in startup diagnostics

Next step

Proceed to Chapter 10.

Deep-dive reference