# 9. Error Handling and Reliability

## Error model overview

DTL surfaces errors through status/result patterns and, in some APIs, exception-style or callback-configurable behavior.

## C++ guidance

- prefer explicit `result/status` checking in distributed control paths
- enrich error context near backend boundaries
- avoid suppressing backend or collective participation errors

## C ABI guidance

- check every returned `dtl_status`
- translate status to logs/messages at call boundaries
- treat `*_UNAVAILABLE` and `*_FAILED` distinctly

## Python guidance

- catch binding exceptions and preserve actionable context
- keep async request failures observable at await/wait boundaries

## Reliability practices

1. validate inputs/handles early
2. use explicit synchronization points where required
3. fail fast on collective contract violations
4. ensure deterministic cleanup paths for partially initialized states

## Failure categories to handle explicitly

- invalid argument / null pointer
- backend unavailable or failed initialization
- communication/collective failure
- timeout/cancellation paths (where applicable)

## Operational recommendations

- centralize status-to-message translation in application adapters
- record rank/context identifiers in logs
- include backend capability snapshot in startup diagnostics

## Next step

Proceed to [Chapter 10](10-performance-tuning-and-scaling.md).

## Deep-dive reference

- [Legacy Deep-Dive: Error Handling](error_handling.md)
- [Runtime and Handle Model](13-runtime-and-handle-model.md)