# C Bindings Guide This guide covers the C ABI bindings for DTL, providing a stable interface for C programs and as the foundation for bindings in other languages. --- ## Table of Contents - [Overview](#overview) - [Installation](#installation) - [Core Concepts](#core-concepts) - [API Reference](#api-reference) - [Context Operations](#context-operations) - [Communicator Operations](#communicator-operations) - [Container Operations](#container-operations) - [Collective Operations](#collective-operations) - [Algorithm Operations](#algorithm-operations) - [Policy Selection](#policy-selection) - [RMA Operations](#rma-operations) - [Error Handling](#error-handling) - [Thread Safety](#thread-safety) - [Building FFI for Other Languages](#building-ffi-for-other-languages) - [Complete Example](#complete-example) --- ## Overview The C bindings provide: - **ABI Stability**: Binary compatibility across minor versions - **Language Neutrality**: Works with C99 and C11 compilers - **Clear Ownership**: Explicit memory management via naming conventions - **Error Transparency**: All errors surfaced via status codes ### Design Goals 1. Enable C programs to use DTL 2. Provide a stable foundation for bindings in other languages (Python, Fortran, Julia, etc.) 3. Maintain minimal overhead over native C++ calls --- ## Installation ### Building the C Library ```bash cmake .. -DDTL_BUILD_C_BINDINGS=ON make dtl_c ``` This produces `libdtl_c.so` (Linux/macOS) or `dtl_c.dll` (Windows). ### Linking ```bash # Compile C program gcc -std=c99 -o my_program my_program.c -ldtl_c -lmpi # With explicit paths gcc -std=c99 -I/path/to/dtl/include -L/path/to/dtl/lib \ -o my_program my_program.c -ldtl_c -lmpi ``` ### Headers ```c // Master header includes everything #include // Or include specific headers #include #include #include ``` --- ## Core Concepts ### Opaque Handles All DTL objects are accessed through opaque handle types: ```c typedef struct dtl_context_s* dtl_context_t; typedef struct dtl_vector_s* dtl_vector_t; typedef struct dtl_tensor_s* dtl_tensor_t; typedef struct dtl_request_s* dtl_request_t; ``` ### Memory Ownership Conventions Function naming indicates ownership semantics: | Suffix | Meaning | Caller Responsibility | |--------|---------|----------------------| | `_create` | Creates new owned object | MUST call `_destroy` | | `_destroy` | Releases owned object | Object becomes invalid | | `_get` | Returns borrowed pointer | MUST NOT free | | `_take` | Transfers ownership to caller | Caller MUST free/destroy | | `_give` | Transfers ownership from caller | Caller MUST NOT use after | ### Basic Types ```c typedef int32_t dtl_rank_t; // MPI rank typedef uint64_t dtl_size_t; // Size type typedef int64_t dtl_index_t; // Index type (signed for offsets) typedef int32_t dtl_status; // Status code ``` ### Data Types ```c typedef enum dtl_dtype { DTL_DTYPE_INT8 = 0, DTL_DTYPE_INT16 = 1, DTL_DTYPE_INT32 = 2, DTL_DTYPE_INT64 = 3, DTL_DTYPE_UINT8 = 4, DTL_DTYPE_UINT16 = 5, DTL_DTYPE_UINT32 = 6, DTL_DTYPE_UINT64 = 7, DTL_DTYPE_FLOAT32 = 8, DTL_DTYPE_FLOAT64 = 9, DTL_DTYPE_BYTE = 10 } dtl_dtype; ``` --- ## API Reference ### Environment Operations The environment manages backend lifecycle (MPI, CUDA, HIP, NCCL, SHMEM) using reference-counted RAII semantics. The first `create` call initializes backends; the last `destroy` finalizes them. ```c // Lifecycle dtl_status dtl_environment_create(dtl_environment_t* env); dtl_status dtl_environment_create_with_args(dtl_environment_t* env, int* argc, char*** argv); void dtl_environment_destroy(dtl_environment_t env); // State queries int dtl_environment_is_initialized(void); dtl_size_t dtl_environment_ref_count(void); // Backend availability int dtl_environment_has_mpi(void); int dtl_environment_has_cuda(void); int dtl_environment_has_hip(void); int dtl_environment_has_nccl(void); int dtl_environment_has_shmem(void); int dtl_environment_mpi_thread_level(void); // Context factories dtl_status dtl_environment_make_world_context(dtl_environment_t env, dtl_context_t* ctx); dtl_status dtl_environment_make_world_context_gpu(dtl_environment_t env, int device_id, dtl_context_t* ctx); dtl_status dtl_environment_make_cpu_context(dtl_environment_t env, dtl_context_t* ctx); ``` **Example:** ```c int main(int argc, char** argv) { dtl_environment_t env; dtl_status status = dtl_environment_create_with_args(&env, &argc, &argv); if (status != DTL_SUCCESS) { fprintf(stderr, "Init failed: %s\n", dtl_status_message(status)); return 1; } dtl_context_t ctx; status = dtl_environment_make_world_context(env, &ctx); printf("Rank %d of %d\n", dtl_context_rank(ctx), dtl_context_size(ctx)); dtl_context_destroy(ctx); dtl_environment_destroy(env); return 0; } ``` ### Context Operations The context encapsulates MPI communicator and device selection. Contexts can be created directly or via environment factory methods (preferred). ```c // Create context with default options (MPI_COMM_WORLD, no GPU) dtl_status dtl_context_create_default(dtl_context_t* ctx); // Create context with options typedef struct { int device_id; // GPU device ID (-1 for CPU only) int init_mpi; // Whether to initialize MPI (default: 1) int finalize_mpi; // Whether to finalize MPI on destruction (default: 0) int reserved[4]; // ABI-stable extension fields } dtl_context_options; dtl_status dtl_context_create(dtl_context_t* ctx, const dtl_context_options* opts); // Destroy context void dtl_context_destroy(dtl_context_t ctx); // Query properties dtl_rank_t dtl_context_rank(dtl_context_t ctx); // Current rank dtl_rank_t dtl_context_size(dtl_context_t ctx); // Total ranks // Synchronization dtl_status dtl_context_barrier(dtl_context_t ctx); ``` #### Mode-Aware CUDA/NCCL Context APIs ```c // Add CUDA/NCCL domains dtl_status dtl_context_with_cuda(dtl_context_t ctx, int device_id, dtl_context_t* out); dtl_status dtl_context_with_nccl(dtl_context_t ctx, int device_id, dtl_context_t* out); dtl_status dtl_context_with_nccl_ex( dtl_context_t ctx, int device_id, dtl_nccl_operation_mode mode, dtl_context_t* out); // Split with NCCL domain dtl_status dtl_context_split_nccl(dtl_context_t ctx, int color, int key, dtl_context_t* out); dtl_status dtl_context_split_nccl_ex( dtl_context_t ctx, int color, int key, int device_id, dtl_nccl_operation_mode mode, dtl_context_t* out); // NCCL mode/capability introspection int dtl_context_nccl_mode(dtl_context_t ctx); int dtl_context_nccl_supports_native(dtl_context_t ctx, dtl_nccl_operation op); int dtl_context_nccl_supports_hybrid(dtl_context_t ctx, dtl_nccl_operation op); ``` `DTL_NCCL_MODE_NATIVE_ONLY` rejects non-native NCCL operation families. `DTL_NCCL_MODE_HYBRID_PARITY` enables explicit hybrid parity paths where available. **Example:** ```c dtl_context_t ctx; dtl_status status = dtl_context_create_default(&ctx); if (status != DTL_SUCCESS) { fprintf(stderr, "Error: %s\n", dtl_status_message(status)); return 1; } printf("Rank %d of %d\n", dtl_context_rank(ctx), dtl_context_size(ctx)); dtl_context_barrier(ctx); dtl_context_destroy(ctx); ``` ### Container Operations #### Distributed Vector ```c // Create vector dtl_status dtl_vector_create( dtl_context_t ctx, dtl_dtype dtype, dtl_size_t global_size, dtl_vector_t* vec ); // Create with fill value dtl_status dtl_vector_create_fill( dtl_context_t ctx, dtl_dtype dtype, dtl_size_t global_size, const void* fill_value, dtl_vector_t* vec ); // Destroy vector void dtl_vector_destroy(dtl_vector_t vec); // Size queries dtl_size_t dtl_vector_global_size(dtl_vector_t vec); dtl_size_t dtl_vector_local_size(dtl_vector_t vec); dtl_size_t dtl_vector_local_offset(dtl_vector_t vec); // Data access (borrowed pointers - do not free) const void* dtl_vector_local_data(dtl_vector_t vec); void* dtl_vector_local_data_mut(dtl_vector_t vec); // Index queries dtl_rank_t dtl_vector_owner(dtl_vector_t vec, dtl_size_t global_idx); int dtl_vector_is_local(dtl_vector_t vec, dtl_size_t global_idx); ``` **Example:** ```c dtl_vector_t vec; dtl_status status = dtl_vector_create(ctx, DTL_DTYPE_FLOAT64, 1000, &vec); if (status != DTL_SUCCESS) { fprintf(stderr, "Failed to create vector: %s\n", dtl_status_message(status)); return 1; } // Get local data pointer double* data = (double*)dtl_vector_local_data_mut(vec); dtl_size_t local_size = dtl_vector_local_size(vec); // Fill with values for (size_t i = 0; i < local_size; i++) { data[i] = (double)(dtl_vector_local_offset(vec) + i); } dtl_vector_destroy(vec); ``` #### Distributed Span (`dtl_span_t`) The C ABI exposes a first-class non-owning distributed span handle: - create from containers: `dtl_span_from_vector`, `dtl_span_from_array`, `dtl_span_from_tensor` - create from raw local buffer + metadata: `dtl_span_create` - subspan operations: `dtl_span_first`, `dtl_span_last`, `dtl_span_subspan` - local access: `dtl_span_data(_mut)`, `dtl_span_get_local`, `dtl_span_set_local` - metadata: `dtl_span_size`, `dtl_span_local_size`, `dtl_span_rank`, `dtl_span_num_ranks` `dtl_span_t` is explicitly non-owning. The backing container/storage must outlive every span created from it. ```c dtl_vector_t vec = NULL; dtl_span_t span = NULL; dtl_status status = dtl_vector_create(ctx, DTL_DTYPE_FLOAT64, 1024, &vec); if (status != DTL_SUCCESS) return 1; status = dtl_span_from_vector(vec, &span); if (status != DTL_SUCCESS) return 1; double value = 3.14; status = dtl_span_set_local(span, 0, &value); dtl_span_destroy(span); dtl_vector_destroy(vec); ``` #### Distributed Tensor ```c // Create tensor dtl_status dtl_tensor_create( dtl_context_t ctx, dtl_dtype dtype, const dtl_size_t* shape, int ndim, dtl_tensor_t* tensor ); // Destroy tensor void dtl_tensor_destroy(dtl_tensor_t tensor); // Shape queries int dtl_tensor_ndim(dtl_tensor_t tensor); void dtl_tensor_shape(dtl_tensor_t tensor, dtl_size_t* shape); void dtl_tensor_local_shape(dtl_tensor_t tensor, dtl_size_t* shape); dtl_size_t dtl_tensor_global_size(dtl_tensor_t tensor); dtl_size_t dtl_tensor_local_size(dtl_tensor_t tensor); // Data access const void* dtl_tensor_local_data(dtl_tensor_t tensor); void* dtl_tensor_local_data_mut(dtl_tensor_t tensor); ``` **Example:** ```c dtl_size_t shape[] = {100, 64, 64}; // 100x64x64 tensor dtl_tensor_t tensor; dtl_status status = dtl_tensor_create(ctx, DTL_DTYPE_FLOAT32, shape, 3, &tensor); float* data = (float*)dtl_tensor_local_data_mut(tensor); // ... fill data ... dtl_tensor_destroy(tensor); ``` #### Distributed Array Fixed-size distributed array (size cannot be changed after creation). ```c // Create array dtl_status dtl_array_create( dtl_context_t ctx, dtl_dtype dtype, dtl_size_t size, dtl_array_t* arr ); // Create with fill value dtl_status dtl_array_create_fill( dtl_context_t ctx, dtl_dtype dtype, dtl_size_t size, const void* fill_value, dtl_array_t* arr ); // Destroy array void dtl_array_destroy(dtl_array_t arr); // Size queries dtl_size_t dtl_array_size(dtl_array_t arr); // Global size (fixed) dtl_size_t dtl_array_local_size(dtl_array_t arr); dtl_index_t dtl_array_local_offset(dtl_array_t arr); // Data access const void* dtl_array_local_data(dtl_array_t arr); void* dtl_array_local_data_mut(dtl_array_t arr); // Index queries dtl_rank_t dtl_array_owner(dtl_array_t arr, dtl_index_t global_idx); int dtl_array_is_local(dtl_array_t arr, dtl_index_t global_idx); ``` **Example:** ```c dtl_array_t arr; dtl_status status = dtl_array_create(ctx, DTL_DTYPE_INT32, 1000, &arr); int32_t* data = (int32_t*)dtl_array_local_data_mut(arr); for (size_t i = 0; i < dtl_array_local_size(arr); i++) { data[i] = i; } // Note: No resize() method - arrays are fixed size dtl_array_destroy(arr); ``` ### Collective Operations ```c // Reduction operations typedef enum dtl_reduce_op { DTL_OP_SUM = 0, DTL_OP_PROD = 1, DTL_OP_MIN = 2, DTL_OP_MAX = 3, DTL_OP_LAND = 4, // Logical AND DTL_OP_LOR = 5, // Logical OR DTL_OP_BAND = 6, // Bitwise AND DTL_OP_BOR = 7 // Bitwise OR } dtl_reduce_op; // Broadcast dtl_status dtl_broadcast( dtl_context_t ctx, void* buf, dtl_size_t count, dtl_dtype dtype, dtl_rank_t root ); // Reduce to root dtl_status dtl_reduce( dtl_context_t ctx, const void* sendbuf, void* recvbuf, dtl_size_t count, dtl_dtype dtype, dtl_reduce_op op, dtl_rank_t root ); // Reduce to all ranks dtl_status dtl_allreduce( dtl_context_t ctx, const void* sendbuf, void* recvbuf, dtl_size_t count, dtl_dtype dtype, dtl_reduce_op op ); // Gather to root dtl_status dtl_gather( dtl_context_t ctx, const void* sendbuf, dtl_size_t sendcount, dtl_dtype sendtype, void* recvbuf, dtl_size_t recvcount, dtl_dtype recvtype, dtl_rank_t root ); // Scatter from root dtl_status dtl_scatter( dtl_context_t ctx, const void* sendbuf, dtl_size_t sendcount, dtl_dtype sendtype, void* recvbuf, dtl_size_t recvcount, dtl_dtype recvtype, dtl_rank_t root ); // Gather to all dtl_status dtl_allgather( dtl_context_t ctx, const void* sendbuf, dtl_size_t sendcount, dtl_dtype sendtype, void* recvbuf, dtl_size_t recvcount, dtl_dtype recvtype ); ``` **Example:** ```c double local_sum = compute_local_sum(vec); double global_sum; dtl_status status = dtl_allreduce( ctx, &local_sum, // send buffer &global_sum, // receive buffer 1, // count DTL_DTYPE_FLOAT64, // data type DTL_OP_SUM // reduction operation ); if (status == DTL_SUCCESS) { printf("Global sum: %f\n", global_sum); } ``` ### Algorithm Operations DTL provides distributed algorithm operations on containers. ```c // Callback types typedef void (*dtl_unary_func)(void* element, dtl_size_t index, void* user_data); typedef int (*dtl_predicate)(const void* element, void* user_data); // for_each - apply function to each local element dtl_status dtl_for_each_vector(dtl_vector_t vec, dtl_unary_func func, void* user_data); dtl_status dtl_for_each_array(dtl_array_t arr, dtl_unary_func func, void* user_data); // copy - copy data between containers dtl_status dtl_copy_vector(dtl_vector_t src, dtl_vector_t dst); dtl_status dtl_copy_array(dtl_array_t src, dtl_array_t dst); // fill - fill container with value dtl_status dtl_fill_vector(dtl_vector_t vec, const void* value); dtl_status dtl_fill_array(dtl_array_t arr, const void* value); // find - find first matching element dtl_index_t dtl_find_vector(dtl_vector_t vec, const void* value); dtl_index_t dtl_find_if_vector(dtl_vector_t vec, dtl_predicate pred, void* user_data); dtl_index_t dtl_find_array(dtl_array_t arr, const void* value); dtl_index_t dtl_find_if_array(dtl_array_t arr, dtl_predicate pred, void* user_data); // count - count matching elements dtl_size_t dtl_count_vector(dtl_vector_t vec, const void* value); dtl_size_t dtl_count_if_vector(dtl_vector_t vec, dtl_predicate pred, void* user_data); dtl_size_t dtl_count_array(dtl_array_t arr, const void* value); dtl_size_t dtl_count_if_array(dtl_array_t arr, dtl_predicate pred, void* user_data); // reduce - local reduction dtl_status dtl_reduce_local_vector(dtl_vector_t vec, dtl_reduce_op op, void* result); dtl_status dtl_reduce_local_array(dtl_array_t arr, dtl_reduce_op op, void* result); // sort - local sort dtl_status dtl_sort_vector(dtl_vector_t vec); dtl_status dtl_sort_vector_descending(dtl_vector_t vec); dtl_status dtl_sort_array(dtl_array_t arr); dtl_status dtl_sort_array_descending(dtl_array_t arr); // minmax - find local min and max dtl_status dtl_minmax_vector(dtl_vector_t vec, void* min_val, void* max_val); dtl_status dtl_minmax_array(dtl_array_t arr, void* min_val, void* max_val); ``` **Example:** ```c // Fill vector with value double value = 42.0; dtl_fill_vector(vec, &value); // Count elements greater than 10 int predicate_gt_10(const void* elem, void* user_data) { return *(double*)elem > 10.0; } dtl_size_t count = dtl_count_if_vector(vec, predicate_gt_10, NULL); // Local reduction double local_sum; dtl_reduce_local_vector(vec, DTL_OP_SUM, &local_sum); // Sort ascending dtl_sort_vector(vec); ``` ### Policy Selection DTL supports policy selection at container creation time. ```c // Partition policies typedef enum dtl_partition_policy { DTL_PARTITION_BLOCK = 0, DTL_PARTITION_CYCLIC = 1, DTL_PARTITION_BLOCK_CYCLIC = 2, DTL_PARTITION_HASH = 3, DTL_PARTITION_REPLICATED = 4 } dtl_partition_policy; // Placement policies typedef enum dtl_placement_policy { DTL_PLACEMENT_HOST = 0, DTL_PLACEMENT_DEVICE = 1, // CUDA only DTL_PLACEMENT_UNIFIED = 2, // CUDA only DTL_PLACEMENT_DEVICE_PREFERRED = 3 // CUDA only } dtl_placement_policy; // Execution policies typedef enum dtl_execution_policy { DTL_EXEC_SEQ = 0, DTL_EXEC_PAR = 1, DTL_EXEC_ASYNC = 2 } dtl_execution_policy; // Container options typedef struct dtl_container_options { dtl_partition_policy partition; dtl_placement_policy placement; dtl_execution_policy execution; } dtl_container_options; // Initialize options to defaults void dtl_container_options_init(dtl_container_options* opts); // Create container with options dtl_status dtl_vector_create_with_options( dtl_context_t ctx, dtl_dtype dtype, dtl_size_t size, const dtl_container_options* opts, dtl_vector_t* vec ); // Query container policies dtl_partition_policy dtl_vector_partition_policy(dtl_vector_t vec); dtl_placement_policy dtl_vector_placement_policy(dtl_vector_t vec); // Check policy availability int dtl_placement_available(dtl_placement_policy placement); ``` **Example:** ```c dtl_container_options opts; dtl_container_options_init(&opts); opts.partition = DTL_PARTITION_CYCLIC; dtl_vector_t vec; dtl_vector_create_with_options(ctx, DTL_DTYPE_FLOAT64, 10000, &opts, &vec); // Query policy dtl_partition_policy policy = dtl_vector_partition_policy(vec); printf("Partition: %s\n", policy == DTL_PARTITION_CYCLIC ? "cyclic" : "other"); ``` ### RMA Operations Remote Memory Access (one-sided communication) operations. #### Window Management ```c typedef struct dtl_window_s* dtl_window_t; typedef enum dtl_lock_mode { DTL_LOCK_EXCLUSIVE = 0, DTL_LOCK_SHARED = 1 } dtl_lock_mode; // Create window from existing memory dtl_status dtl_window_create( dtl_context_t ctx, void* base, dtl_size_t size, dtl_window_t* win ); // Allocate window with new memory dtl_status dtl_window_allocate( dtl_context_t ctx, dtl_size_t size, dtl_window_t* win ); // Destroy window void dtl_window_destroy(dtl_window_t win); // Query properties void* dtl_window_base(dtl_window_t win); dtl_size_t dtl_window_size(dtl_window_t win); int dtl_window_is_valid(dtl_window_t win); ``` #### Synchronization ```c // Active-target synchronization (collective) dtl_status dtl_window_fence(dtl_window_t win); // Passive-target synchronization (per-rank) dtl_status dtl_window_lock(dtl_window_t win, dtl_rank_t target, dtl_lock_mode mode); dtl_status dtl_window_unlock(dtl_window_t win, dtl_rank_t target); dtl_status dtl_window_lock_all(dtl_window_t win); dtl_status dtl_window_unlock_all(dtl_window_t win); // Flush pending operations dtl_status dtl_window_flush(dtl_window_t win, dtl_rank_t target); dtl_status dtl_window_flush_all(dtl_window_t win); dtl_status dtl_window_flush_local(dtl_window_t win, dtl_rank_t target); dtl_status dtl_window_flush_local_all(dtl_window_t win); ``` #### Data Transfer ```c // Put data to remote window dtl_status dtl_rma_put( dtl_window_t win, dtl_rank_t target, dtl_size_t target_offset, const void* origin, dtl_size_t size ); // Get data from remote window dtl_status dtl_rma_get( dtl_window_t win, dtl_rank_t target, dtl_size_t target_offset, void* buffer, dtl_size_t size ); // Async versions dtl_status dtl_rma_put_async(dtl_window_t win, dtl_rank_t target, dtl_size_t offset, const void* data, dtl_size_t size, dtl_request_t* req); dtl_status dtl_rma_get_async(dtl_window_t win, dtl_rank_t target, dtl_size_t offset, void* buffer, dtl_size_t size, dtl_request_t* req); ``` #### Atomic Operations ```c // Atomic accumulate dtl_status dtl_rma_accumulate( dtl_window_t win, dtl_rank_t target, dtl_size_t offset, const void* origin, dtl_size_t size, dtl_dtype dtype, dtl_reduce_op op ); // Atomic fetch-and-op dtl_status dtl_rma_fetch_and_op( dtl_window_t win, dtl_rank_t target, dtl_size_t offset, const void* origin, void* result, dtl_dtype dtype, dtl_reduce_op op ); // Atomic compare-and-swap dtl_status dtl_rma_compare_and_swap( dtl_window_t win, dtl_rank_t target, dtl_size_t offset, const void* compare, const void* swap, void* result, dtl_dtype dtype ); ``` **Example:** ```c // Create window dtl_window_t win; dtl_window_allocate(ctx, 1024, &win); // Active-target: use fence dtl_window_fence(win); // Start epoch double data[10] = {1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0, 10.0}; dtl_rank_t my_rank = dtl_context_rank(ctx); dtl_rma_put(win, my_rank, 0, data, sizeof(data)); dtl_window_fence(win); // Complete epoch // Passive-target: use lock/unlock dtl_window_lock(win, my_rank, DTL_LOCK_EXCLUSIVE); int64_t old_val; int64_t addend = 1; dtl_rma_fetch_and_op(win, my_rank, 0, &addend, &old_val, DTL_DTYPE_INT64, DTL_OP_SUM); dtl_window_flush(win, my_rank); dtl_window_unlock(win, my_rank); // Cleanup dtl_window_destroy(win); ``` --- ## Error Handling ### Status Codes All functions that can fail return `dtl_status`: ```c #define DTL_SUCCESS 0 // Communication (100-199) #define DTL_ERROR_COMMUNICATION 100 #define DTL_ERROR_SEND_FAILED 101 #define DTL_ERROR_RECV_FAILED 102 #define DTL_ERROR_BARRIER_FAILED 105 #define DTL_ERROR_TIMEOUT 106 // Memory (200-299) #define DTL_ERROR_MEMORY 200 #define DTL_ERROR_ALLOCATION_FAILED 201 #define DTL_ERROR_OUT_OF_MEMORY 202 // Bounds (400-499) #define DTL_ERROR_BOUNDS 400 #define DTL_ERROR_OUT_OF_BOUNDS 401 #define DTL_ERROR_INVALID_ARGUMENT 410 // Backend (500-599) #define DTL_ERROR_BACKEND 500 #define DTL_ERROR_MPI 530 // Internal (900-999) #define DTL_ERROR_NOT_IMPLEMENTED 901 ``` ### Error Messages ```c // Get human-readable error message const char* dtl_status_message(dtl_status status); // Get error category name const char* dtl_status_category(dtl_status status); // Get error category code int dtl_status_category_code(dtl_status status); ``` ### Error Handling Pattern ```c dtl_status status = dtl_some_operation(...); if (status != DTL_SUCCESS) { fprintf(stderr, "[%s] Error %d: %s\n", dtl_status_category(status), status, dtl_status_message(status)); // Handle error... } ``` --- ## Thread Safety | Category | Guarantee | |----------|-----------| | Version queries | Thread-safe | | Status messages | Thread-safe (static strings) | | Context operations | NOT thread-safe within same context | | Container operations | NOT thread-safe within same container | **Multi-threaded usage**: Different contexts/containers MAY be used from different threads simultaneously. The same context/container MUST NOT be accessed from multiple threads without external synchronization. --- ## Building FFI for Other Languages The C ABI is designed to be easily wrapped by other languages: ### Symbol Naming All symbols use the `dtl_` prefix with predictable patterns: - Types: `dtl__t` - Functions: `dtl__` - Constants: `DTL_` ### Handle Pattern Opaque handles are pointers to forward-declared structs: ```c // In header typedef struct dtl_context_s* dtl_context_t; // Actual implementation hidden in .cpp struct dtl_context_s { // ... implementation details ... }; ``` This ensures: - ABI stability (pointer size is fixed) - No need to expose internal structure layout - Safe to change internals without recompilation ### Feature Detection ```c // Check available backends at runtime int dtl_has_mpi(void); // Returns 1 if MPI available int dtl_has_cuda(void); // Returns 1 if CUDA available int dtl_has_hip(void); // Returns 1 if HIP available int dtl_has_nccl(void); // Returns 1 if NCCL available ``` ### Version Queries ```c int dtl_version_major(void); int dtl_version_minor(void); int dtl_version_patch(void); int dtl_abi_version(void); const char* dtl_version_string(void); // e.g., "1.0.0" ``` --- ## Complete Example ```c /** * DTL C Bindings Example: Distributed Sum * * Compile: gcc -std=c99 -o dist_sum dist_sum.c -ldtl_c -lmpi * Run: mpirun -np 4 ./dist_sum */ #include #include #include int main(int argc, char** argv) { dtl_context_t ctx; dtl_vector_t vec; dtl_status status; // Create context status = dtl_context_create_default(&ctx); if (status != DTL_SUCCESS) { fprintf(stderr, "Failed to create context: %s\n", dtl_status_message(status)); return 1; } dtl_rank_t rank = dtl_context_rank(ctx); dtl_rank_t size = dtl_context_size(ctx); printf("[Rank %d/%d] Started\n", rank, size); // Create distributed vector with 10000 elements const dtl_size_t global_size = 10000; status = dtl_vector_create(ctx, DTL_DTYPE_FLOAT64, global_size, &vec); if (status != DTL_SUCCESS) { fprintf(stderr, "Failed to create vector: %s\n", dtl_status_message(status)); dtl_context_destroy(ctx); return 1; } // Get local data pointer double* data = (double*)dtl_vector_local_data_mut(vec); dtl_size_t local_size = dtl_vector_local_size(vec); dtl_size_t local_offset = dtl_vector_local_offset(vec); printf("[Rank %d] Local size: %lu, offset: %lu\n", rank, (unsigned long)local_size, (unsigned long)local_offset); // Initialize with global indices for (dtl_size_t i = 0; i < local_size; i++) { data[i] = (double)(local_offset + i); } // Compute local sum double local_sum = 0.0; for (dtl_size_t i = 0; i < local_size; i++) { local_sum += data[i]; } // Global sum via allreduce double global_sum; status = dtl_allreduce(ctx, &local_sum, &global_sum, 1, DTL_DTYPE_FLOAT64, DTL_OP_SUM); if (status != DTL_SUCCESS) { fprintf(stderr, "Allreduce failed: %s\n", dtl_status_message(status)); dtl_vector_destroy(vec); dtl_context_destroy(ctx); return 1; } // Verify result (sum of 0..N-1 = N*(N-1)/2) double expected = (double)(global_size * (global_size - 1) / 2); if (rank == 0) { printf("\nGlobal sum: %.0f\n", global_sum); printf("Expected: %.0f\n", expected); printf("Result: %s\n", (global_sum == expected) ? "SUCCESS" : "FAILURE"); } // Cleanup dtl_vector_destroy(vec); dtl_context_destroy(ctx); return (global_sum == expected) ? 0 : 1; } ``` --- ## References - [Python Bindings Guide](python_bindings.md) (uses C ABI internally) - [Fortran Bindings Guide](fortran_bindings.md) (via ISO_C_BINDING) ## Thread Safety A `dtl_context_t` handle is **not safe for concurrent use** from multiple threads. However, multi-threaded programs can use DTL safely by following these patterns: ### Pattern 1: One context per thread ```c void* worker(void* arg) { int thread_id = *(int*)arg; // Each thread creates its own context dtl_context_t ctx; dtl_context_create_default(&ctx); // Create and work with containers independently dtl_vector_t vec; dtl_vector_create_f64(&vec, ctx, 1000); // ... compute ... dtl_vector_destroy(vec); dtl_context_destroy(ctx); return NULL; } ``` ### Pattern 2: Duplicate context for threads ```c // Main thread creates the primary context dtl_context_t main_ctx; dtl_context_create_default(&main_ctx); // Before spawning threads, duplicate for each dtl_context_t thread_ctx; dtl_context_dup(main_ctx, &thread_ctx); // Pass thread_ctx to the worker thread // Each thread uses its own duplicated context ``` ### Pattern 3: External synchronization ```c pthread_mutex_t ctx_lock = PTHREAD_MUTEX_INITIALIZER; dtl_context_t shared_ctx; void* worker(void* arg) { // Serialize access to shared context pthread_mutex_lock(&ctx_lock); dtl_context_barrier(shared_ctx); pthread_mutex_unlock(&ctx_lock); return NULL; } ``` ### Recommended approach **Pattern 1** (one context per thread) is recommended for most use cases. It provides complete isolation with no synchronization overhead. Each context duplicates the MPI communicator internally, ensuring message isolation between threads. For GPU applications with multiple CUDA streams, use Pattern 2 and set different device IDs for each context: ```c dtl_context_options opts; dtl_context_options_init(&opts); opts.device_id = thread_id % num_gpus; dtl_context_t ctx; dtl_context_create(&ctx, &opts); ```