# HIP Backend Guide **Status:** Production-Ready **Since:** DTL 0.1.0-alpha.1 **Last Updated:** 2026-02-07 ## Overview The HIP (Heterogeneous-Compute Interface for Portability) backend enables DTL to run on AMD GPUs via the ROCm software stack. HIP provides a CUDA-like programming model, and DTL's HIP backend mirrors the CUDA backend's architecture with AMD-specific adaptations. Key capabilities: - **AMD GPU memory management** via `hip_memory_space` - **Stream-based execution** via `hip_executor` for asynchronous kernel dispatch - **API parity** with the CUDA backend for most DTL operations - **Portability** — HIP code can also target NVIDIA GPUs as a compilation target ## Requirements - **ROCm** 5.0 or later (6.x recommended) - **AMD GPU** with GFX9 or later architecture (MI100, MI200, MI300, RX 7000 series) - **C++20 compiler** (GCC 10+, Clang 14+ with ROCm support) - **CMake** 3.21+ (for HIP language support) ## CMake Configuration Enable the HIP backend at configure time: ```bash cmake -DDTL_ENABLE_HIP=ON \ -DCMAKE_HIP_ARCHITECTURES=gfx90a \ .. ``` ### Common CMake Flags | Flag | Default | Description | |------|---------|-------------| | `DTL_ENABLE_HIP` | `OFF` | Enable HIP backend | | `CMAKE_HIP_ARCHITECTURES` | Auto | Target GPU architectures (e.g., `gfx908;gfx90a;gfx942`) | ### ROCm Installation Install ROCm following AMD's official documentation for your distro: ```bash # Ubuntu 22.04+ sudo apt install rocm-dev hipcc ``` Verify the installation: ```bash hipcc --version rocminfo # Lists available AMD GPUs ``` ## CUDA to HIP Porting ### API Mapping HIP mirrors the CUDA API with a `hip` prefix. DTL abstracts most of these differences behind its backend interfaces: | CUDA | HIP | DTL Abstraction | |------|-----|-----------------| | `cudaMalloc` | `hipMalloc` | `hip_memory_space::allocate()` | | `cudaFree` | `hipFree` | `hip_memory_space::deallocate()` | | `cudaMemcpy` | `hipMemcpy` | `hip_memory_transfer::copy()` | | `cudaMemcpyAsync` | `hipMemcpyAsync` | `hip_memory_transfer::copy_async()` | | `cudaStreamCreate` | `hipStreamCreate` | `hip_stream` RAII wrapper | | `cudaEventCreate` | `hipEventCreate` | `hip_event` RAII wrapper | | `cudaDeviceSynchronize` | `hipDeviceSynchronize` | `hip_executor::synchronize()` | | `cudaGetDeviceCount` | `hipGetDeviceCount` | `dtl::hip::device_count()` | ### The hipify Tool AMD provides `hipify-perl` and `hipify-clang` to automatically convert CUDA source to HIP: ```bash # Convert a CUDA source file to HIP hipify-perl my_kernel.cu > my_kernel.hip.cpp # Or use the clang-based converter for more accuracy hipify-clang my_kernel.cu -o my_kernel.hip.cpp ``` DTL's backend code uses `#if DTL_ENABLE_HIP` guards rather than hipified CUDA code, ensuring each backend is a clean, first-class implementation. ## Memory Management ### `hip_memory_space` The `hip_memory_space` class manages device memory allocations on AMD GPUs. It satisfies DTL's `MemorySpace` concept. ```cpp #include dtl::hip::hip_memory_space mem_space; // Allocate device memory auto alloc_result = mem_space.allocate(1024); if (alloc_result.has_value()) { void* ptr = alloc_result.value(); // Use device memory... mem_space.deallocate(ptr, 1024); } ``` Properties: - `host_accessible = false` — device memory is not directly host-accessible - `device_accessible = true` — accessible from GPU kernels - Supports per-device allocation ### Memory Transfers ```cpp #include // Host to device dtl::hip::hip_memory_transfer::copy(d_ptr, h_ptr, size, dtl::hip::transfer_kind::host_to_device); // Device to host dtl::hip::hip_memory_transfer::copy(h_ptr, d_ptr, size, dtl::hip::transfer_kind::device_to_host); // Async copy on a stream dtl::hip::hip_memory_transfer::copy_async(d_ptr, h_ptr, size, dtl::hip::transfer_kind::host_to_device, stream); ``` ## Executor Patterns ### `hip_executor` The `hip_executor` provides stream-based asynchronous execution on AMD GPUs: ```cpp #include // Create executor with a non-blocking stream dtl::hip::hip_stream stream(dtl::hip::stream_flags::non_blocking); dtl::hip::hip_executor exec(std::move(stream)); // Execute work on the GPU exec.execute([](hipStream_t s) { my_kernel<<>>(d_data, n); }); // Wait for completion exec.synchronize(); ``` ### Stream Management DTL provides RAII stream wrappers for HIP: ```cpp #include // Default stream dtl::hip::hip_stream default_stream; // Non-blocking stream (owned) dtl::hip::hip_stream stream(dtl::hip::stream_flags::non_blocking); // Wrap an existing hipStream_t (non-owning) hipStream_t external_stream; hipStreamCreate(&external_stream); dtl::hip::hip_stream wrapped(external_stream, false); // does not destroy on destruct ``` ### Event Tracking ```cpp #include dtl::hip::hip_event event; event.record(stream); // Check completion if (event.query()) { // Work is done } // Or block until done event.synchronize(); ``` ## DTL HIP-Specific Considerations ### Placement Policies with HIP When HIP is enabled but CUDA is not, the placement policies map to HIP equivalents: - `device_only` — Allocates via `hipMalloc` on device N - `unified_memory` — Uses `hipMallocManaged` for managed memory - `device_preferred` — Uses unified memory with device-preferred hints ```cpp // HIP device allocation dtl::distributed_vector> vec(1000, ctx); // HIP unified memory dtl::distributed_vector vec_unified(1000, ctx); ``` ### Build System When building with HIP: - Use `hipcc` as the compiler or set up CMake's HIP language support - ROCm's `amdclang++` can compile HIP sources with `--offload-arch=` - DTL detects HIP availability via `find_package(hip)` in CMake ### Error Handling HIP errors are mapped to DTL status codes: ```cpp // HIP errors translated to DTL status dtl::status_code::hip_error // = 520 ``` ## Known Differences from CUDA Backend | Feature | CUDA Backend | HIP Backend | |---------|-------------|-------------| | Device memory | `cudaMalloc` | `hipMalloc` | | Unified memory | `cudaMallocManaged` | `hipMallocManaged` | | Streams | `cudaStream_t` | `hipStream_t` | | Events | `cudaEvent_t` | `hipEvent_t` | | Error type | `cudaError_t` | `hipError_t` | | Status code | `status_code::cuda_error` (510) | `status_code::hip_error` (520) | | Warp size | 32 | 64 (AMD GCN/CDNA) | | Shared memory | 48KB (default) | 64KB (typical) | | Architecture flag | `CMAKE_CUDA_ARCHITECTURES=80` | `CMAKE_HIP_ARCHITECTURES=gfx90a` | | Progress engine | CUDA event-based | HIP event-based | | NCCL integration | Yes | RCCL (ROCm equivalent) | ### Warp Size Differences AMD GPUs use a wavefront size of 64 (vs. NVIDIA's warp size of 32). This affects: - Warp-level primitives (`__ballot`, `__shfl`) - Shared memory bank conflicts - Occupancy calculations - Reduction patterns within a warp/wavefront ### Performance Considerations - AMD GPUs may have different optimal block sizes than NVIDIA GPUs - Memory coalescing rules are similar but not identical - Use `rocprof` instead of `nsight-compute` for profiling - Prefer `hipcc` for compilation to ensure proper HIP runtime linkage ## See Also - [CUDA Backend Guide](cuda_guide.md) — NVIDIA GPU support - [NCCL Backend](nccl_backend.md) — GPU-to-GPU collectives (NVIDIA) - [Backend Comparison](comparison.md) — Feature comparison across backends - [AMD ROCm Documentation](https://rocm.docs.amd.com/) — Official ROCm docs