HIP Backend Guide
Status: Production-Ready Since: DTL 0.1.0-alpha.1 Last Updated: 2026-02-07
Overview
The HIP (Heterogeneous-Compute Interface for Portability) backend enables DTL to run on AMD GPUs via the ROCm software stack. HIP provides a CUDA-like programming model, and DTL’s HIP backend mirrors the CUDA backend’s architecture with AMD-specific adaptations.
Key capabilities:
AMD GPU memory management via
hip_memory_spaceStream-based execution via
hip_executorfor asynchronous kernel dispatchAPI parity with the CUDA backend for most DTL operations
Portability — HIP code can also target NVIDIA GPUs as a compilation target
Requirements
ROCm 5.0 or later (6.x recommended)
AMD GPU with GFX9 or later architecture (MI100, MI200, MI300, RX 7000 series)
C++20 compiler (GCC 10+, Clang 14+ with ROCm support)
CMake 3.21+ (for HIP language support)
CMake Configuration
Enable the HIP backend at configure time:
cmake -DDTL_ENABLE_HIP=ON \
-DCMAKE_HIP_ARCHITECTURES=gfx90a \
..
Common CMake Flags
Flag |
Default |
Description |
|---|---|---|
|
|
Enable HIP backend |
|
Auto |
Target GPU architectures (e.g., |
ROCm Installation
Install ROCm following AMD’s official documentation for your distro:
# Ubuntu 22.04+
sudo apt install rocm-dev hipcc
Verify the installation:
hipcc --version
rocminfo # Lists available AMD GPUs
CUDA to HIP Porting
API Mapping
HIP mirrors the CUDA API with a hip prefix. DTL abstracts most of these differences behind its backend interfaces:
CUDA |
HIP |
DTL Abstraction |
|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
The hipify Tool
AMD provides hipify-perl and hipify-clang to automatically convert CUDA source to HIP:
# Convert a CUDA source file to HIP
hipify-perl my_kernel.cu > my_kernel.hip.cpp
# Or use the clang-based converter for more accuracy
hipify-clang my_kernel.cu -o my_kernel.hip.cpp
DTL’s backend code uses #if DTL_ENABLE_HIP guards rather than hipified CUDA code, ensuring each backend is a clean, first-class implementation.
Memory Management
hip_memory_space
The hip_memory_space class manages device memory allocations on AMD GPUs. It satisfies DTL’s MemorySpace concept.
#include <backends/hip/hip_memory_space.hpp>
dtl::hip::hip_memory_space mem_space;
// Allocate device memory
auto alloc_result = mem_space.allocate(1024);
if (alloc_result.has_value()) {
void* ptr = alloc_result.value();
// Use device memory...
mem_space.deallocate(ptr, 1024);
}
Properties:
host_accessible = false— device memory is not directly host-accessibledevice_accessible = true— accessible from GPU kernelsSupports per-device allocation
Memory Transfers
#include <backends/hip/hip_memory_transfer.hpp>
// Host to device
dtl::hip::hip_memory_transfer::copy(d_ptr, h_ptr, size,
dtl::hip::transfer_kind::host_to_device);
// Device to host
dtl::hip::hip_memory_transfer::copy(h_ptr, d_ptr, size,
dtl::hip::transfer_kind::device_to_host);
// Async copy on a stream
dtl::hip::hip_memory_transfer::copy_async(d_ptr, h_ptr, size,
dtl::hip::transfer_kind::host_to_device, stream);
Executor Patterns
hip_executor
The hip_executor provides stream-based asynchronous execution on AMD GPUs:
#include <backends/hip/hip_executor.hpp>
// Create executor with a non-blocking stream
dtl::hip::hip_stream stream(dtl::hip::stream_flags::non_blocking);
dtl::hip::hip_executor exec(std::move(stream));
// Execute work on the GPU
exec.execute([](hipStream_t s) {
my_kernel<<<grid, block, 0, s>>>(d_data, n);
});
// Wait for completion
exec.synchronize();
Stream Management
DTL provides RAII stream wrappers for HIP:
#include <backends/hip/hip_executor.hpp>
// Default stream
dtl::hip::hip_stream default_stream;
// Non-blocking stream (owned)
dtl::hip::hip_stream stream(dtl::hip::stream_flags::non_blocking);
// Wrap an existing hipStream_t (non-owning)
hipStream_t external_stream;
hipStreamCreate(&external_stream);
dtl::hip::hip_stream wrapped(external_stream, false); // does not destroy on destruct
Event Tracking
#include <backends/hip/hip_event.hpp>
dtl::hip::hip_event event;
event.record(stream);
// Check completion
if (event.query()) {
// Work is done
}
// Or block until done
event.synchronize();
DTL HIP-Specific Considerations
Placement Policies with HIP
When HIP is enabled but CUDA is not, the placement policies map to HIP equivalents:
device_only<N>— Allocates viahipMallocon device Nunified_memory— UseshipMallocManagedfor managed memorydevice_preferred— Uses unified memory with device-preferred hints
// HIP device allocation
dtl::distributed_vector<float, dtl::device_only<0>> vec(1000, ctx);
// HIP unified memory
dtl::distributed_vector<float, dtl::unified_memory> vec_unified(1000, ctx);
Build System
When building with HIP:
Use
hipccas the compiler or set up CMake’s HIP language supportROCm’s
amdclang++can compile HIP sources with--offload-arch=<gfx...>DTL detects HIP availability via
find_package(hip)in CMake
Error Handling
HIP errors are mapped to DTL status codes:
// HIP errors translated to DTL status
dtl::status_code::hip_error // = 520
Known Differences from CUDA Backend
Feature |
CUDA Backend |
HIP Backend |
|---|---|---|
Device memory |
|
|
Unified memory |
|
|
Streams |
|
|
Events |
|
|
Error type |
|
|
Status code |
|
|
Warp size |
32 |
64 (AMD GCN/CDNA) |
Shared memory |
48KB (default) |
64KB (typical) |
Architecture flag |
|
|
Progress engine |
CUDA event-based |
HIP event-based |
NCCL integration |
Yes |
RCCL (ROCm equivalent) |
Warp Size Differences
AMD GPUs use a wavefront size of 64 (vs. NVIDIA’s warp size of 32). This affects:
Warp-level primitives (
__ballot,__shfl)Shared memory bank conflicts
Occupancy calculations
Reduction patterns within a warp/wavefront
Performance Considerations
AMD GPUs may have different optimal block sizes than NVIDIA GPUs
Memory coalescing rules are similar but not identical
Use
rocprofinstead ofnsight-computefor profilingPrefer
hipccfor compilation to ensure proper HIP runtime linkage
See Also
CUDA Backend Guide — NVIDIA GPU support
NCCL Backend — GPU-to-GPU collectives (NVIDIA)
Backend Comparison — Feature comparison across backends
AMD ROCm Documentation — Official ROCm docs