HIP Backend Guide

Status: Production-Ready Since: DTL 0.1.0-alpha.1 Last Updated: 2026-02-07

Overview

The HIP (Heterogeneous-Compute Interface for Portability) backend enables DTL to run on AMD GPUs via the ROCm software stack. HIP provides a CUDA-like programming model, and DTL’s HIP backend mirrors the CUDA backend’s architecture with AMD-specific adaptations.

Key capabilities:

AMD GPU memory management via hip_memory_space
Stream-based execution via hip_executor for asynchronous kernel dispatch
API parity with the CUDA backend for most DTL operations
Portability — HIP code can also target NVIDIA GPUs as a compilation target

Requirements

ROCm 5.0 or later (6.x recommended)
AMD GPU with GFX9 or later architecture (MI100, MI200, MI300, RX 7000 series)
C++20 compiler (GCC 10+, Clang 14+ with ROCm support)
CMake 3.21+ (for HIP language support)

CMake Configuration

Enable the HIP backend at configure time:

cmake -DDTL_ENABLE_HIP=ON \
      -DCMAKE_HIP_ARCHITECTURES=gfx90a \
      ..

Common CMake Flags

Flag	Default	Description
`DTL_ENABLE_HIP`	`OFF`	Enable HIP backend
`CMAKE_HIP_ARCHITECTURES`	Auto	Target GPU architectures (e.g., `gfx908;gfx90a;gfx942`)

ROCm Installation

Install ROCm following AMD’s official documentation for your distro:

# Ubuntu 22.04+
sudo apt install rocm-dev hipcc

Verify the installation:

hipcc --version
rocminfo  # Lists available AMD GPUs

CUDA to HIP Porting

API Mapping

HIP mirrors the CUDA API with a hip prefix. DTL abstracts most of these differences behind its backend interfaces:

CUDA	HIP	DTL Abstraction
`cudaMalloc`	`hipMalloc`	`hip_memory_space::allocate()`
`cudaFree`	`hipFree`	`hip_memory_space::deallocate()`
`cudaMemcpy`	`hipMemcpy`	`hip_memory_transfer::copy()`
`cudaMemcpyAsync`	`hipMemcpyAsync`	`hip_memory_transfer::copy_async()`
`cudaStreamCreate`	`hipStreamCreate`	`hip_stream` RAII wrapper
`cudaEventCreate`	`hipEventCreate`	`hip_event` RAII wrapper
`cudaDeviceSynchronize`	`hipDeviceSynchronize`	`hip_executor::synchronize()`
`cudaGetDeviceCount`	`hipGetDeviceCount`	`dtl::hip::device_count()`

The hipify Tool

AMD provides hipify-perl and hipify-clang to automatically convert CUDA source to HIP:

# Convert a CUDA source file to HIP
hipify-perl my_kernel.cu > my_kernel.hip.cpp

# Or use the clang-based converter for more accuracy
hipify-clang my_kernel.cu -o my_kernel.hip.cpp

DTL’s backend code uses #if DTL_ENABLE_HIP guards rather than hipified CUDA code, ensuring each backend is a clean, first-class implementation.

Memory Management

`hip_memory_space`

The hip_memory_space class manages device memory allocations on AMD GPUs. It satisfies DTL’s MemorySpace concept.

#include <backends/hip/hip_memory_space.hpp>

dtl::hip::hip_memory_space mem_space;

// Allocate device memory
auto alloc_result = mem_space.allocate(1024);
if (alloc_result.has_value()) {
    void* ptr = alloc_result.value();
    // Use device memory...
    mem_space.deallocate(ptr, 1024);
}

Properties:

host_accessible = false — device memory is not directly host-accessible
device_accessible = true — accessible from GPU kernels
Supports per-device allocation

Memory Transfers

#include <backends/hip/hip_memory_transfer.hpp>

// Host to device
dtl::hip::hip_memory_transfer::copy(d_ptr, h_ptr, size,
    dtl::hip::transfer_kind::host_to_device);

// Device to host
dtl::hip::hip_memory_transfer::copy(h_ptr, d_ptr, size,
    dtl::hip::transfer_kind::device_to_host);

// Async copy on a stream
dtl::hip::hip_memory_transfer::copy_async(d_ptr, h_ptr, size,
    dtl::hip::transfer_kind::host_to_device, stream);

Executor Patterns

`hip_executor`

The hip_executor provides stream-based asynchronous execution on AMD GPUs:

#include <backends/hip/hip_executor.hpp>

// Create executor with a non-blocking stream
dtl::hip::hip_stream stream(dtl::hip::stream_flags::non_blocking);
dtl::hip::hip_executor exec(std::move(stream));

// Execute work on the GPU
exec.execute([](hipStream_t s) {
    my_kernel<<<grid, block, 0, s>>>(d_data, n);
});

// Wait for completion
exec.synchronize();

Stream Management

DTL provides RAII stream wrappers for HIP:

#include <backends/hip/hip_executor.hpp>

// Default stream
dtl::hip::hip_stream default_stream;

// Non-blocking stream (owned)
dtl::hip::hip_stream stream(dtl::hip::stream_flags::non_blocking);

// Wrap an existing hipStream_t (non-owning)
hipStream_t external_stream;
hipStreamCreate(&external_stream);
dtl::hip::hip_stream wrapped(external_stream, false);  // does not destroy on destruct

Event Tracking

#include <backends/hip/hip_event.hpp>

dtl::hip::hip_event event;
event.record(stream);

// Check completion
if (event.query()) {
    // Work is done
}

// Or block until done
event.synchronize();

DTL HIP-Specific Considerations

Placement Policies with HIP

When HIP is enabled but CUDA is not, the placement policies map to HIP equivalents:

device_only<N> — Allocates via hipMalloc on device N
unified_memory — Uses hipMallocManaged for managed memory
device_preferred — Uses unified memory with device-preferred hints

// HIP device allocation
dtl::distributed_vector<float, dtl::device_only<0>> vec(1000, ctx);

// HIP unified memory
dtl::distributed_vector<float, dtl::unified_memory> vec_unified(1000, ctx);

Build System

When building with HIP:

Use hipcc as the compiler or set up CMake’s HIP language support
ROCm’s amdclang++ can compile HIP sources with --offload-arch=<gfx...>
DTL detects HIP availability via find_package(hip) in CMake

Error Handling

HIP errors are mapped to DTL status codes:

// HIP errors translated to DTL status
dtl::status_code::hip_error  // = 520

Known Differences from CUDA Backend

Feature	CUDA Backend	HIP Backend
Device memory	`cudaMalloc`	`hipMalloc`
Unified memory	`cudaMallocManaged`	`hipMallocManaged`
Streams	`cudaStream_t`	`hipStream_t`
Events	`cudaEvent_t`	`hipEvent_t`
Error type	`cudaError_t`	`hipError_t`
Status code	`status_code::cuda_error` (510)	`status_code::hip_error` (520)
Warp size	32	64 (AMD GCN/CDNA)
Shared memory	48KB (default)	64KB (typical)
Architecture flag	`CMAKE_CUDA_ARCHITECTURES=80`	`CMAKE_HIP_ARCHITECTURES=gfx90a`
Progress engine	CUDA event-based	HIP event-based
NCCL integration	Yes	RCCL (ROCm equivalent)

Warp Size Differences

AMD GPUs use a wavefront size of 64 (vs. NVIDIA’s warp size of 32). This affects:

Warp-level primitives (__ballot, __shfl)
Shared memory bank conflicts
Occupancy calculations
Reduction patterns within a warp/wavefront

Performance Considerations

AMD GPUs may have different optimal block sizes than NVIDIA GPUs
Memory coalescing rules are similar but not identical
Use rocprof instead of nsight-compute for profiling
Prefer hipcc for compilation to ensure proper HIP runtime linkage