DeepSeek‑V4‑Flash runs on AMD MI300X with comparable performance

DeepSeek‑V4‑Flash is now operational on AMD’s MI300X accelerator.

A blog entry published on June 2 details the steps required to compile the model’s FlashAttention kernels for the MI300X, install the necessary ROCm libraries, and run inference with the model’s 7‑billion‑parameter checkpoint. The author reports batch‑size‑1 latency of roughly 120 ms for a 512‑token prompt, which is within 10 % of the same workload on an NVIDIA H100. Memory usage caps at 78 GB, fitting the MI300X’s 128 GB HBM3.

This matters because most LLM deployments still depend on CUDA‑only software stacks. Demonstrating a viable AMD path opens the door for cost‑conscious teams to tap the MI300X’s lower TCO and higher bandwidth without rewriting models from scratch. It also nudges the broader ecosystem—frameworks, quantization tools, and inference servers—toward broader hardware support.

Even so, the setup remains fiddly: the post notes several patches to the FlashAttention repo and a custom ROCm driver version. Until AMD’s tooling catches up, early adopters will still need to wrestle with compatibility quirks.

← Back to the front page