Blackwell‑Optimised
Diffusion Engine & Perf Harness
A
production‑grade CUDA solver bundle targeting Blackwell‑class
GPUs, combining:
Performance test harness
– runs a suite of 3D grid sizes, measures sec/step, points/s, and
device memory usage, and emits a CSV (perf_results.csv) for
reproducible benchmarks.
Mixed‑precision diffusion
kernels – FP16 field storage with FP32 accumulation to
unlock ~1.5–3× higher throughput vs FP32‑only, without
giving up stability.
Geometric multigrid V‑cycle
– shared‑memory smoother and multilevel hierarchy to replace
Jacobi in implicit diffusion solves and cut iteration counts by
~5–20×.
Integration hooks + script –
ready‑to‑paste TuringSolver entry points and a
one‑command perf suite script to build, run, and compare on the
Blackwell SKU.
Position this as a discrete, license‑ready
bundle for GPU vendors, cloud HPC teams, or simulation OEMs who want
provable Blackwell‑class performance out of the box.