▲ 1 Custom FP4 CUDA Kernel – 129 Tflops on DGX Spark with Pre-Quantized Weight Cache (forums.developer.nvidia.com) by vkaufmann | Feb 25, 2026 | 1 comments on HN Visit Link