Pulse: A Profiling and Visualization Infrastructure for Heterogeneous Managed Systems.

Published in MPLR, 2026

TornadoVM is an open-source framework that enables Java applications to execute on heterogeneous hardware accelerators, such as GPUs and FPGAs. While TornadoVM simplifies programmability, understanding and optimizing the performance of applications on heterogeneous systems remains a significant challenge.

This paper introduces Pulse, a profiling infrastructure designed to collect, correlate, and visualize detailed performance metrics for application components offloaded to accelerators. At its core, Pulse includes a profiler that integrates information from both the managed runtime and the underlying hardware. The collected metrics include fine-grained execution timing, data-transfer costs, compilation overheads, and power consumption during accelerator execution. In addition, Pulse provides an interactive visualization layer that presents these metrics in an accessible and actionable manner, helping developers analyze performance bottlenecks and improve the efficiency of Java applications running on heterogeneous hardware through TornadoVM.

Using a Java-native implementation of the Llama3 inference pipeline (GPULlama3.java) as a case study, this paper demonstrates how Pulse facilitates profiling-guided optimization. By following profiling information generated by Pulse, developers were able to refactor and fuse fine-grained GPU kernels, resulting in reduced kernel launches per layer from 19 to 13, a 38% decrease in datatransfers cost, and an overall 18.4% improvement in the execution time per-layer. Finally, the instrumentation overhead of Pulse is evaluated to be 1.5% for coarse-grained microbenchmarks, and up to 30% for workloads dominated by short-lived kernels such as GPULlama3.java - representing a worst-case scenario for finegrained profiling.

Download Preprint