Linux Perf: Measuring Specific Code Sections with Pause/Resume APIs

Sunday, 21st April 2024: While working with different HPC performance tools, Linux perf has always intrigued me. Over the years I have curiously watched Brendan Gregg's talks and have learned a lot from various examples. As a performance engineer, I find Linux perf's capabilities fascinating. However, I have struggled to fully engage with Linux perf. This disconnect can be attributed to various factors. For instance, perf is often unavailable or lacks necessary permissions on HPC systems, its focus is primarily on single-server analysis rather than distributed applications executing on a large number of servers, etc.

While working on a local desktop/laptop or a server with the necessary permissions, a common scenario in optimizing an application involves measuring performance metrics for specific code sections known as regions of interest. While Linux perf offers comprehensive performance analysis and profiling, most examples focus on the entire application or arbitrary time ranges rather than specific regions of interest. One can utilize CLI options such as --delay msecs, but there is often a need for finer-grained control over performance metric measurements. In the realm of HPC, this level of control is something that we use frequently.

In the old days, I remember looking for pause/resume APIs for perf but couldn't easily find options to programmatically control the measurements. It was surprising for me that such APIs weren't there. Last month I was digging into details for my last blog post and then I stumbled upon this question where Parsa Amini has provided a succinct but complete example. I also read through the patch discussion when Alexey Budankov contributed a feature in July 2020 that could help to implement pause/resume functionality. Looking through these resources and scanning source files, it was clear how to use this option to pause and resume recording from the application.

Even though this feature is already over three years old, introduced in 2020 with Linux kernel version 5.9, most of the older perf examples do not explicitly discuss it. I have searched this question myself multiple times, so I thought it would be a good candidate for a short blog post to summarize.

C++ Implementation

Let's dive straight into the implementation. We will start by creating the PerfManager class, which will utilize FIFOs to interact with perf. These FIFOs are established externally to the application, as we will explore later on. perf responds to enable and disable commands to control the recording of profiling data. Upon successful execution of these commands, perf sends an acknowledgment message in the form of ack\n.

We will leverage a dummy application code from our previous blog post for demonstration purposes. You can find it here.

By examining the main() function, the purpose of PerfManager is quite straightforward. Many applications contain initialization or finalization routines that are not typically relevant for optimizing the performance of compute-intensive kernels. Therefore, we are focusing our performance recording solely on the simulate() function, which is where the computational workload lies.

Example in Action

Let's compile the application simply by:

Once we have the application binary ready, we need to set up the FIFOs through which perf and the application will interact. The script is self-explanatory with the help of comments:

Once we run the above script, perf will output something like the following:

The initial lines indicate when perf is enabling and disabling the profile recording, corresponding to the pmon.pause() and pmon.resume() functions in the main() function. After the application finishes, perf will print summary statistics for the recorded metrics. It's important to note that these metrics are only for the region of interest for which we have enabled profiling (simulate() function in our case). The rest of the output should be self-explanatory if you are familiar with the perf.

It's worth noting that perf record functions in the same manner:

This approach enables us to profile complex applications while maintaining a focus on specific regions of interest targeted for optimization. Revisit the previous post to see how you can leverage these pause/resume APIs.

Credits

I am not someone with a great understanding of the perf developer ecosystem, but as far as I know, Alexey Budankov introduced the --control fd: option to perf while working at Intel. So credit to Alexey for extending this invaluable tool (and also Intel Vtune).
In addition, I would like to extend my appreciation to Parsa Amini, whose comprehensive example provided in response to a StackOverflow question, served as a good reference for building upon the functionality outlined in the man page.
Thanks Alexey and Parsa!