Sunday, 21st April 2024: While working with different HPC performance tools, Linux perf has always intrigued me. Over the years I have curiously watched Brendan Gregg's talks and have learned a lot from various examples. As a performance engineer, I find Linux perf's capabilities fascinating. However, I have struggled to fully engage with Linux perf. This disconnect can be attributed to various factors. For instance, perf is often unavailable or lacks necessary permissions on HPC systems, its focus is primarily on single-server analysis rather than distributed applications executing on a large number of servers, etc.
While working on a local desktop/laptop or a server with the necessary permissions, a common scenario in optimizing an application involves measuring performance metrics for specific code sections known as regions of interest. While Linux perf offers comprehensive performance analysis and profiling, most examples focus on the entire application or arbitrary time ranges rather than specific regions of interest. One can utilize CLI options such as --delay msecs
, but there is often a need for finer-grained control over performance metric measurements. In the realm of HPC, this level of control is something that we use frequently.
In the old days, I remember looking for pause/resume APIs for perf but couldn't easily find options to programmatically control the measurements. It was surprising for me that such APIs weren't there. Last month I was digging into details for my last blog post and then I stumbled upon this question where Parsa Amini has provided a succinct but complete example. I also read through the patch discussion when Alexey Budankov contributed a feature in July 2020 that could help to implement pause/resume functionality. Looking through these resources and scanning source files, it was clear how to use this option to pause and resume recording from the application.
Even though this feature is already over three years old, introduced in 2020 with Linux kernel version 5.9, most of the older perf examples do not explicitly discuss it. I have searched this question myself multiple times, so I thought it would be a good candidate for a short blog post to summarize.
C++ Implementation
Let's dive straight into the implementation. We will start by creating the PerfManager
class, which will utilize FIFOs to interact with perf
. These FIFOs are established externally to the application, as we will explore later on. perf
responds to enable
and disable
commands to control the recording of profiling data. Upon successful execution of these commands, perf
sends an acknowledgment message in the form of ack\n
.
We will leverage a dummy application code from our previous blog post for demonstration purposes. You can find it here.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 |
#include <iostream> #include <string> #include <cassert> #include <unistd.h> #include <cmath> #include <cstring> // Class for managing perf monitoring class PerfManager { // control and ack fifo from perf int ctl_fd = -1; int ack_fd = -1; // if perf is enabled bool enable = false; // commands and acks to/from perf static constexpr const char* enable_cmd = "enable"; static constexpr const char* disable_cmd = "disable"; static constexpr const char* ack_cmd = "ack\n"; // send command to perf via fifo and confirm ack void send_command(const char* command) { if (enable) { write(ctl_fd, command, strlen(command)); char ack[5]; read(ack_fd, ack, 5); assert(strcmp(ack, ack_cmd) == 0); } } public: PerfManager() { // setup fifo file descriptors char* ctl_fd_env = std::getenv("PERF_CTL_FD"); char* ack_fd_env = std::getenv("PERF_ACK_FD"); if (ctl_fd_env && ack_fd_env) { enable = true; ctl_fd = std::stoi(ctl_fd_env); ack_fd = std::stoi(ack_fd_env); } } // public apis void pause() { send_command(disable_cmd); } void resume() { send_command(enable_cmd); } }; // Sample Application void dummy_work(int factor) { const size_t num_iter = 30000000 * factor; volatile double result = 0; for (size_t i = 0; i < num_iter; ++i) { result += std::exp(1.1); } } void initialize() { dummy_work(5); } void finalise() { dummy_work(8); } void timestep() { dummy_work(3); } void simulate() { for (int i = 0; i < 10; ++i) { timestep(); } } int main(int argc, char **argv) { // pause profiling at the beginning PerfManager pmon; pmon.pause(); initialize(); // resume profiling for a region of interest pmon.resume(); simulate(); pmon.pause(); finalise(); return 0; } |
By examining the main()
function, the purpose of PerfManager
is quite straightforward. Many applications contain initialization or finalization routines that are not typically relevant for optimizing the performance of compute-intensive kernels. Therefore, we are focusing our performance recording solely on the simulate()
function, which is where the computational workload lies.
Example in Action
Let's compile the application simply by:
1 2 3 |
g++ sample_callgraph.cpp -O1 -g -o sample_callgraph |
Once we have the application binary ready, we need to set up the FIFOs through which perf
and the application will interact. The script is self-explanatory with the help of comments:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
# name of the FIFOs FIFO_PREFIX="perf_fd" # remove dangling files if any rm -rf ${FIFO_PREFIX}.* # create two fifos mkfifo ${FIFO_PREFIX}.ctl mkfifo ${FIFO_PREFIX}.ack # associate file descriptors exec {perf_ctl_fd}<>${FIFO_PREFIX}.ctl exec {perf_ack_fd}<>${FIFO_PREFIX}.ack # set env vars for application export PERF_CTL_FD=${perf_ctl_fd} export PERF_ACK_FD=${perf_ack_fd} # start perf with the associated file descriptors perf stat \ --event=task-clock,instructions,LLC-loads,LLC-stores \ --delay=-1 \ --control fd:${perf_ctl_fd},${perf_ack_fd} \ -- ./sample_callgraph |
Once we run the above script, perf
will output something like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
Events disabled Events enabled Events disabled Performance counter stats for './sample_callgraph': 1807.47 msec task-clock # 0.696 CPUs utilized 5405812469 cpu_core/instructions/ 18420 cpu_core/LLC-loads/ # 10.191 K/sec 3984 cpu_core/LLC-stores/ # 2.204 K/sec 2.596550438 seconds time elapsed |
The initial lines indicate when perf
is enabling and disabling the profile recording, corresponding to the pmon.pause()
and pmon.resume()
functions in the main()
function. After the application finishes, perf
will print summary statistics for the recorded metrics. It's important to note that these metrics are only for the region of interest for which we have enabled profiling (simulate()
function in our case). The rest of the output should be self-explanatory if you are familiar with the perf.
It's worth noting that perf record
functions in the same manner:
1 2 3 4 5 6 7 8 |
perf record -g \ --delay=-1 \ --control fd:${perf_ctl_fd},${perf_ack_fd} \ -- ./sample_callgraph perf script | c++filt | gprof2dot -f perf --strip \ | dot -Tpng -o sample_callgraph.png |
This approach enables us to profile complex applications while maintaining a focus on specific regions of interest targeted for optimization. Revisit the previous post to see how you can leverage these pause/resume APIs.
Credits
I am not someone with a great understanding of the perf developer ecosystem, but as far as I know, Alexey Budankov introduced the --control fd:
option to perf while working at Intel. So credit to Alexey for extending this invaluable tool (and also Intel Vtune).
In addition, I would like to extend my appreciation to Parsa Amini, whose comprehensive example provided in response to a StackOverflow question, served as a good reference for building upon the functionality outlined in the man page.
Thanks Alexey and Parsa!