Linux Perf: Measuring Specific Code Sections with Pause/Resume APIs

Sunday, 21st April 2024: While working with different HPC performance tools, Linux perf has always intrigued me. Over the years I have curiously watched Brendan Gregg's talks and have learned a lot from various examples. As a performance engineer, I find Linux perf's capabilities fascinating. However, I have struggled to fully engage with Linux perf. This disconnect can be attributed to various factors. For instance, perf is often unavailable or lacks necessary permissions on HPC systems, its focus is primarily on single-server analysis rather than distributed applications executing on a large number of servers, etc.

While working on a local desktop/laptop or a server with the necessary permissions, a common scenario in optimizing an application involves measuring performance metrics for specific code sections known as regions of interest. While Linux perf offers comprehensive performance analysis and profiling, most examples focus on the entire application or arbitrary time ranges rather than specific regions of interest. One can utilize CLI options such as --delay msecs, but there is often a need for finer-grained control over performance metric measurements. In the realm of HPC, this level of control is something that we use frequently.

In the old days, I remember looking for pause/resume APIs for perf but couldn't easily find options to programmatically control the measurements. It was surprising for me that such APIs weren't there. Last month I was digging into details for my last blog post and then I stumbled upon this question where Parsa Amini has provided a succinct but complete example. I also read through the patch discussion when Alexey Budankov contributed a feature in July 2020 that could help to implement pause/resume functionality. Looking through these resources and scanning source files, it was clear how to use this option to pause and resume recording from the application.

Even though this feature is already over three years old, introduced in 2020 with Linux kernel version 5.9, most of the older perf examples do not explicitly discuss it. I have searched this question myself multiple times, so I thought it would be a good candidate for a short blog post to summarize.

C++ Implementation

Let's dive straight into the implementation. We will start by creating the PerfManager class, which will utilize FIFOs to interact with perf. These FIFOs are established externally to the application, as we will explore later on. perf responds to enable and disable commands to control the recording of profiling data. Upon successful execution of these commands, perf sends an acknowledgment message in the form of ack\n.

We will leverage a dummy application code from our previous blog post for demonstration purposes. You can find it here.


#include <iostream>
#include <string>
#include <cassert>
#include <unistd.h>
#include <cmath>
#include <cstring>

// Class for managing perf monitoring
class PerfManager {

    // control and ack fifo from perf
    int ctl_fd = -1;
    int ack_fd = -1;

    // if perf is enabled
    bool enable = false;

    // commands and acks to/from perf
    static constexpr const char* enable_cmd = "enable";
    static constexpr const char* disable_cmd = "disable";
    static constexpr const char* ack_cmd = "ack\n";

    // send command to perf via fifo and confirm ack
    void send_command(const char* command) {
        if (enable) {
            write(ctl_fd, command, strlen(command));
            char ack[5];
            read(ack_fd, ack, 5);
            assert(strcmp(ack, ack_cmd) == 0);
        }
    }

  public:

    PerfManager() {
        // setup fifo file descriptors
        char* ctl_fd_env = std::getenv("PERF_CTL_FD");
        char* ack_fd_env = std::getenv("PERF_ACK_FD");
        if (ctl_fd_env && ack_fd_env) {
            enable = true;
            ctl_fd = std::stoi(ctl_fd_env);
            ack_fd = std::stoi(ack_fd_env);
        }
    }

    // public apis

    void pause() {
        send_command(disable_cmd);
    }

    void resume() {
        send_command(enable_cmd);
    }
};

// Sample Application

void dummy_work(int factor) {
    const size_t num_iter = 30000000 * factor;
    volatile double result = 0;
    for (size_t i = 0; i < num_iter; ++i) {
        result += std::exp(1.1);
    }
}

void initialize() { dummy_work(5); }
void finalise() { dummy_work(8); }
void timestep() { dummy_work(3); }
void simulate() {
    for (int i = 0; i < 10; ++i) {
        timestep();
    }
}

int main(int argc, char **argv) {

    // pause profiling at the beginning
    PerfManager pmon;
    pmon.pause();

    initialize();

    // resume profiling for a region of interest
    pmon.resume();
    simulate();
    pmon.pause();

    finalise();
    return 0;
}

#include <iostream>

#include <string>

#include <cassert>

#include <unistd.h>

#include <cmath>

#include <cstring>

// Class for managing perf monitoring

class PerfManager {

// control and ack fifo from perf

int ctl_fd = -1;

int ack_fd = -1;

// if perf is enabled

bool enable = false;

// commands and acks to/from perf

static constexpr const char* enable_cmd = "enable";

static constexpr const char* disable_cmd = "disable";

static constexpr const char* ack_cmd = "ack\n";

// send command to perf via fifo and confirm ack

void send_command(const char* command) {

if (enable) {

write(ctl_fd, command, strlen(command));

char ack[5];

read(ack_fd, ack, 5);

assert(strcmp(ack, ack_cmd) == 0);

}

public:

PerfManager() {

// setup fifo file descriptors

char* ctl_fd_env = std::getenv("PERF_CTL_FD");

char* ack_fd_env = std::getenv("PERF_ACK_FD");

if (ctl_fd_env && ack_fd_env) {

enable = true;

ctl_fd = std::stoi(ctl_fd_env);

ack_fd = std::stoi(ack_fd_env);

}

// public apis

void pause() {

send_command(disable_cmd);

}

void resume() {

send_command(enable_cmd);

}

};

// Sample Application

void dummy_work(int factor) {

const size_t num_iter = 30000000 * factor;

volatile double result = 0;

for (size_t i = 0; i < num_iter; ++i) {

result += std::exp(1.1);

}

void initialize() { dummy_work(5); }

void finalise() { dummy_work(8); }

void timestep() { dummy_work(3); }

void simulate() {

for (int i = 0; i < 10; ++i) {

timestep();

}

int main(int argc, char **argv) {

// pause profiling at the beginning

PerfManager pmon;

pmon.pause();

initialize();

// resume profiling for a region of interest

pmon.resume();

simulate();

pmon.pause();

finalise();

return 0;

}

By examining the main() function, the purpose of PerfManager is quite straightforward. Many applications contain initialization or finalization routines that are not typically relevant for optimizing the performance of compute-intensive kernels. Therefore, we are focusing our performance recording solely on the simulate() function, which is where the computational workload lies.

Example in Action

Let's compile the application simply by:


g++ sample_callgraph.cpp -O1 -g -o sample_callgraph

g++ sample_callgraph.cpp -O1 -g -o sample_callgraph

Once we have the application binary ready, we need to set up the FIFOs through which perf and the application will interact. The script is self-explanatory with the help of comments:


# name of the FIFOs
FIFO_PREFIX="perf_fd"

# remove dangling files if any
rm -rf ${FIFO_PREFIX}.*

# create two fifos
mkfifo ${FIFO_PREFIX}.ctl
mkfifo ${FIFO_PREFIX}.ack

# associate file descriptors
exec {perf_ctl_fd}<>${FIFO_PREFIX}.ctl
exec {perf_ack_fd}<>${FIFO_PREFIX}.ack

# set env vars for application
export PERF_CTL_FD=${perf_ctl_fd}
export PERF_ACK_FD=${perf_ack_fd}

# start perf with the associated file descriptors
perf stat \
    --event=task-clock,instructions,LLC-loads,LLC-stores \
    --delay=-1 \
    --control fd:${perf_ctl_fd},${perf_ack_fd} \
    -- ./sample_callgraph

# name of the FIFOs

FIFO_PREFIX="perf_fd"

# remove dangling files if any

rm -rf ${FIFO_PREFIX}.*

# create two fifos

mkfifo ${FIFO_PREFIX}.ctl

mkfifo ${FIFO_PREFIX}.ack

# associate file descriptors

exec {perf_ctl_fd}<>${FIFO_PREFIX}.ctl

exec {perf_ack_fd}<>${FIFO_PREFIX}.ack

# set env vars for application

export PERF_CTL_FD=${perf_ctl_fd}

export PERF_ACK_FD=${perf_ack_fd}

# start perf with the associated file descriptors

perf stat \

--event=task-clock,instructions,LLC-loads,LLC-stores \

--delay=-1 \

--control fd:${perf_ctl_fd},${perf_ack_fd} \

-- ./sample_callgraph

Once we run the above script, perf will output something like the following:


Events disabled
Events enabled
Events disabled

Performance counter stats for './sample_callgraph':

           1807.47 msec task-clock                       #    0.696 CPUs utilized
        5405812469      cpu_core/instructions/
             18420      cpu_core/LLC-loads/              #   10.191 K/sec
              3984      cpu_core/LLC-stores/             #    2.204 K/sec

       2.596550438 seconds time elapsed

Events disabled

Events enabled

Events disabled

Performance counter stats for './sample_callgraph':

1807.47 msec task-clock # 0.696 CPUs utilized

5405812469 cpu_core/instructions/

18420 cpu_core/LLC-loads/ # 10.191 K/sec

3984 cpu_core/LLC-stores/ # 2.204 K/sec

2.596550438 seconds time elapsed

The initial lines indicate when perf is enabling and disabling the profile recording, corresponding to the pmon.pause() and pmon.resume() functions in the main() function. After the application finishes, perf will print summary statistics for the recorded metrics. It's important to note that these metrics are only for the region of interest for which we have enabled profiling (simulate() function in our case). The rest of the output should be self-explanatory if you are familiar with the perf.

It's worth noting that perf record functions in the same manner:


perf record -g \
    --delay=-1 \
    --control fd:${perf_ctl_fd},${perf_ack_fd} \
    -- ./sample_callgraph
perf script | c++filt | gprof2dot -f perf --strip \
    | dot -Tpng -o sample_callgraph.png

perf record -g \

--delay=-1 \

--control fd:${perf_ctl_fd},${perf_ack_fd} \

-- ./sample_callgraph

perf script | c++filt | gprof2dot -f perf --strip \

| dot -Tpng -o sample_callgraph.png

This approach enables us to profile complex applications while maintaining a focus on specific regions of interest targeted for optimization. Revisit the previous post to see how you can leverage these pause/resume APIs.

Credits

I am not someone with a great understanding of the perf developer ecosystem, but as far as I know, Alexey Budankov introduced the --control fd: option to perf while working at Intel. So credit to Alexey for extending this invaluable tool (and also Intel Vtune).
In addition, I would like to extend my appreciation to Parsa Amini, whose comprehensive example provided in response to a StackOverflow question, served as a good reference for building upon the functionality outlined in the man page.
Thanks Alexey and Parsa!

Navigating the Complexity of Large Codebases Using…

Performance Engineering

For Parallel Applications