pahole: Analysing Memory Layout of Complex Data Structures With Ease

Sunday, 5th November 2023: Putting together this blog post feels like a positive stride! As I mentioned in the previous post, (core-to-core latency tool), I'm aiming to integrate more consistent writing into my routine. While it took a month to pen this down, it's progress from the previous year 😇. Hoping that the upward trend will continue...!

During the past few weekends, I've been reading about the capabilities of the perf C2C (Cache To Cache) tool. As I went through the documentation and examples, a tool called pahole caught my attention. I'm a bit surprised I haven't stumbled upon it earlier. With scientific code optimizations, we carefully examine the memory layout of data structures for cache efficiency. While, as developers, we typically take into consideration padding and alignment requirements when designing data structures, I've never utilized tools like pahole for inspecting the memory layout of C/C++ data structures. While working on a large codebase or code developed by someone else, I thought this tool might be handy and hence thought it would be a good candidate for a short blog post!

Background

I don't need to delve into details of why and how of padding & alignment. Data structures can encompass data members of various types and sizes, each subject to different alignment requirements. To make sure each data member satisfies these memory alignment requirements, compilers need to insert additional space or "holes". As a developer, we also want to look into data structure layout for several reasons:

  • Based on the target architecture's alignment requirements, compilers automatically add padding to make the data members aligned to the word size of the CPU. This can increase the total size of the object and hence we want to be aware of the overhead.
  • Developers might want to explicitly add padding to achieve specific alignment or memory layout. For example, many CPU architectures fetch data in cache-sized blocks and hence align data structures to reduce the chance of crossing cache lines can improve cache utilization.
  • Without proper padding, independent variables that share the same cache line can lead to false sharing. This causes unnecessary cache coherence and can result in a significant performance penalty.

When developing data structures with these considerations in mind, an important aspect is to understand the memory layout of these structures. If you're dealing with your own codebase, you likely possess a good understanding of the existing choices. However, what if you encounter new code or complex data structures where manual inspection isn't a feasible option? How do you systematically explore and understand the memory layouts of these structures? This is precisely where pahole becomes invaluable!

Pahole In Action

pahole (Poke-a-hole) tool is developed as part of the dwarves suite of utilities. When we compile applications with debugging information (e.g. -g compiler flag), compilers include DWARF information in object or binary. This information includes details about types, variables, and their locations in memory. pahole reads this DWARF information and presents it in an intuitive and easily understandable form. This can help developers understand the memory layout, alignment, and padding of data structures and narrow down "holes" inserted by compilers.

The best way to understand its usefulness is by looking at the examples. So let's dive in!

Install

On Linux distributions like Ubuntu, we can use a standard system package manager to install dwarves:

Installing pahole from source is not difficult as well:

With this, we have pahole command available. Here are some of the CLI options that I found interesting or relevant to this blog post:

Examples

To demonstrate the use of the pahole, let's consider below data structure:

Feel free to disregard specific terminology used in the above structures. I'm using the terminology of ion channels, compartments, and neurons, both for a change and to have some meaningful types (from the computational neuroscience domain where I currently work).

The first step is to compile our example test.cpp with debugging information using the flag "-g":

This will produce an object, test.o, which can be used with pahole. Here are specific questions that can effectively demonstrate the utility of pahole.

What is the arrangement of structs in memory, and what is the size and location of paddings?

The first obvious task is to understand the memory layout of the data structure. This can be easily accomplished by invoking the pahole command with an object file. pahole adds C-style comments to explain the memory layout. Following each member, two numbers are provided within C-style comments, indicating the offset and size of the data member in bytes. If the compiler has introduced "holes," pahole mentions their sizes. At the end, a summary of sizes and holes is provided.

In the above example, we used the --class_name CLI option to only analyze the Channel type (for brevity). The output report is not difficult to understand:

  • name member starts at offset 0 and has a size of 4 bytes.
  • conductance starts at offset 8 and has a size of 8 bytes. It is preceded by a 4-byte hole due to the 8-byte alignment requirement of the double type.
  • id starts at offset 16 and has a size of 4 bytes.
  • rpotential, starting at offset 24 with an 8-byte size, is also preceded by a 4-byte hole to adhere to the memory alignment requirements for the double type.
  • current member starts at offset 32 and occupies 4 bytes.
  • prop, preceded by a 4-byte hole, starts at offset 40 with a size of 32 bytes.

Once all members are listed, the data structure summary is provided:

  • The total size of the struct (including padding) is 72 bytes and spans over 2 cachelines (assuming a default cacheline size of 64 bytes)
  • The combined size of all members is 60 bytes. There are a total of 3 holes and their total size is 12 bytes.
  • This type uses only 8 bytes of the second cacheline (denoted here as the "last cacheline")

Just as a bonus example, you can take a look at the output when we compile the test for a 32-bit target architecture. As the alignment requirement for a double variable is typically 4 bytes, we can see how the data structure is laid out:

How can data members be rearranged for optimal memory layout?

In the preceding output, we observed that the Channel type extends over two cachelines, totaling 72 bytes, despite the actual size of its members being 60 bytes. What if we instruct pahole to rearrange the members for a more compact structure?

If we read the comments, it's evident what pahole has done. By shifting 'id' from its original position after 'conductance' to follow 'name,' it avoided two holes of 8 bytes. As a result, the Channel type can now fit neatly onto a single cacheline. While this is a trivial example, in more complex codebases, I believe pahole can be quite practical to quickly experiment with data structure reorganizations.

How does the data structure will be laid out with a cacheline size of X bytes?

Nowadays most widely-used Intel/AMD/ARM CPUs have 64-byte cachelines, making this question less interesting. But CPUs with different cacheline sizes are not foreign to us (e.g. Power9 with a cacheline size of 128-byte). pahole provides a CLI option, --cacheline_size, allowing for the exploration of potential memory layouts with an arbitrary cacheline size:

Understanding nested data structures and their organization in memory layout

Until now we focused on a single type Channel for brevity. Let's now try to look at the Neuron type that encompasses the Compartment and Channel types. We will use --expand_types CLI option to show us a detailed memory layout for the Neuron type:

Isn't that useful?


That's all for today! This is what I have gathered so far, and I believe it provides sufficient background information for you to dive into the details if you find this tool relevant to your needs. You can read more about this tool in the 2007 paper in the Proceedings of the Linux Symposium. The author has provided some other examples and use cases that you might find interesting.

By the way, when using pahole with shared libraries from a large production codebase, be sure to explore the CLI options for filtering data structures of particular interest. Alternatively, consider using pahole with individual object files. In my own project experimentation, naive execution of pahole libfoo.so ended up with 49k lines output! 😀

Credit

I'm not very familiar with the developer community of performance tools outside HPC, but discovering and learning new tools like pahole is always exciting! (even though git log says it was born 17 years ago!). This post is mostly for my self-learning using examples/documentation put together by the community. So, full credit and gratitude to Arnaldo Carvalho de Melo for developing and his ongoing efforts! 🥂