Linux Perf: Measuring Specific Code Sections with Pause/Resume APIs

Sunday, 21st April 2024: While working with different HPC performance tools, Linux perf has always intrigued me. Over the years I have curiously watched Brendan Gregg's talks and have learned a lot from various examples. As a performance engineer, I find Linux perf's capabilities fascinating. However, I have struggled to fully engage with Linux perf. This disconnect can be attributed to various factors. For instance, perf is often unavailable or lacks necessary permissions on HPC systems, its focus is primarily on single-server analysis rather than distributed applications executing on a large number of servers, etc.… Read the rest

Navigating the Complexity of Large Codebases Using Vtune + xdot (or perf + gprof2dot)

Sunday, 7th April 2024: Back in September 2013, when I started my journey at the Blue Brain, I was navigating relatively large codebases for the first time. I was eager to gain a comprehensive understanding of code structures, their execution workflow, and performance aspects. During this period I started using Intel Vtune with xdot/gprof2dot and found it extremely useful. With this combination, I could generate detailed execution call graphs of applications and then sit together to deep dive into both the structural and performance aspects of the code with the senior engineers.… Read the rest

LinkTest : Measuring Communication Latency and Bandwidth At Scale

October 2020: What makes supercomputers special? They have state-of-the-art processors, fast parallel file systems, specialized power & cooling infrastructure and complex software stack to run. But, a high-speed interconnect that tightly integrates thousands of nodes differentiate a supercomputer from a commodity cluster. Data movement within a node or across nodes is an important aspect for many scientific applications and hence low latency, high bandwidth interconnect technology is one of the key elements of the HPC systems.

Setting up such a system with tens of thousands of nodes and performance tuning is not an easy task. Especially during the early days of deployment and acceptance benchmarking where we often have to run various tests for weeks to identify issues, fix them and reach expected performance.… Read the rest

Understanding CPU Architecture And Performance Using LIKWID

March 2020: I was planning to write about CPU microarchitecture analysis for a long time. I started writing this post more than a year ago, just before the beginning of COVID-19. But with so many things happening around (and new parenting responsibilities 👧), this got delayed for quite a long time. Finally getting some weekend time to get this out!

Like previous blog posts, this also became longer and longer as I started writing details.… Read the rest

I/O Performance Analysis with Darshan

When optimizing parallel applications at scale, we often focus on computation-communication aspects and I/O often gets limited attention. With increasing performance gap between compute and I/O subsystems, improving I/O performance remains one of the major challenge. As filesystem is a shared resource, few jobs running on a system can significantly impact performance of other applications. In such scenario, even if we use profiling tool (see list here) to identify slow I/O routines, it's difficult to understand real cause. For example, there might be other applications dominating filesystem resulting in poor I/O performance.… Read the rest

Blade : Cube’s OTF2 Trace Visualizer

My first experience with the Vampir trace visualiser was in 2010 during my studies at EPCC. While working on the exercises and samples, I was excited by the possibility of finding out what every process or thread (from thousands) is doing at any point in time. Over the years I have used TAU + Score-P + Vampir toolset with different applications on various systems. When it comes to trace visualisation for scientific applications at scale, Vampir is very impressive. If you haven't used it before, give a try!

One of the missing piece in profiling toolset (in my opinion) is an open source alternative for OTF2 trace visualisation.… Read the rest

Summary Of Python Profiling Tools – Part I

If you are working in the area of scientific computing, in academia or industry, most likely you are using Python in some form. Traditionally Python is described as slow when it comes to performance and there are number of discussions about speed compared to native C/C++ applications 1 2. The goal of this post is not to argue about performance but to summarise various tools that can help to find out performance bottlenecks before coming to such conclusions. In the previous post, I summarised more than 90 profiling tools that can be used for analysing performance of C/C++/Fortran applications.… Read the rest

Summary of Profiling Tools for Parallel Applications

Many scientific/industrial applications run on workstation to largest supercomputers in the world. With the continuous evolution of hardware platforms, achieving good performance is a challenging task. There are many profiling tools available to analyse and optimise the performance. But not all tools/methods are available on every platform, especially in high performance computing. First step in performance engineering workflow is to understand which tools are available and when they can be used. There is no one-size-fits-all solution : some are designed with broad feature list for high level analysis and others for specific platform with low level hardware metrics.

While choosing profiling tool one need to consider different aspects:

  • Goal : Are you interested in high level performance metrics?
Read the rest