Summary of Debugging Tools for Parallel Applications

Nowadays it’s not uncommon to run parallel applications with hundreds of thousands of processes on supercomputing platforms. Debugging these parallel applications with sporadic crashes, deadlocks, memory errors or incorrect results is a challenging task. There are number of tools available that help identifying and fixing bugs but one needs to understand tools, their capabilities and when they can be used. This post tries to summarise various debugging tools (open source as well as commercial).

Note that not all tools can be used with distributed applications. For example, open source tools like GDB and Valgrind are commonly used for debugging serial, multi-threaded applications. But one can debug small scale MPI applications by launching multiple GDB instances with the help of MPI launcher and terminal emulator like Xterm. Similarly, Valgrind can be used to debug distributed application with the help of tools like Memcheckview . In certain cases we can reproduce the issue on small scale where serial tools are helpful. And hence, along with better commercial alternatives, many serial tools are included in this post.

If you are interested into performance analysis aspects, take a look at Summary of Profiling Tools post which summarises more than 90 profiling tools for performance analysis and optimisation.

Below tools are listed in alphabetical order.

Abnormal Termination Processing (ATP)

Summary ATP is a tool developed by Cray to help debugging applications at scale. When application crashes with tens of or hundreds of thousands of ranks, generating and analysing core dumps from every rank is not practical. In the event of an application crash, ATP performs analysis on the dying application : all stack backtraces of the ranks are gathered into a merged stack backtrace tree and written to disk as a single dot file. This dot can be then visualised with stat-view tool from STAT and gives concise yet comprehensive view of what the application was doing at the time of its termination. ATP uses MRNet to co-ordinate analysis at scale and Stackwalker API to collect the backtraces. It can be configured to attach debugger like Totalview to perform detailed debugging.
Platforms Cray systems
License Required Yes
Documentation User Guide
When To Use • Your application is failing on Cray system at scale and generating/analysing core dumps from thousands of ranks is not possible • You want lightweight monitoring framework that merge stacks on application crash and dumps out concise information in a readable format.

AQtime

Summary AQtime is a profiling and debugging tool developed by SmartBear Software. It helps to identify performance bottlenecks, memory leaks, code coverage gaps and track resource utilisation. AQtime provides multiple modes for performance analysis : one can start with lightweight sampling and then drill deeper into the hotspots using the more accurate profiling mode. It allows monitoring threads (Windows, .NET and COM) and per thread basis profiling. AQtime can be used as standalone performance profiler or integrated into IDE like Microsoft Visual Studio, RAD Studio allowing analysis of the application without leaving the development environment.
Platforms Windows
Languages/Models C, C++, Delphi, .NET, Java
License Required Yes
Documentation User Guide
When To Use You have serial or threaded application and looking for debugger or profiler on Windows platform

Archer

Summary Archer is an open source, portable, data race detector tool for OpenMP applications. It combines static and dynamic analysis techniques to detect data races with high accuracy and lower runtime overhead even for large codebases. Archer is build on top of open source tools such as LLVM-Clang OpenMP runtime, ThreadSanitizer and Polly. It helps to find data races, non-determinic application behaviour (crashes, program exceptions, wrong results, etc.) which is hard to find with traditional debugging tool. Depending on the application workload, the runtime could slow down by 2x-20x.
Platforms Unix/Linux, Mac OS, Windows
Languages/Models C, C++, OpenMP
License Required No
Documentation GitHub, README
When To Use You have OpenMP application and looking for open source tool to find nasty data races and non-deterministic application behaviour

ARM Distributed Debugging Tool (DDT)

Summary Arm DDT is a graphical debugger developed by Allinea Software (now part of ARM). It is primarily used for debugging parallel applications on clusters and supercomputing platforms. DDT supports simultaneous debugging on heterogenous architectures, for example of CPU and GPU codes together. It helps to find memory issues including out-of-bound accesses, memory leaks. DDT provide intuitive GUI to browse source code, examine scalar/array variable, call-stack across processes in single view. It provides remote debugging functionality and supports offline mode to debug application non-interactively with batch systems. DDT uses GDB underneath. Along with Totalview, DDT is commonly used debugger on supercomputing platforms.
Platforms Linux (Intel x86, Intel Xeon Phi, IBM Power, Arm, NVIDIA GPU)
Languages/Models C, C++, Fortran, MPI, OpenMP, OpenACC, CUDA, UPC
License Required Yes
Documentation User Guide
When To Use You are looking for a debugging tool for serial, multi-threaded or parallel applications on desktop, cluster or largest supercomputer

bgq_stack

Summary bgq_stack is an utility to print a symbolic version of the stack from ASCII core files. When application terminates abnormally on BlueGene platform, core files (for each rank) are generated in plain text format. The core file contains frame addresses representing function call stack record that can help to identify line and function executing at the time of program termination. bgq_stack helps to generate source file and line number from core file using debug information from executable and addr2line utility.
Platforms IBM BlueGene-Q
License Required No
Documentation ALCF Instructions
When To Use You have text core file from generated on BlueGene system and need to identify what line of what routine was executing when the error occurred

Line Mode Debugger (LGDB) and Cray Comparative Debugger (CCDB)

Summary LGDB and CCDB are debugging tools developed by Cray. LGDB is a GDB-based parallel debugger with command line interface. In addition to many GDB features, it include extensions to handle parallel execution. CCDB is a GUI tool for comparative debugging and uses LGDB underneath. It allows user to run two versions of the application simultaneously : one that generates correct result and other with incorrect results. User can define expressions to be compared between runs. By comparing the data structures of two, CCDB can help to identify the location where the two codes start to differ from each other. This methodology can be used between different run-time environments, different hardwares, for example, when a code is ported from CPU to a GPU.
Platforms Cray systems
Languages/Models C, C++, Fortran, MPI
License Required Yes
Documentation Cray Debugger User Guide
When To Use • Your want to debug an application by comparing against working older version • You want to run two versions side by side while running on different scale or different hardwares and compare data structures between them

Coreprocessor

Summary Coreprocessor is a basic parallel debugger that can help to debug problems at application, kernel or hardware level on BlueGene platform. It uses the low-level hardware JTAG interface to read and organise hardware information. Coreprocessor can sort nodes based on their stack traceback and kernel status, which can help isolate a failing or problem node quickly. It can attach to running processes for deadlock determination and can be used to analyze, sort and view text core files.
Platforms IBM BlueGene-Q
License Required No
Documentation Blue Gene/Q System Administration Guide, ALCF Instructions
When To Use You need a tool to examine large number of cores files generated by application on BlueGene system

CUDA-GDB

Summary CUDA-GDB is a debugging tool developed by NVIDIA to assist simultaneous debugging of both GPU and CPU code. It is based on x86-64 port of GDB with additional features for debugging CUDA applications on actual GPU hardware. CUDA-GDB allow user to set breakpoints, watch and modify variables (local, shared, global) / memory of any thread running on device. It has support for source/assembly level debugging on multi-GPU system. CUDA-GDB can be integrated with DDD, EMACS or Nsight Eclipse Edition.
Platforms Linux, Mac OS, Android
Languages/Models C, C++, Fortran, CUDA
License Required No
Documentation User Guide, GitHub
When To Use You are familiar with GDB for CPU debugging and want similar command line debugger for CUDA applications running on GPU

CUDA-MEMCHECK

Summary CUDA-MEMCHECK is a correctness checking tool suite included in CUDA toolkit. It helps to identify the cause of memory access and runtime execution errors in GPU codes. CUDA-MEMCHECK monitors threads running on GPU device and detect various errors such as out-of-bound accesses, misaligned memory accesses, stack overflows, illegal instructions, potential race conditions. It can display stack back traces on host and device for errors with source file and line number.
Platforms Linux, Mac OS
Languages/Models C, C++, CUDA
License Required No
Documentation User Guide
When To Use You are looking for Valgrind like tool to find memory access errors on GPU

Curses Debugger (cgdb), Data Display Debugger (DDD) and KDbg

Summary cgdb, DDD and KDbg are graphical front-ends for command line debuggers. cgdb provides lightweight curses interface to the GDB. DDD can be used with number of command line debuggers like GDB, DBX, JDB, XDB etc. KDbg is a KDE based graphical user interface for GDB. They provide lots of basic functionality like search/view/step through source code, inspect data structures, set/clear/enable/disable breakpoints, display/watch arbitrary expressions etc.
Platforms Unix/Linux
Languages/Models C, C++, Fortran, Java, Perl, Php, Python
License Required No
Documentation cgdb User Guide, DDD User Guide, KDbg User Guide
When To Use You are using command line debuggers and looking for graphical user interface

DELEAKER

Summary DELEAKER is a memory profiler and leak detection tool for windows applications. It intercepts all resource allocations such as memory, GDI objects, Handle and records corresponding call stack. DELEAKER allows to take snapshots during application execution and provide GUI to compare/analyse them with full stack view. It can be used as standalone application or can be integrated with Visual Studio.
Platforms Windows
Languages/Models C, C++, C#, .NET
License Required Yes
Documentation User Guide
When To Use You are looking for memory/GDI/Handle/FileView leak detection tool on Windows platform

Dr. Memory

Summary Dr. Memory is a cross platform, memory monitoring tool. It helps to identify memory errors like uninitialized accesses, out-of-bound accesses, double frees, memory leaks etc. Dr. Memory uses DynamoRIO code manipulation framework underneath for dynamic instrumentation. It also provide drstrace tool for windows that provide system call tracing functionality similar to strace.
Platforms Unix/Linux, Mac OS, Windows, Android
License Required No
Documentation User Guide, Publication, GitHub
When To Use You are looking for faster memory correctness tool compared tools like Valgrind’s Memcheck

DTrace

Summary DTrace is a dynamic tracing framework for analysing applications on production systems in real time. It was originally developed for Solaris and has been ported to several Unix-like systems. Dtrace is scriptable framework : one can attach “probes” to a running system and peek inside as to what it is doing. It helps to understand memory utilisation, CPU time, filesystem and network resources used by active processes.
Platforms Linux, MacOS
Languages/Models Assembly, C, C++, Java, Erlang, JavaScript, Perl, PHP, Python, Ruby, shell script, Tcl
License Required No
Documentation Dtrace Guide
When To Use You want to diagnose application issues (on workstation, server or cloud environment) and need a tool capable of tracing at user/kernel space

GPU PerfStudio

Summary GPU PerfStudio is performance and debugging tool developed by AMD. It was originally developed for Direct3D and OpenGL application for Windows and later ported to Linux. GPU PerfStudio consists of five important tools for graphics developers : Frame Debugger (to visualise the graphics state and resources in the frame), Frame Profiler (to identify per draw call performance issues at the hardware counter level), Shader Debugger (to step through and debug shader code and its output), API Tracer (to show CPU timing information) and Shader Analyzer (to help optimising shader code).
Platforms Linux, Windows, AMD GPU
Languages/Models C, C++, DirectX, OpenGL, Vulkan
License Required No
Documentation Project Page
When To Use You want to analyse and optimise game applications for AMD GPUs

Helgrind and Data Race Detector (DRD)

Summary Helgrind and DRD are error detection tools for multi-threading applications. These tools are part of Valgrind framework and can be used with applications using POSIX threading primitives directly or libraries built on top of POSIX threading primitives (e.g. Boost.Thread, C++11 std::thread, QThreads). Helgrind and DRD helps to find various synchronisation errors, incorrect API usage, lock contention and data races. Both tools provide similar functionality but DRD could have better performance and Helgrind produces more comprehensible reports. As Valgrind performs code emulation technique and records read/write/api calls, the execution could be significantly slower.
Platforms Unix/Linux, Mac OS
Languages/Models C, C++, Pthread
License Required No
Documentation Helgrind User Guide, DRD User Manual
When To Use • You have developed multi-threaded application but it produces incorrect results and occasional locks up • You want to try commonly available tool for diagnosing thread hazards

Insure++, PurifyPlus

Summary Insure++, PurifyPlus are runtime memory analysis and error detection tools. They helps to identify various memory errors such as heap corruption, memory leaks, array out-of-bound accesses, buffer overflows. Insure++ can be used in two modes : source instrumentation mode and link mode. In source instrumentation mode Insure++ perform source-code instrumentation that help to find errors that other tools might miss. It also provides GUI that show memory allocations, possible outstanding leaks over time. PurifyPlus works by instrumenting object code and can detect errors occurring inside third-party libraries.
Platforms Unix/Linux, Windows
Languages/Models C, C++
License Required Yes
Documentation Project Page
When To Use You are looking for memory analysis and debugging tool on windows platform

Intel Inspector

Summary Intel Inspector (successor of Intel Thread Checker) is a code correctness tool that helps to identify threading and memory errors. It performs dynamic instrumentation and analyse execution to find out intermittent, non-deterministic errors. Intel Inspector helps to find out threading errors (like deadlock, race condition) and memory errors (like memory leaks, memory corruption, dangling pointers, uninitialized variables).
Platforms Linux, Windows
Languages/Models C, C++, Fortran, TBB, OpenMP, Pthread, Win32 threads
License Required Yes
Documentation Documentation
When To Use • You want to analyse memory issues (leaks, dangling pointers, un-initialized variables) • You have threaded application and want to find out issues like race conditions, deadlocks etc.

Floating-point Litmus Tests (FLiT)

Summary FLiT is an infrastructure for detecting variability in floating-point computations caused by variations in compiler optimisation, hardware and execution environments. Unlike other tools, FLiT is not a debugging tool but a framework to detect discrepancies in floating point computation across hardware, compilers and libraries. It allows developer to create reproducibility tests with their application and then compiles them under a set of configured compilers and a large range of compiler flags. The results from the tests under different compilations are then compared against the results from a “ground truth” compilation (e.g. un-optimized compilation). This help developer to determine which compilations are safe and navigate the tradeoff between reproducibility and performance.
Platforms Unix/Linux, Mac OS
Languages/Models C, C++
License Required No
Documentation GitHub, README
When To Use • You are writing an application and concerned about floating point discrepancies • You are looking for framework which allows to write compute kernels and test them with different compilers and different optimisation levels to ensure code correctness as well as reproducibility

GNU Project debugger (GDB)

Summary GDB is a widely used, portable, command line debugger for applications written in various programming languages. It provides rich functionality for monitoring, tracing and altering programming execution at runtime. GDB supports debugging multi-threaded applications (see threads) as well as multiple processes simultaneously (see inferiors). It can be integrated into IDEs (e.g. Codelite, Code::Blocks, Dev-C++, Qt Creator, Eclipse, NetBeans, Visual Studio) or can be used via front-ends like UltraGDB, DDD, Nemiver, KDbg. One can use GDB to debug MPI applications with the help of terminal emulator like xterm (see OpenMPI instructions). Other tools like CUDA-GDB, DDT uses GDB underneath.
Platforms Unix/Linux, Mac OS, Windows
Languages/Models Ada, C, C++, Objective-C, Free Pascal, Fortran, Go, Java, Python (and others)
License Required No
Documentation User Guide
When To Use • You are looking for readily available debugger for your application on any given platform • You want to debug parallel application (multi-threaded or multi-process) on small scale

GPU PerfStudio

Summary GPU PerfStudio is performance and debugging tool developed by AMD. It was originally developed for Direct3D and OpenGL application for Windows and later ported to Linux. GPU PerfStudio consists of five important tools for graphics developers : Frame Debugger (to visualise the graphics state and resources in the frame), Frame Profiler (to identify per draw call performance issues at the hardware counter level), Shader Debugger (to step through and debug shader code and its output), API Tracer (to show CPU timing information) and Shader Analyzer (to help optimising shader code).
Platforms Linux, Windows, AMD GPU
Languages/Models C, C++, DirectX, OpenGL, Vulkan
License Required No
Documentation Project Page
When To Use You want to debug and optimise game applications for AMD GPUs

LaunchMON

Summary LaunchMON is a software framework that helps other tools to launch daemons on remote node at scale. Many debuggers and performance analysis tools often need to launch and control middleware daemons on the compute nodes for scalable communication. LaunchMON provides general purpose, efficient, portable and secure infrastructure to achieve this. It can interact with the resource manager like SLURM to determine when, where and how to perform the operations. Many other tools like STAT, DDT use LaunchMON underneath.
Platforms Unix/Linux
License Required No
Documentation README, GitHub
When To Use • You are developing parallel debugging tool • You need a library to identify the remote nodes and processes of a parallel program, and also deploy tool daemons into the right remote nodes

LLDB

Summary LLDB is a debugging tool built on top of reusable software libraries from LLVM toolchain. It uses LLVM disassembler and Clang expression parser that can better handle complex C++ codes compared to other debuggers like GDB. LLDB has an advantage of modern libraries from LLVM project and permissive software licence (UIUC) that allows easy integration with proprietary softwares.
Platforms Linux, Mac OS, Windows
Languages/Models C, Objective-C, C++, Swift
License Required No
Documentation Tutorial
When To Use You are using LLVM compiler toolchain and looking for a debugger alternative to GDB

Linux Trace Toolkit Next Generation (LTTng)

Summary LTTng is a tracing framework for standalone applications, libraries and kernel with minimal overhead. It is successor of Linux Trace Toolkit (LTT) and available on many desktop, server and embedded linux distributions. Similar to perf/Dtrace, it can be used for system wide introspection to understand interactions among multiple applications. Visualisation tools like Trace Compass and Sourcery Analyzer can be used for visualising collected traces.
Platforms Linux
License Required No
Documentation User Guide
When To Use You want trace single process or want to perform system wide introspection with minimal overhead

Memcheck

Summary Memcheck is default memory debugging tool of Valgrind framework. It helps to find memory issues such as uninitialized memory access, read/write after deallocation, double free, memory leaks, mismatch of malloc/new vs free/delete. All memory accesses (read/write) are checked, and calls to malloc/new/free/delete are intercepted. As a result, it could significantly slowdown the execution (5x-100x). Memcheck can be used to debug parallel MPI applications by launching Valgrind under MPI launcher and then re-directing report to per process log file. Alternatively, one can use memcheckview graphical tool (part of ToolGear) to interpret Memcheck’s results.
Platforms Linux, Mac OS
Languages/Models C, C++
License Required No
Documentation User Manual, Quick Start Guide
When To Use • You want to pinpoint cause of sporadic memory crash and unpredictable application behaviour • You want readily available tool for debugging memory issues with serial application or small scale parallel application

MTuner

Summary MTuner is a memory profiler and memory leak finder for C/C++ applications. It records entire history of memory operations over time with minimal impact on run-time performance. With intuitive GUI, MTuner helps to provide insight into memory related behaviour of an application and quickly narrow down sources of memory leaks, spikes, high count of allocations, etc. Using MTuner SDK instrumentation API one can inset event markers, memory tags, named allocators for precise memory profiling.
Platforms Windows (partial support for Linux)
Languages/Models C, C++
License Required No
Documentation User Guide, GitHub
When To Use You want to profile and analyse memory usage with entire time-based history of all memory operations

Marmot Umpire Scalable Tool (MUST)

Summary MUST is a runtime error detection tool for MPI applications. It automatically detects non-standard compliant use of MPI constructs. MUST intercepts all MPI calls and checks for local, non-local correctness errors such as invalid arguments, data type matching errors, overlap in compunction buffers, resource leaks and actual/potential deadlocks. It combines the features of old Marmot and Umpire tools with improved scalability.
Platforms Linux
Languages/Models C, C++, Fortran, MPI
License Required No
Documentation Project Page
When To Use You have MPI application and want detect violations to the MPI standard that might manifest on certain system or with different MPI implementation

Nsight

Summary Nsight is a development tool from NVIDIA for heterogeneous computing. It provides simultaneous debugging and profiling capabilities for CPU as well as GPU. Nsight helps to identify/analyse bottlenecks and monitor the activities of entire system. It can be integrated with Eclipse and Microsoft Visual Studio.
Platforms Linux, Windows, Mac OS
Languages/Models C, C++, CUDA, Direct3D, Vulkan, OpenGL
License Required No
Documentation User Guide
When To Use You are looking for IDE with debugging and profiling capabilities for NVIDIA GPUs

ReMPI

Summary ReMPI is a record and replay tool for MPI applications. As network/system noise can affect the order of received messages, applications can take different computation paths depending on received messages. This makes debugging process complicated as computation paths and associated computational results may vary between the original run (where a bug manifested itself) and the debugged runs. ReMPI helps debugging such non-deterministic MPI applications by reproducing order of message receives. Even if a bug manifests in a particular order of message receives, ReMPI can consistently reproduce the target bug. It uses PMPI interface for tracing message receive order. ReMPI can be used with existing tools like Totalview, DDT and STAT.
Platforms Linux, Mac OS
Languages/Models C, C++, Fortran, MPI
License Required No
Documentation
README, GitHub
When To Use • Your MPI application has non-determistic communication pattern • You want mechanism to re-run application by preserving MPI message communication order

RenderDoc

Summary RenderDoc is a frame-capture based graphics API debugger designed for quick and easy introspection of any graphics application. RenderDoc allows to capture a single frame of an application, then load that capture up in an analysis tool to inspect the API use and GPU work in detail.
Platforms Linux, Windows
Languages/Models Vulkan, Direct3D, OpenGL
License Required No
Documentation User Guide, GitHub
When To Use You are developing rendering application and need a tool for frame analysis & debugging, graphics inspection and detailed examination of API usage

Stack Trace Analysis Tool (STAT)

Summary STAT is a lightweight, scalable tool to aid in debugging parallel applications at extreme-scale. It is not intended to be a full-featured debugger but can help to pinpoint root cause of deadlocks even running with hundreds of thousands processes. STAT gather stack traces from parallel application’s processes and merge them into a compact form. The merging process groups processes that exhibit similar behavior into process equivalence classes. It provides GUI to navigate process groups and allow attaching full-featured debugger like DDT, Totalview for in-depth analysis.
Platforms Linux
Languages/Models C, C++, Fortran, MPI (and other programming models)
License Required No
Documentation User Guide, GitHub
When To Use • You are running application at scale and you suspect deadlock • You need a tool to attach to running application and show the execution state of every process in compact view

Oracle Studio Thread Analyser

Summary Thread Analyser is a tool part of Oracle Developer Studio (formerly Sun Studio) that helps to pinpoint race and deadlock conditions in multi-threaded applications. When application is compiled in Oracle Studio, compiler add instrumentation code to the executable that helps to detect errors at runtime. It provides GUI integrated into Performance Analyzer.
Platforms Linux/Solaris (Intel, AMD and SPARC)
Languages/Models OpenMP, Pthread, Solaris thread API, Cray(R) parallel directive
License Required No
Documentation User Guide
When To Use You are developing application on Solaris platform and looking for data race detection tool

Temanejo

Summary Temanejo is a graphical tool for analysing and debugging task-parallel, data-dependency-driven programming models. It allows one to display the task-dependency graph of application components, and allows simple interaction with the runtime system in order to control some aspects of parallel execution. Temanejo is able to assist debugging (to varying extent) for the programming models like SMPSs, OmpSs, StarPU, PaRSEC and OpenMP. It uses Ayudame library to collect information, so called events, from supporting runtime systems, and to excert control over a runtime system.
Platforms Linux, Mac OS
Languages/Models OpenMP, OmpSs, SMPSs, StarPU, ParRSEC
License Required No
Documentation User Guide
When To Use • You have task-parallel applications and you are not sure about dependencies and runtime scheduling of tasks • You are looking for visual debugging tool to understand the dependency execution at runtime

ThreadSanitizer (TSan) and AddressSanitizer (ASan)

Summary TSan is a fast data race detector tool for multi-threaded C, C++ applications. It performs compile-time instrumentation to record information about each memory access, and then checks whether that access participates in a race. Compared to other tools, TSan better understand builtin atomics and synchronisation constructs and therefore provides more accurate results with no real false positives. The overhead could vary from application to application, but typically memory usage may increase by 5-10x and runtime by 2-20x. ASan is a fast memory error detector tool. It helps to find errors such as use-after-free, heap/stack/global buffer overflow, out-of-bounds accesses, invalid/double free. Typical slowdown introduced by ASan is ~2x and increases memory usage ~3x. TSan and ASan originally developed by Google for LLVM toolchain and now have been ported to GNU toolchain.
Platforms Linux, Mac OS, Windows
Languages/Models C, C++, Fortran, Pthread
License Required No
Documentation TSan User Guide, ASan User Guide
When To Use You need faster data race detection tool to find sporadic crashes and memory corruptions

Record and Replay Debugger (rr)

Summary rr is a record and replay framework developed by Mozilla. During the record phase, rr records all inputs to process and logs it to the disk as trace. This trace can be replayed as many times during debugging process and all state will be reproduced exactly. During the replay phase, rr provides an enhanced gdb debugging experience that supports reverse execution. As a bug can be replayed over and over again, it helps to debug issues that are very difficult to solve with traditional debuggers. This fictionally is similar to ReplayEngine of Totalview. rr can be integrated with IDEs like Visual Studio Code, QtCreator, Eclipse, CLion.
Platforms Linux
Languages/Models C, C++, Fortran, Pthread
License Required No
Documentation Wiki, GitHub
When To Use • You have non-deterministic, difficult to reproduce bug in application • You want a tool that is capable of recording the execution once and then replay it multiple times with reverse-debugging functionality

Totalview

Summary Totalview is a debugger developed by Rogue Wave Software for both serial and parallel programs. It helps to analyse and debug serial, parallel, multi-process, multi-threaded and hybrid applications on variety of HPC architectures. Totalview has a memory analysis tool called MemoryScape (for detecting memory leak, memory corruption) and a reverse debugging tool called ReplayEngine (providing record and replay debugging functionality like rr). It provides remote debugging functionality and Both a graphical user interface and command line interface. Along with DDT, Totalview is commonly used debugger on supercomputing platforms.
Platforms Unix/Linux, Mac OS, (Intel x86, Intel Xeon Phi, IBM Power, Arm, NVIDIA GPU)
Languages/Models C, C++, Fortran, MPI, OpenMP, OpenACC, CUDA
License Required Yes
Documentation User Guide
When To Use You are looking for a debugging tool for serial, multi-threaded or parallel applications on desktop, cluster or largest supercomputer

UndoDB

Summary UndoDB is a reversible debugger developed by Undo. Similar to rr and newer version of < a href=”#h3-gdb”>GDB, it supports rewinding and replaying through the program’s execution history. One can set breakpoints and watchpoints in the past, and then rewind to them. UndoDB uses GDB as a default front-end but can be configured with IDEs like Eclipse, CLion.
Platforms Linux (x86, AArch64)
Languages/Models C, C++
License Required No
Documentation User Guide
When To Use You like reverse debugging feature of GDB but it’s slow and hence you are looking for better alternatives

Valgrind

Summary Valgrind is an instrumentation framework for building dynamic analysis tools. It provides number of simulation-based debugging and profiling tools : Memcheck (memory-management error detector), Cachegrind (cache profiler), Callgrind (extends Cachegrind with callgraphs), Massif(heap profiler), DRD/Helgrind (data race detector). Valgrind in essence is a virtual machine that performs dynamic recompilation of binary using JIT compilation technique : It first translates the application into simpler Intermediate Representation (IR), then particularly tool can perform whatever transformations it would like on the IR and Valgrind translates the IR back to machine code and lets the host processor run it. Even though tool is used primarily used with single process, one can use it to debug MPI programs at moderate scale with the help of Tool Gear‘s MemcheckView.
Platforms Unix/Linux, Mac OS
Languages/Models C, C++, Fortran, Pthread
License Required No
Documentation User Guide
When To Use You want readily available, single tool for detecting memory errors, threading bugs and profile your programs

Python Debuggers

There are number of tools available for debugging Python applications. For multiprocessing applications in Python, I haven’t used anything other than logger in the multiprocessing module. Here are few other debuggers commonly used :

pdb : Interactive source code debugger included in the standard library
pudb : A visual, console-based, full-screen debugger for Python
Winpdb : Platform independent Python Debugger
pydbgr : A gdb-like debugger for Python
• Number of Python IDE’s like Spyder, PyCharm, Atom provides inbuilt debugger integration

This wiki page provides number of other alternatives.

Other Tools

Below is a list of additional tools not included in this post. Some of these tools are deprecated or not in active development or have better alternatives.

AutomaDeD : Tool for automatic diagnosis of performance and correctness problems in MPI applications
DBX : Source level debugger for C/Fortran/Pascal primarily on Solaris, AIX and BSD systems
jdb : Simple command-line, GDB equivalent debugger for Java
IDB Debugger developed by Intel supporting parallel programming models including MPI, OpenMP, and Pthreads (Deprecated)
Intel Static Security Analysis : Tool developed by Intel to identify security vulnerabilities including buffer overflows, uninitialised variables, memory leaks (Deprecated)
Marmot : MPI error detection tool for checking MPI calls, their arguments and non-portable constructs (superseded by MUST)
DHAT : Heap analysis tool to understand memory block lifetimes, block utilisation, memory access ratios and layout inefficiencies
WinDbg : A kernel-mode and user-mode debugger on Windows platform
• IDEs : Many IDEs like NetBeans, Eclipse, CLion, Visual Studio have inbuilt debugger or provide integration with third-party debuggers for serial/multi-threaded applications

If you have any question, suggestion or would like to improve the post with your favourite tool, I will be glad to hear!

About

Over the years I have used different performance analysis and optimisation tools for High Performance Computing platforms. I work with different scientific applications written in C, C++, Fortran, Python, Matlab, Octave and Java. Often these applications run on workstations, small clusters and largest supercomputers in the world. This gives me an opportunity to work with different systems including Intel/IBM/ARM CPUs, NVIDIA/AMD GPUs, Intel MIC, IBM BlueGene, Fujitsu and Cray.

I have great interest in learning, understanding and experimenting with performance tools. After scattered notes and wiki pages over the years, I was planning to write this blog for long time. And finally trying to put all this together!

This blog is part of my learning process and hence everything may not be perfect. If you find this useful, thanks to brilliant people developing tools, tutorials from where I am borrowing this information. If there are any inaccuracies or missing details, it’s because I am still learning.

If you have any question or suggestion or want to discuss specific aspects, I will be happy to hear!

Summary of Profiling Tools for Parallel Applications

Many scientific/industrial applications run on workstation to largest supercomputers in the world. With the continuous evolution of hardware platforms, achieving good performance is a challenging task. There are many profiling tools available to analyse and optimise the performance. But not all tools/methods are available on every platform, especially in high performance computing. First step in performance engineering workflow is to understand which tools are available and when they can be used. There is no one-size-fits-all solution : some are designed with broad feature list for high level analysis and others for specific platform with low level hardware metrics.

While choosing profiling tool one need to consider different aspects:

  • Goal : Are you interested in high level performance metrics? Or, do you want to dive into low level hardware details?
  • Application : Is it serial or parallel application? Do you have access to source code to re-build the software?
  • Programming Language : Which programming language is used? Is it supported by the tool?
  • Programming Model : If it is parallelised, what is programming model? Is it supported by the tool?
  • Hardware Platform : What is target platform (CPU/GPU/FPGA)? Do you have administrative permissions on the system?

In this (long) post we will summarise different profiling tools (open source as well as commercial). If you are interested in parallel debugging tools, Summary of Debugging Tools will be useful.

The tools are listed in alphabetical order.

Aftermath

Summary Aftermath is a performance analysis tool for task parallel programs such as OpenMP and OpenStream applications. It provides instrumented OpenMP run-time (based on LLVM-OpenMP runtime) that collects performance data during runtime and generates a trace file at program termination. The graphical user interface helps to filter and analyse the traces interactively. Aftermath provides timeline visualisation to correlate performance data with the execution of OpenMP constructs like loops and tasks. Different metrics like tasks creation, execution, synchronisation, performance counters can be mapped on the timeline.
Platforms Unix/Linux
Languages/Models C, C++, OpenMP, OpenStream
License Required No
Documentation User Guide
When To Use You have OpenMP/OpenStream application and looking for a tool to assist in the performance debugging
Note Tracing is not supported for all OpenMP constructs and requires use of clang compiler (see documentation)

Arm MAP

Summary Arm MAP is a profiling tool developed by Allinea Software (now part of ARM). MAP is primarily used for analysing parallel applications written in C, C++, Fortran. It uses statistical sampling method to record cpu/memory usage, thread activity, I/O, communication and synchronisation across MPI processes.
Platforms Intel x86, Intel Xeon Phi, IBM Power, Armv8, NVIDIA GPU
Languages/Models C, C++, Fortran, MPI, OpenMP, OpenSHMEM, OpenACC, CUDA
License Required Yes
Documentation User Guide
When To Use • You are looking for profiling tool for parallel application with easy to use user interface • You have used Arm Performance Reports to understand high level characteristics and now want to dive into details to understand hardware bottlenecks

Arm Performance Reports

Summary Arm Performance Reports is a high level performance analysis tool to characterise application performance and pinpoint inefficiencies. It is also developed by Allinea Software (now part of ARM). The main goal of Arm Performance Reports is to show whether the application is fully utilising target hardware resources. It generate single page text/HTML report which helps developer to understand computation, communication and I/O characteristics of the application.
Platforms Intel x86, Intel Xeon Phi, Armv8, NVIDIA GPU
Languages/Models C, C++, Fortran, MPI
License Required Yes
Documentation User Guide
When To Use • You haven’t analysed application before and wondering if application is utilising the hardware fully • With minimal efforts you want to understand the computing characteristics of your application

Apprentice2

Summary Apprentice2 is a performance visualisation tool for profiling data collected by CrayPat. Depending on the instrumentation and analysis type, Apprentice2 helps to generate variety of reports/graphs including load imbalance, serialisation, I/O activity, process-pairs communication, execution timeline, call-graphs, function/region profile, cpu/memory utilisation.
Platforms All Cray systems
License Required Yes
Documentation User Guide
When To Use • You have used CrayPAT to collect performance data • Even though CrayPat provides excellent command line utilities to analyse data, you are visual guy and want to understand profile data with interactive graphical tool

Automatic Performance Collection (AutoPerf)

Summary AutoPerf is a library for collecting hardware performance counters and MPI usage information on IBM BlueGene-Q system. Autoperf is available on BG-Q systems deployed at ALCF. There is a x86 port on GitHub.
Platforms IBM BlueGene-Q
License Required No
Documentation ALCF Instructions
When To Use • You are running application on IBM BlueGene-Q systems at ALCF (MIRA, Cetus, Vesta) • You ran application and want to understand application performance and MPI usage

Blue Gene Performance Monitoring (BGPM)

Summary Blue Gene Performance Monitoring API (BGPM) provides programming interface for accessing hardware performance counters on IBM BlueGene-Q. BGPM provides C interface to monitor main hardware counter sources : P Unit (CPU events), L2 Unit (L2 cache events), I/O Unit, Network Unit and Compute Kernel Node.
Platforms IBM BlueGene-Q
Languages/Models C, C++
License Required No
Documentation Doxygen documentation in /bgsys/drivers/ppcfloor/bgpm/docs/, PAPI BG-Q Report
When To Use • You are performance tool developer • You want to use Lowe level API for accessing hardware performance counters

Caliper

Summary Caliper is a program instrumentation and performance measurement framework for HPC applications. Unlike traditional profiling tools, Caliper can be used to embed performance analysis capabilities into application itself. It can allow third-party tools to access application context information, or it can be configured as a stand-alone performance recorder.
Platforms Unix/Linux
Languages/Models C, C++, Fortran
License Required No
Documentation GitHub, User Guide, Publication
When To Use instead of third-party profiling tools, you want to embed performance analysis capabilities into your application itself (Note still in development / research stage)

Cachegrind

Summary Cachegrind is a cache and branch prediction tool part of Valgrind framework. Cachegrind simulates how application interacts with cache hierarchy and branch predictor. It collects different events like data reads, cache misses and attribute back it to the source code. It performs dynamic recompilation of binary at runtime and hence runtime overhead can be huge (10-100x slower). There is a windows port on GitHub.
Platforms Unix/Linux/MacOS (x86, PowerPC)
Languages/Models C, C++, Fortran
License Required No
Documentation User Guide
When To Use • You want to understand how your (serial) application interact with cache hierarchy • You want to use tool available with common linux distribution • as run-time overhead could be huge, you have small, representative input dataset so that you can run application for short time

Callgrind

Summary Callgrind is a profiling tool part of Valgrind framework. Callgrind records functions execution, caller-callee relation and present it as call-graph. The output of Callgrind is flat call graph but tool like KCachegrind can be used to visualise the profiling data. It also provide option to use Cachegrind capabilities to improve the profiling information. As application will be run under virtual processor emulated by Valgrind, execution will take considerably longer to run under Callgrind than it typically would.
Platforms Unix/Linux/MacOS (x86, AMD64)
Languages/Models C, C++, Fortran
License Required No
Documentation User Guide
When To Use • You want to profile (serial) application with tool available with common linux distribution • As run-time overhead could be huge, you have small, representative input dataset so that you can run application for short time

Codelet Extractor and REplayer (CERE)

Summary CERE is a code isolation framework based on LLVM. CERE is used to find out hotspots of the application and then extract these hotspots into standalone kernels called codelets. The codelet can be modified, compiled, run, and measured independently from the original application. This helps to do piecewise optimisation of the application and reduces the benchmarking cost.
Platforms Linux (x86, Aarch64)
Languages/Models C, C++, Fortran, D
License Required No
Documentation GitHub, Tutorial, Publication
When To Use You have large code base and want to extract kernels to analyse and optimise separately
Note Still in development / research tool

CodeAnalyst / CodeXL

Summary CodeAnalyst was a profiler developed by AMD for x86/x86_64 processors and now replaced by CodeXL. It has CPU/GPU profiler, static kernel analyser, HSPA profiler and graphic frame analyser. CodXL also include debugger for CPU as well as GPU.
Platforms Linux, Windows
Languages/Models C, C++, OpenCL, OpenGL, DirectX, Vulkan
License Required No
Documentation GitHub, User Guides
When To Use • You are targetting AMD platform • You are looking for IDE to profile/debugging application on AMD CPU/GPU/APU

Cray Performance Analysis Toolkit (CrayPat)

Summary CrayPat is a performance measurement and analysis tool available on Cray systems. One can run variety of analysis experiments including profiling, tracing, hardware performance counter analysis, load-imabalnce analysis. There is also easy-to-use, simplified version called CrayPat-lite for basic performance analysis. CrayPat has feature called Automatic Program Analysis (APA) which uses profiling information to gather hotspots and then perform tracing of selected hotpot routines. The profile data generated from CrayPat can be analysed using command line tool pat_report or GUI tools like Apprentice2 and Reveal (GUI).
Platforms Linux (Cray Systems)
Languages/Models C, C++, Fortran, MPI, OpenMP, OpenACC, SHMEM, UPC, Pthreads
License Required Yes
Documentation User Guide
When To Use • You are running application on Cray systems • You are looking for tool capable of doing high level profiling as well as low level hardware counter analysis

CUBE Uniform Behavioral Encoding (CUBE)

Summary CUBE is a performance data explorer tool used with Score-P, TAU and Scalasca frameworks. It is designed for interactive exploration of the performance data in a scalable fashion. CUBE presents application performance in multi-dimensional performance spaces : performance metric, call path and system resource. It also provide various command line tools and library for reading/writing/manipulating profile data.
Platforms Linux, MacOS
Languages/Models C, C++, OpenMP, MPI, OpenACC, CUDA
License Required No
Documentation Project Website
When To Use • You have generated profile data from Score-P, Scalasca (or TAU) • You want to visualise the profile data or need command line tool/library to manipulate the profile data

DAGViz

Summary DAGViz is a performance visualisation tool for task parallel programs. The task-based programming models allows developer to expose logical parallelism by creating fine-grained tasks. The runtime systems take care of thread management, task scheduling, load balancing etc and hence the order of execution of tasks can be different. The nondeterministic nature of task parallel execution hides runtime scheduling mechanisms from programmers. This poses challenge for programmers to understand the cause of suboptimal performance of their programs. DAGViz helps to visualise logical tasks and their runtime execution.
Platforms Linux, MacOS
Languages/Models C, C++, OpenMP, Cilk Plus, Intel TBB, Qthreads, MassiveThreads
License Required No
Documentation GitHub, Publication
When To Use • You have parallelised application using task parallel programming model • You want to analyse runtime scheduling of tasks

Darshan

Summary Darshan is a lightweight I/O profiling tool for HPC applications. It helps to understand I/O characteristics of application including I/O access patterns, sizes, number of operations etc. Darshan has two components : Darshan-runtime (used to collect performance data on target system) and Darshan-util (used for analysing the data collected by Darshan-runtime). Darshan can be deployed on production systems for I/O characterisation of entire workload.
Platforms Unix/Linux
Languages/Models C, C++, Fortran, MPI, HDF5, NetCDF
License Required No
Documentation Project Page, Documentation
When To Use You have MPI application and want to understand I/O performance

Dimemas

Summary Dimemas is a performance prediction tool for MPI applications. It is a trace driven simulator that helps to perform what if analysis. Developers can define architectural parameters (cpu, node, network, filesystem etc.) for non-existent target machine and then Dimemas reconstructs time behaviour of application on new target platform. It uses trace generated from Extrae profiling tool.
Platforms Unix/Linux
Programming Languages/Models C, C++, Fortran, MPI
License Required No
Documentation Project Page, GitHub
When To Use • You have used Extrae/Paraver for performance measurement and analysis • You want to understand (or predict) how your application will perform on a system to which you don’t have access or simply doesn’t exist yet

DTrace

Summary DTrace is a dynamic tracing framework for analysing applications on production systems in real time. It was originally developed for Solaris and has been ported to several Unix-like systems. Dtrace is scriptable framework : one can attach “probes” to a running system and peek inside as to what it is doing. It helps to understand memory utilisation, CPU time, filesystem and network resources used by active processes. Unlike other tools, it is not profiler but an excellent tracing tool and hence included in this post.
Platforms Linux, MacOS
Languages/Models Assembly, C, C++, Java, Erlang, JavaScript, Perl, PHP, Python, Ruby, shell script, Tcl
License Required No
Documentation Dtrace Guide
When To Use You want to diagnose performance issue (on workstation, server or cloud environment) and need a tool capable of tracing at user/kernel space

Dyninst

Summary Dyninst library provides machine independent interface to analyse and instrument binaries. Performance analysis tools often need to instrument application at runtime. Dyninst allows this instrumentation after linking phase or during execution. Multiple performance tools like TAU, STAT, Open|SpeedShop, Extrae used Dyninst underneath.
Platforms Unix/Linux, MacOS
License Required No
Documentation Project Page, GitHub
When To Use You are developing performance measurement tool, debugger or computational steering application where you want to modify the binary

Extra-P

Summary Extra-P is a automatic performance modeling tool developed with Scalasca tool set. The main goal of Extra-P is to help finding scalability bugs. User runs small-scale performance experiments at different processor configurations to generate profiles which are used as input for model generation. Extra-P generates list of potential scalability bugs and human-readable models for different performance metrics.
Platforms Linux, MacOS
Languages/Models C, C++, Fortran, MPI, OpenMP
License Required No
Documentation User Guide
When To Use • You have used Score-P or Scalasca for performance analysis • You are curious about possible scalability issues at scale but don’t have access to system • You want to generate performance models to find out the potential scalability issues

Extrae

Summary Extrae is an instrumentation and performance measurement tool for HPC applications. It uses different interposition mechanisms like library preload, binary instrumentation (using Dyninst) and programming model specific instrumentation layer to inject probes into the target application. The profiles/traces can be analysed using Paraver and Dimemas.
Platforms Intel, BlueGene-Q, Cray, GPU, ARM, Fujitsu
Languages/Models C, C++, Fortran, Java, Python, MPI, OpenMP, Pthreads, OmpSs, CUDA, OpenCL
License Required No
Documentation Project Page, GitHub
When To Use • You want to analyse performance of parallel application on different architectures • Possibly you want to perform what if analysis using Dimemas and need input timeline trace

gperftools

Summary gperftools (originally Google Performance Tools) is a collection of performance profiling and memory checking tools. It consist of sampling based cpu profiler, heap checker and heap profiler. The profile data can be visualised with pprof.
Platforms Unix/Linux
Languages/Models C, C++
License Required No
Documentation Documentation
When To Use You are looking for simple to use profiling tool for analysing (serial) C/C++ application

GPROF

Summary GPROF is a commonly available performance analysis tool on most Unix/Linux platforms. It uses sampling as well as instrumentation techniques : it instruments the application during compilation time but uses sampling technique during runtime to gather performance hotspots. GPROF is an extended version of the older “prof” tool with the ability to generate call graph information. For non-trivial applications the overhead could be high due to instrumentation.
Platforms Unix/Linux
Languages/Models C, C++, Fortran
License Required No
Documentation Project Page
When To Use You have not-so-complex C/C++ application and want to quickly get an idea about hotspots using system tool

GPU PerfStudio

Summary GPU PerfStudio is performance and debugging tool developed by AMD. It was originally developed for Direct3D and OpenGL application for Windows and later ported to Linux. GPU PerfStudio consists of five important tools for graphics developers : Frame Debugger (to visualise the graphics state and resources in the frame), Frame Profiler (to identify per draw call performance issues at the hardware counter level), Shader Debugger (to step through and debug shader code and its output), API Tracer (to show CPU timing information) and Shader Analyzer (to help optimising shader code).
Platforms Linux, Windows, AMD GPU
Languages/Models C, C++, DirectX, OpenGL, Vulkan
License Required No
Documentation Project Page
When To Use You want to analyse and optimise game applications for AMD GPUs

Flow Graph Analyzer

Summary Flow Graph Analyzer is a visualisation tool for designing and analysing applications built using Intel TBB flow graph interface. The flow graph interface allows developers to expose parallelism at high level using data flow algorithms and dependency graph. Flow Graph Analyser has designer component that allows one to visually create flow graph diagrams which can be converted into C++ stubs for application development. The runtime execution of these graph dependencies can be different. Flow Graph Analyser allows one to visualise the runtime behaviour using application traces. One can look into the timeline executions and runtime behaviour of the dependency graphs.
Platforms Linux, Windows
Languages/Models C++, Intel TBB
License Required Yes (part of Intel Advisor or Intel Parallel Studio)
Documentation Project Page, User Guide
When To Use • You have parallelised application using Intel TBB flow graph interface • You want to understand runtime execution of node dependencies to debug performance issues

HPCToolkit

Summary HPCToolkit is a performance measurement and analysis tool for parallel applications. It supports profiling and fine-grained tracing with minimal overhead for reasonable choice of sampling period. It can collect hardware performance counters, derived metrics and attributes them back to the source code. The profile/trace data generated can be visualised using hpcviewer/hpctraceviewer.
Platforms Linux, Intel, IBM Power, BlueGene, Cray
Languages/Models C, C++, Fortran, OpenMP, MPI, Pthread
License Required No
Documentation Project Page, GitHub, User Guide
When To Use You want to profile application on workstation or large supercomputing platform

Hardware Performance Monitor (HPM)

Summary Hardware Performance Monitor (HPM) is a high level software layer for measuring hardware counters on IBM architectures. Compared to BGPM, HPM provides easy to use API for configuring, controlling and reading hardware performance counters. It transparently handlers multiplexing, overflows and output data in human readable format.
Platforms Linux/AIX, IBM Power, IBM BlueGene
Programming Languages/Models C, C++, Fortran, OpenMP, MPI, Pthread
License Required No
Documentation Project Page, User Guide
When To Use • You are running application on IBM systems • Instead of using profiling tool, you want simple API to measure performance counters

hpcviewer and hpctraceviewer

Summary hpcviewer and hpctraceviewer are performance visualisation tools for HPCToolkit. They are used to interactively explore profile and trace data respectively.
Platforms Linux, Windows, MacOS
Programming Languages/Models C, C++, Fortran, OpenMP, MPI, Pthread
License Required No
Documentation GitHub, User Guide
When To Use You have used HPCToolkit to generate profile/traces and need a visualisation tool

IBM High Performance Computing Toolkit (HPCT)

Summary HPCT is set of libraries and tools developed by IBM for performance measurement and analysis. It provides high level interface for MPI profiling, MPI tracing and hardware performance monitoring using (HPM). HPCT helps in application optimisation and tuning of serial as well as parallel applications.
Platforms IBM Systems (especially BlueGene-Q)
Languages/Models C, C++, Fortran, OpenMP, MPI, Pthread
License Required No
Documentation Project Page, User Guide
When To Use • You are running application on IBM system (e.g. BG-Q) • You are looking for easy to use API to collect performance data for whole execution or parts of the execution

Instrument

Summary Instrument is a performance analysis tool part of Xcode on MacOS. It replaces previous profiler called Shark. Instrument is based on Dtrace and can profile Mac OS as well as iOS applications. Instrument helps to collect traces and displays timeline with cpu, memory, network, filesystem activity on Apple devices.
Platforms MacOS
Languages/Models C, C++, Objective-C, Objective-C++, Swift
License Required No
Documentation User Guide
When To Use You are developing/running application on Mac OS and want to profile with tool available at hand

Intel Advisor

Summary Intel Advisor or Advisor XE is a code vectorisation and threading assistance tool developed by Intel. It helps to improve vectorisation by analysing scalar/vector code generated by auto-vectorisation compilers like GCC, Intel, LLVM, Microsoft. Advisor can perform various analysis like loop carried dependencies, memory access pattern. It generates detailed reports about inefficiencies, suggest code improvements and provide speedup estimation. Intel Advisor also contain Threading advisor which can help to find scalability issues and synchronisation errors.
Platforms Linux, Windows
Languages/Models C, C++, Fortran, C#, TBB, Cilk Plus, OpenMP
License Required Yes
Documentation User Guide
When To Use • Application kernels are vectorised by compiler but you are not sure about the vectorisation efficiency • You want to perform detailed memory access pattern analysis • You want to understand if loops can be vectorised

Intel Inspector

Summary Intel Inspector (successor of Intel Thread Checker) is a code correctness tool that helps to identify threading and memory errors. It performs dynamic instrumentation and analyse execution to find out intermittent, non-deterministic errors. Intel Inspector helps to find out threading errors (like deadlock, race condition) and memory errors (like memory leaks, memory corruption, dangling pointers, uninitialized variables).
Platforms Linux, Windows
Languages/Models C, C++, Fortran, TBB, OpenMP, Pthread, Win32 threads
License Required Yes
Documentation Documentation
When To Use • You want to analyse memory issues (leaks, dangling pointers, un-initialized variables) • You have threaded application and want to find out issues like race conditions, deadlocks etc.

Intel Trace Analyzer and Collector (ITAC)

Summary ITAC is a tool for profiling and tuning MPI applications. It allows to identify hotspots and issues for poor scaling performance. ITAC consist of three components : trace collector, trace analyser and message checker. It can collect trace of MPI application and helps to visualise communication structure. Using message checker one can also find incorrect or inefficient use of MPI constructs.
Platforms Linux, Windows
Languages/Models C, C++, Fortran, MPI
License Required Yes
Documentation User Guide
When To Use • You have MPI application and want to understand communication structure • You want to find out inefficient/incorrect of MPI constructs

Intel Compiler’s Profiler

Summary Intel compilers provide options to profile loops/functions in the application. This is an easy way to identify where the application is spending cpu cycles. Developer can decide level of instrumentation (loop, function, loop bodies) during compilation. Once the application is executed, text/xml report is generated which can be visualised with utility called loopprofileviewer.sh.
Platforms Linux, Windows, Mac OS
Languages/Models C, C++, Fortran
License Required Yes
Documentation Compiler Guide
When To Use • You are using Intel compiler for application development • Instead of using standalone profiler, you want to quickly find out most time consuming parts or ‘hotspots’ of the code

Intel Graphics Performance Analysers

Summary Intel Graphics Performance Analysers is a tool suite for analysing and optimising game/interactive 3D graphic applications. It consist of System Analyser (for real time performance feedback of the CPU and GPU), Frame Analyser (to determining where each frame is taking the most amount of time), Platform Analyser (to identify the workloads that are running on the CPU & GPU) and Graphics Trace Analyser (for process level event traces on CPU & GPU).
Platforms Linux, Windows, Mac OS
Languages/Models Microsoft DirectX, Apple Metal, OpenGL
License Required No
Documentation User Guide
When To Use You want to analyse and optimise graphics applications especially on Intel CPU and Intel HD Graphics

Intel VTune Amplifier

Summary Intel VTune Amplifier is a low level performance analysis tool especially for Intel CPUs. Many features can work on AMD CPUs but advanced hardware-based sampling requires an Intel-manufactured CPU. Intel VTune supports different experiments like hotspot, memory access analysis, locks & wait analysis, concurrency analysis, bandwidth usage, HPC characterisation. By looking at hotspots, one can drill down to the instruction and hardware performance counter level analysis. Along with hardware performance counters, it presents many derived metrics to easily identify hardware bottlenecks.
Platforms Linux, Windows
Languages/Models C, C++, C#, Fortran, Java, Python, Go, OpenCL, OpenMP, Intel TBB, MPI
License Required Yes
Documentation User Guide
When To Use • You are running application on Intel CPUs and want to find out hotspots • You want to perform micro-architecture level analysis to find out hardware resource bottlenecks

Integrated Performance Monitoring (IPM)

Summary Integrated Performance Monitoring is an infrastructure developed by NERSC for high level performance analysis of parallel applications. It is designed with the goal of ease-of-use and scalability for performance analysis of parallel applications. IPM can be deployed as monitoring framework for entire workload on clusters/supercomputers. It generate report with wall clock time, MPI communication, memory usage and floating point operations. It can be configured to generate detailed XML report which can be be visualised using web interface.
Platforms Linux
Languages/Models C, C++, Fortran, MPI
License Required No
Documentation Project Website, GitHub
When To Use You want to deploy low overhead, performance monitoring framework on cluster/supercomputing platform

JProfiler

Summary JProfiler is a profiling tool developed by ej-technologies for Java applications. It can be used to analyse cpu & memory usage, dynamic memory allocations, thread executions, race conditions etc. JProfiler can perform live analysis on local/remote servers or offline analysis of profile data. It also provide visual representation of virtual machine load in terms of active and total bytes, instances, threads/classes execution and garbage collector activity. JProfile can be used as standalone tool or in integration wit IDEs like Eclipse, IntelliJ, NetBeans, JDeveloper.
Platforms Unix/Linux, Windows, Mac
Languages/Models Java SE, Java EE Subsystems and Databases
License Required Yes
Documentation User Guide
When To Use • You are developing Java application • You want to find out performance bottlenecks on local workstation or remote server

Jumpshot

Summary Jumpshot is a performance visualisation tool part of Multi-Processing Environment (MPE) package. It is used for timeline visualisation of SLOG-2 traces. Jumpshot helps to understand hotspots, communication patterns and load-imbalance across MPI processes.
Platforms Linux, Windows, Mac
Languages/Models C, C++, Fortran, MPI
Documentation Project Website
License Required No
When To Use You have have generated CLOG2/SLOG2 traces and looking for open source tool to visualise timeline traces
Note This is quite old tool but there are not many choices for timeline visualisation of MPI applications

JVM Profiler

Summary JVM Profiler is a performance analysis tool developed by UBER for analysing JVM applications in distributed environment. JVM Profiler can attach Java agent to executors of Spark/Hadoop application in a distributed way and collect various metrics at runtime. It allows to trace arbitrary java methods/arguments without source code change (similar to Dtrace). It helps to analyse and debug memory usage, cpu usage, I/O issues at scale.
Platforms Unix/Linux, Windows, Mac
Languages/Models Java, Spark
Documentation Project Website, GitHub
License Required No
When To Use You want to understand performance bottlenecks of standalone Java application or Spark/Hadoop application at scale

Kcachegrind

Summary Kcachegrind (or QCacheGrind) is a performance visualisation tool for profilers like Cachegrind, Callgrind. It can present profile data in different ways (e.g. tree map, call graph) and perform source annotation. There are open source tools available for converting profile data to callgrind-format (e.g. for OProfile) and then Kcachegrind can be used for visualisation.
Platforms Linux, Windows, Mac OS
License Required No
Documentation User Guide, GitHub
When To Use You have used Valgrind tools (e.g. callgrind, cachegrind) for profiling and now want to visualise the performance data

Kerncraft

Summary Kerncraft is performance modeling framework to investigate data reuse and cache requirement of an application. It uses loop kernels analysis and static code analysis techniques. When combined with IACA, kerncraft can give a good overview of both in-core and memory bottlenecks.
Platforms Unix/Linux
Languages/Models C, C++, Fortran
License Required No
Documentation GitHub
When To Use • You want to construct Roofline/ECM models for loops in the application • You want to understand CPU resource bottlenecks and optimization opportunities
Note Still in development / research tool

Linux Trace Toolkit Next Generation (LTTng)

Summary LTTng is a tracing framework for standalone applications, libraries and kernel with minimal overhead. It is successor of Linux Trace Toolkit (LTT) and available on many desktop, server and embedded linux distributions. Similar to perf/Dtrace, it can be used for system wide introspection to understand interactions among multiple applications. Visualisation tools like Trace Compass and Sourcery Analyzer can be used for visualising collected traces.
Platforms Linux
License Required No
Documentation User Guide
When To Use You want trace single process or want to perform system wide introspection with minimal overhead

Modular Assembly Quality Analyzer and Optimizer (MAQAO)

Summary MAQAO is a framework for static and dynamic analysis of binaries. As it operates at binary level, MAQAO is programming model agnostic and commonly used for single node performance analysis. It has static analyser plugin that can assess quality of the loops with respect to vectorisation using micro-architecture performance models. MAQAO generate report with suggestions to improve code performance (e.g. loop transformations, compiler hints).
Platforms Linux (x86, ARM)
Languages/Models C, C++, Fortran, OpenMP, Pthread, MPI
License Required No
Documentation User Guide
When To Use You want to analyse binary for performance optimisation opportunities (e.g. vectorisation, inlining)

Massif

Summary Massif is a heap profiling tool part of Valgrind framework. It helps to understand the memory usage of an application including allocated memory, program stack and the extra bytes allocated for book-keeping and alignment. Massif also helps to identify critical memory leaks. The profiling data generated by Massif can be visualised using massif-visualiser.
Platforms Linux, MacOS
Languages/Models C, C++, Fortran
License Required No
Documentation User Guide
When To Use • You want to perform detailed memory usage analysis to reduce memory footprint • You want to identify critical memory leaks

memP

Summary memP is a lightweight, parallel heap profiling tool for MPI applications. It helps to find heap allocation that causes mpi rank to reach its memory in use high water mark (HWM). memP generate two types of report : summary report and task report. The summary report describes the memory HWM of each task over the execution of an application. The task report can be generated for each rank based on specific criteria that provides a snapshot of the heap memory currently in use. The report generated can be plain text file or XML format that can be visualised using mpiPview (part of Tool Gear).
Platforms Linux
Languages/Models C, C++, Fortran, MPI
License Required No
Documentation Project Page
When To Use You have MPI application and want to analyse memory allocations reaching HWM

mpiP

Summary mpiP is a lightweight profiling tool for MPI applications. It collects MPI communication statistics local to each rank and hence has considerably small overhead at scale. The report generated at the end of execution can be in plain text format or XML format. The XML report can be visualised with mpiPview (part of Tool Gear).
Platforms Linux, Intel, IBM BlueGene, Cray Platforms
Languages/Models C, C++, Fortran, MPI
License Required No
Documentation Project Page, GitHub
When To Use You want to profile MPI communication on workstation or largest supercomputing system

Nsight

Summary Nsight is a development tool from NVIDIA for heterogeneous computing. It provides simultaneous debugging and profiling capabilities for CPU as well as GPU. Nsight helps to identify/analyse bottlenecks and monitor the activities of entire system. It can be integrated with Eclipse and Microsoft Visual Studio.
Platforms Linux, Windows, Mac OS
Languages/Models C, C++, CUDA, Direct3D, Vulkan, OpenGL
License Required No
Documentation User Guide
When To Use You are looking for IDE with debugging and profiling capabilities for NVIDIA GPUs

nvprof

Summary nvprof is a command-line profiling tool for CUDA applications. It helps to collect and analyse profile data both for CPU and GPU. nvprof records timeline execution including kernel execution, memory transfers, CUDA API calls and different hardware metrics. One can generate summary profiles or detailed traces to be visualised with nvvp. nvprof is capable of profiling CUDA kernels irrespective of language they are written in.
Platforms Linux, Windows, Mac OS
Languages/Models CUDA
License Required No
Documentation User Guide
When To Use You are looking for readily available command line tool for profiling CUDA applications

NVIDIA Visual Profiler (nvvp)

Summary NVIDIAVisual Profiler is a performance profiling tool developed by NVIDIA. It helps to analyse and optimise C/C++ applications using CUDA/OpenACC programming models. nvvp shows CPU and GPU activity in a unified time line, including CUDA API calls, kernel launches, memory transfers and CUDA launches. One can look at low-level performance metrics generated from hardware counters and software instrumentation. nvvp can analyse application execution and suggest actions to eliminate or reduce those bottlenecks.
Platforms Linux, Windows, Mac OS
Languages/Models C, C++, CUDA, OpenACC
Documentation User Guide
License Required No
When To Use You are looking for profiling tool to analyse and optimise performance of CUDA application on local or remote system

ompP

Summary ompP is a profiling tool for OpenMP applications. It uses Opari for instrumenting OpenMP directives. ompP can be configured to use PAPI underneath to supports measurement of hardware counters. ompP can perform overhead analysis and detection of common inefficiency in OpenMP applications.
Platforms Linux
Languages/Models C, C++, Fortran, OpenMP
License Required No
Documentation User Guide
When To Use You are looking for simple tool for profiling OpenMP applications

OpenMP Pragma And Region Instrumentor 2 (OPARI2)

Summary OPARI2 is a source to source instrumentor for OpenMP applications. It is not used as standalone profiler but as an instrumentor by many performance analysis tools like TAU, ompP, Score-P. It uses POMP2 interface to surround OpenMP directives and OpenMP runtime calls.
Platforms Linux
Languages/Models C, C++, Fortran, OpenMP
License Required No
Documentation User Guide
When To Use You are building profiling tool for OpenMP applications and need source to source instrumentor

Open|SpeedShop (O|SS)

Summary Open|SpeedShop is a performance analysis toolset for performance analysis of applications running on workstation, cluster and supercomputing platform. It uses both statistical sampling and tracing techniques to record performance information. O|SS can be used for analysing sequential, multi-threaded, MPI, CUDA and hybrid applications. It supports various analysis type including Program Counter Sampling, Hardware Performance Counters, MPI, I/O, OpenMP, Memory Usage and CUDA tracing.
Platforms Linux, Intel, AMD, ARM, IBM Power, NVIDIA GPU
Languages/Models C, C++, Fortran, OpenMP, MPI, Pthread, CUDA
License Required No
Documentation User Guide, GitHub
When To Use You are looking for performance analysis tool for workstation or supercomputing system
Note Installation is bit heavy, prefer to use existing installation if available

OProfile

Summary OProfile is a low overhead, statistical profiling tool introduced in linux kernel 2.4. Along with perf, OProfile is most commonly used performance counter monitoring tool on linux platforms. It uses performance monitoring unit (PMU) available on processor to retrieve information about different events (e.g. memory references, L2 cache requests, hardware interrupts). It can be used to profile standalone executable or for system wide introspection.
Platforms Linux
License Required No
Documentation User Guide
When To Use You want to analyse single process or entire system with commonly available linux tool
Note Need administrative permissions for hardware event monitoring or full system introspection

Oracle Developer Studio’s Performance Analyzer

Summary Performance Analyzer is a profiling tool part of Oracle Developer Studio (formerly Oracle Solaris Studio). It is primarily used on Solaris for x86 and SPARC architectures. The performance data is collected from various sources including statistical sampling, MPI communication, thread synchronisation, IO calls, memory allocation.
Platforms Linux/Solaris
Languages/Models C, C++, Fortran, Java, Scala, OpenMP, MPI
License Required No
Documentation User Guide
When To Use You are developing application on Solaris platform and looking for IDE with performance analysis capabilities

Performance Application Programming Interface (PAPI)

Summary PAPI provides consistent, high level programming interface for using hardware performance counters found in major microprocessors. It is widely used by performance tools (e.g. TAU, Score-P, HPCToolkit, O|SS) to collect hardware metrics like flops, clock cycles, instruction counts, cache misses. PAPI provides access to native events but also defines many derived metrics like flops. It handles counter multiplexing and overflow transparently.
Platforms UNIX/Linux
Languages/Models C, C++
License Required No
Documentation User Guide
When To Use You are looking for high level API to access hardware counters without worrying too much about architecture details

Parallel Profile analysis (Paraprof)

Paraprof is a performance analysis tool part of TAU. It supports profile data generated from different tools like TAU, mpiP, ompP, gprof, Score-P, HPCToolkit etc. Paraprof allows user to load multiple experiments and compare them simultaneously. It provide different views like call graph, histogram, bar chart, call trees, 3-D topology/communication metric view.
Platforms Unix/Linux, MacOS, Windows
License Required No
Documentation User Guide
When To Use • You have used Tau for collecting performance data • You want to visualise/compare performance data from different experiments

Paraver

Summary Paraver is a flexible trace manipulation and visualisation tool. It has it’s own paraver trace format and commonly used with Extrae tool. Paraver helps to get qualitative global picture of the application behavior by visual inspection. It has flexible timeline trace visualiser for comparative analysis of multiple traces. Paraver can simultaneously show timelines with different performance metrics like communication, hardware counters.
Platforms Linux, Windows, Mac OS
Languages/Models C, C++, Fortran, MPI, OpenMP, OpenCL, pthreads, OmpSs, CUDA
License Required No
Documentation User Guide, GitHub
When To Use You have used Extrae for collecting performance data and want to now visualise performance data

perf

Summary perf (perf_events or perf tools) is a performance analysis tool introduced in Linux kernel version 2.6.31. It support wide range of analysis including hardware counters, tracepoints, dynamic probes. Perf is natively supported in many popular distributions including Red Hat and Debian. It provides per task, per CPU and per-workload counters and source code event annotation. Perf abstracts away CPU hardware differences and presents a generalised command line interface for performance measurement.
Platforms Linux
Languages/Models C, C++, Fortran, Java, Matlab
Documentation Wiki Page
License Required No
When To Use • You want analyse standalone application or entire system using readily available tool • You want perform hotspot analysis or low level hardware counter analysis
Note Need administrative permissions for hardware event monitoring or full system introspection

Perfworks

Summary Perfworks provides C++ API for collecting hardware performance metrics for NVIDIA GPUs. It provide actionable, high-level metrics, that helps to recognise bottlenecks quickly. Other performance tools including Nsight, Visual Profiler use Perfworks API underneath. Developer can call Perfworks API to access low-level performance metrics.
Platforms Linux, Windows
Languages/Models C++, CUDA, OpenGL, OpenGL
License Required No
Documentation Project Page
When To Use Instead of full fledged profiler (like nvprof, nvvp), you are looking for library to read performance metrics

Periscope Tuning Framework (PTF)

Summary Periscope Tuning Framework is a toolset for automated performance analysis and tuning of HPC applications. It is designed to assist developers in the performance optimisation
workflow. Periscope provide various tuning plugins to automatically find the optimal combination of settings such as compiler flags, MPI settings, number of OpenMP threads in each parallel section, etc. It also provides possibility of workflow optimisation where the whole process of optimising an application, including for example running jobs on the HPC system, adjusting the job’s settings and recording data can be formalized and partially automated.
Platforms Linux
Languages/Models C, C++, Fortran, MPI, OpenMP
License Required No
Documentation User Guide
When To Use You are looking for auto-tuning framework to run jobs with different settings to find optimal combination

PGI Profiler (PGPROF)

Summary PGPROF is a profiling tool shipped with PGI compiler. It allow to profile applications running on CPU and GPU simultaneously. PGPROF display timeline of CPU and GPU activity, and includes automated analysis engine to identify optimisation opportunities.
Platforms Linux, Mac OS, Windows, Intel, IBM Power
Languages/Models C, C++, Fortran, MPI, OpenACC
License Required No (Community Edition)
Documentation User Guide
When To Use You are developing CPU/GPU application using PGI compiler suite and looking for a profiling tool

pprof

Summary pprof is a performance visualisation tool developed by Google. It reads profiling samples in profile.proto format and can be used to visualise profile data generated by sampling tools like gperftools, perf. pprof can read profiles from a local file, or over http and generate text report or graphical reports like flame graph. If the profile samples contain machine addresses, pprof can annotate the samples with source using native binutils tools.
Platforms Unix/Linux
License Required No
Documentation Readme
When To Use You have generated profile data using sampling tool like perf, gperftools and looking for visualisation tool
Note This tool should not be confused with command line tool pprof provided by TAU

Radeon GPU Profiler (RGP)

Summary Radeon GPU Profiler is a low-level optimisation tool developed by AMD for Radeon GPUs. It provides built-in hardware thread tracing, timing and occupancy information. RGP helps to analyze graphics, async compute usage, event timing, pipeline stalls, barriers, bottlenecks and other performance inefficiencies.
Platforms Linux, Windows
Languages/Models DirectX, Vulkan
License Required No
Documentation GitHub
When To Use You are game developer and looking for profiling tool to understand how AMD GPU is actually executing your application at hardware level

Radeon GPU Analyzer (RGA)

Summary Radeon GPU Analyzer is an offline compiler and performance analysis tool developed by AMD for helping developers to optimise their shaders for AMD APUs/GPUs. Using RGA developers can compile the code for various AMD GPUs/APUs independent from the one physically installed on the system. It generate AMD ISA disassembly, performance statistics and static analysis reports for each target platform.
Platforms Linux, Windows
Languages/Models DirectX, Vulkan, OpenGL, Vulkan, OpenCL
License Required No
Documentation GitHub
When To Use You are targeting GPU/APU and want to investigate how different compiler optimizations and compilation chains affect the performance of shaders

PurifyPlus

Summary PurifyPlus is a runtime analysis tool that helps to monitor application execution and reports key aspects of its behaviour. It looks at program’s behavior based on what it does when it runs. PurifyPlus supports memory debugging and performance analysis.
Platforms Unix/Linux, Windows
Languages/Models C, C++
License Required Yes
Documentation Product Page
When To Use You are looking for tool that can monitor application execution and help to detect the cause of a performance or memory bottlenecks

Reveal

Summary Reveal is an analysis and code restructuring assistant tool developed by Cray. It uses profile data generated by Craypat and performs static analysis of source code. With this it helps to identify time consuming loops and provide feedback on loop dependency and vectorisation. Reveal can also automatically markup loops for OpenMP parallelisation. It performs variable scoping and create directives with the appropriate private and shared clauses. The process is semi-automatic and still requires programmer input.
Platforms Linux, Cray Systems
Languages/Models C, C++, Fortran, OpenMP, MPI
License Required Yes
Documentation User Guide
When To Use You are developing application on Cray platform and looking for a tool to assist in OpenMP parallelisation or vectorisation

SCalable performance Analysis of LArge SCale parallel Applications (Scalasca)

Summary Scalasca is a performance measurement, analysis, and optimisation tool for MPI, OpenMP & Hybrid parallel applications. It has automated trace analysis capability that helps to identify wait states (caused by imbalanced workloads) and potential scaling bottlenecks (from communication and synchronisation). It uses Score-P for performance measurement.
Platforms Unix/Linux
Languages/Models C, C++, Fortran, MPI, OpenMP, Pthread
License Required No
Documentation User Guide
When To Use You are looking for scalable performance analysis tool with automated trace analysis capabilities

Scalable Performance Measurement Infrastructure for Parallel Codes (Score-P)

Summary Score-P is a scalable tool suite for profiling, event tracing, and online analysis of HPC applications. It is used by other profiling tools including Scalasca , TAU, Vampir. Score-P uses OTF2 for traces and Cube4 for profiles. Score-P has plugin architecture and functionality can be extended for specific use-cases (available on GitHub).
Platforms Unix/Linux
Languages/Models C, C++, Fortran, MPI, OpenMP, CUDA, OpenACC, OpenCL, SHMEM, Pthreads
License Required No
Documentation User Guide
When To Use You are looking for scalable performance measurement tool for HPC applications

TAU (Tuning and Analysis Utilities)

Summary TAU is portable performance analysis toolkit for instrumentation, performance measurement and performance analysis. It supports various instrumentation methods : source-to-source instrumentation (using PDT), compiler instrumentation, manual instrumentation using API, library interception at runtime. The profile data generated can be analysed using command like tool pprof or GUI tool Paraprof. The profile/trace data can be analysed using tools like Paraprof, Cube, Vampir, Jumpshot. TAU also provide tool called PerfExplorer for performance data mining.
Platforms Unix/Linux, Mac OS, Windows
Languages/Models C, C++, Fortran, Java, Python, MPI, OpenMP, CUDA, OpenACC, OpenCL, SHMEM, Pthreads, PGAS
License Required No
Documentation User Guide
When To Use You are looking portable and scalable performance analysis tool for parallel applications

Temanejo

Summary Temanejo is a graphical tool for analysing and debugging task-parallel, data-dependency-driven programming models. It allows one to display the task-dependency graph of application components, and allows simple interaction with the runtime system in order to control some aspects of parallel execution. Temanejo is able to assist debugging (to varying extent) for the programming models like SMPSs, OmpSs, StarPU, PaRSEC and OpenMP. It uses Ayudame library to collect information, so called events, from supporting runtime systems, and to excert control over a runtime system.
Platforms Linux, Mac OS
Languages/Models OpenMP, OmpSs, SMPSs, StarPU, ParRSEC
License Required No
Documentation User Guide
When To Use • You have task-parallel applications and you are not sure about dependencies and runtime scheduling of tasks • You are looking for visual debugging tool to understand the dependency execution at runtime

ToolGear

Summary ToolGear is a framework for developing GUI tools with minimal efforts. It provides high level, language agnostic XML interface for designing user interfaces. ToolGear is shipped with visualisation tools like Mpipview (for visualising mpiP performance data) and Memcheckview (for visualising Valgrind’s memcheck performance data).
Platforms Unix/Linux
License Required No
Documentation Readme
When To Use You are looking for visualisation tool for Valgrind’s memcheck or mpiP tool

Trace Compass

Summary Trace Compass is a tool for viewing and analysing traces from various tracing tools. It is first and foremost a framework with many builtin analyzes including LTTng, perf, GDB, Best Trace Format (BTF), ibpcap (Packet CAPture) and user-defined text/xml traces. Trace Compass provide different views, graphs that present profile data in more user-friendly and informative way rather than huge text dumps. There is new eclipse project called Trace Compass Incubator. It is a complement to Trace Compass and includes additional features that are under development, contributed and maintained by the community.
Platforms Linux, Mac OS, Windows
License Required No
Documentation User Guide
When To Use You want to visually inspect traces collected by tracing tools like perf and LTTng

Vampir

Summary Vampir is a scalable trace analysis and visualisation framework for parallel applications. Vampir has optimised event analysis algorithms and customisable displays that enable fast and interactive rendering of very complex performance data. Many profiling tools like Score-P, TAU, Open|SpeedShop generate the traces in OTF format that can be visualised by Vampir. It provides large set of chart representations to analyse message passing characteristics, I/O behaviour, performance counters with timeline visualisations. This enable developers to quickly display and analyze arbitrary program behavior at any level of detail. For analysing very large performance data, Vampir can be used with VampirServer with client/server architecture.
Platforms Linux, Mac OS, Windows
Languages/Models MPI, SHMEM, OpenMP, Pthreads, OpenACC, CUDA, OpenCL
License Required Yes
Documentation User Guide
When To Use You are looking for full featured tool suite for analyzing the performance and message passing characteristics of parallel applications

Very Sleepy

Summary Very Sleepy is a CPU profiling tool for C/C++ applications on Windows platform. It is derived from Sleepy profiler and uses statistical sampling technique. Very Sleepy records instruction pointer and memory addressees are then mapped to functions/line numbers using debug information (PDB or DWARF2). It provides graphical user interface that shows call graph information, timings and options to export profiling data in CSV format.
Platforms Windows (x86/64)
License Required No
Documentation GitHub, Project Website
When To Use You are looking for simple, standalone profiling tool for C/C++ applications on Windows

Windows Performance Toolkit

Summary Windows Performance Toolkit is a performance monitoring tool for Windows platform. It consists of two components: Windows Performance Recorder (WPR) and Windows Performance Analyzer (WPA). WPR is a powerful recording tool that creates Event Tracing for Windows (ETW) recordings. WPR provides built-in profiles with specific events to be recorded. WPA is a powerful analysis tool with very flexible UI. It has extensive graphing capabilities and data tables with full text search capabilities.
Platforms Windows
License Required No
Documentation User Guide
When To Use You are looking for system introspection tool for Windows (similar to perf for linux)

YourKit

Summary YourKit is a performance analysis tool for Java and .NET applications. It support various analysis including CPU usage, memory usage, memory leaks, thread synchronisation and exception profiling. YourKit can be used for high level analysis (to see application behaviour) or low-level detail (to pinpoint performance issues). It provides high level monitoring of web, I/O and database activity. The profiling reports can be exported to other formats (e.g. XML, HTML, CSV) for 3rd party applications.
Platforms Linux, Windows, Mac OS
Languages/Models Java, .NET
License Required Yes
Documentation Project Website
When To Use You are looking for a tool to analyse Java (SE/EE) or .NET applications in development or production environment

Python Profiling Tools

Below is list of profiling tools for analysing python applications. These will be covered in separate blog post but here is brief summary :

cProfile : Built in python module for profiling python scripts
pycallgraph : Python module that creates call graph visualizations for python applications
gprof2dot : Handy python script to convert the output from many profilers into a dot graph.
RunSnakeRun : GUI utility to view cProfile dumps in a sortable GUI view
SnakeViz : Browser based graphical viewer for the output of Python’s cProfile module
vprof : Visual profiler package providing rich and interactive visualizations for various Python program characteristics
line_profiler : Python module for doing line-by-line profiling of functions

Other Tools

Below is list of additional tools but some are not in active development, have better alternative or from different domain.

PerfExpert : Easy-to-use automatic performance diagnosis and optimization tool for HPC applications
Perfsuite : Collection of tools, utilities, and libraries for software performance analysis
PapiEX : Tool designed to transparently and passively measure the hardware performance counters using PAPI
ravel : Trace visualisation tool for MPI applications
Zoom : Performance analysis tool for applications running on Linux and Mac OS
Sourcery Analyzer : Powerful tool for embedded design with profiling and analysis engine
ThreadSpotter : Tool for diagnosing performance issues related to data locality, cache usage, and thread interaction
dotTrace : Performance profiler for .NET applications
RedGate ANTS : Profiler for .NET desktop applications and ASP.NET MVC applications
JustTrace : Profiling tool for .NET applications
Java Profilers : Many Java profilers not covered in this post (JMemProf, JMP, JBoss Profiler, Java Interactive Profiler (JIP), Profiler4j, JMeasurement)
• IDEs : Many IDEs like NetBeans, Eclipse, CLion have inbuilt profiling tools not covered in this post

CREDIT

Thanks to following people who suggested missing / their favourite tool after publishing this post :
    • Richard Neill from University of Manchester pointed out Aftermath
    • u/SantaCruzDad on reddit suggested Very Sleepy

If you have any question, suggestion or would like to improve the post with your favourite tool, I will be glad to hear!