Summary of Profiling Tools for Parallel Applications

Many scientific/industrial applications run on workstation to largest supercomputers in the world. With the continuous evolution of hardware platforms, achieving good performance is a challenging task. There are many profiling tools available to analyse and optimise the performance. But not all tools/methods are available on every platform, especially in high performance computing. First step in performance engineering workflow is to understand which tools are available and when they can be used. There is no one-size-fits-all solution : some are designed with broad feature list for high level analysis and others for specific platform with low level hardware metrics.

While choosing profiling tool one need to consider different aspects:

  • Goal : Are you interested in high level performance metrics? Or, do you want to dive into low level hardware details?
  • Application : Is it serial or parallel application? Do you have access to source code to re-build the software?
  • Programming Language : Which programming language is used? Is it supported by the tool?
  • Programming Model : If it is parallelised, what is programming model? Is it supported by the tool?
  • Hardware Platform : What is target platform (CPU/GPU/FPGA)? Do you have administrative permissions on the system?

In this (long) post we will summarise different profiling tools (open source as well as commercial). If you are interested in parallel debugging tools, Summary of Debugging Tools will be useful.

The tools are listed in alphabetical order.

Aftermath

Summary Aftermath is a performance analysis tool for task parallel programs such as OpenMP and OpenStream applications. It provides instrumented OpenMP run-time (based on LLVM-OpenMP runtime) that collects performance data during runtime and generates a trace file at program termination. The graphical user interface helps to filter and analyse the traces interactively. Aftermath provides timeline visualisation to correlate performance data with the execution of OpenMP constructs like loops and tasks. Different metrics like tasks creation, execution, synchronisation, performance counters can be mapped on the timeline.
Platforms Unix/Linux
Languages/Models C, C++, OpenMP, OpenStream
License Required No
Documentation User Guide
When To Use You have OpenMP/OpenStream application and looking for a tool to assist in the performance debugging
Note Tracing is not supported for all OpenMP constructs and requires use of clang compiler (see documentation)

Arm MAP

Summary Arm MAP is a profiling tool developed by Allinea Software (now part of ARM). MAP is primarily used for analysing parallel applications written in C, C++, Fortran. It uses statistical sampling method to record cpu/memory usage, thread activity, I/O, communication and synchronisation across MPI processes.
Platforms Intel x86, Intel Xeon Phi, IBM Power, Armv8, NVIDIA GPU
Languages/Models C, C++, Fortran, MPI, OpenMP, OpenSHMEM, OpenACC, CUDA
License Required Yes
Documentation User Guide
When To Use • You are looking for profiling tool for parallel application with easy to use user interface • You have used Arm Performance Reports to understand high level characteristics and now want to dive into details to understand hardware bottlenecks

Arm Performance Reports

Summary Arm Performance Reports is a high level performance analysis tool to characterise application performance and pinpoint inefficiencies. It is also developed by Allinea Software (now part of ARM). The main goal of Arm Performance Reports is to show whether the application is fully utilising target hardware resources. It generate single page text/HTML report which helps developer to understand computation, communication and I/O characteristics of the application.
Platforms Intel x86, Intel Xeon Phi, Armv8, NVIDIA GPU
Languages/Models C, C++, Fortran, MPI
License Required Yes
Documentation User Guide
When To Use • You haven't analysed application before and wondering if application is utilising the hardware fully • With minimal efforts you want to understand the computing characteristics of your application

Apprentice2

Summary Apprentice2 is a performance visualisation tool for profiling data collected by CrayPat. Depending on the instrumentation and analysis type, Apprentice2 helps to generate variety of reports/graphs including load imbalance, serialisation, I/O activity, process-pairs communication, execution timeline, call-graphs, function/region profile, cpu/memory utilisation.
Platforms All Cray systems
License Required Yes
Documentation User Guide
When To Use • You have used CrayPAT to collect performance data • Even though CrayPat provides excellent command line utilities to analyse data, you are visual guy and want to understand profile data with interactive graphical tool

Automatic Performance Collection (AutoPerf)

Summary AutoPerf is a library for collecting hardware performance counters and MPI usage information on IBM BlueGene-Q system. Autoperf is available on BG-Q systems deployed at ALCF. There is a x86 port on GitHub.
Platforms IBM BlueGene-Q
License Required No
Documentation ALCF Instructions
When To Use • You are running application on IBM BlueGene-Q systems at ALCF (MIRA, Cetus, Vesta) • You ran application and want to understand application performance and MPI usage

Blue Gene Performance Monitoring (BGPM)

Summary Blue Gene Performance Monitoring API (BGPM) provides programming interface for accessing hardware performance counters on IBM BlueGene-Q. BGPM provides C interface to monitor main hardware counter sources : P Unit (CPU events), L2 Unit (L2 cache events), I/O Unit, Network Unit and Compute Kernel Node.
Platforms IBM BlueGene-Q
Languages/Models C, C++
License Required No
Documentation Doxygen documentation in /bgsys/drivers/ppcfloor/bgpm/docs/, PAPI BG-Q Report
When To Use • You are performance tool developer • You want to use Lowe level API for accessing hardware performance counters

Caliper

Summary Caliper is a program instrumentation and performance measurement framework for HPC applications. Unlike traditional profiling tools, Caliper can be used to embed performance analysis capabilities into application itself. It can allow third-party tools to access application context information, or it can be configured as a stand-alone performance recorder.
Platforms Unix/Linux
Languages/Models C, C++, Fortran
License Required No
Documentation GitHub, User Guide, Publication
When To Use instead of third-party profiling tools, you want to embed performance analysis capabilities into your application itself (Note still in development / research stage)

Cachegrind

Summary Cachegrind is a cache and branch prediction tool part of Valgrind framework. Cachegrind simulates how application interacts with cache hierarchy and branch predictor. It collects different events like data reads, cache misses and attribute back it to the source code. It performs dynamic recompilation of binary at runtime and hence runtime overhead can be huge (10-100x slower). There is a windows port on GitHub.
Platforms Unix/Linux/MacOS (x86, PowerPC)
Languages/Models C, C++, Fortran
License Required No
Documentation User Guide
When To Use • You want to understand how your (serial) application interact with cache hierarchy • You want to use tool available with common linux distribution • as run-time overhead could be huge, you have small, representative input dataset so that you can run application for short time

Callgrind

Summary Callgrind is a profiling tool part of Valgrind framework. Callgrind records functions execution, caller-callee relation and present it as call-graph. The output of Callgrind is flat call graph but tool like KCachegrind can be used to visualise the profiling data. It also provide option to use Cachegrind capabilities to improve the profiling information. As application will be run under virtual processor emulated by Valgrind, execution will take considerably longer to run under Callgrind than it typically would.
Platforms Unix/Linux/MacOS (x86, AMD64)
Languages/Models C, C++, Fortran
License Required No
Documentation User Guide
When To Use • You want to profile (serial) application with tool available with common linux distribution • As run-time overhead could be huge, you have small, representative input dataset so that you can run application for short time

Codelet Extractor and REplayer (CERE)

Summary CERE is a code isolation framework based on LLVM. CERE is used to find out hotspots of the application and then extract these hotspots into standalone kernels called codelets. The codelet can be modified, compiled, run, and measured independently from the original application. This helps to do piecewise optimisation of the application and reduces the benchmarking cost.
Platforms Linux (x86, Aarch64)
Languages/Models C, C++, Fortran, D
License Required No
Documentation GitHub, Tutorial, Publication
When To Use You have large code base and want to extract kernels to analyse and optimise separately
Note Still in development / research tool

CodeAnalyst / CodeXL

Summary CodeAnalyst was a profiler developed by AMD for x86/x86_64 processors and now replaced by CodeXL. It has CPU/GPU profiler, static kernel analyser, HSPA profiler and graphic frame analyser. CodXL also include debugger for CPU as well as GPU.
Platforms Linux, Windows
Languages/Models C, C++, OpenCL, OpenGL, DirectX, Vulkan
License Required No
Documentation GitHub, User Guides
When To Use • You are targetting AMD platform • You are looking for IDE to profile/debugging application on AMD CPU/GPU/APU

Cray Performance Analysis Toolkit (CrayPat)

Summary CrayPat is a performance measurement and analysis tool available on Cray systems. One can run variety of analysis experiments including profiling, tracing, hardware performance counter analysis, load-imabalnce analysis. There is also easy-to-use, simplified version called CrayPat-lite for basic performance analysis. CrayPat has feature called Automatic Program Analysis (APA) which uses profiling information to gather hotspots and then perform tracing of selected hotpot routines. The profile data generated from CrayPat can be analysed using command line tool pat_report or GUI tools like Apprentice2 and Reveal (GUI).
Platforms Linux (Cray Systems)
Languages/Models C, C++, Fortran, MPI, OpenMP, OpenACC, SHMEM, UPC, Pthreads
License Required Yes
Documentation User Guide
When To Use • You are running application on Cray systems • You are looking for tool capable of doing high level profiling as well as low level hardware counter analysis

CUBE Uniform Behavioral Encoding (CUBE)

Summary CUBE is a performance data explorer tool used with Score-P, TAU and Scalasca frameworks. It is designed for interactive exploration of the performance data in a scalable fashion. CUBE presents application performance in multi-dimensional performance spaces : performance metric, call path and system resource. It also provide various command line tools and library for reading/writing/manipulating profile data.
Platforms Linux, MacOS
Languages/Models C, C++, OpenMP, MPI, OpenACC, CUDA
License Required No
Documentation Project Website
When To Use • You have generated profile data from Score-P, Scalasca (or TAU) • You want to visualise the profile data or need command line tool/library to manipulate the profile data

DAGViz

Summary DAGViz is a performance visualisation tool for task parallel programs. The task-based programming models allows developer to expose logical parallelism by creating fine-grained tasks. The runtime systems take care of thread management, task scheduling, load balancing etc and hence the order of execution of tasks can be different. The nondeterministic nature of task parallel execution hides runtime scheduling mechanisms from programmers. This poses challenge for programmers to understand the cause of suboptimal performance of their programs. DAGViz helps to visualise logical tasks and their runtime execution.
Platforms Linux, MacOS
Languages/Models C, C++, OpenMP, Cilk Plus, Intel TBB, Qthreads, MassiveThreads
License Required No
Documentation GitHub, Publication
When To Use • You have parallelised application using task parallel programming model • You want to analyse runtime scheduling of tasks

Darshan

Summary Darshan is a lightweight I/O profiling tool for HPC applications. It helps to understand I/O characteristics of application including I/O access patterns, sizes, number of operations etc. Darshan has two components : Darshan-runtime (used to collect performance data on target system) and Darshan-util (used for analysing the data collected by Darshan-runtime). Darshan can be deployed on production systems for I/O characterisation of entire workload.
Platforms Unix/Linux
Languages/Models C, C++, Fortran, MPI, HDF5, NetCDF
License Required No
Documentation Project Page, Documentation
When To Use You have MPI application and want to understand I/O performance

Dimemas

Summary Dimemas is a performance prediction tool for MPI applications. It is a trace driven simulator that helps to perform what if analysis. Developers can define architectural parameters (cpu, node, network, filesystem etc.) for non-existent target machine and then Dimemas reconstructs time behaviour of application on new target platform. It uses trace generated from Extrae profiling tool.
Platforms Unix/Linux
Programming Languages/Models C, C++, Fortran, MPI
License Required No
Documentation Project Page, GitHub
When To Use • You have used Extrae/Paraver for performance measurement and analysis • You want to understand (or predict) how your application will perform on a system to which you don't have access or simply doesn't exist yet

DTrace

Summary DTrace is a dynamic tracing framework for analysing applications on production systems in real time. It was originally developed for Solaris and has been ported to several Unix-like systems. Dtrace is scriptable framework : one can attach “probes” to a running system and peek inside as to what it is doing. It helps to understand memory utilisation, CPU time, filesystem and network resources used by active processes. Unlike other tools, it is not profiler but an excellent tracing tool and hence included in this post.
Platforms Linux, MacOS
Languages/Models Assembly, C, C++, Java, Erlang, JavaScript, Perl, PHP, Python, Ruby, shell script, Tcl
License Required No
Documentation Dtrace Guide
When To Use You want to diagnose performance issue (on workstation, server or cloud environment) and need a tool capable of tracing at user/kernel space

Dyninst

Summary Dyninst library provides machine independent interface to analyse and instrument binaries. Performance analysis tools often need to instrument application at runtime. Dyninst allows this instrumentation after linking phase or during execution. Multiple performance tools like TAU, STAT, Open|SpeedShop, Extrae used Dyninst underneath.
Platforms Unix/Linux, MacOS
License Required No
Documentation Project Page, GitHub
When To Use You are developing performance measurement tool, debugger or computational steering application where you want to modify the binary

Extra-P

Summary Extra-P is a automatic performance modeling tool developed with Scalasca tool set. The main goal of Extra-P is to help finding scalability bugs. User runs small-scale performance experiments at different processor configurations to generate profiles which are used as input for model generation. Extra-P generates list of potential scalability bugs and human-readable models for different performance metrics.
Platforms Linux, MacOS
Languages/Models C, C++, Fortran, MPI, OpenMP
License Required No
Documentation User Guide
When To Use • You have used Score-P or Scalasca for performance analysis • You are curious about possible scalability issues at scale but don't have access to system • You want to generate performance models to find out the potential scalability issues

Extrae

Summary Extrae is an instrumentation and performance measurement tool for HPC applications. It uses different interposition mechanisms like library preload, binary instrumentation (using Dyninst) and programming model specific instrumentation layer to inject probes into the target application. The profiles/traces can be analysed using Paraver and Dimemas.
Platforms Intel, BlueGene-Q, Cray, GPU, ARM, Fujitsu
Languages/Models C, C++, Fortran, Java, Python, MPI, OpenMP, Pthreads, OmpSs, CUDA, OpenCL
License Required No
Documentation Project Page, GitHub
When To Use • You want to analyse performance of parallel application on different architectures • Possibly you want to perform what if analysis using Dimemas and need input timeline trace

gperftools

Summary gperftools (originally Google Performance Tools) is a collection of performance profiling and memory checking tools. It consist of sampling based cpu profiler, heap checker and heap profiler. The profile data can be visualised with pprof.
Platforms Unix/Linux
Languages/Models C, C++
License Required No
Documentation Documentation
When To Use You are looking for simple to use profiling tool for analysing (serial) C/C++ application

GPROF

Summary GPROF is a commonly available performance analysis tool on most Unix/Linux platforms. It uses sampling as well as instrumentation techniques : it instruments the application during compilation time but uses sampling technique during runtime to gather performance hotspots. GPROF is an extended version of the older "prof" tool with the ability to generate call graph information. For non-trivial applications the overhead could be high due to instrumentation.
Platforms Unix/Linux
Languages/Models C, C++, Fortran
License Required No
Documentation Project Page
When To Use You have not-so-complex C/C++ application and want to quickly get an idea about hotspots using system tool

GPU PerfStudio

Summary GPU PerfStudio is performance and debugging tool developed by AMD. It was originally developed for Direct3D and OpenGL application for Windows and later ported to Linux. GPU PerfStudio consists of five important tools for graphics developers : Frame Debugger (to visualise the graphics state and resources in the frame), Frame Profiler (to identify per draw call performance issues at the hardware counter level), Shader Debugger (to step through and debug shader code and its output), API Tracer (to show CPU timing information) and Shader Analyzer (to help optimising shader code).
Platforms Linux, Windows, AMD GPU
Languages/Models C, C++, DirectX, OpenGL, Vulkan
License Required No
Documentation Project Page
When To Use You want to analyse and optimise game applications for AMD GPUs

Flow Graph Analyzer

Summary Flow Graph Analyzer is a visualisation tool for designing and analysing applications built using Intel TBB flow graph interface. The flow graph interface allows developers to expose parallelism at high level using data flow algorithms and dependency graph. Flow Graph Analyser has designer component that allows one to visually create flow graph diagrams which can be converted into C++ stubs for application development. The runtime execution of these graph dependencies can be different. Flow Graph Analyser allows one to visualise the runtime behaviour using application traces. One can look into the timeline executions and runtime behaviour of the dependency graphs.
Platforms Linux, Windows
Languages/Models C++, Intel TBB
License Required Yes (part of Intel Advisor or Intel Parallel Studio)
Documentation Project Page, User Guide
When To Use • You have parallelised application using Intel TBB flow graph interface • You want to understand runtime execution of node dependencies to debug performance issues

HPCToolkit

Summary HPCToolkit is a performance measurement and analysis tool for parallel applications. It supports profiling and fine-grained tracing with minimal overhead for reasonable choice of sampling period. It can collect hardware performance counters, derived metrics and attributes them back to the source code. The profile/trace data generated can be visualised using hpcviewer/hpctraceviewer.
Platforms Linux, Intel, IBM Power, BlueGene, Cray
Languages/Models C, C++, Fortran, OpenMP, MPI, Pthread
License Required No
Documentation Project Page, GitHub, User Guide
When To Use You want to profile application on workstation or large supercomputing platform

Hardware Performance Monitor (HPM)

Summary Hardware Performance Monitor (HPM) is a high level software layer for measuring hardware counters on IBM architectures. Compared to BGPM, HPM provides easy to use API for configuring, controlling and reading hardware performance counters. It transparently handlers multiplexing, overflows and output data in human readable format.
Platforms Linux/AIX, IBM Power, IBM BlueGene
Programming Languages/Models C, C++, Fortran, OpenMP, MPI, Pthread
License Required No
Documentation Project Page, User Guide
When To Use • You are running application on IBM systems • Instead of using profiling tool, you want simple API to measure performance counters

hpcviewer and hpctraceviewer

Summary hpcviewer and hpctraceviewer are performance visualisation tools for HPCToolkit. They are used to interactively explore profile and trace data respectively.
Platforms Linux, Windows, MacOS
Programming Languages/Models C, C++, Fortran, OpenMP, MPI, Pthread
License Required No
Documentation GitHub, User Guide
When To Use You have used HPCToolkit to generate profile/traces and need a visualisation tool

IBM High Performance Computing Toolkit (HPCT)

Summary HPCT is set of libraries and tools developed by IBM for performance measurement and analysis. It provides high level interface for MPI profiling, MPI tracing and hardware performance monitoring using (HPM). HPCT helps in application optimisation and tuning of serial as well as parallel applications.
Platforms IBM Systems (especially BlueGene-Q)
Languages/Models C, C++, Fortran, OpenMP, MPI, Pthread
License Required No
Documentation Project Page, User Guide
When To Use • You are running application on IBM system (e.g. BG-Q) • You are looking for easy to use API to collect performance data for whole execution or parts of the execution

Instrument

Summary Instrument is a performance analysis tool part of Xcode on MacOS. It replaces previous profiler called Shark. Instrument is based on Dtrace and can profile Mac OS as well as iOS applications. Instrument helps to collect traces and displays timeline with cpu, memory, network, filesystem activity on Apple devices.
Platforms MacOS
Languages/Models C, C++, Objective-C, Objective-C++, Swift
License Required No
Documentation User Guide
When To Use You are developing/running application on Mac OS and want to profile with tool available at hand

Intel Advisor

Summary Intel Advisor or Advisor XE is a code vectorisation and threading assistance tool developed by Intel. It helps to improve vectorisation by analysing scalar/vector code generated by auto-vectorisation compilers like GCC, Intel, LLVM, Microsoft. Advisor can perform various analysis like loop carried dependencies, memory access pattern. It generates detailed reports about inefficiencies, suggest code improvements and provide speedup estimation. Intel Advisor also contain Threading advisor which can help to find scalability issues and synchronisation errors.
Platforms Linux, Windows
Languages/Models C, C++, Fortran, C#, TBB, Cilk Plus, OpenMP
License Required Yes
Documentation User Guide
When To Use • Application kernels are vectorised by compiler but you are not sure about the vectorisation efficiency • You want to perform detailed memory access pattern analysis • You want to understand if loops can be vectorised

Intel Inspector

Summary Intel Inspector (successor of Intel Thread Checker) is a code correctness tool that helps to identify threading and memory errors. It performs dynamic instrumentation and analyse execution to find out intermittent, non-deterministic errors. Intel Inspector helps to find out threading errors (like deadlock, race condition) and memory errors (like memory leaks, memory corruption, dangling pointers, uninitialized variables).
Platforms Linux, Windows
Languages/Models C, C++, Fortran, TBB, OpenMP, Pthread, Win32 threads
License Required Yes
Documentation Documentation
When To Use • You want to analyse memory issues (leaks, dangling pointers, un-initialized variables) • You have threaded application and want to find out issues like race conditions, deadlocks etc.

Intel Trace Analyzer and Collector (ITAC)

Summary ITAC is a tool for profiling and tuning MPI applications. It allows to identify hotspots and issues for poor scaling performance. ITAC consist of three components : trace collector, trace analyser and message checker. It can collect trace of MPI application and helps to visualise communication structure. Using message checker one can also find incorrect or inefficient use of MPI constructs.
Platforms Linux, Windows
Languages/Models C, C++, Fortran, MPI
License Required Yes
Documentation User Guide
When To Use • You have MPI application and want to understand communication structure • You want to find out inefficient/incorrect of MPI constructs

Intel Compiler's Profiler

Summary Intel compilers provide options to profile loops/functions in the application. This is an easy way to identify where the application is spending cpu cycles. Developer can decide level of instrumentation (loop, function, loop bodies) during compilation. Once the application is executed, text/xml report is generated which can be visualised with utility called loopprofileviewer.sh.
Platforms Linux, Windows, Mac OS
Languages/Models C, C++, Fortran
License Required Yes
Documentation Compiler Guide
When To Use • You are using Intel compiler for application development • Instead of using standalone profiler, you want to quickly find out most time consuming parts or 'hotspots' of the code

Intel Graphics Performance Analysers

Summary Intel Graphics Performance Analysers is a tool suite for analysing and optimising game/interactive 3D graphic applications. It consist of System Analyser (for real time performance feedback of the CPU and GPU), Frame Analyser (to determining where each frame is taking the most amount of time), Platform Analyser (to identify the workloads that are running on the CPU & GPU) and Graphics Trace Analyser (for process level event traces on CPU & GPU).
Platforms Linux, Windows, Mac OS
Languages/Models Microsoft DirectX, Apple Metal, OpenGL
License Required No
Documentation User Guide
When To Use You want to analyse and optimise graphics applications especially on Intel CPU and Intel HD Graphics

Intel VTune Amplifier

Summary Intel VTune Amplifier is a low level performance analysis tool especially for Intel CPUs. Many features can work on AMD CPUs but advanced hardware-based sampling requires an Intel-manufactured CPU. Intel VTune supports different experiments like hotspot, memory access analysis, locks & wait analysis, concurrency analysis, bandwidth usage, HPC characterisation. By looking at hotspots, one can drill down to the instruction and hardware performance counter level analysis. Along with hardware performance counters, it presents many derived metrics to easily identify hardware bottlenecks.
Platforms Linux, Windows
Languages/Models C, C++, C#, Fortran, Java, Python, Go, OpenCL, OpenMP, Intel TBB, MPI
License Required Yes
Documentation User Guide
When To Use • You are running application on Intel CPUs and want to find out hotspots • You want to perform micro-architecture level analysis to find out hardware resource bottlenecks

Integrated Performance Monitoring (IPM)

Summary Integrated Performance Monitoring is an infrastructure developed by NERSC for high level performance analysis of parallel applications. It is designed with the goal of ease-of-use and scalability for performance analysis of parallel applications. IPM can be deployed as monitoring framework for entire workload on clusters/supercomputers. It generate report with wall clock time, MPI communication, memory usage and floating point operations. It can be configured to generate detailed XML report which can be be visualised using web interface.
Platforms Linux
Languages/Models C, C++, Fortran, MPI
License Required No
Documentation Project Website, GitHub
When To Use You want to deploy low overhead, performance monitoring framework on cluster/supercomputing platform

JProfiler

Summary JProfiler is a profiling tool developed by ej-technologies for Java applications. It can be used to analyse cpu & memory usage, dynamic memory allocations, thread executions, race conditions etc. JProfiler can perform live analysis on local/remote servers or offline analysis of profile data. It also provide visual representation of virtual machine load in terms of active and total bytes, instances, threads/classes execution and garbage collector activity. JProfile can be used as standalone tool or in integration wit IDEs like Eclipse, IntelliJ, NetBeans, JDeveloper.
Platforms Unix/Linux, Windows, Mac
Languages/Models Java SE, Java EE Subsystems and Databases
License Required Yes
Documentation User Guide
When To Use • You are developing Java application • You want to find out performance bottlenecks on local workstation or remote server

Jumpshot

Summary Jumpshot is a performance visualisation tool part of Multi-Processing Environment (MPE) package. It is used for timeline visualisation of SLOG-2 traces. Jumpshot helps to understand hotspots, communication patterns and load-imbalance across MPI processes.
Platforms Linux, Windows, Mac
Languages/Models C, C++, Fortran, MPI
Documentation Project Website
License Required No
When To Use You have have generated CLOG2/SLOG2 traces and looking for open source tool to visualise timeline traces
Note This is quite old tool but there are not many choices for timeline visualisation of MPI applications

JVM Profiler

Summary JVM Profiler is a performance analysis tool developed by UBER for analysing JVM applications in distributed environment. JVM Profiler can attach Java agent to executors of Spark/Hadoop application in a distributed way and collect various metrics at runtime. It allows to trace arbitrary java methods/arguments without source code change (similar to Dtrace). It helps to analyse and debug memory usage, cpu usage, I/O issues at scale.
Platforms Unix/Linux, Windows, Mac
Languages/Models Java, Spark
Documentation Project Website, GitHub
License Required No
When To Use You want to understand performance bottlenecks of standalone Java application or Spark/Hadoop application at scale

Kcachegrind

Summary Kcachegrind (or QCacheGrind) is a performance visualisation tool for profilers like Cachegrind, Callgrind. It can present profile data in different ways (e.g. tree map, call graph) and perform source annotation. There are open source tools available for converting profile data to callgrind-format (e.g. for OProfile) and then Kcachegrind can be used for visualisation.
Platforms Linux, Windows, Mac OS
License Required No
Documentation User Guide, GitHub
When To Use You have used Valgrind tools (e.g. callgrind, cachegrind) for profiling and now want to visualise the performance data

Kerncraft

Summary Kerncraft is performance modeling framework to investigate data reuse and cache requirement of an application. It uses loop kernels analysis and static code analysis techniques. When combined with IACA, kerncraft can give a good overview of both in-core and memory bottlenecks.
Platforms Unix/Linux
Languages/Models C, C++, Fortran
License Required No
Documentation GitHub
When To Use • You want to construct Roofline/ECM models for loops in the application • You want to understand CPU resource bottlenecks and optimization opportunities
Note Still in development / research tool

Linux Trace Toolkit Next Generation (LTTng)

Summary LTTng is a tracing framework for standalone applications, libraries and kernel with minimal overhead. It is successor of Linux Trace Toolkit (LTT) and available on many desktop, server and embedded linux distributions. Similar to perf/Dtrace, it can be used for system wide introspection to understand interactions among multiple applications. Visualisation tools like Trace Compass and Sourcery Analyzer can be used for visualising collected traces.
Platforms Linux
License Required No
Documentation User Guide
When To Use You want trace single process or want to perform system wide introspection with minimal overhead

Modular Assembly Quality Analyzer and Optimizer (MAQAO)

Summary MAQAO is a framework for static and dynamic analysis of binaries. As it operates at binary level, MAQAO is programming model agnostic and commonly used for single node performance analysis. It has static analyser plugin that can assess quality of the loops with respect to vectorisation using micro-architecture performance models. MAQAO generate report with suggestions to improve code performance (e.g. loop transformations, compiler hints).
Platforms Linux (x86, ARM)
Languages/Models C, C++, Fortran, OpenMP, Pthread, MPI
License Required No
Documentation User Guide
When To Use You want to analyse binary for performance optimisation opportunities (e.g. vectorisation, inlining)

Massif

Summary Massif is a heap profiling tool part of Valgrind framework. It helps to understand the memory usage of an application including allocated memory, program stack and the extra bytes allocated for book-keeping and alignment. Massif also helps to identify critical memory leaks. The profiling data generated by Massif can be visualised using massif-visualiser.
Platforms Linux, MacOS
Languages/Models C, C++, Fortran
License Required No
Documentation User Guide
When To Use • You want to perform detailed memory usage analysis to reduce memory footprint • You want to identify critical memory leaks

memP

Summary memP is a lightweight, parallel heap profiling tool for MPI applications. It helps to find heap allocation that causes mpi rank to reach its memory in use high water mark (HWM). memP generate two types of report : summary report and task report. The summary report describes the memory HWM of each task over the execution of an application. The task report can be generated for each rank based on specific criteria that provides a snapshot of the heap memory currently in use. The report generated can be plain text file or XML format that can be visualised using mpiPview (part of Tool Gear).
Platforms Linux
Languages/Models C, C++, Fortran, MPI
License Required No
Documentation Project Page
When To Use You have MPI application and want to analyse memory allocations reaching HWM

mpiP

Summary mpiP is a lightweight profiling tool for MPI applications. It collects MPI communication statistics local to each rank and hence has considerably small overhead at scale. The report generated at the end of execution can be in plain text format or XML format. The XML report can be visualised with mpiPview (part of Tool Gear).
Platforms Linux, Intel, IBM BlueGene, Cray Platforms
Languages/Models C, C++, Fortran, MPI
License Required No
Documentation Project Page, GitHub
When To Use You want to profile MPI communication on workstation or largest supercomputing system

Nsight

Summary Nsight is a development tool from NVIDIA for heterogeneous computing. It provides simultaneous debugging and profiling capabilities for CPU as well as GPU. Nsight helps to identify/analyse bottlenecks and monitor the activities of entire system. It can be integrated with Eclipse and Microsoft Visual Studio.
Platforms Linux, Windows, Mac OS
Languages/Models C, C++, CUDA, Direct3D, Vulkan, OpenGL
License Required No
Documentation User Guide
When To Use You are looking for IDE with debugging and profiling capabilities for NVIDIA GPUs

nvprof

Summary nvprof is a command-line profiling tool for CUDA applications. It helps to collect and analyse profile data both for CPU and GPU. nvprof records timeline execution including kernel execution, memory transfers, CUDA API calls and different hardware metrics. One can generate summary profiles or detailed traces to be visualised with nvvp. nvprof is capable of profiling CUDA kernels irrespective of language they are written in.
Platforms Linux, Windows, Mac OS
Languages/Models CUDA
License Required No
Documentation User Guide
When To Use You are looking for readily available command line tool for profiling CUDA applications

NVIDIA Visual Profiler (nvvp)

Summary NVIDIAVisual Profiler is a performance profiling tool developed by NVIDIA. It helps to analyse and optimise C/C++ applications using CUDA/OpenACC programming models. nvvp shows CPU and GPU activity in a unified time line, including CUDA API calls, kernel launches, memory transfers and CUDA launches. One can look at low-level performance metrics generated from hardware counters and software instrumentation. nvvp can analyse application execution and suggest actions to eliminate or reduce those bottlenecks.
Platforms Linux, Windows, Mac OS
Languages/Models C, C++, CUDA, OpenACC
Documentation User Guide
License Required No
When To Use You are looking for profiling tool to analyse and optimise performance of CUDA application on local or remote system

ompP

Summary ompP is a profiling tool for OpenMP applications. It uses Opari for instrumenting OpenMP directives. ompP can be configured to use PAPI underneath to supports measurement of hardware counters. ompP can perform overhead analysis and detection of common inefficiency in OpenMP applications.
Platforms Linux
Languages/Models C, C++, Fortran, OpenMP
License Required No
Documentation User Guide
When To Use You are looking for simple tool for profiling OpenMP applications

OpenMP Pragma And Region Instrumentor 2 (OPARI2)

Summary OPARI2 is a source to source instrumentor for OpenMP applications. It is not used as standalone profiler but as an instrumentor by many performance analysis tools like TAU, ompP, Score-P. It uses POMP2 interface to surround OpenMP directives and OpenMP runtime calls.
Platforms Linux
Languages/Models C, C++, Fortran, OpenMP
License Required No
Documentation User Guide
When To Use You are building profiling tool for OpenMP applications and need source to source instrumentor

Open|SpeedShop (O|SS)

Summary Open|SpeedShop is a performance analysis toolset for performance analysis of applications running on workstation, cluster and supercomputing platform. It uses both statistical sampling and tracing techniques to record performance information. O|SS can be used for analysing sequential, multi-threaded, MPI, CUDA and hybrid applications. It supports various analysis type including Program Counter Sampling, Hardware Performance Counters, MPI, I/O, OpenMP, Memory Usage and CUDA tracing.
Platforms Linux, Intel, AMD, ARM, IBM Power, NVIDIA GPU
Languages/Models C, C++, Fortran, OpenMP, MPI, Pthread, CUDA
License Required No
Documentation User Guide, GitHub
When To Use You are looking for performance analysis tool for workstation or supercomputing system
Note Installation is bit heavy, prefer to use existing installation if available

OProfile

Summary OProfile is a low overhead, statistical profiling tool introduced in linux kernel 2.4. Along with perf, OProfile is most commonly used performance counter monitoring tool on linux platforms. It uses performance monitoring unit (PMU) available on processor to retrieve information about different events (e.g. memory references, L2 cache requests, hardware interrupts). It can be used to profile standalone executable or for system wide introspection.
Platforms Linux
License Required No
Documentation User Guide
When To Use You want to analyse single process or entire system with commonly available linux tool
Note Need administrative permissions for hardware event monitoring or full system introspection

Oracle Developer Studio's Performance Analyzer

Summary Performance Analyzer is a profiling tool part of Oracle Developer Studio (formerly Oracle Solaris Studio). It is primarily used on Solaris for x86 and SPARC architectures. The performance data is collected from various sources including statistical sampling, MPI communication, thread synchronisation, IO calls, memory allocation.
Platforms Linux/Solaris
Languages/Models C, C++, Fortran, Java, Scala, OpenMP, MPI
License Required No
Documentation User Guide
When To Use You are developing application on Solaris platform and looking for IDE with performance analysis capabilities

Performance Application Programming Interface (PAPI)

Summary PAPI provides consistent, high level programming interface for using hardware performance counters found in major microprocessors. It is widely used by performance tools (e.g. TAU, Score-P, HPCToolkit, O|SS) to collect hardware metrics like flops, clock cycles, instruction counts, cache misses. PAPI provides access to native events but also defines many derived metrics like flops. It handles counter multiplexing and overflow transparently.
Platforms UNIX/Linux
Languages/Models C, C++
License Required No
Documentation User Guide
When To Use You are looking for high level API to access hardware counters without worrying too much about architecture details

Parallel Profile analysis (Paraprof)

Paraprof is a performance analysis tool part of TAU. It supports profile data generated from different tools like TAU, mpiP, ompP, gprof, Score-P, HPCToolkit etc. Paraprof allows user to load multiple experiments and compare them simultaneously. It provide different views like call graph, histogram, bar chart, call trees, 3-D topology/communication metric view.
Platforms Unix/Linux, MacOS, Windows
License Required No
Documentation User Guide
When To Use • You have used Tau for collecting performance data • You want to visualise/compare performance data from different experiments

Paraver

Summary Paraver is a flexible trace manipulation and visualisation tool. It has it's own paraver trace format and commonly used with Extrae tool. Paraver helps to get qualitative global picture of the application behavior by visual inspection. It has flexible timeline trace visualiser for comparative analysis of multiple traces. Paraver can simultaneously show timelines with different performance metrics like communication, hardware counters.
Platforms Linux, Windows, Mac OS
Languages/Models C, C++, Fortran, MPI, OpenMP, OpenCL, pthreads, OmpSs, CUDA
License Required No
Documentation User Guide, GitHub
When To Use You have used Extrae for collecting performance data and want to now visualise performance data

perf

Summary perf (perf_events or perf tools) is a performance analysis tool introduced in Linux kernel version 2.6.31. It support wide range of analysis including hardware counters, tracepoints, dynamic probes. Perf is natively supported in many popular distributions including Red Hat and Debian. It provides per task, per CPU and per-workload counters and source code event annotation. Perf abstracts away CPU hardware differences and presents a generalised command line interface for performance measurement.
Platforms Linux
Languages/Models C, C++, Fortran, Java, Matlab
Documentation Wiki Page
License Required No
When To Use • You want analyse standalone application or entire system using readily available tool • You want perform hotspot analysis or low level hardware counter analysis
Note Need administrative permissions for hardware event monitoring or full system introspection

Perfworks

Summary Perfworks provides C++ API for collecting hardware performance metrics for NVIDIA GPUs. It provide actionable, high-level metrics, that helps to recognise bottlenecks quickly. Other performance tools including Nsight, Visual Profiler use Perfworks API underneath. Developer can call Perfworks API to access low-level performance metrics.
Platforms Linux, Windows
Languages/Models C++, CUDA, OpenGL, OpenGL
License Required No
Documentation Project Page
When To Use Instead of full fledged profiler (like nvprof, nvvp), you are looking for library to read performance metrics

Periscope Tuning Framework (PTF)

Summary Periscope Tuning Framework is a toolset for automated performance analysis and tuning of HPC applications. It is designed to assist developers in the performance optimisation
workflow. Periscope provide various tuning plugins to automatically find the optimal combination of settings such as compiler flags, MPI settings, number of OpenMP threads in each parallel section, etc. It also provides possibility of workflow optimisation where the whole process of optimising an application, including for example running jobs on the HPC system, adjusting the job’s settings and recording data can be formalized and partially automated.
Platforms Linux
Languages/Models C, C++, Fortran, MPI, OpenMP
License Required No
Documentation User Guide
When To Use You are looking for auto-tuning framework to run jobs with different settings to find optimal combination

PGI Profiler (PGPROF)

Summary PGPROF is a profiling tool shipped with PGI compiler. It allow to profile applications running on CPU and GPU simultaneously. PGPROF display timeline of CPU and GPU activity, and includes automated analysis engine to identify optimisation opportunities.
Platforms Linux, Mac OS, Windows, Intel, IBM Power
Languages/Models C, C++, Fortran, MPI, OpenACC
License Required No (Community Edition)
Documentation User Guide
When To Use You are developing CPU/GPU application using PGI compiler suite and looking for a profiling tool

pprof

Summary pprof is a performance visualisation tool developed by Google. It reads profiling samples in profile.proto format and can be used to visualise profile data generated by sampling tools like gperftools, perf. pprof can read profiles from a local file, or over http and generate text report or graphical reports like flame graph. If the profile samples contain machine addresses, pprof can annotate the samples with source using native binutils tools.
Platforms Unix/Linux
License Required No
Documentation Readme
When To Use You have generated profile data using sampling tool like perf, gperftools and looking for visualisation tool
Note This tool should not be confused with command line tool pprof provided by TAU

Radeon GPU Profiler (RGP)

Summary Radeon GPU Profiler is a low-level optimisation tool developed by AMD for Radeon GPUs. It provides built-in hardware thread tracing, timing and occupancy information. RGP helps to analyze graphics, async compute usage, event timing, pipeline stalls, barriers, bottlenecks and other performance inefficiencies.
Platforms Linux, Windows
Languages/Models DirectX, Vulkan
License Required No
Documentation GitHub
When To Use You are game developer and looking for profiling tool to understand how AMD GPU is actually executing your application at hardware level

Radeon GPU Analyzer (RGA)

Summary Radeon GPU Analyzer is an offline compiler and performance analysis tool developed by AMD for helping developers to optimise their shaders for AMD APUs/GPUs. Using RGA developers can compile the code for various AMD GPUs/APUs independent from the one physically installed on the system. It generate AMD ISA disassembly, performance statistics and static analysis reports for each target platform.
Platforms Linux, Windows
Languages/Models DirectX, Vulkan, OpenGL, Vulkan, OpenCL
License Required No
Documentation GitHub
When To Use You are targeting GPU/APU and want to investigate how different compiler optimizations and compilation chains affect the performance of shaders

PurifyPlus

Summary PurifyPlus is a runtime analysis tool that helps to monitor application execution and reports key aspects of its behaviour. It looks at program's behavior based on what it does when it runs. PurifyPlus supports memory debugging and performance analysis.
Platforms Unix/Linux, Windows
Languages/Models C, C++
License Required Yes
Documentation Product Page
When To Use You are looking for tool that can monitor application execution and help to detect the cause of a performance or memory bottlenecks

Reveal

Summary Reveal is an analysis and code restructuring assistant tool developed by Cray. It uses profile data generated by Craypat and performs static analysis of source code. With this it helps to identify time consuming loops and provide feedback on loop dependency and vectorisation. Reveal can also automatically markup loops for OpenMP parallelisation. It performs variable scoping and create directives with the appropriate private and shared clauses. The process is semi-automatic and still requires programmer input.
Platforms Linux, Cray Systems
Languages/Models C, C++, Fortran, OpenMP, MPI
License Required Yes
Documentation User Guide
When To Use You are developing application on Cray platform and looking for a tool to assist in OpenMP parallelisation or vectorisation

SCalable performance Analysis of LArge SCale parallel Applications (Scalasca)

Summary Scalasca is a performance measurement, analysis, and optimisation tool for MPI, OpenMP & Hybrid parallel applications. It has automated trace analysis capability that helps to identify wait states (caused by imbalanced workloads) and potential scaling bottlenecks (from communication and synchronisation). It uses Score-P for performance measurement.
Platforms Unix/Linux
Languages/Models C, C++, Fortran, MPI, OpenMP, Pthread
License Required No
Documentation User Guide
When To Use You are looking for scalable performance analysis tool with automated trace analysis capabilities

Scalable Performance Measurement Infrastructure for Parallel Codes (Score-P)

Summary Score-P is a scalable tool suite for profiling, event tracing, and online analysis of HPC applications. It is used by other profiling tools including Scalasca , TAU, Vampir. Score-P uses OTF2 for traces and Cube4 for profiles. Score-P has plugin architecture and functionality can be extended for specific use-cases (available on GitHub).
Platforms Unix/Linux
Languages/Models C, C++, Fortran, MPI, OpenMP, CUDA, OpenACC, OpenCL, SHMEM, Pthreads
License Required No
Documentation User Guide
When To Use You are looking for scalable performance measurement tool for HPC applications

TAU (Tuning and Analysis Utilities)

Summary TAU is portable performance analysis toolkit for instrumentation, performance measurement and performance analysis. It supports various instrumentation methods : source-to-source instrumentation (using PDT), compiler instrumentation, manual instrumentation using API, library interception at runtime. The profile data generated can be analysed using command like tool pprof or GUI tool Paraprof. The profile/trace data can be analysed using tools like Paraprof, Cube, Vampir, Jumpshot. TAU also provide tool called PerfExplorer for performance data mining.
Platforms Unix/Linux, Mac OS, Windows
Languages/Models C, C++, Fortran, Java, Python, MPI, OpenMP, CUDA, OpenACC, OpenCL, SHMEM, Pthreads, PGAS
License Required No
Documentation User Guide
When To Use You are looking portable and scalable performance analysis tool for parallel applications

Temanejo

Summary Temanejo is a graphical tool for analysing and debugging task-parallel, data-dependency-driven programming models. It allows one to display the task-dependency graph of application components, and allows simple interaction with the runtime system in order to control some aspects of parallel execution. Temanejo is able to assist debugging (to varying extent) for the programming models like SMPSs, OmpSs, StarPU, PaRSEC and OpenMP. It uses Ayudame library to collect information, so called events, from supporting runtime systems, and to excert control over a runtime system.
Platforms Linux, Mac OS
Languages/Models OpenMP, OmpSs, SMPSs, StarPU, ParRSEC
License Required No
Documentation User Guide
When To Use • You have task-parallel applications and you are not sure about dependencies and runtime scheduling of tasks • You are looking for visual debugging tool to understand the dependency execution at runtime

ToolGear

Summary ToolGear is a framework for developing GUI tools with minimal efforts. It provides high level, language agnostic XML interface for designing user interfaces. ToolGear is shipped with visualisation tools like Mpipview (for visualising mpiP performance data) and Memcheckview (for visualising Valgrind's memcheck performance data).
Platforms Unix/Linux
License Required No
Documentation Readme
When To Use You are looking for visualisation tool for Valgrind's memcheck or mpiP tool

Trace Compass

Summary Trace Compass is a tool for viewing and analysing traces from various tracing tools. It is first and foremost a framework with many builtin analyzes including LTTng, perf, GDB, Best Trace Format (BTF), ibpcap (Packet CAPture) and user-defined text/xml traces. Trace Compass provide different views, graphs that present profile data in more user-friendly and informative way rather than huge text dumps. There is new eclipse project called Trace Compass Incubator. It is a complement to Trace Compass and includes additional features that are under development, contributed and maintained by the community.
Platforms Linux, Mac OS, Windows
License Required No
Documentation User Guide
When To Use You want to visually inspect traces collected by tracing tools like perf and LTTng

Vampir

Summary Vampir is a scalable trace analysis and visualisation framework for parallel applications. Vampir has optimised event analysis algorithms and customisable displays that enable fast and interactive rendering of very complex performance data. Many profiling tools like Score-P, TAU, Open|SpeedShop generate the traces in OTF format that can be visualised by Vampir. It provides large set of chart representations to analyse message passing characteristics, I/O behaviour, performance counters with timeline visualisations. This enable developers to quickly display and analyze arbitrary program behavior at any level of detail. For analysing very large performance data, Vampir can be used with VampirServer with client/server architecture.
Platforms Linux, Mac OS, Windows
Languages/Models MPI, SHMEM, OpenMP, Pthreads, OpenACC, CUDA, OpenCL
License Required Yes
Documentation User Guide
When To Use You are looking for full featured tool suite for analyzing the performance and message passing characteristics of parallel applications

Very Sleepy

Summary Very Sleepy is a CPU profiling tool for C/C++ applications on Windows platform. It is derived from Sleepy profiler and uses statistical sampling technique. Very Sleepy records instruction pointer and memory addressees are then mapped to functions/line numbers using debug information (PDB or DWARF2). It provides graphical user interface that shows call graph information, timings and options to export profiling data in CSV format.
Platforms Windows (x86/64)
License Required No
Documentation GitHub, Project Website
When To Use You are looking for simple, standalone profiling tool for C/C++ applications on Windows

Windows Performance Toolkit

Summary Windows Performance Toolkit is a performance monitoring tool for Windows platform. It consists of two components: Windows Performance Recorder (WPR) and Windows Performance Analyzer (WPA). WPR is a powerful recording tool that creates Event Tracing for Windows (ETW) recordings. WPR provides built-in profiles with specific events to be recorded. WPA is a powerful analysis tool with very flexible UI. It has extensive graphing capabilities and data tables with full text search capabilities.
Platforms Windows
License Required No
Documentation User Guide
When To Use You are looking for system introspection tool for Windows (similar to perf for linux)

YourKit

Summary YourKit is a performance analysis tool for Java and .NET applications. It support various analysis including CPU usage, memory usage, memory leaks, thread synchronisation and exception profiling. YourKit can be used for high level analysis (to see application behaviour) or low-level detail (to pinpoint performance issues). It provides high level monitoring of web, I/O and database activity. The profiling reports can be exported to other formats (e.g. XML, HTML, CSV) for 3rd party applications.
Platforms Linux, Windows, Mac OS
Languages/Models Java, .NET
License Required Yes
Documentation Project Website
When To Use You are looking for a tool to analyse Java (SE/EE) or .NET applications in development or production environment

Python Profiling Tools

Below is list of profiling tools for analysing python applications. These will be covered in separate blog post but here is brief summary :

cProfile : Built in python module for profiling python scripts
pycallgraph : Python module that creates call graph visualizations for python applications
gprof2dot : Handy python script to convert the output from many profilers into a dot graph.
RunSnakeRun : GUI utility to view cProfile dumps in a sortable GUI view
SnakeViz : Browser based graphical viewer for the output of Python’s cProfile module
vprof : Visual profiler package providing rich and interactive visualizations for various Python program characteristics
line_profiler : Python module for doing line-by-line profiling of functions

Other Tools

Below is list of additional tools but some are not in active development, have better alternative or from different domain.

PerfExpert : Easy-to-use automatic performance diagnosis and optimization tool for HPC applications
Perfsuite : Collection of tools, utilities, and libraries for software performance analysis
PapiEX : Tool designed to transparently and passively measure the hardware performance counters using PAPI
ravel : Trace visualisation tool for MPI applications
Zoom : Performance analysis tool for applications running on Linux and Mac OS
Sourcery Analyzer : Powerful tool for embedded design with profiling and analysis engine
ThreadSpotter : Tool for diagnosing performance issues related to data locality, cache usage, and thread interaction
dotTrace : Performance profiler for .NET applications
RedGate ANTS : Profiler for .NET desktop applications and ASP.NET MVC applications
JustTrace : Profiling tool for .NET applications
Java Profilers : Many Java profilers not covered in this post (JMemProf, JMP, JBoss Profiler, Java Interactive Profiler (JIP), Profiler4j, JMeasurement)
• IDEs : Many IDEs like NetBeans, Eclipse, CLion have inbuilt profiling tools not covered in this post

CREDIT

Thanks to following people who suggested missing / their favourite tool after publishing this post :
    • Richard Neill from University of Manchester pointed out Aftermath
    • u/SantaCruzDad on reddit suggested Very Sleepy

If you have any question, suggestion or would like to improve the post with your favourite tool, I will be glad to hear!