Summary of Debugging Tools for Parallel Applications

Nowadays it's not uncommon to run parallel applications with hundreds of thousands of processes on supercomputing platforms. Debugging these parallel applications with sporadic crashes, deadlocks, memory errors or incorrect results is a challenging task. There are number of tools available that help identifying and fixing bugs but one needs to understand tools, their capabilities and when they can be used. This post tries to summarise various debugging tools (open source as well as commercial).

Note that not all tools can be used with distributed applications. For example, open source tools like GDB and Valgrind are commonly used for debugging serial, multi-threaded applications. But one can debug small scale MPI applications by launching multiple GDB instances with the help of MPI launcher and terminal emulator like Xterm. Similarly, Valgrind can be used to debug distributed application with the help of tools like Memcheckview . In certain cases we can reproduce the issue on small scale where serial tools are helpful. And hence, along with better commercial alternatives, many serial tools are included in this post.

If you are interested into performance analysis aspects, take a look at Summary of Profiling Tools post which summarises more than 90 profiling tools for performance analysis and optimisation.

Below tools are listed in alphabetical order.

Abnormal Termination Processing (ATP)

Summary ATP is a tool developed by Cray to help debugging applications at scale. When application crashes with tens of or hundreds of thousands of ranks, generating and analysing core dumps from every rank is not practical. In the event of an application crash, ATP performs analysis on the dying application : all stack backtraces of the ranks are gathered into a merged stack backtrace tree and written to disk as a single dot file. This dot can be then visualised with stat-view tool from STAT and gives concise yet comprehensive view of what the application was doing at the time of its termination. ATP uses MRNet to co-ordinate analysis at scale and Stackwalker API to collect the backtraces. It can be configured to attach debugger like Totalview to perform detailed debugging.
Platforms Cray systems
License Required Yes
Documentation User Guide
When To Use • Your application is failing on Cray system at scale and generating/analysing core dumps from thousands of ranks is not possible • You want lightweight monitoring framework that merge stacks on application crash and dumps out concise information in a readable format.

AQtime

Summary AQtime is a profiling and debugging tool developed by SmartBear Software. It helps to identify performance bottlenecks, memory leaks, code coverage gaps and track resource utilisation. AQtime provides multiple modes for performance analysis : one can start with lightweight sampling and then drill deeper into the hotspots using the more accurate profiling mode. It allows monitoring threads (Windows, .NET and COM) and per thread basis profiling. AQtime can be used as standalone performance profiler or integrated into IDE like Microsoft Visual Studio, RAD Studio allowing analysis of the application without leaving the development environment.
Platforms Windows
Languages/Models C, C++, Delphi, .NET, Java
License Required Yes
Documentation User Guide
When To Use You have serial or threaded application and looking for debugger or profiler on Windows platform

Archer

Summary Archer is an open source, portable, data race detector tool for OpenMP applications. It combines static and dynamic analysis techniques to detect data races with high accuracy and lower runtime overhead even for large codebases. Archer is build on top of open source tools such as LLVM-Clang OpenMP runtime, ThreadSanitizer and Polly. It helps to find data races, non-determinic application behaviour (crashes, program exceptions, wrong results, etc.) which is hard to find with traditional debugging tool. Depending on the application workload, the runtime could slow down by 2x-20x.
Platforms Unix/Linux, Mac OS, Windows
Languages/Models C, C++, OpenMP
License Required No
Documentation GitHub, README
When To Use You have OpenMP application and looking for open source tool to find nasty data races and non-deterministic application behaviour

ARM Distributed Debugging Tool (DDT)

Summary Arm DDT is a graphical debugger developed by Allinea Software (now part of ARM). It is primarily used for debugging parallel applications on clusters and supercomputing platforms. DDT supports simultaneous debugging on heterogenous architectures, for example of CPU and GPU codes together. It helps to find memory issues including out-of-bound accesses, memory leaks. DDT provide intuitive GUI to browse source code, examine scalar/array variable, call-stack across processes in single view. It provides remote debugging functionality and supports offline mode to debug application non-interactively with batch systems. DDT uses GDB underneath. Along with Totalview, DDT is commonly used debugger on supercomputing platforms.
Platforms Linux (Intel x86, Intel Xeon Phi, IBM Power, Arm, NVIDIA GPU)
Languages/Models C, C++, Fortran, MPI, OpenMP, OpenACC, CUDA, UPC
License Required Yes
Documentation User Guide
When To Use You are looking for a debugging tool for serial, multi-threaded or parallel applications on desktop, cluster or largest supercomputer

bgq_stack

Summary bgq_stack is an utility to print a symbolic version of the stack from ASCII core files. When application terminates abnormally on BlueGene platform, core files (for each rank) are generated in plain text format. The core file contains frame addresses representing function call stack record that can help to identify line and function executing at the time of program termination. bgq_stack helps to generate source file and line number from core file using debug information from executable and addr2line utility.
Platforms IBM BlueGene-Q
License Required No
Documentation ALCF Instructions
When To Use You have text core file from generated on BlueGene system and need to identify what line of what routine was executing when the error occurred

Line Mode Debugger (LGDB) and Cray Comparative Debugger (CCDB)

Summary LGDB and CCDB are debugging tools developed by Cray. LGDB is a GDB-based parallel debugger with command line interface. In addition to many GDB features, it include extensions to handle parallel execution. CCDB is a GUI tool for comparative debugging and uses LGDB underneath. It allows user to run two versions of the application simultaneously : one that generates correct result and other with incorrect results. User can define expressions to be compared between runs. By comparing the data structures of two, CCDB can help to identify the location where the two codes start to differ from each other. This methodology can be used between different run-time environments, different hardwares, for example, when a code is ported from CPU to a GPU.
Platforms Cray systems
Languages/Models C, C++, Fortran, MPI
License Required Yes
Documentation Cray Debugger User Guide
When To Use • Your want to debug an application by comparing against working older version • You want to run two versions side by side while running on different scale or different hardwares and compare data structures between them

Coreprocessor

Summary Coreprocessor is a basic parallel debugger that can help to debug problems at application, kernel or hardware level on BlueGene platform. It uses the low-level hardware JTAG interface to read and organise hardware information. Coreprocessor can sort nodes based on their stack traceback and kernel status, which can help isolate a failing or problem node quickly. It can attach to running processes for deadlock determination and can be used to analyze, sort and view text core files.
Platforms IBM BlueGene-Q
License Required No
Documentation Blue Gene/Q System Administration Guide, ALCF Instructions
When To Use You need a tool to examine large number of cores files generated by application on BlueGene system

CUDA-GDB

Summary CUDA-GDB is a debugging tool developed by NVIDIA to assist simultaneous debugging of both GPU and CPU code. It is based on x86-64 port of GDB with additional features for debugging CUDA applications on actual GPU hardware. CUDA-GDB allow user to set breakpoints, watch and modify variables (local, shared, global) / memory of any thread running on device. It has support for source/assembly level debugging on multi-GPU system. CUDA-GDB can be integrated with DDD, EMACS or Nsight Eclipse Edition.
Platforms Linux, Mac OS, Android
Languages/Models C, C++, Fortran, CUDA
License Required No
Documentation User Guide, GitHub
When To Use You are familiar with GDB for CPU debugging and want similar command line debugger for CUDA applications running on GPU

CUDA-MEMCHECK

Summary CUDA-MEMCHECK is a correctness checking tool suite included in CUDA toolkit. It helps to identify the cause of memory access and runtime execution errors in GPU codes. CUDA-MEMCHECK monitors threads running on GPU device and detect various errors such as out-of-bound accesses, misaligned memory accesses, stack overflows, illegal instructions, potential race conditions. It can display stack back traces on host and device for errors with source file and line number.
Platforms Linux, Mac OS
Languages/Models C, C++, CUDA
License Required No
Documentation User Guide
When To Use You are looking for Valgrind like tool to find memory access errors on GPU

Curses Debugger (cgdb), Data Display Debugger (DDD) and KDbg

Summary cgdb, DDD and KDbg are graphical front-ends for command line debuggers. cgdb provides lightweight curses interface to the GDB. DDD can be used with number of command line debuggers like GDB, DBX, JDB, XDB etc. KDbg is a KDE based graphical user interface for GDB. They provide lots of basic functionality like search/view/step through source code, inspect data structures, set/clear/enable/disable breakpoints, display/watch arbitrary expressions etc.
Platforms Unix/Linux
Languages/Models C, C++, Fortran, Java, Perl, Php, Python
License Required No
Documentation cgdb User Guide, DDD User Guide, KDbg User Guide
When To Use You are using command line debuggers and looking for graphical user interface

DELEAKER

Summary DELEAKER is a memory profiler and leak detection tool for windows applications. It intercepts all resource allocations such as memory, GDI objects, Handle and records corresponding call stack. DELEAKER allows to take snapshots during application execution and provide GUI to compare/analyse them with full stack view. It can be used as standalone application or can be integrated with Visual Studio.
Platforms Windows
Languages/Models C, C++, C#, .NET
License Required Yes
Documentation User Guide
When To Use You are looking for memory/GDI/Handle/FileView leak detection tool on Windows platform

Dr. Memory

Summary Dr. Memory is a cross platform, memory monitoring tool. It helps to identify memory errors like uninitialized accesses, out-of-bound accesses, double frees, memory leaks etc. Dr. Memory uses DynamoRIO code manipulation framework underneath for dynamic instrumentation. It also provide drstrace tool for windows that provide system call tracing functionality similar to strace.
Platforms Unix/Linux, Mac OS, Windows, Android
License Required No
Documentation User Guide, Publication, GitHub
When To Use You are looking for faster memory correctness tool compared tools like Valgrind's Memcheck

DTrace

Summary DTrace is a dynamic tracing framework for analysing applications on production systems in real time. It was originally developed for Solaris and has been ported to several Unix-like systems. Dtrace is scriptable framework : one can attach “probes” to a running system and peek inside as to what it is doing. It helps to understand memory utilisation, CPU time, filesystem and network resources used by active processes.
Platforms Linux, MacOS
Languages/Models Assembly, C, C++, Java, Erlang, JavaScript, Perl, PHP, Python, Ruby, shell script, Tcl
License Required No
Documentation Dtrace Guide
When To Use You want to diagnose application issues (on workstation, server or cloud environment) and need a tool capable of tracing at user/kernel space

GPU PerfStudio

Summary GPU PerfStudio is performance and debugging tool developed by AMD. It was originally developed for Direct3D and OpenGL application for Windows and later ported to Linux. GPU PerfStudio consists of five important tools for graphics developers : Frame Debugger (to visualise the graphics state and resources in the frame), Frame Profiler (to identify per draw call performance issues at the hardware counter level), Shader Debugger (to step through and debug shader code and its output), API Tracer (to show CPU timing information) and Shader Analyzer (to help optimising shader code).
Platforms Linux, Windows, AMD GPU
Languages/Models C, C++, DirectX, OpenGL, Vulkan
License Required No
Documentation Project Page
When To Use You want to analyse and optimise game applications for AMD GPUs

Helgrind and Data Race Detector (DRD)

Summary Helgrind and DRD are error detection tools for multi-threading applications. These tools are part of Valgrind framework and can be used with applications using POSIX threading primitives directly or libraries built on top of POSIX threading primitives (e.g. Boost.Thread, C++11 std::thread, QThreads). Helgrind and DRD helps to find various synchronisation errors, incorrect API usage, lock contention and data races. Both tools provide similar functionality but DRD could have better performance and Helgrind produces more comprehensible reports. As Valgrind performs code emulation technique and records read/write/api calls, the execution could be significantly slower.
Platforms Unix/Linux, Mac OS
Languages/Models C, C++, Pthread
License Required No
Documentation Helgrind User Guide, DRD User Manual
When To Use • You have developed multi-threaded application but it produces incorrect results and occasional locks up • You want to try commonly available tool for diagnosing thread hazards

Insure++, PurifyPlus

Summary Insure++, PurifyPlus are runtime memory analysis and error detection tools. They helps to identify various memory errors such as heap corruption, memory leaks, array out-of-bound accesses, buffer overflows. Insure++ can be used in two modes : source instrumentation mode and link mode. In source instrumentation mode Insure++ perform source-code instrumentation that help to find errors that other tools might miss. It also provides GUI that show memory allocations, possible outstanding leaks over time. PurifyPlus works by instrumenting object code and can detect errors occurring inside third-party libraries.
Platforms Unix/Linux, Windows
Languages/Models C, C++
License Required Yes
Documentation Project Page
When To Use You are looking for memory analysis and debugging tool on windows platform

Intel Inspector

Summary Intel Inspector (successor of Intel Thread Checker) is a code correctness tool that helps to identify threading and memory errors. It performs dynamic instrumentation and analyse execution to find out intermittent, non-deterministic errors. Intel Inspector helps to find out threading errors (like deadlock, race condition) and memory errors (like memory leaks, memory corruption, dangling pointers, uninitialized variables).
Platforms Linux, Windows
Languages/Models C, C++, Fortran, TBB, OpenMP, Pthread, Win32 threads
License Required Yes
Documentation Documentation
When To Use • You want to analyse memory issues (leaks, dangling pointers, un-initialized variables) • You have threaded application and want to find out issues like race conditions, deadlocks etc.

Floating-point Litmus Tests (FLiT)

Summary FLiT is an infrastructure for detecting variability in floating-point computations caused by variations in compiler optimisation, hardware and execution environments. Unlike other tools, FLiT is not a debugging tool but a framework to detect discrepancies in floating point computation across hardware, compilers and libraries. It allows developer to create reproducibility tests with their application and then compiles them under a set of configured compilers and a large range of compiler flags. The results from the tests under different compilations are then compared against the results from a "ground truth" compilation (e.g. un-optimized compilation). This help developer to determine which compilations are safe and navigate the tradeoff between reproducibility and performance.
Platforms Unix/Linux, Mac OS
Languages/Models C, C++
License Required No
Documentation GitHub, README
When To Use • You are writing an application and concerned about floating point discrepancies • You are looking for framework which allows to write compute kernels and test them with different compilers and different optimisation levels to ensure code correctness as well as reproducibility

GNU Project debugger (GDB)

Summary GDB is a widely used, portable, command line debugger for applications written in various programming languages. It provides rich functionality for monitoring, tracing and altering programming execution at runtime. GDB supports debugging multi-threaded applications (see threads) as well as multiple processes simultaneously (see inferiors). It can be integrated into IDEs (e.g. Codelite, Code::Blocks, Dev-C++, Qt Creator, Eclipse, NetBeans, Visual Studio) or can be used via front-ends like UltraGDB, DDD, Nemiver, KDbg. One can use GDB to debug MPI applications with the help of terminal emulator like xterm (see OpenMPI instructions). Other tools like CUDA-GDB, DDT uses GDB underneath.
Platforms Unix/Linux, Mac OS, Windows
Languages/Models Ada, C, C++, Objective-C, Free Pascal, Fortran, Go, Java, Python (and others)
License Required No
Documentation User Guide
When To Use • You are looking for readily available debugger for your application on any given platform • You want to debug parallel application (multi-threaded or multi-process) on small scale

GPU PerfStudio

Summary GPU PerfStudio is performance and debugging tool developed by AMD. It was originally developed for Direct3D and OpenGL application for Windows and later ported to Linux. GPU PerfStudio consists of five important tools for graphics developers : Frame Debugger (to visualise the graphics state and resources in the frame), Frame Profiler (to identify per draw call performance issues at the hardware counter level), Shader Debugger (to step through and debug shader code and its output), API Tracer (to show CPU timing information) and Shader Analyzer (to help optimising shader code).
Platforms Linux, Windows, AMD GPU
Languages/Models C, C++, DirectX, OpenGL, Vulkan
License Required No
Documentation Project Page
When To Use You want to debug and optimise game applications for AMD GPUs

LaunchMON

Summary LaunchMON is a software framework that helps other tools to launch daemons on remote node at scale. Many debuggers and performance analysis tools often need to launch and control middleware daemons on the compute nodes for scalable communication. LaunchMON provides general purpose, efficient, portable and secure infrastructure to achieve this. It can interact with the resource manager like SLURM to determine when, where and how to perform the operations. Many other tools like STAT, DDT use LaunchMON underneath.
Platforms Unix/Linux
License Required No
Documentation README, GitHub
When To Use • You are developing parallel debugging tool • You need a library to identify the remote nodes and processes of a parallel program, and also deploy tool daemons into the right remote nodes

LLDB

Summary LLDB is a debugging tool built on top of reusable software libraries from LLVM toolchain. It uses LLVM disassembler and Clang expression parser that can better handle complex C++ codes compared to other debuggers like GDB. LLDB has an advantage of modern libraries from LLVM project and permissive software licence (UIUC) that allows easy integration with proprietary softwares.
Platforms Linux, Mac OS, Windows
Languages/Models C, Objective-C, C++, Swift
License Required No
Documentation Tutorial
When To Use You are using LLVM compiler toolchain and looking for a debugger alternative to GDB

Linux Trace Toolkit Next Generation (LTTng)

Summary LTTng is a tracing framework for standalone applications, libraries and kernel with minimal overhead. It is successor of Linux Trace Toolkit (LTT) and available on many desktop, server and embedded linux distributions. Similar to perf/Dtrace, it can be used for system wide introspection to understand interactions among multiple applications. Visualisation tools like Trace Compass and Sourcery Analyzer can be used for visualising collected traces.
Platforms Linux
License Required No
Documentation User Guide
When To Use You want trace single process or want to perform system wide introspection with minimal overhead

Memcheck

Summary Memcheck is default memory debugging tool of Valgrind framework. It helps to find memory issues such as uninitialized memory access, read/write after deallocation, double free, memory leaks, mismatch of malloc/new vs free/delete. All memory accesses (read/write) are checked, and calls to malloc/new/free/delete are intercepted. As a result, it could significantly slowdown the execution (5x-100x). Memcheck can be used to debug parallel MPI applications by launching Valgrind under MPI launcher and then re-directing report to per process log file. Alternatively, one can use memcheckview graphical tool (part of ToolGear) to interpret Memcheck's results.
Platforms Linux, Mac OS
Languages/Models C, C++
License Required No
Documentation User Manual, Quick Start Guide
When To Use • You want to pinpoint cause of sporadic memory crash and unpredictable application behaviour • You want readily available tool for debugging memory issues with serial application or small scale parallel application

MTuner

Summary MTuner is a memory profiler and memory leak finder for C/C++ applications. It records entire history of memory operations over time with minimal impact on run-time performance. With intuitive GUI, MTuner helps to provide insight into memory related behaviour of an application and quickly narrow down sources of memory leaks, spikes, high count of allocations, etc. Using MTuner SDK instrumentation API one can inset event markers, memory tags, named allocators for precise memory profiling.
Platforms Windows (partial support for Linux)
Languages/Models C, C++
License Required No
Documentation User Guide, GitHub
When To Use You want to profile and analyse memory usage with entire time-based history of all memory operations

Marmot Umpire Scalable Tool (MUST)

Summary MUST is a runtime error detection tool for MPI applications. It automatically detects non-standard compliant use of MPI constructs. MUST intercepts all MPI calls and checks for local, non-local correctness errors such as invalid arguments, data type matching errors, overlap in compunction buffers, resource leaks and actual/potential deadlocks. It combines the features of old Marmot and Umpire tools with improved scalability.
Platforms Linux
Languages/Models C, C++, Fortran, MPI
License Required No
Documentation Project Page
When To Use You have MPI application and want detect violations to the MPI standard that might manifest on certain system or with different MPI implementation

Nsight

Summary Nsight is a development tool from NVIDIA for heterogeneous computing. It provides simultaneous debugging and profiling capabilities for CPU as well as GPU. Nsight helps to identify/analyse bottlenecks and monitor the activities of entire system. It can be integrated with Eclipse and Microsoft Visual Studio.
Platforms Linux, Windows, Mac OS
Languages/Models C, C++, CUDA, Direct3D, Vulkan, OpenGL
License Required No
Documentation User Guide
When To Use You are looking for IDE with debugging and profiling capabilities for NVIDIA GPUs

ReMPI

Summary ReMPI is a record and replay tool for MPI applications. As network/system noise can affect the order of received messages, applications can take different computation paths depending on received messages. This makes debugging process complicated as computation paths and associated computational results may vary between the original run (where a bug manifested itself) and the debugged runs. ReMPI helps debugging such non-deterministic MPI applications by reproducing order of message receives. Even if a bug manifests in a particular order of message receives, ReMPI can consistently reproduce the target bug. It uses PMPI interface for tracing message receive order. ReMPI can be used with existing tools like Totalview, DDT and STAT.
Platforms Linux, Mac OS
Languages/Models C, C++, Fortran, MPI
License Required No
Documentation
README, GitHub
When To Use • Your MPI application has non-determistic communication pattern • You want mechanism to re-run application by preserving MPI message communication order

RenderDoc

Summary RenderDoc is a frame-capture based graphics API debugger designed for quick and easy introspection of any graphics application. RenderDoc allows to capture a single frame of an application, then load that capture up in an analysis tool to inspect the API use and GPU work in detail.
Platforms Linux, Windows
Languages/Models Vulkan, Direct3D, OpenGL
License Required No
Documentation User Guide, GitHub
When To Use You are developing rendering application and need a tool for frame analysis & debugging, graphics inspection and detailed examination of API usage

Stack Trace Analysis Tool (STAT)

Summary STAT is a lightweight, scalable tool to aid in debugging parallel applications at extreme-scale. It is not intended to be a full-featured debugger but can help to pinpoint root cause of deadlocks even running with hundreds of thousands processes. STAT gather stack traces from parallel application’s processes and merge them into a compact form. The merging process groups processes that exhibit similar behavior into process equivalence classes. It provides GUI to navigate process groups and allow attaching full-featured debugger like DDT, Totalview for in-depth analysis.
Platforms Linux
Languages/Models C, C++, Fortran, MPI (and other programming models)
License Required No
Documentation User Guide, GitHub
When To Use • You are running application at scale and you suspect deadlock • You need a tool to attach to running application and show the execution state of every process in compact view

Oracle Studio Thread Analyser

Summary Thread Analyser is a tool part of Oracle Developer Studio (formerly Sun Studio) that helps to pinpoint race and deadlock conditions in multi-threaded applications. When application is compiled in Oracle Studio, compiler add instrumentation code to the executable that helps to detect errors at runtime. It provides GUI integrated into Performance Analyzer.
Platforms Linux/Solaris (Intel, AMD and SPARC)
Languages/Models OpenMP, Pthread, Solaris thread API, Cray(R) parallel directive
License Required No
Documentation User Guide
When To Use You are developing application on Solaris platform and looking for data race detection tool

Temanejo

Summary Temanejo is a graphical tool for analysing and debugging task-parallel, data-dependency-driven programming models. It allows one to display the task-dependency graph of application components, and allows simple interaction with the runtime system in order to control some aspects of parallel execution. Temanejo is able to assist debugging (to varying extent) for the programming models like SMPSs, OmpSs, StarPU, PaRSEC and OpenMP. It uses Ayudame library to collect information, so called events, from supporting runtime systems, and to excert control over a runtime system.
Platforms Linux, Mac OS
Languages/Models OpenMP, OmpSs, SMPSs, StarPU, ParRSEC
License Required No
Documentation User Guide
When To Use • You have task-parallel applications and you are not sure about dependencies and runtime scheduling of tasks • You are looking for visual debugging tool to understand the dependency execution at runtime

ThreadSanitizer (TSan) and AddressSanitizer (ASan)

Summary TSan is a fast data race detector tool for multi-threaded C, C++ applications. It performs compile-time instrumentation to record information about each memory access, and then checks whether that access participates in a race. Compared to other tools, TSan better understand builtin atomics and synchronisation constructs and therefore provides more accurate results with no real false positives. The overhead could vary from application to application, but typically memory usage may increase by 5-10x and runtime by 2-20x. ASan is a fast memory error detector tool. It helps to find errors such as use-after-free, heap/stack/global buffer overflow, out-of-bounds accesses, invalid/double free. Typical slowdown introduced by ASan is ~2x and increases memory usage ~3x. TSan and ASan originally developed by Google for LLVM toolchain and now have been ported to GNU toolchain.
Platforms Linux, Mac OS, Windows
Languages/Models C, C++, Fortran, Pthread
License Required No
Documentation TSan User Guide, ASan User Guide
When To Use You need faster data race detection tool to find sporadic crashes and memory corruptions

Record and Replay Debugger (rr)

Summary rr is a record and replay framework developed by Mozilla. During the record phase, rr records all inputs to process and logs it to the disk as trace. This trace can be replayed as many times during debugging process and all state will be reproduced exactly. During the replay phase, rr provides an enhanced gdb debugging experience that supports reverse execution. As a bug can be replayed over and over again, it helps to debug issues that are very difficult to solve with traditional debuggers. This fictionally is similar to ReplayEngine of Totalview. rr can be integrated with IDEs like Visual Studio Code, QtCreator, Eclipse, CLion.
Platforms Linux
Languages/Models C, C++, Fortran, Pthread
License Required No
Documentation Wiki, GitHub
When To Use • You have non-deterministic, difficult to reproduce bug in application • You want a tool that is capable of recording the execution once and then replay it multiple times with reverse-debugging functionality

Totalview

Summary Totalview is a debugger developed by Rogue Wave Software for both serial and parallel programs. It helps to analyse and debug serial, parallel, multi-process, multi-threaded and hybrid applications on variety of HPC architectures. Totalview has a memory analysis tool called MemoryScape (for detecting memory leak, memory corruption) and a reverse debugging tool called ReplayEngine (providing record and replay debugging functionality like rr). It provides remote debugging functionality and Both a graphical user interface and command line interface. Along with DDT, Totalview is commonly used debugger on supercomputing platforms.
Platforms Unix/Linux, Mac OS, (Intel x86, Intel Xeon Phi, IBM Power, Arm, NVIDIA GPU)
Languages/Models C, C++, Fortran, MPI, OpenMP, OpenACC, CUDA
License Required Yes
Documentation User Guide
When To Use You are looking for a debugging tool for serial, multi-threaded or parallel applications on desktop, cluster or largest supercomputer

UndoDB

Summary UndoDB is a reversible debugger developed by Undo. Similar to rr and newer version of < a href="#h3-gdb">GDB, it supports rewinding and replaying through the program's execution history. One can set breakpoints and watchpoints in the past, and then rewind to them. UndoDB uses GDB as a default front-end but can be configured with IDEs like Eclipse, CLion.
Platforms Linux (x86, AArch64)
Languages/Models C, C++
License Required No
Documentation User Guide
When To Use You like reverse debugging feature of GDB but it's slow and hence you are looking for better alternatives

Valgrind

Summary Valgrind is an instrumentation framework for building dynamic analysis tools. It provides number of simulation-based debugging and profiling tools : Memcheck (memory-management error detector), Cachegrind (cache profiler), Callgrind (extends Cachegrind with callgraphs), Massif(heap profiler), DRD/Helgrind (data race detector). Valgrind in essence is a virtual machine that performs dynamic recompilation of binary using JIT compilation technique : It first translates the application into simpler Intermediate Representation (IR), then particularly tool can perform whatever transformations it would like on the IR and Valgrind translates the IR back to machine code and lets the host processor run it. Even though tool is used primarily used with single process, one can use it to debug MPI programs at moderate scale with the help of Tool Gear's MemcheckView.
Platforms Unix/Linux, Mac OS
Languages/Models C, C++, Fortran, Pthread
License Required No
Documentation User Guide
When To Use You want readily available, single tool for detecting memory errors, threading bugs and profile your programs

Python Debuggers

There are number of tools available for debugging Python applications. For multiprocessing applications in Python, I haven't used anything other than logger in the multiprocessing module. Here are few other debuggers commonly used :

pdb : Interactive source code debugger included in the standard library
pudb : A visual, console-based, full-screen debugger for Python
Winpdb : Platform independent Python Debugger
pydbgr : A gdb-like debugger for Python
• Number of Python IDE's like Spyder, PyCharm, Atom provides inbuilt debugger integration

This wiki page provides number of other alternatives.

Other Tools

Below is a list of additional tools not included in this post. Some of these tools are deprecated or not in active development or have better alternatives.

AutomaDeD : Tool for automatic diagnosis of performance and correctness problems in MPI applications
DBX : Source level debugger for C/Fortran/Pascal primarily on Solaris, AIX and BSD systems
jdb : Simple command-line, GDB equivalent debugger for Java
IDB Debugger developed by Intel supporting parallel programming models including MPI, OpenMP, and Pthreads (Deprecated)
Intel Static Security Analysis : Tool developed by Intel to identify security vulnerabilities including buffer overflows, uninitialised variables, memory leaks (Deprecated)
Marmot : MPI error detection tool for checking MPI calls, their arguments and non-portable constructs (superseded by MUST)
DHAT : Heap analysis tool to understand memory block lifetimes, block utilisation, memory access ratios and layout inefficiencies
WinDbg : A kernel-mode and user-mode debugger on Windows platform
• IDEs : Many IDEs like NetBeans, Eclipse, CLion, Visual Studio have inbuilt debugger or provide integration with third-party debuggers for serial/multi-threaded applications

If you have any question, suggestion or would like to improve the post with your favourite tool, I will be glad to hear!