I was planning to write about CPU microarchitecture analysis for a long time. I started writing this post more than a year ago, just before the beginning of COVID-19. But with so many things happening around (and new parenting responsibilities 👧), this got delayed for quite a long time. Finally getting some weekend time to get this out!
What makes supercomputers special? They have state-of-the-art processors, fast parallel file systems, specialized power & cooling infrastructure and complex software stack to run. But, a high-speed interconnect that tightly integrates thousands of nodes differentiate a supercomputer from a commodity cluster. Data movement within a node or across nodes is an important aspect for many scientific applications and hence low latency, high bandwidth interconnect technology is one of the key elements of the HPC systems.
Setting up such a system with tens of thousands of nodes and performance tuning is not an easy task. Especially during the early days of deployment and acceptance benchmarking where we often have to run various tests for weeks to identify issues, fix them and reach expected performance.… Read the rest
When optimizing parallel applications at scale, we often focus on computation-communication aspects and I/O often gets limited attention. With increasing performance gap between compute and I/O subsystems, improving I/O performance remains one of the major challenge. As filesystem is a shared resource, few jobs running on a system can significantly impact performance of other applications. In such scenario, even if we use profiling tool (see list here) to identify slow I/O routines, it's difficult to understand real cause. For example, there might be other applications dominating filesystem resulting in poor I/O performance.… Read the rest
What is One API? This has been a common question since Intel announced One API vision during Intel Architecture Day back in December 2018. The aim of this is to deliver uniform software experience with optimal performance across broad range of Intel hardware. There has been some press releases and high level presentations depicting how One API is going to solve programming challenge for scalar processors (CPUs), vector processors (GPUs), matrix processors (AI accelerators) as well as spatial processing elements (FPGAs). For people waiting for Intel Xe as Xeon Phi successor, this is exciting.… Read the rest
Almost a decade ago I was involved in number of technical training activities, and I enjoyed it lot. Actually, I wanted to be a trainer before moving to academia/research. Let's keep that for later discussion 🙂 .
I didn't think about creating online videos (aka screencasting). Recently I came across Screenflow and saw how easy it is to create reasonably good quality videos/tutorials. During this weekend I decided to give it a try, and this is my first attempt!
During one of the recent computing conference I heard the author saying:
With all the laws of computing fighting against us, the industry needs open interfaces to allow innovations...
I was curious about who all are fighting against us 🙂 . In this post I tried to put together summary of different laws related to computing (especially computing hardware trend, parallel performance and efficiency) that I came across over the years. Lets get started...!
My first experience with the Vampir trace visualiser was in 2010 during my studies at EPCC. While working on the exercises and samples, I was excited by the possibility of finding out what every process or thread (from thousands) is doing at any point in time. Over the years I have used TAU + Score-P + Vampir toolset with different applications on various systems. When it comes to trace visualisation for scientific applications at scale, Vampir is very impressive. If you haven't used it before, give a try!
If you are working in the area of scientific computing, in academia or industry, most likely you are using Python in some form. Traditionally Python is described as slow when it comes to performance and there are number of discussions about speed compared to native C/C++ applications 1 2. The goal of this post is not to argue about performance but to summarise various tools that can help to find out performance bottlenecks before coming to such conclusions. In the previous post, I summarised more than 90 profiling tools that can be used for analysing performance of C/C++/Fortran applications.… Read the rest
Different python profiling tools use different methodologies for gathering performance data and hence have different runtime overhead. Before choosing a profiler tool it is helpful to understand two commonly employed techniques for collecting performance data :
- Deterministic profiling Deterministic profilers execute trace functions at various points of interest (e.g. function call, function return) and record precise timings of these events. Typically this requires source code instrumentation but python provides hooks (optional callbacks) which can be used to insert trace functions.
- Statistical profiling Instead of tracking every event (e.g. call to every function), statistical profilers interrupt application periodically and collect samples of the execution state (call stack snapshots).
Nowadays it's not uncommon to run parallel applications with hundreds of thousands of processes on supercomputing platforms. Debugging these parallel applications with sporadic crashes, deadlocks, memory errors or incorrect results is a challenging task. There are number of tools available that help identifying and fixing bugs but one needs to understand tools, their capabilities and when they can be used. This post tries to summarise various debugging tools (open source as well as commercial).
Note that not all tools can be used with distributed applications. For example, open source tools like GDB and Valgrind are commonly used for debugging serial, multi-threaded applications.… Read the rest