Workshop Programme December 4, 2018

Preliminary Workshop Program:
10:00 - 10:30 Welcome with coffee
10:30 - 12:30 Presentations
12:30 - 13:30 Lunch
13:30 - 16:00 Presentations
16:00 - 17:00 Drinks, Snacks & Networking

Confirmed Presentations:

Trevor McDonell (University of Utrecht) - A Functional Programming Language for GPUs
Graphics processing units (GPUs), while primarily designed to support the efficient rendering of computer graphics, are increasingly finding their highly-parallel architectures being used to tackle demanding computational problems in many non-graphics domains. However, GPU applications typically need to be programmed at a very low level, and their specialised hardware requires expert knowledge in order to be used effectively. These barriers make it difficult for domain scientists to leverage GPUs in their applications, without first becoming GPU programming experts. This talk discusses our work on the programming language _Accelerate_, in which computations are expressed in a high-level functional style, but compiled down to efficient low-level GPU code. I will discuss some of the aspects to making such a language practicable as well as performant.

Valeriu Codreanu (SURFSara) - Design and Performance Evaluation of a Commodity GPU Cluster for HPC and Deep Learning Workloads
The LISA GPU cluster is a heterogeneous system mixing Pascal and Volta-grade GPU accelerators, that was motivated by the increased computational requirements of modern workloads such as deep learning. I will start with presenting some of the design decisions for the various cluster components (e.g. CPUs, GPUs, networking, storage) that aim to support this fast growing deep learning community, but will also look beyond it, towards other communities using non-FP64 GPU computing. We will showcase a variety of software packages that can be efficiently used on top of this infrastructure, and the ways in which Dutch research and education are supported by it. I will present single/multi-node performance evaluation and best practices for scaling several deep learning frameworks (e.g. TensorFlow, PyTorch, MXNet) and HPC workloads (e.g. GROMACS). Finally, I will share some of the successful applications already benefitting from this new infrastructure, and conclude with the lessons learnt through the design and management process.

Henk Dreuning (University of Amsterdam) - A Beginner's Guide to Estimating and Improving Performance Portability
Given the increasing diversity of multi- and many-core processors, portability is a desirable feature of applications designed and implemented for such platforms. Portability is unanimously seen as a productivity enabler, but it is also considered a major performance blocker. Thus, performance portability has emerged as the property of an application to preserve similar form and similar performance on a set of platforms; a first metric, based on extensive evaluation, has been proposed to quantify performance portability for a given application on a set of given platforms. In this work, we explore the challenges and limitations of this performance portability metric (PPM) on two levels. We first use 5 OpenACC applications and 3 platforms, and we demonstrate how to compute and interpret PPM in this context. Our results indicate specific challenges in parameter selection and results interpretation. Second, we use controlled experiments to assess the impact of platform-specific optimizations on both performance and performance portability. Our results illustrate, for our 5 OpenACC applications, a clear tension between performance improvement and performance portability improvement.

Pieter Hijma (VU University Amsterdam) - Optimization Effectiveness: A Case-Study in Relating Performance to Programming Effort
Given the ever increasing complexity of our computing hardware, it becomes more difficult to achieve high performance and to apply optimizations effectively. In this paper we propose a novel quantitative measure for the effectiveness of optimizations. We have implemented an interesting real-world application in the forensics domain and applied several optimizations to make best use of our heterogeneous many-core cluster. We analyze the programming effort for these optimizations qualitatively and quantitatively and relate it to the achieved performance to come to a measure of effectiveness for optimizations. We introduce measures for programming effort and effectiveness of optimizations, we propose a structured approach for optimizing applications, and give a detailed explanation of our optimizations that lead to high performance and excellent scalability for our use case.

Sagar Dolas (SURFSara) - Exploring the Potential of the ROCm Software Stack for High Performance Computing and Deep Learning on AMD GPUs
ROCm (Radeon Open GPU Computing) is one of the first open source platforms for GPU computing making heavy use of heterogeneous system architecture, and is programming language independent. This is a much anticipated effort to accelerate open source software development for GPU computing. In these exciting times of high performance computing and deep learning this effort can prove to be disruptive and innovative in its own way. ROCm initiative has motivated community wide development of various essential software packages to accelerate scientific software development and present competitive alternatives. In this talk, I will present some of most profound features of the ROCm software stack, its challenges, and will showcase performance of matrix multiplication kernels and of training deep neural networks such as ResNet-50 on MI25 GPU-accelerated AMD EPYC systems using Tensorflow. We will also see how using an open source assembly level kernel library (Tensile) for general matrix multiplication can get us closer to metal and increase performance.

Maxwell Cai (Leiden University) - GPU-accelerated Research in Astrophysics
In modern astronomy and astrophysical research, one of the outstanding problem is the big data challenge. Huge amount of data are being generated continuously from big telescopes and HPC systems. Fortunately, many calculations can be parallelized on the GPU. In this presentation, I would like to highlight several increasingly important applications of GPU computing in astrophysical research, including large-scale N-body simulations, high-performance data processing pipelines, and deep learning.

Ehsan Sharifi Esfahani (University of Amsterdam) - A survey on Energy Efficiency in GPUs
The high energy consumption of Graphic Processing Units (GPUs) in high performance computing environments is a critical issue nowadays. It has a negative impact on the environment while at the same time increasing the operational costs and decreasing the potential of designing of the future exascale machines. These fosters the development of more energy-efficient parallel programs. In this literature study, we investigated the motivations and related challenges of green computing in GPUs. Then, we surveyed GPUs power breakdown, the metrics used for evaluating energy-efficiency in this environment, and different related energy models. We also summarized the proposed solutions of green-computing in GPUs and classified them according to our suggestion taxonomies and classifications.

Merijn Verstraaten (Netherlands eScience Center) - Mix-and-Match: A Model-driven Runtime Optimisation Strategy for BFS on GPUs
The performance of graph algorithms is heavily dependent on the algorithm, execution platform, and structure of the input graph. This variability remains difficult to predict and hinders the choice of the right algorithm for a given problem. We show the results of a case study on breadth-first search (BFS) on GPUs. We demonstrate the severity of this variability by comparing 5 implementation strategies for GPU-enabled BFS, and showing how selecting one single algorithm for the entire traversal can significantly limit performance. We propose to mix-and-match different algorithms at runtime, to compose the best performing BFS traversal. Our approach is based on two novel elements: a predictive model, based on a decision tree, which is able to dynamically select the best performing algorithm for each BFS level, and a quick context switch between algorithms, which limits the overhead of the combined BFS. We demonstrate empirically that our dynamic switching BFS outperforms our non-switching implementations by 2x and existing state-of-the-art GPU BFS implementations by 3x. We conclude that mix-and-match BFS is a competitive approach for performing fast graph traversal, while being easily extended to include more BFS implementations and easily adaptable to other types of processors or specific types of graphs.