Program

26 February 2023

Room: St-Laurent 8

Session 1:	Chair: Yaoqing Gao — Huawei Canada
8:00 - 8:30	Vectorizing sparse matrix computations with partially-strided codelets — Kazem Cheshmi, Zachary Cetinic and Maryam Mehri Dehnavi
8:40 - 9:10	Towards higher SIMD utilization in Control-Divergent Loop Nests via a compiler-only Active-Lane Consolidation Implementation — João Paulo Labegalini de Carvalho, Rouzbeh Paktinatkeleshteri, Wyatt Praharenka and José Nelson Amaral
9:20 - 9:50	Matrix Computation Acceleration in the Presence of Data Layout Conversions — Chenchen Tang, Fang Gao and Kai-Ting Amy Wang slides
10:00-10:20	Nutrition Break
Session 2:	Chair: João Paulo Labegalini de Carvalho, University of Alberta
10:20-10:42	Active mask computed in predication instruction vs stored in branch unit for SIMT execution — Kevin Lin and Guansong Zhang
10:50-11:12	Structure Peeling Using Runtime Memory Identifiers — Rayson Ho, Henry Kao, Ehsan Amiri and Tomasz Czajkowski
11:20-11:42	SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs — Xingyu Tian, Zhifan Ye, Alec Lu, Licheng Guo, Yuze Chi and Zhenman Fang
11:50-12:12	Coarse-Grained Resource Sharing for Entire Neural Networks — Tzung-Han Juang, Christof Schlaak and Christophe Dubach
12:20- 1:20	Lunch
Session 3:	Chair: Fernando Quintao Pereira - Universidade Federal de Minas Gerais, Brazil
1:20 - 1:50	Processor-Centric Contention Slowdown Model for Heterogeneous System-on-chips — Yuanchao Xu, Mehmet Belviranili and Xipeng Shen
2:00 - 2:30	SODA Synthesizer: an Open-Source, End-To-End Hardware Compiler — Nicolas Bohm Agostini, Serena Curzel, Ankur Limaye, Vinay Amatya, Marco Minutoli, Vito Giovanni Castellana, Joseph Manzano, Fabrizio Ferrandi and Antonino Tumeo
2:40 - 3:10	ICHNITE: Predicting memory footprint at compile-time — Nafis Mustakin and Daniel Wong
3:20 - 3:40	Nutrition Break
Session 3:	Chair: Xipeng Shen, North Carolina State University, USA
3:40 - 4:10	A SYCL Extension for User-Driven Online Kernel Fusion — Víctor Pérez-Carrasco, Lukas Sommer, Victor Lomüller, Kumudha Narasimhan and Mehdi Goli
4:20 - 4:50	Reusable semantics for implementation of Python optimizing compilers — Olivier Melançon
5:00 - 5:30	Interfacing with CPython from Gambit Scheme — Marc-André Bélanger and Marc Feeley

Vectorizing sparse matrix computations with partially-strided codelets
Kazem Cheshmi, Zachary Cetinic and Maryam Mehri Dehnavi

The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize strided computation regions of a sparse code. In this work, we propose a locality-based codelet mining (LCM) algorithm that efficiently searches for strided and partially strided regions in sparse matrix computations for vectorization. We also present a classification of partially strided codelets and a differentiation-based approach to generate codelets from memory accesses in the sparse computation. LCM is implemented as an inspector-executor framework called LCM I/E that generates vectorized code for the sparse matrix-vector multiplication (SpMV), sparse matrix times dense matrix (SpMM), and sparse triangular solver (SpTRSV). LCM I/E outperforms the MKL library with an average speedup of 1.67×, 4.1×, and 1.75× for SpMV, SpTRSV, and SpMM, respectively. It is also faster than the state-of-the-art inspector-executor framework Sympiler [1] for the SpTRSV kernel with an average speedup of 1.9×.

Towards higher SIMD utilization in Control-Divergent Loop Nests via a compiler-only Active-Lane Consolidation Implementation
João Paulo Labegalini de Carvalho, Rouzbeh Paktinatkeleshteri, Wyatt Praharenka and José Nelson Amaral

Vectorization is a widely applied compiler transformation to exploit Single-Instruction Multiple-Data (SIMD) parallelism available on modern CPU and (GP)GPU architectures. However, control-flow constructs commonly found in loop nests of real-world applications might prevent vectorization on commodity processors that lack predicated instructions. Moreover, control-flow linearization performed by IF-conversion, a traditional compiler technique to vectorize loop nests with control-flow, leads to low SIMD utilization and execution of unnecessary instructions. Branch on Super-word Condition-Codes (BOSCCs) methods have shown promising results by eliding the execution of vector code for inactive vector lanes. BOSCC-based methods optimistically expect to find uniform vectors – in which all lanes are active – to be common at run-time. Nevertheless, high control-flow divergence leads to infrequent uniform vectors. In addition, modern architectures have long vector registers – up to 2048 bits on Arm processors with the Scalable-Vector Extension (SVE) –, and thus the likelihood of uniform-vector materialization diminishes with the length of vector registers. This talk will discuss Active-Lane Consolidation (ALC), a technique to increase SIMD utilization on long-vector architectures, and the challenges when implementing ALC as a compiler-only transformation in LLVM.

Matrix Computation Acceleration in the Presence of Data Layout Conversions
Chenchen Tang, Fang Gao and Kai-Ting Amy Wang

Accelerators for performing matrix multiplications have flourished due to the importance of GEMM in applications from the HPC and AI/ML domains. The matrix multiplication hardware often requires special data layout, for instance, the row interleaved layout in Intel's advanced matrix extensions (AMX) and the 4D fractal layout in Huawei's DaVinci cube unit. However, programmers or tranditional software stacks are accustomed to 2D data layouts such as the row and column major layouts. Commonly used BLAS library supports only row and column major layouts. Thus, to leverage the power of the accelerators, pre- and post-processing on the host to convert data into the required accelerator layout are needed. This additional processing is overhead that eats into the end-to-end application performance. The application is modified to either perform the conversions on-demand or when possible, push the conversions to the entry and exit of the application such that data is kept in the accelerator layout throughout, in order to minimize the overhead.

Active mask computed in predication instruction vs stored in branch unit for SIMT execution
Kevin Lin and Guansong Zhang

This paper is to explore the trade off between using active mask explicitly generated in compiler predication instruction vs implicitly used in divergent PC table of a GPU branch unit. To quantify the results, we constructed a GPU simulator to evaluate different approaches for handling divergent execution in common applications for SIMT execution.

Structure Peeling Using Runtime Memory Identifiers
Rayson Ho, Henry Kao, Ehsan Amiri and Tomasz Czajkowski

Structure Peeling is a compiler performance optimization that transforms an array-of-structures (AoS) into a structure-of-arrays (SoA). Instead of structures being placed contiguously in AoS form, Structure Peeling will transform the memory layout of the AoS such that same fields of the contiguous structures are grouped together in their own continuous memory regions – SoA form. This transformation can improve the spatial locality of memory accesses and hence improve performance of an application. We propose a novel method of Structure Peeling which allows us to safely peel multiple AoSs when static analysis cannot determine a single memory region where uses of an AoS may point to. We introduce a unique identifier, memory ID, as a tag for each live copy of an AoS that exists in the program. The memory ID is set and reference at runtime to determine which of the multiple memory regions are accessed, eliminating the need to statically determine where each AoS originates from. Compared to a state-of-the-art techniques, we are able to obtain 13% more speedup in the SPEC CPU2017 MCF application.

SASA: A Scalable and Automatic Stencil Acceleration Framework for Optimized Hybrid Spatial and Temporal Parallelism on HBM-based FPGAs
Xingyu Tian, Zhifan Ye, Alec Lu, Licheng Guo, Yuze Chi and Zhenman Fang

Stencil computation is one of the fundamental computing patterns in many application domains such as scientific computing and image processing. While there are promising studies that accelerate stencils on FPGAs, there lacks an automated acceleration framework to systematically explore both spatial and temporal parallelisms for iterative stencils that could be either computation-bound or memory-bound. In this talk, we present SASA, a scalable and automatic stencil acceleration framework on modern HBM-based FPGAs. SASA takes the high-level stencil DSL and FPGA platform as inputs, automatically exploits the best spatial and temporal parallelism configuration based on our accurate analytical model, and generates the optimized FPGA design with the best parallelism configuration in TAPA high-level synthesis C++ as well as its corresponding host code. Compared to state-of-the-art automatic stencil acceleration framework SODA that only exploits temporal parallelism, SASA achieves an average speedup of 3.41× and up to 15.73× speedup on the HBM-based Xilinx Alveo U280 FPGA board for a wide range of stencil kernels. The SASA paper has recently been accepted by ACM Transactions on Reconfigurable Technology and Systems. Dr. Zhenman Fang plans to present this work in the LATHC workshop.

Coarse-Grained Resource Sharing for Entire Neural Networks
Tzung-Han Juang, Christof Schlaak and Christophe Dubach

Designing hardware accelerators is notoriously hard due to the use of low-level HDLs (Hardware Description Languages). HLS (High-Level Synthesis) tools remove the need for using HDLs by offering high-level language abstractions such as C or OpenCL to describe the hardware. However, users still have to fine-tune the hardware architecture based on vendor-specific directives. Moreover, workload-specific hardware accelerators tend to be specialized and might not be able to handle different input data sizes. This could lead to inefficient designs due to unused resources or simply being unable to handle larger input sizes. This work proposes to generate designs automatically using coarse-grained resource function sharing starting from a functional IR (Intermediate Representation). The use of a functional IR facilitates the exploitation of opportunities to share functions given their explicit representation in the IR. Furthermore, applications such as neural networks, fit particularly well with the functional abstractions given their composable nature. When a function is called multiple times, this presents an opportunity to share the hardware resources associated with the function. This is achieved by simply inserting multiplexers into the circuits to select between different function inputs. However, it is only legal to do so, when we can guarantee that the function will never be used concurrently. This work uses interference graphs to detect when the hardware resources associated with a coarse-grained function can be shared. Furthermore, an optimization pass is introduced, using rewriting on the functional IR, to propagate and eliminate multiplexers in order to reduce resource usage. Preliminary results on an Intel Arria 10 FPGA show that it is possible to fit the entire VGG16 neural network on the FPGA. This is achieved starting with a high-level functional description of the neural networks. The resulting performance is on par with a baseline where each layer would have its own dedicated FPGA.

Processor-Centric Contention Slowdown Model for Heterogeneous System-on-chips
Yuanchao Xu, Mehmet Belviranili and Xipeng Shen

Heterogeneity is becoming a norm in modern computer architectures. How to optimize the designs of such architectures and how to make programs run efficiently on such systems are two fundamental problems that have drawn much attention in the last decade. Most of those studies have however assumed a clean single-program execution scenario. But the reality is often different: Multiple programs sharing a single machine is common, with them each competing for all kinds of resources, from computing units to memory and bandwidth. As a result, many of the careful designs and optimizations end up with a poor performance delivery in practice. This talk will examine the challenges, present some promising findings, and point out some future directions.

SODA Synthesizer: an Open-Source, End-To-End Hardware Compiler
Nicolas Bohm Agostini, Serena Curzel, Ankur Limaye, Vinay Amatya, Marco Minutoli, Vito Giovanni Castellana, Joseph Manzano, Fabrizio Ferrandi and Antonino Tumeo

Enabling autonomous control in novel scientific experimental workflows requires the ability to generate highly specialized systems for data analysis and artificial intelligence, enabling the low-latency reasoning capabilities needed to take real-time decisions. This paper presents the SODA (Software Defined Accelerators) framework, an open-source modular, multi- level, no-human-in-the-loop, hardware compiler that enables end-to-end generation of specialized accelerators from high-level data science frameworks. SODA is composed of SODA-OPT, a high-level frontend developed in MLIR that interfaces with domain-specific programming environments and allows performing system level design, and Bambu, a state-of-the-art high-level synthesis (HLS) engine that can target different device technologies, including field programmable gate arrays (FPGAs) and application specific integrated circuits (ASICs). The framework implements design space exploration as compiler optimization passes. The SODA Synthesizer integrates with open-source logic synthesis and physical layout tools such as OpenROAD to provide an end-to-end solution from high-level algorithmic description to evaluation of the floorplan. We show how the modular, yet tight, integration of the high-level optimizer and lower-level HLS tools enables the generation of specialized accelerators for low-latency data analytics at the edge. We then discuss some of the research opportunities that such a framework allows.

ICHNITE: Predicting memory footprint at compile-time
Nafis Mustakin and Daniel Wong

Programming heterogeneous computer systems places significant burden on programmers, particularly relating to managing complex data structures and manual movement of data. Shared virtual address space (such as Unified Virtuam Memory, UVM) for CPUs and GPUs aim to relieve a chunk of burden from programmers. This comes as a result of enabling on-demand data migration between CPU and GPU, which in itself gives rise to new problems such as page thrashing and high latency from page walks. To avoid such problems, the modern programming paradigms, such as CUDA API, offers functions to guide memory management with hints such as cudaMemAdvise(), or guide prefetching of memory, such as cudaMemPrefetchAsync(). However, this yet again increases the responsibility of the programmer in order to get the optimal performance. We argue that most issues plaguing heterogeneous computing using UVM can be alleviated if we can correctly estimate the memory footprints of individual programs at compile time. To that effect, we propose ICHNITE, a compiler tool to estimate a function’s memory footprint. At its core, ICHNITE is a program-specific RNN-based memory footprint predictor. ICHNITE takes in the source code and predicts the read and writes sets of individual functions. Thereby it is able to differentiate between the memory region demand of CPU functions and GPU kernels. This information provided by ICHNITE can be directly used to automatically advise memory prefetching in UVM. ICHNITE can also alleviate programmer burden by identifying read only and write only regions to guide memory management hints and pin pages to certain devices. The analysis by ICHNITE can also be used to improve performance of heterogenous programs by i) identifying data dependencies between heterogeneous devices, ii) identifying potential data movement issues (thrashing) in heterogeneous programs, and iii) guide placement of data pages across heterogenous devices. ICHNITE can therefore increase performance by reducing data transfer overhead through smarter prefetching and increase ease of use by reducing programmer burden. In this presentation, we will present our vision for ICHNITE and potential benefits that ICHNITE can provide.

A SYCL Extension for User-Driven Online Kernel Fusion
Víctor Pérez-Carrasco, Lukas Sommer, Victor Lomüller, Kumudha Narasimhan and Mehdi Goli

Heterogeneous programming models such as SYCL allow developers to integrate a variety of accelerators found in today’s heterogeneous systems into an application with ease. However, while offloading specific tasks to specialized accelerators can deliver significant performance improvements for many applications, short-running device kernels remain a challenge for most heterogeneous programming models. Each invocation of a device kernel is linked to some overhead, caused by the necessary data-transfers, kernel launch and synchronization between host and device. In particular, for a sequence of short-running kernels, this can lead to an unfavourable ratio of overhead and actual computation, resulting in performance degradation. One potential solution to address this problem is to merge multiple small, memory-bound, short-running kernels into a single larger kernel. This leads to better use of the device’s resources and amortizes the device launch overhead. Yet, manually creating fused kernels can be an error-prone, challenging task for developers, and the resulting kernels are less reusable and maintainable. The extension to the SYCL API presented in this talk aims to automate the creation of fused kernels. It provides a mechanism for users or software frameworks using SYCL to instruct the SYCL runtime to automatically fuse multiple device kernels at runtime, without the need for manual implementation of the fused kernel. Users or software frameworks can use their application and domain knowledge, as well as runtime context information, to determine when fusion of kernels is legal and profitable, while the actual process of creating a fused kernel is automated by the SYCL runtime. Reducing the kernel launch overhead is however not the only way kernel fusion can improve application performance. The LLVM-based JIT compiler integrated into the SYCL runtime implementation for automatic creation of fused kernels can perform further optimizations. One such optimization is the internalization of dataflow. Intermediate results that originally needed to be communicated via global memory between the different kernels now become internal dataflow of the fused kernel. Replacing slow global memory accesses for this internalized dataflow with faster accesses to local memory or even registers can yield significant performance improvements for many applications. The extension presented in this talk is currently an experimental vendor extension, targeting SYCL version 2020. The initial proof-of-concept implementation was based on Codeplay’s ComputeCpp SYCL implementation and has also been contributed and open-sourced as part of the DPC++ SYCL implementation. To demonstrate the performance improvements unlocked by the extension, two different types of workloads are evaluated on Intel CPU and integrated Intel GPUs. For a set of sixteen typical operator sequences from neural networks with various input sizes, kernel fusion achieves speedups between 0.9x and 2.26x on GPU (geo.-mean 1.35x), and between 1.02x and 3.2x on CPU (geo.-mean 1.78x). For complete neural networks, this translates to 1.19x (Resnet 50) and 1.68x (VGG 16) speedup on CPU, and 1.15x (Resnet 50) and 1.02x (VGG 16) speedup on GPU. For the six benchmarks 3mm, bicg, correlation, covariance, fdtd2d and gramschmidt from the SYCL Bench benchmark suite with different input sizes, fusion achieves speedups between 0.98x and 4.91x on GPU (geo.-mean 1.34x), and speedups between 0.82x and 3.28x on CPU (geo.-mean 1.06x). In summary, this talk presents a SYCL extension automating the creation of fused kernels on user request and shows the potential performance benefits of such an extension on different workloads.

Reusable semantics for implementation of Python optimizing compilers
Olivier Melançon

Python is among the most popular programming languages in the world due to its accessibility and extensive standard library. Paradoxically, Python is also known for its poor performance on many tasks. Hence, more efficient implementations of the language are required. The development of such optimized implementations is nevertheless hampered by the complex semantics of Python and the lack of an official formal semantics. We address this issue by presenting a formal semantics for Python focused on the development of optimizing compilers. This semantics is written as to be easily reusable by existing compilers. We also introduce semPy, a partial evaluator of our formal semantics. This tool allows to automatically target and remove redundant operations from the semantics of Python. As such, semPy generates a semantics which naturally executes more efficiently. Finally, we present Zipi, a Python optimizing compiler developed with the aid of semPy. On some tasks, Zipi displays performance competing with those of PyPy, a Python compiler known for its good performance. These results open the door to optimizations based on a partial evaluation technique which generates specialized implementations for frequent use cases.

Interfacing with CPython from Gambit Scheme
Marc-André Bélanger and Marc Feeley

We have previously presented an integration between Gambit Scheme and CPython [1]. That work combined a syntactic interface relying on a custom parser to facilitate writing Python expressions directly in a Scheme program, as well as a low-level integration typical of other Foreign Function Interfaces (FFI). The presented FFI uses a Foreign Procedure Call (FPC) mechanism to bridge the Gambit and CPython threading models. Python packages from the Python Package Index (PyPI) can now be used from Gambit Scheme, opening a world of mature Python libraries to Scheme developers. The FFI is freely accessible as the 'python' module from the https://github.com/gambit/python GitHub repository. Since publishing at the 2022 Scheme Workshop [1], we have received user feedback which will be addressed in this updated presentation. We will first explain the mechanics of our implementation, then go through all of the steps required to get up and running with the necessary software and end with a few examples. The demonstrations will be done on Linux but Gambit Scheme and its 'python' module can also be used on macOS and Windows. In addition to static examples shown in the accompanying slides, we wish to give a live demo of the FFI to give participants a good overview of its usage. [1]: Marc-André Bélanger and Marc Feeley. 2022. A Foreign Function Interface between Gambit Scheme and CPython. In Scheme and Functional Programming Workshop 2022.