Software Stack for Hardware Accelerators Workshop (SSHAW)

Workshop Program

Monday, August 17, 8:00PM-9:30PM (EDT / GMT-4)

Convolution Gradient Code Generation for AI Accelerators with Specialized Data Layout Requirements
Linh H. Tran, Amy Wang, Zichun Ye and Giancarlo Colmenares
Video Link Slides Link

Exploring Agile Hardware/Software Co-Design Methodology
Billy Mengxuan Cai, Shruthi Ashwathnarayan, Farhan Shafiq, Ahmed Eltantawy, Reza Azimi and Yaoqing Gao
Video Link Slides Link

RISE: A functional pattern-based language in MLIR
Martin Lucke, Michel Steuwer and Aaron Smith
Video Link Slides Link

Enabling Multi-FPGA Clusters as OpenMP Acceleration Devices
Ramon Nepomuceno, Renan Sterle and Guido Araujo
Video Link Slides Link

Accelerating Pooling through Im2col and Col2im Instructions in the DaVinci Architecture
Caio Salvador Rohwedder, Amy Wang, Giancarlo Colmenares, Jose Nelson Amaral and Guido Araujo
Video Link Slides Link

Convolution Gradient Code Generation for AI Accelerators with Specialized Data Layout Requirements
Linh H. Tran, Amy Wang, Zichun Ye and Giancarlo Colmenares
Video Link Slides Link

Convolution is the most popular and extensively optimized operator in modern deep neural networks. As machine learning frameworks such as TensorFlow [1] enable network training by an end user using commodity hardware, efforts are being put to optimize the backward or gradient kernels for convolution to speed up the training process. Current method of computing the convolution data gradient suffers from low efficiency with the usage of the column-to-image (col2im) function which performs multiple-to-one gradient aggregation. To overcome the inefficiency, new generations of tensor processors with better support for vector operations would be needed.

In this paper, we present an alternative approach of generating convolution backward kernels with respect to data and weights inputs using the forward convolution kernel which consists of image-to-column (im2col) and matrix multiplications. This approach requires non-trivial data format or layout conversions on the inputs and outputs surrounding the use of the forward convolution kernel. Such conversions can become even more complex and possibly inefficient when dealing with tensor processors or accelerators that require peculiar data format to begin with. As such, we formulate an iterator method to systematically perform the required data conversions while taking into account hardware specific optimizations in data movement. We illustrate the iterator method using Huawei DaVinci’s tensor data layout [2].

Our test results using the shapes from ResNet-50 [3] show that on CPU, using library kernels [4] to perform column-to-image, image-to-column, matrix multiplications and the needed format conversions, our approach brings better performance than the original approach that uses col2im. Further, our approach outperforms TVM’s automatically generated backward kernels [5]. We also investigate how a fast image-to-column native hardware support can affect the overall performance of the backward kernels.

Exploring Agile Hardware/Software Co-Design Methodology
Billy Mengxuan Cai, Shruthi Ashwathnarayan, Farhan Shafiq, Ahmed Eltantawy, Reza Azimi and Yaoqing Gao
Video Link Slides Link

With a very competitive marketplace and rapidly evolving application requirements, production of new hardware chips are under tight deadlines due to shortened time-to-market. This puts more emphasis on tools and technologies for automating hardware-software co-design, co-development, and co-verifications. Although the problem of design automation has been around for decades, recent developments in machine learning techniques accompanied by the increased processing power bring back this problem to light with possible new solutions. Still, the challenge remains as to how to develop techniques that can be demonstrated to effectively shorten the design, development, and verification cycle of new chip architectures.

A main challenge in designing complex systems with many hardware and software components is how to develop and maintain unambiguous and consistent system specifications, normally in the form of the Instruction Set Architecture (ISA), that can be used by a large number of engineers from multiple teams (hardware development and verification, simulation, compiler development, operating system, tools, etc.) This is key because a number of tools have to be developed based on such a ISA specification in order to enable the usage and evaluation of the system. Also, it is expected that the chip ISA goes through many iterations within the development cycle. If this process is not automated properly, a vast amount of manual effort must be spent to debug and resolve issues that occur as a result of inconsistent interpretations and human errors in dealing with many versions of the specification. Thus it is highly desired to have a formal definition of the ISA so that (a) its consistency and correctness can be analyzed and verified automatically as much as possible, and (b) a number of tools that are based on ISA can also be generated automatically from the formal ISA.

Secondly, current practices in designing a new chip product heavily relies on prior experience and insights from designing similar solutions or the previous iterations of the same chip. These risk-aware practices are established mainly due to (a) the complexity of the possible solutions that can meet all the requirements (functionality, performance, power, area, etc.), and (b) hard time-to-market constraints. Such experience and insights that come from deep understanding of the bottlenecks and pain points of the previous designs are indeed precious. However, they would likely result in incremental improvements and unlikely to achieve substantial competitive advantages. Therefore, any tools and techniques that allow the designers to systematically explore a vast HW/SW design space to arrive at more optimized solutions would highly be beneficial.

In this short paper, we present our experience in addressing both of these challenges. First, we provide the design and implementation of a Re-targetable SDK (RSDK) based on a semi-formal DSL that we designed to define the ISA. Our RSDK solution is capable of automatically generating a number of software tools (assembler/disassembler, functional simulator, document generation, compiler, etc.) from the ISA definition hence reducing a vast amount of manual effort in both hardware and software development. Secondly, we present our preliminary solution to leverage machine learning (ML) in building automatic HW/SW co-design space exploration tools.

RISE: A functional pattern-based language in MLIR
Martin Lucke, Michel Steuwer and Aaron Smith
Video Link Slides Link

We need new intermediate representations that break up today’s monolithic and inflexible machine learning kernels into more flexible and finer grained compute primitives. This talk presents one such representation called RISE which provides a set of data parallel high level patterns as a way to describe computations over higher dimensional arrays (tensors) in an abstract way. We will discuss the implementation of the RISE dialect in MLIR and show examples of lowering TensorFlow operations using RISE.

Enabling Multi-FPGA Clusters as OpenMP Acceleration Devices
Ramon Nepomuceno, Renan Sterle and Guido Araujo
Video Link Slides Link

FPGA-based hardware accelerators have received increasing attention in recent years mainly due to its reconfiguration capabilities which facilitate the adaptation of the accelerator to distinct types of workloads, thus resulting in higher computational performance and energy efficiency. It has been reported that offloading computation to FPGAs achieves better performance when compared to GPU or CPU execution for some stencil and pipelined applications (e.g. Fast Fourier Transforms). This performance could be even higher if multiple FPGAs are interconnected in a Multi-FPGA cluster. However, programming such heterogeneous architecture is a challenging endeavor and still requires research and development efforts to make it productive. This work aims at making a Multi-FPGA cluster architecture work as a coordinated set of acceleration devices through the OpenMP task-based programming model. This enables the programmer to expose a higher degree of parallelism while leveraging on OpenMP task constructs to simplify programming.

Accelerating Pooling through Im2col and Col2im Instructions in the DaVinci Architecture
Caio Salvador Rohwedder, Amy Wang, Giancarlo Colmenares, Jose Nelson Amaral and Guido Araujo
Video Link Slides Link

Image-to-column (Im2col) and column-to-image (Col2im) are data transformation operations that have been extensively used to map convolution to matrix multiplication. These operations rearrange convolution's inputs to provide a friendlier data-layout to CPUs and GPUs. They also enable one to leverage highly optimized linear algebra libraries that implement matrix multiplication, such as OpenBLAS, ATLAS, and Eigen.

Nevertheless, such transformations are memory intensive, and they create a considerable performance overhead. As a result, Im2col and Col2im have not been explored by other CNN operators besides convolution, for which the performance overhead is amortized by its large number of operations.

DaVinci is a Neural-Network acceleration architecture released by Huawei. It features the AI Core, an engine that provides hardware support for efficient execution of Im2col and Col2im operations. This presentation describes the inner workings of the DaVinci's Im2col and Col2im instructions and shows that they can be used to considerably improve the performance of pooling operators. Speed-ups up to 4x were achieved when compared with implementations that don't utilize these instructions.