ICPP 2020 Program (all times are EDT / GMT-4)

Click on a paper title to see video and slides links

Monday, August 17th

8:00am-10:00am

SRMPDS: International Workshop on Scheduling and Resource Management for Parallel and Distributed...

Zoom Meeting B

Workshop

10:00am-11:00am

EMS: International Workshop on Embedded Multicore Systems

Zoom Meeting A

Workshop

12:00pm-2:45pm

P2S2: International Workshop on Parallel Programming Models and Systems Software for High-End Com...

Zoom Meeting C

Workshop

8:00pm-9:30pm

Software Stack for Hardware Accelerators Workshop (SSHAW)

Zoom Meeting C

Workshop

10:00pm-11:00pm

AWASN: International Workshop on Applications of Wireless Ad hoc and Sensor Networks

Meeting Link Provided by Chairs

Workshop

Asynchronous

EXA_PMRA: International Workshop on Performance modelling, Runtime System and Applications at the Exascale

Workshop

Tuesday, August 18th

11:00am-11:15am

Opening Remarks

Zoom Meeting A

J. Nelson Amaral

Opening Remark

11:15am-11:55am

Keynote-1: Kathy Yelick, U.C. Berkeley

Zoom Meeting A

Xipeng Shen

Genomic Analysis and Learning at Scale: Mapping Irregular Computations to Advanced Architectures

Keynote

12:05pm-12:45pm

Best-Paper Candidates

Zoom Meeting A

Lizy John

Huffman Coding with Gap Arrays for GPU Acceleration

CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs

CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs

Best Paper Candidate

Jiya Su and Feng Zhang (Renmin University of China), Weifeng Liu (China University of Petroleum), Bingsheng He (National University of Singapore), Ruofan Wu and Xiaoyong Du (Renmin University of China), and Rujia Wang (Illinois Institute of Technology)

Abstract

Video Link Slides Link Sparse triangular solves (SpTRSVs) have been extensively used in linear algebra fields, and many GPU-based SpTRSV algorithms have been proposed. Synchronization-free SpTRSVs, due to their short preprocessing time and high performance, are currently the most popular SpTRSV algorithms. However, we observe that the performance of those SpTRSV algorithms on different matrices can vary greatly by 845 times. Our further studies show that when the average number of components per level is high and the average number of nonzero elements per row is low, those SpTRSVs exhibit extremely low performance. The reason is that, they use a warp on the GPU to process a row in sparse matrices, and such warp-level designs have severe underutilization of the GPU. To solve this problem, we propose CapelliniSpTRSV, a thread-level synchronization-free SpTRSV algorithm. Particularly, CapelliniSpTRSV has three novel features. First, unlike previous studies, CapelliniSpTRSV does not need preprocessing to calculate levels. Second, CapelliniSpTRSV exhibits high performance on matrices that previous SpTRSVs cannot handle efficiently. Third, CapelliniSp- TRSV�s optimization does not rely on specific sparse matrix storage format. Instead, it can achieve very good performance on the most popular sparse matrix storage, compressed sparse row (CSR) format, and thus users do not need to conduct format conversion. We evaluate CapelliniSpTRSV with 245 matrices from the Florida Sparse Matrix Collection on three GPU platforms, and experiments show that our SpTRSV exhibits 6.84 GFLOPS/s, which is 4.97x speedup over the state-of-the-art synchronization-free SpTRSV algorithm, and 4.74x speedup over the SpTRSV in cuSPARSE. CapelliniSpTRSV is open-sourced in https://github.com/JiyaSu/CapelliniSpTRSV.

SkyChain: A Deep Reinforcement Learning-Empowered Dynamic Blockchain Sharding System

GOSH: Embedding Big Graphs on Small Hardware

Paper

12:55pm-1:25pm

1A: Distributed Systems

Zoom Meeting A

Martin Kong

CARD: A Congestion-Aware Request Dispatching Scheme for Replicated Metadata Server Cluster

Safe, Fast Sharing of memcached as a Protected Library

DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms

Paper

1B: Edge Learning and Inference

Zoom Meeting B

Chih-Chieh Yang

ShadowTutor: Distributed Partial Distillation for Mobile Video DNN Inference

FEEL: A Federated Edge Learning System for Efficient and Privacy-Preserving Mobile Healthcare

Adaptive Distributed Convolutional Neural Network Inference at the Network Edge with ADCNN

Paper

1C: Memory Systems

Zoom Meeting C

Alaa Alameldeen

An Efficient Wear-level Architecture using Self-adaptive Wear Leveling

CCHL: Compression-Consolidation Hardware Logging for Efficient Failure-Atomic Persistent Memory Updates

Balancing Fairness and Efficiency for Cache Sharing in Semi-external Memory System

Paper

1:35pm-2:05pm

2A: Fault-Tolerance

Zoom Meeting A

Karsin Ben

Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method

Robustness of the Young/Daly formula for stochastic iterative applications

Energy-aware strategies for reliability-oriented real-time task allocation on heterogeneous platforms

Energy-aware strategies for reliability-oriented real-time task allocation on heterogeneous platforms

Li Han (East China Normal University, ENS Lyon); Yiqin Gao (ENS Lyon); Jing Liu (East China Normal University); Yves Robert (ENS Lyon, University of Tennessee Knoxville); and Fr�d�ric Vivien (ENS Lyon)

Abstract

Video Link Slides Link Low energy consumption and high reliability are widely identified as increasingly relevant issues in real-time systems on heterogeneous platforms. In this paper, we propose a multi-criteria optimization strategy to minimize the expected energy consumption while enforcing the reliability threshold and meeting all task deadlines. The tasks are replicated to ensure a prescribed reliability threshold. The platforms are composed of processors with different (and possibly unrelated) characteristics, including speed profile, energy cost and failure rate. We provide several mapping and scheduling heuristics towards this challenging optimization problem. Specifically, a novel approach is designed to control (i) how many replicas to use for each task, (ii) on which processor to map each replica and (iii) when to schedule each replica on its assigned processor. Different mappings achieve different levels of reliability and consume different amounts of energy. Scheduling matters because once a task replica is successful, the other replicas of that task are cancelled, which calls for minimizing the amount of temporal overlap between any replica pair. The experiments are conducted for a comprehensive set of execution scenarios, with a wide range of processor speed profiles and failure rates. The comparison results reveal that our strategies perform better than the random baseline, with a gain of 40% in energy consumption, for nearly all cases. The absolute performance of the heuristics is assessed by a comparison with a lower bound; the best heuristics achieve an excellent performance, with an average value only 4% higher than the lower bound.

Paper

2B: Scheduling and Placement in Networks

Zoom Meeting B

Bin Ren

Cooperative Game for Multiple Chargers with Dynamic Network Topology

Optimizing Flow Bandwidth Consumption with Traffic-diminishing Middlebox Placement

Towards High-Efficiency Data Centers via Job-Aware Network Scheduling

Paper

2C: Systems for Machine Learning

Zoom Meeting C

Dingwen Tao

DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training

DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training

Lipeng Wang (Hong Kong University of Science and Technology), Songgao Ye (SenseTime Research), Baichen Yang (Hong Kong University of Science and Technology), Youyou Lu (Tsinghua University), Hequan Zhang and Shengen Yan (SenseTime Research), and Qiong Luo (Hong Kong University of Science and Technology)

Abstract

Video Link Slides Link We observe three problems in existing storage and caching systems for deep-learning training (DLT) tasks: (1) accessing a dataset containing a large number of small files takes a long time, (2) global in-memory caching systems are vulnerable to node failures and slow to recover, and (3) repeatedly reading a dataset of files in shuffled orders is inefficient when the dataset is too large to be cached in memory. Therefore, we propose DIESEL, a dataset-based distributed storage and caching system for DLT tasks. Our approach is via a storage-caching system co-design. Firstly, since accessing small files is a metadata-intensive operation, DIESEL decouples the metadata processing from metadata storage, and introduces metadata snapshot mechanisms for each dataset. This approach speeds up metadata access significantly. Secondly, DIESEL deploys a task-grained distributed cache across the worker nodes of a DLT task. This way node failures are contained within each DLT task. Furthermore, the files are grouped into large chunks in storage, so the recovery time of the caching system is reduced greatly. Thirdly, DIESEL provides chunk-based shuffle so that the performance of random file access is improved without sacrificing training accuracy. Our experiments show that DIESEL achieves a linear speedup on metadata access, and outperforms an existing distributed caching system in both file caching and file reading. In real DLT tasks, DIESEL halves the data access time of an existing storage system, and reduces the training time by hours without changing any training code.

E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster

ParSecureML: An Efficient Parallel Secure Machine Learning Framework on GPUs

Paper

2:15pm-3:00pm

Poster Session

Zoom Meeting C

Paul Lu

EPMA: Efficient Partial Message Access in IoT Era

Towards Parallelization of a Texture Description Algorithm for Breast Lesion Classification using OpenMP and CUDA

Jeor: Accelerate Linear Algebra Operation in SSDs

Saec: Similarity-Aware Embedding Compression in Recommendation Systems

Poster

Wednesday, August 19th

11:00am-11:40am

Keynote-2: Michael Schulte, AMD

Zoom Meeting A

Lizy John

Challenges and Opportunities for Extreme-Scale Computing

Keynote

11:50am-12:20pm

3A: Graph Processing and Concurrent Data Structures

Zoom Meeting A

Michela Becchi

Graffix: Efficient Graph Processing with a Tinge of GPU-Specific Approximations

Optimizing Linearizable Bulk Operations on Data Structures

GraBi: Communication-Efficient and Workload-Balanced Partitioning for Bipartite Graphs

GraBi: Communication-Efficient and Workload-Balanced Partitioning for Bipartite Graphs

Feng Sheng and Qiang Cao (Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics); Hong Jiang (University of Texas at Arlington, Department of Computer Science and Engineering); and Jie Yao (Huazhong University of Science and Technology, School of Computer Science and Technology)

Abstract

Video Link Slides Link Machine Learning and Data Mining (MLDM) applications generally represent their input data in bipartite graphs with two disjoint vertex-subsets connected only by edges between them. Despite the prevalence of bipartite graphs, existing graph partitioning frameworks have rarely sufficiently exploited their unique structures, especially the highly lopsided subset sizes and extremely skewed vertex degrees. As a result of poor partitioning quality, high communication cost and severe workload imbalance arise during subsequent computation over these bipartite graphs in distributed environments, significantly hampering the performance of MLDM applications.

In this paper, we approach these problems by communication-efficient and workload-balanced partitioning of bipartite graphs, which fully exploits the beneficial features of bipartite graphs. To this end, we present GraBi, a two-stage partitioning framework that partitions a bipartite graph first vertically and then horizontally. The first stage divides each vectored vertex into multiple vertex-chunks such that the bipartite graph is vertically partitioned into multiple layers, to strike an appropriate tradeoff between inter- and intra-vertex communication. In the second stage, for each layer, the vertex-chunks in the larger vertex-subset are first assigned to nodes, to minimize vertex replicas. Specifically, these vertex-chunks are horizontally decomposed into one or more sub-chunks with an upper-bounded number of edges, and then the sub-chunks are evenly assigned over nodes using a set of hash functions, to achieve workload balance among all computing nodes. Our evaluation shows that, GraPU decreases the computation time of MLDM algorithms by up to 5.41x, 4.32x, 1.89x over three state-of-the-art partitioning frameworks Hybrid-cut, Bi-cut, and 3D-partitioner respectively.

Paper

3B: Large-Scale Applications on Supercomputers

Zoom Meeting B

Kamesh Madduri

Large-scale Simulations of Peridynamics on Sunway Taihulight Supercomputer

Large-scale Simulations of Peridynamics on Sunway Taihulight Supercomputer

Xinyuan Li (Computer Network Information Center, Chinese Academy of Science; University of Chinese Academy of Science) and Huang Ye and Jian Zhang (Computer Network Information Center, Chinese Academy of Science)

Abstract

Video Link Slides Link Peridynamics (PD) methods are good at describing solid mechanical behaviours and have the superiority on simulating the discontinuous problems. They can be applied to many fields, such as materials science, human health, and industrial manufacturing, etc., which motivates us to provide their efficient numerical simulations on the Sunway TaihuLight supercomputer. However, massive and complex calculations of PD simulations and the characteristics of Sunway TaihuLight bring challenges to efficient parallel PD simulations. In this paper, we present a series of performance optimization techniques to perform a large-scale parallel PD simulation application on Sunway TaihuLight. We first design the data grouping and SPM-based caching to increase the bandwidth of data transmission and reduce the time of the main memory access. Further, we design and implement vectorization and instruction-level optimization for PD applications to improve computational performance. Finally, we offer the overlapping strategies of data transmission and computation so that data transmission can be covered by computation. Our work in a core group improves the performance of the serial version on the SW26010 processor by 181 times. Compared to the serial and single-CPU Peirdigm-based simulations on Intel Xeon E5-2680 V3, our work gets a speedup of 60 times and 6 times, respectively. Near linear scalability is also obtained. When testing the weak scaling, the simulation of a 296,222,720-point example achieves 1.14 PFLOPS with 8192 (532,480 cores) processes. When testing the strong scaling, 90% parallel efficiency is observed as the number of processes increases 64 times to 4096 processes.

Toward Large-Scale Image Segmentation on Summit

SWMapper: Scalable Read Mapper on SunWay TaihuLight

Paper

3C: Machine Learning for Computing

Zoom Meeting C

Eunjung Park

An Online Learning-Based Task Offloading Framework for 5G Small Cell Networks

A Reinforcement Learning Based System for Minimizing Cloud Storage Service Cost

Deep Reinforcement Learning based Elasticity-compatible Heterogeneous Resource Management for Time-critical Computing

Deep Reinforcement Learning based Elasticity-compatible Heterogeneous Resource Management for Time-critical Computing

Zixia Liu and Liqiang Wang (Department of Computer Science, University of Central Florida) and Gang Quan (Electrical and Computer Engineering Department, Florida International University)

Abstract

Video Link Slides Link Rapidly generated data and the amount magnitude of data analytical jobs pose great pressure to the underlying computing facilities. A distributed multi-cluster computing environment such as a hybrid cloud consequently raises its necessity due to its advantages in adapting geographically distributed and potentially cloud-based computing resources. Different clusters forming such an environment could be heterogeneous and may be resource-elastic as well. From analytical perspective, in accordance with increasing needs on streaming applications and timely analytical demands, many data analytical jobs nowadays are time-critical in terms of their temporal urgency. And the overall workload of the computing environment can be hybrid to contain both time-critical and general applications. These all call for an efficient resource management approach capable to apprehend both computing environment and application features.

However, the added up complexity and high dynamics of the system greatly hinder the performance of traditional rule-based approaches. In this work, we propose to utilize deep reinforcement learning for developing elasticity-compatible resource management for a heterogeneous distributed computing environment, aiming for less occurrences of missing temporal deadline while maintaining low average execution time ratio. Along with reinforcement learning we design a deep model employing Long Short-Term Memory (LSTM) structure and partial model sharing for multi-target learning mechanism. The experimental results show that the proposed approach could greatly outperform baselines and serve as a robust resource management for variant workloads.

Paper

12:30pm-1:00pm

4A: Performance Tools and Methodology

Zoom Meeting A

Tanzima Islam

Generating Robust Parallel Programs via Model Driven Prediction of Compiler Optimizations for Non-determinism

Generating Robust Parallel Programs via Model Driven Prediction of Compiler Optimizations for Non-determinism

Girish Mururu (Georgia Institute of Technology); Kaushik Ravichandran (Georgia Institute of Technology, Facebook); and Ada Gavrilovska and Santosh Pande (Georgia Institute of Technology)

Abstract

Video Link Slides Link Execution orders in parallel programs are governed by non-determinism and can vary substantially across different executions even on the same input. Thus, a highly non-deterministic program can exhibit rare execution orders never observed during testing. It is desirable to reduce non-determinism to suppress corner case behavior in production cycle (making the execution robust or bug-free) and increase non-determinism for reproducing bugs in the development cycle. Performance-wise different optimization levels (e.g. from O0 to O3) are enabled during development , however, non-determinism-wise, developers have no way to select right compiler optimization level in order to increase non-determinism for debugging or to decrease it for robustness.

The major source of non-determinism is the underlying execution model, primarily determined by the processor architecture and the operating system (OS). Architectural artifacts such as cache misses and TLB misses characterize and shape the non-determinism. In this work, we seek to capture such sources of non-determinism through an architectural model based on hardware performance counters and use the model for predicting the appropriate compiler optimization level for generating a robust parallel program, which has minimal non-determinism in production. As a side effect, the generated model also allows maximizing non-determinism for debugging purposes. We demonstrate our technique on the PARSEC benchmark suite, and among other results show that the generated robust program decreases non-deterministic behavior up to 66.48%, and as a practical measure we also show that a known race condition plus randomly injected ones are rendered benign in the robust parallel program generated by our framework.

Memory-Centric Communication Mechanism for Real-time Autonomous Navigation Applications

Automatic Identification and Precise Attribution of DRAM Bandwidth Contention

Paper

4B: Storage Reliability & Memory Security

Zoom Meeting B

Radu Teodorescu

An Adaptive Erasure-Coded Storage Scheme with an Efficient Code-Switching Algorithm

First Time Miss : Low Overhead Mitigation For Shared Memory Cache Side Channels

A Rack-aware Pipeline Repair Scheme for Erasure-coded Distributed Storage Systems

Paper

4C: Supporting Efficient Machine Learning

Zoom Meeting C

Seyong Lee

Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures

Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures

Qingchang Han (Beihang University, SenseTime Research); Yongmin Hu (Beihang University); Fengwei Yu (SenseTime Research); Hailong Yang (Beihang University); Bing Liu (SenseTime Research); Peng Hu (SenseTime Research, Beihang University); Ruihao Gong (Beihang University, SenseTime Research); Yanfei Wang (SenseTime Research); and Rui Wang, Zhongzhi Luan, and Depei Qian (Beihang University)

Abstract

Video Link Slides Link With the continuous demand for higher accuracy of deep neural networks, the model size has been increasing significantly. Quantization is one of the most widely used model compression methods, which can effectively reduce the model size without severe accuracy loss. Modern processors such as ARM CPU and NVIDIA GPU have already provided the support of low-bit arithmetic instructions. However, there lacks efficient and practical optimizations for convolution computation towards extremely low-bit on ARM CPU (e.g., 2~8-bit) and NVIDIA GPU (e.g., 4-bit and 8-bit). This paper explores the performance optimization methods of extremely low-bit convolution on diverse architectures. On ARM CPU, we propose two instruction schemes for 2~3-bit and 4~8-bit convolution with corresponding register allocation methods. In addition, we re-design the GEMM computation with data padding and packing optimizations. We also implement winograd algorithm for convolution with specific bit width (e.g., 4~6-bit) to achieve higher performance. On NVIDIA GPU, we propose a data partition mechanism and multi-level memory access optimizations, to better adapt the computation to GPU thread and memory hierarchy. We also propose quantization fusion to eliminate unnecessary data access. The experiment results demonstrate our implementations achieve better performance of extremely low-bit convolution compared to the state-of-the-art frameworks and libraries such as ncnn and cuDNN. To the best of our knowledge, this is the first work provides efficient implementations of extremely low-bit convolutions covering 2~8-bit on ARM CPU and 4-bit/8-bit on NVIDIA GPU.

Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures

Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity

Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity

Zhenbo Hu, Xiangyu Zou, and Wen Xia (Harbin Institute of Technology, Shenzhen); Sian Jin (University of Alabama); Dingwen Tao (The University of Alabama, Tuscaloosa, AL, USA); and Yang Liu, Weizhe Zhang, and Zheng Zhang (Harbin Institute of Technology, Shenzhen)

Abstract

Video Link Slides Link Deep neural networks (DNNs) have gained considerable attention in various real-world applications due to the strong performance on representation learning. However, a DNN needs to be trained many epochs for pursuing a higher inference accuracy, which requires storing sequential versions of DNNs and releasing the updated versions to users. As a result, large amounts of storage and network resources are required, which significantly hamper DNN utilization on resource-constrained platforms (e.g., IoT, mobile phone, etc.).

In this paper, we present a novel delta compression framework called Delta-DNN, which can efficiently compress the float-point numbers in DNNs by exploiting the floats similarity existing in DNNs during training. Specifically, (1) we observe the high similarity of float-point numbers between the neighboring versions of a neural network in training; (2) inspired by delta compression technique, we only record the delta (i.e., the differences) between two neighboring versions, instead of storing the full new version for DNNs; (3) we use the error-bounded lossy compression to compress the delta data for a high compression ratio, where the error bound is strictly assessed by an acceptable loss of DNNs� inference accuracy; (4) we evaluate Delta-DNN�s performance on two scenarios, including reducing the transmission of releasing DNNs over network and saving the storage space occupied by multiple versions of DNNs.

According to experimental results on six popular DNNs, Delta-DNN achieves the compression ratio 2�-10� higher than state-of-the-art methods, while without sacrificing inference accuracy and changing the neural network structure.

Paper

1:10pm-1:40pm

5A: Data Center Networking

Zoom Meeting A

Scott Atchley

AMRT: Anti-ECN Marking to Improve Utilization of Receiver-driven Transmission in Data Center

PS : Periodic Strategy for the 40-100Gbps Energy Efficient Ethernet

Polo: Receiver-Driven Congestion Control for Low Latency over Commodity Network Fabric

Paper

5B: Parallel Algorithms I

Zoom Meeting B

Grey Ballard

Prune the Unnecessary: Parallel Pull-Push Louvain Algorithms with Automatic Edge Pruning

Fast Spectral Graph Layout on Multicore Platforms

Revisiting Sparse Dynamic Programming for the 0/1 Knapsack Problem

Paper

5C: Parallel and Distributed Machine Learning

Zoom Meeting C

Agnieszka K. Miedlar

Developing a Loss Prediction-based Asynchronous Stochastic Gradient Descent Algorithm for Distributed Training of Deep Neural Networks

Federated Learning with Proximal Stochastic Variance Reduced Gradient Algorithms

Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning

Paper

Thursday, August 20th

11:00am-11:40am

Keynote-3: Saman Amarasinghe, MIT

Zoom Meeting A

Xipeng Shen

How to Make Sparse Fast

Keynote

11:50am-12:20pm

6A: Heterogeneous Systems

Zoom Meeting A

Vivek Kale

Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems

Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors

Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors

Juan Carlos Saez, Fernando Castro, and Manuel Prieto-Matias (Complutense University of Madrid)

Abstract

Video Link Slides Link Asymmetric multicore processors (AMPs) couple high-performance big cores and low-power small cores with the same instruction-set architecture but different features, such as clock frequency or microarchitecture. Previous work has shown that asymmetric designs may deliver higher energy efficiency than symmetric multicores for diverse workloads. Despite their benefits, AMPs pose significant challenges to runtime systems of parallel programming models. While previous work has mainly explored how to efficiently execute task-based parallel applications on AMPs, via enhancements in the runtime system, improving the performance of unmodified data-parallel applications on these architectures is still a big challenge. In this work we analyze the particular case of loop-based OpenMP applications, which are widely used today in scientific and engineering domains, and constitute the dominant application type in many parallel benchmark suites used for performance evaluation on multicore systems. We observed that conventional loop-scheduling OpenMP approaches are unable to efficiently cope with the load imbalance that naturally stems from the different performance delivered by big and small cores.

To address this shortcoming, we propose Asymmetric Iteration Distribution (AID), a set of novel loop-scheduling methods for AMPs that distribute iterations unevenly across worker threads to efficiently deal with performance asymmetry. We implemented AID in libgomp --the GNU OpenMP runtime system--, and evaluated it on two different asymmetric multicore platforms. Our analysis reveals that the AID methods constitute effective replacements of the static and dynamic methods on AMPs, and are capable of improving performance over these conventional strategies by up to 56% and 16.8%, respectively.

Detecting Anomalous Computation with RNNs on GPU-Accelerated HPC Machines

Paper

6B: Performance Evaluation and Characterization

Zoom Meeting B

Probir Roy

Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver

SPECcast: A Methodology for Fast Performance Evaluation with SPEC CPU 2017 Multiprogrammed Workloads

The Art of CPU-Pinning: Evaluating and Improving the Performance of Virtualization and Containerization Platforms

Paper

6C: Routing and Mapping in Networks

Zoom Meeting C

Feng Zhang

XShot: Light-weight Link Failure Localization using Crossed Probing Cycles in SDN

On Network Locality in MPI-Based HPC Applications

DeepHop on Edge: Hop-by-hop Routing by Distributed Learning with Semantic Attention

Paper

12:30pm-1:00pm

7A: Microarchitecture and Power Management

Zoom Meeting A

Hyeran Jeon

A GPU Register File using Static Data Compression

HCAPP: Scalable Power Control for Heterogeneous 2.5D Integrated Systems

DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics

Paper

7B: Parallel Algorithms II

Zoom Meeting B

Abdou Guermouche

Adaptive Bulk Search: Solving Quadratic Unconstrained Binary Optimization Problems on Multiple GPUs

Efficient Block Algorithms for Parallel Sparse Triangular Solve

Selective Coflow Completion for Time-sensitive Distributed Applications with Poco

Paper

7C: Resource Management on the Cloud

Zoom Meeting C

Madhusudhan Govindaraju

Improving Load Balance via Resource Exchange in Large-Scale Search Engines

Rendering Server Allocation for MMORPG Players in Cloud Gaming

Impact of Memory DoS Attacks on Cloud Applications and Real-Time Detection Schemes

Paper

1:10pm-1:40pm

8A: GPU-Accelerated Applications

Zoom Meeting A

Ana Lucia Varbanescu

Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators

Detailed Analysis and Optimization of CUDA K-means Algorithm

Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architectures

Paper

1:10pm-1:50pm

8B: Data Centers and the Edge

Zoom Meeting B

Sudhanva Gurumurthi

OVERSEE: Outsourcing Verification to Enable Resource Sharing in Edge Environment

Reducing Latency in Multi-Tenant Data Centers via Cautious Congestion Watch

URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public Clouds

Reliability Augmentation of Requests with Service Function Chain Requirements in Mobile Edge-Cloud Networks

Reliability Augmentation of Requests with Service Function Chain Requirements in Mobile Edge-Cloud Networks

Weifa Liang and Yu Ma (The Australian National University), Wenzheng Xu (Sichuan University), Xiaohua Jia (City University of Hong Kong), and Sid Chi-Kin Chau (The Australian National University)

Abstract

Video Link Slides Link Provisioning reliable network services for mobile users in a mobile edge computing environment is the top priority for most network service providers, as unreliable or severely failed services will result in a tremendous loss on their revenues and consumers. In this paper, we study a novel service reliability augmentation problem in a Mobile Edge-Cloud (MEC) network, where mobile users request various network services through issuing requests with service function chain (SFC) requirements and reliability expectations, and an admitted request may not meet its reliability expectation initially. To enhance its service reliability to reach its expectation, it is a common practice to make use of redundant backups, that is to place redundant VNF instances of each Virtual Network Function (VNF) in its SFC in case its primary VNF instance fails. In this paper, we aim to augment the reliability of each admitted request as much as possible with the ultimate objective to reach its reliability expectation, subject to computing capacity on each cloudlet in the network. To this end, we first formulate a novel service reliability augmentation problem. We then deal with the problem for the admitted request, for which we propose an integer linear program (ILP) solution, and develop a randomized algorithm with a provable approximation ratio while a moderate resource constraint violation. We also devise an efficient heuristic algorithm for the problem without any resource constraint violation. We finally evaluate the performance of the proposed algorithms through experimental simulations. Experimental results demonstrate that the proposed algorithms are promising.

Paper

8C: Storage and I/O Optimization

Zoom Meeting C

Sriram Krishnamoorthy

OPS: Optimized Shuffle Management System for Apache Spark

SeRW: Adaptively Separating Read and Write upon SSDs of Hybrid Storage Server in Clouds

Scalable Coordination of Hierarchical Parallelism

Mass: Workload-Aware Storage Policy for OpenStack Swift

Paper

Created 2020-8-12 7:35