Monday, August 17th8:00am-10:00amSRMPDS: International Workshop on Scheduling and Resource Management for Parallel and Distributed...
Zoom Meeting B
Workshop 10:00am-11:00amEMS: International Workshop on Embedded Multicore Systems
Zoom Meeting A
Workshop 12:00pm-2:45pmP2S2: International Workshop on Parallel Programming Models and Systems Software for High-End Com... Zoom Meeting C Workshop 8:00pm-9:30pmSoftware Stack for Hardware Accelerators Workshop (SSHAW) Zoom Meeting C Workshop 10:00pm-11:00pmAWASN: International Workshop on Applications of Wireless Ad hoc and Sensor Networks Meeting Link Provided by Chairs Workshop AsynchronousEXA_PMRA: International Workshop on Performance modelling, Runtime System and Applications at the Exascale
Workshop |
Tuesday, August 18th11:00am-11:15amOpening Remarks Zoom Meeting A J. Nelson Amaral Opening Remark 11:15am-11:55amKeynote-1: Kathy Yelick, U.C. Berkeley Zoom Meeting A Xipeng Shen Genomic Analysis and Learning at Scale: Mapping Irregular Computations to Advanced Architectures
Genomic Analysis and Learning at Scale: Mapping Irregular Computations to Advanced Architectures
Kathy Yelick (University of California, Berkeley; Lawrence Berkeley National Laboratory) Abstract Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become widely available. Enormous community databases store and share this data with the research community, and some of data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These computations range from analysis and correction of raw genomic data to higher level machine learning approaches. These applications differ from scientific simulations and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data. The ExaBiome project that is part of the Exascale Computing Project is developing high performance tools for analyzing microbial data, which is especially challenging as hundreds of species may be collected and sequenced in a single sample from a human, animal or environmental microbiome. I will give an overview of several high performance genomics analysis problems, including alignment, profiling, clustering, and assembly and describe some of the challenges and opportunities of mapping these to current petascale and future exascale architectures, including GPU-based systems. I will also describe some of the common computational patterns or �motifs� that inform parallelization strategies and can be useful in understanding architectural requirements, algorithmic approaches, and benchmarking current and future systems. Keynote 12:05pm-12:45pmBest-Paper Candidates Zoom Meeting A Lizy John Huffman Coding with Gap Arrays for GPU Acceleration
Huffman Coding with Gap Arrays for GPU Acceleration Best Paper Candidate
Naoya Yamamoto, Koji Nakano, Yasuaki Ito, and Daisuke Takafuji (Hiroshima University) and Akihiko Kasagi and Tsuguchika Tabaru (Fujitsu Laboratories Ltd.) Abstract Video Link Slides Link Huffman coding is a fundamental lossless data compression algorithm used in many data compression file formats such as gzip, zip, png, and jpeg. Huffman encoding converts a 8-bit symbol sequence into a codeword sequence with variable-length codewords based on a codebook. Huffman encoding is easy to parallelize, because all 8-bit symbols can be converted into codewords independently. On the other hand, since an encoded codeword sequence has no separator to identify each codeword, Huffman decoding is hard to parallelize. The main contribution of this paper is to improve previously presented GPU implementations for Huffman encoding and decoding. We also present a new data structure called gap array to be attached to an encoded codeword sequence of Huffman coding for accelerating parallel Huffman decoding. Our improved GPU implementations for Huffman encoding and decoding use several acceleration techniques (1) the Single Kernel Soft Synchronization (SKSS), (2) wordwise global memory access and (3) compact codebooks. The experimental results for 10 files on NVIDIA Tesla V100 GPU show that our GPU Huffman encoding and decoding are 2.87x-7.70x times and 1.26x-2.63x times faster than previously presented GPU Huffman encoding and decoding, respectively. Also, Huffman decoding can be further accelerated by a factor of 1.67x-6450x if a gap array is attached to an encoded codeword sequence. Since the size and computing overhead of gap arrays in Huffman encoding are quite small, we can conclude that gap arrays should be introduced for GPU Huffman encoding and decoding. CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs
CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs Best Paper Candidate
Jiya Su and Feng Zhang (Renmin University of China), Weifeng Liu (China University of Petroleum), Bingsheng He (National University of Singapore), Ruofan Wu and Xiaoyong Du (Renmin University of China), and Rujia Wang (Illinois Institute of Technology) Abstract Video Link Slides Link Sparse triangular solves (SpTRSVs) have been extensively used in linear algebra fields, and many GPU-based SpTRSV algorithms have been proposed. Synchronization-free SpTRSVs, due to their short preprocessing time and high performance, are currently the most popular SpTRSV algorithms. However, we observe that the performance of those SpTRSV algorithms on different matrices can vary greatly by 845 times. Our further studies show that when the average number of components per level is high and the average number of nonzero elements per row is low, those SpTRSVs exhibit extremely low performance. The reason is that, they use a warp on the GPU to process a row in sparse matrices, and such warp-level designs have severe underutilization of the GPU. To solve this problem, we propose CapelliniSpTRSV, a thread-level synchronization-free SpTRSV algorithm. Particularly, CapelliniSpTRSV has three novel features. First, unlike previous studies, CapelliniSpTRSV does not need preprocessing to calculate levels. Second, CapelliniSpTRSV exhibits high performance on matrices that previous SpTRSVs cannot handle efficiently. Third, CapelliniSp- TRSV�s optimization does not rely on specific sparse matrix storage format. Instead, it can achieve very good performance on the most popular sparse matrix storage, compressed sparse row (CSR) format, and thus users do not need to conduct format conversion. We evaluate CapelliniSpTRSV with 245 matrices from the Florida Sparse Matrix Collection on three GPU platforms, and experiments show that our SpTRSV exhibits 6.84 GFLOPS/s, which is 4.97x speedup over the state-of-the-art synchronization-free SpTRSV algorithm, and 4.74x speedup over the SpTRSV in cuSPARSE. CapelliniSpTRSV is open-sourced in https://github.com/JiyaSu/CapelliniSpTRSV. SkyChain: A Deep Reinforcement Learning-Empowered Dynamic Blockchain Sharding System
SkyChain: A Deep Reinforcement Learning-Empowered Dynamic Blockchain Sharding System Best Paper Candidate
Jianting Zhang, Zicong Hong, and Xiaoyu Qiu (Sun Yat-sen University); Yufeng Zhan and Song Guo (The Hong Kong Polytechnic University); and Wuhui Chen (Sun Yat-sen University, zhangjt26@mail2.sysu.edu.cn) Abstract Video Link Slides Link To overcome the limitations on the scalability of current blockchain systems, sharding is widely considered as a promising solution that divides the network into multiple disjoint groups processing transactions in parallel to improve throughput while decreasing the overhead of communication, computation, and storage. However, most existing blockchain sharding systems adopt a static sharding policy that cannot efficiently deal with the dynamic environment in the blockchain system, i.e., joining and leaving of nodes, and malicious attack. This paper presents SkyChain, a novel dynamic sharding-based blockchain framework to achieve a good balance between performance and security without compromising scalability under the dynamic environment. We first propose an adaptive ledger protocol to guarantee that the ledgers can merge or split efficiently based on the dynamic sharding policy. Then, to optimize the sharding policy under dynamic environment with high dimensional system states, a deep reinforcement learning-based sharding approach has been proposed, the goals of which include: 1) building a framework to evaluate the blockchain sharding systems from the aspects of performance and security; 2) adjusting the re-sharding interval, shard number and block size to maintain a long-term balance of the system's performance and security. Experimental results show that SkyChain can effectively improve the performance and security of the sharding system without compromising scalability under the dynamic environment in the blockchain system. GOSH: Embedding Big Graphs on Small Hardware
GOSH: Embedding Big Graphs on Small Hardware Best Paper Candidate
Taha Atahan Akyildiz, Amro Alabsi Aljundi, and Kamer Kaya (Sabancı University) Abstract Video Link Slides Link In graph embedding, the connectivity information of a graph is used to represent each vertex as a point in a d-dimensional space. Unlike the original irregular structural information, such a representation can be used for a multitude of machine learning tasks. Although the process is extremely useful in practice, it is indeed expensive and unfortunately the graphs are becoming larger and harder to embed. Attempts at scaling up the process to larger graphs have been successful but often at a steep price in hardware requirements. We present GOSH, an approach for embedding graphs of arbitrary sizes on a single GPU with minimum constraints. GOSH utilizes a novel coarsening approach to minimize the work required for embedding, delivering high-quality embeddings at a fraction of the time compared to the state-of-the-art. In addition to this, it incorporates a decomposition schema that enables any arbitrarily large graph to be embedded using a single GPU with minimum constraints on the memory size. With these techniques, GOSH is able to embed a graph with over 65 million vertices and 1.8 billion edges in less than an hour on a single GPU and obtains a 93% AUCROC for link-prediction which can be increased to 95% by running the tool for 80 minutes. Paper 12:55pm-1:25pm1A: Distributed Systems Zoom Meeting A Martin Kong CARD: A Congestion-Aware Request Dispatching Scheme for Replicated Metadata Server Cluster
CARD: A Congestion-Aware Request Dispatching Scheme for Replicated Metadata Server Cluster
Shangming Cai (Tsinghua University), Dongsheng Wang (Tsinghua University and Peng Cheng Laboratory), and Zhanye Wang and Haixia Wang (Tsinghua University) Abstract Video Link Slides Link Replicated metadata server cluster (RMSC) is highly efficient to be used in distributed filesystems while facing data-driven scenarios (e.g., massive-scale distributed machine learning tasks). Yet, when considering cost-effectiveness and system utilization, the cluster scale is commonly restricted in practice. Within this context, servers in the cluster start to suffer from load-oscillations at higher system utilization due to clients' congestion-unaware behaviors and unintelligent selection strategies (i.e., servers in the cluster are preferred then evaded intermittently). The consequences brought by load-oscillations degrade the overall performance of the whole system to some extent. One solution to tackle this problem is having clients share a part of the responsibility and behave more wisely for stability concerns. So in this paper, we present a Congestion-Aware Request Dispatching scheme, CARD, which is mainly conducted at clients and directed by a rate control mechanism. Through extensive experiments, we verify that CARD is highly efficient in resolving load-oscillations in RMSC. Apart from this, our results show that RMSC with our congestion-aware based optimization achieves better scalability compared to previous implementations under targeted workloads, especially in heterogeneous environments. Safe, Fast Sharing of memcached as a Protected Library
Safe, Fast Sharing of memcached as a Protected Library
Chris Kjellqvist, Mohammad Hedayati, and Michael L. Scott (University of Rochester) Abstract Video Link Slides Link Memcached is a widely used key-value store. It is structured as a multithreaded user-level server, accessed over socket connections by a potentially distributed collection of clients. Because socket communication is so much more expensive than a single operation on a K-V store, much of the client library is devoted to batching of requests. Batching is not always feasible, however, and the cost of communication seems particularly unfortunate when -- as is often the case -- clients are co-located on a single machine with the server, and have access to the same physical memory. DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms
DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms
Ziyi Zhao, Zhang Jiang, Ximing Liu, and Xiaoli Gong (Nankai University); Wenwen Wang (University of Georgia); and Pen-Chung Yew (University of Minnesota at Twin Cities) Abstract Video Link Slides Link The scalability of a dynamic binary translation (DBT) system has become important due to the prevalence of multicore systems and large multi-threaded applications. Several recent efforts have addressed some critical issues in extending a DBT system to run on multicore platforms for better scalability. In this paper, we present a distributed DBT framework, called DQEMU, that goes beyond a single-node multicore processor and can be scaled up to a cluster of multi-node servers. Paper 1B: Edge Learning and Inference Zoom Meeting B Chih-Chieh Yang ShadowTutor: Distributed Partial Distillation for Mobile Video DNN Inference
ShadowTutor: Distributed Partial Distillation for Mobile Video DNN Inference
Jae-Won Chung, Jae-Yun Kim, and Soo-Mook Moon (Seoul National University) Abstract Video Link Slides Link Following the recent success of deep neural networks (DNN) on video computer vision tasks, performing DNN inferences on videos that originate from mobile devices has gained practical significance. As such, previous approaches developed methods to offload DNN inference computations for images to cloud servers to manage the resource constraints of mobile devices. However, when it comes to video data, communicating information of every frame consumes excessive network bandwidth and renders the entire system susceptible to adverse network conditions such as congestion. Thus, in this work, we seek to exploit the temporal coherence between nearby frames of a video stream to mitigate network pressure. That is, we propose ShadowTutor, a distributed video DNN inference framework that reduces the number of network transmissions through intermittent knowledge distillation to a student model. Moreover, we update only a subset of the student�s parameters, which we call partial distillation, to reduce the data size of each network transmission. Specifically, the server runs a large and general teacher model, and the mobile device only runs an extremely small but specialized student model. On sparsely selected key frames, the server partially trains the student model by targeting the teacher�s response and sends the updated part to the mobile device. We investigate the effectiveness of ShadowTutor with HD video semantic segmentation. Evaluations show that network data transfer is reduced by 95% on average. Moreover, the throughput of the system is improved by over three times and shows robustness to changes in network bandwidth. FEEL: A Federated Edge Learning System for Efficient and Privacy-Preserving Mobile Healthcare
FEEL: A Federated Edge Learning System for Efficient and Privacy-Preserving Mobile Healthcare
Yeting Guo (National University of Defense Technology), Fang Liu (Sun Yat-Sen University), Zhiping Cai (National University of Defense Technology), Li Chen (University of Louisiana at Lafayette), and Nong Xiao (National University of Defense Technology) Abstract Video Link Slides Link With the prosperity of artificial intelligence, neural networks have been increasingly applied in healthcare for a variety of tasks for medical diagnosis and disease prevention. Mobile wearable devices, widely adopted by hospitals and health organizations, serve as emerging sources of medical data and participate in the training of neural network models for accurate model inference. Since the medical data are privacy-sensitive and non-shareable, federated learning has been proposed to train a model across decentralized data, which involves each mobile device running a training task with its own data. However, due to the ever-increasing size and complexity of modern neural network models, it becomes inefficient, and may even infeasible, to perform training tasks on wearable devices that are resource-constrained. In this paper, we propose a FEderated Edge Learning system, FEEL, for efficient privacy-preserving mobile healthcare. Specifically, we design an edge-based training task offloading strategy to improve the training efficiency. Further, we build our system on the basis of federated learning to make use of private user data on multiple devices with the guarantee that the data is kept locally. In addition, during model training, we provide a differential privacy scheme to strengthen the privacy protection. A prototype system has been implemented to evaluate the efficiency, inference performance and noise sensitivity of mobile medical model training, and the results have demonstrated that our proposal could train models in an efficient and privacy-preserving way. Adaptive Distributed Convolutional Neural Network Inference at the Network Edge with ADCNN
Adaptive Distributed Convolutional Neural Network Inference at the Network Edge with ADCNN
Sai Qian Zhang (Harvard University), Jieyu Lin (University of Toronto), and Qi Zhang (Microsoft) Abstract Video Link Slides Link The emergence of the Internet of Things (IoT) has led to a remarkable increase in the volume of data generated at the network edge. In order to support real-time smart IoT applications, massive amounts of data generated from edge devices need to be processed using methods such as deep neural networks (DNNs) with low latency. To improve application performance and minimize resource cost, enterprises have begun to adopt Edge computing, a computation paradigm that advocates processing input data locally at the network edge. However, as edge nodes are often resource-constrained, running data-intensive DNN inference tasks on each individual edge node often incurs high latency, which seriously limits the practicality and effectiveness of this model. Paper 1C: Memory Systems Zoom Meeting C Alaa Alameldeen An Efficient Wear-level Architecture using Self-adaptive Wear Leveling
An Efficient Wear-level Architecture using Self-adaptive Wear Leveling
Jianming Huang, Yu Hua, Pengfei Zuo, Wen Zhou, and Fangting Huang (Huazhong University of Science & Technology) Abstract Video Link Slides Link The non-volatile memory (NVM) is becoming the main device of next-generation memory, due to the high density, near-zero standby power, non-volatile and byte-addressable features. The multi-level cell (MLC) technique has been used in non-volatile memory to significantly increase device density and capacity, which however leads to much weaker endurance than the single-level cell (SLC) counterpart. Although wear-leveling techniques can mitigate this weakness in MLC, the improvements upon MLC-based NVM become very limited due to not achieving uniform write distribution before some cells are really worn out. To address this problem, our paper proposes a self-adaptive wear-leveling (SAWL) scheme for MLC-based NVM. The idea behind SAWL is to dynamically tune the wear-leveling granularities and balance the writes across the cells of entire memory, thus achieving suitable tradeoff between the lifetime and cache hit rate. Moreover, to reduce the size of the address-mapping table, SAWL maintains a few recently-accessed mappings in a small on-chip cache. Experimental results demonstrate that SAWL significantly improves the NVM lifetime and the performance, compared with state-of-the-art schemes. CCHL: Compression-Consolidation Hardware Logging for Efficient Failure-Atomic Persistent Memory Updates
CCHL: Compression-Consolidation Hardware Logging for Efficient Failure-Atomic Persistent Memory Updates
Xueliang Wei, Dan Feng, Wei Tong, Jingning Liu, Chengning Wang, and Liuqing Ye (Huazhong University of Science and Technology) Abstract Video Link Slides Link Non-volatile memory (NVM) is emerging as a fast byte-addressable persistent memory (PM) that promises data persistence at the main memory level. One of the common choices for providing failure-atomic updates in PM is the write-ahead logging (WAL) technique. To mitigate logging overhead, recent studies propose WAL-based hardware logging designs that overlap log writes with transaction execution. However, existing hardware logging designs incur a large number of unnecessary log writes. Many log writes are still performed in the critical path, which causes high performance overhead, particularly for the multi-core systems with many threads. Balancing Fairness and Efficiency for Cache Sharing in Semi-external Memory System
Balancing Fairness and Efficiency for Cache Sharing in Semi-external Memory System
Shanjiang Tang, Qifei Chai, and Ce Yu (Tianjin University); Yusen Li (Nankai University); and Chao Sun (Tianjin University) Abstract Video Link Slides Link Data caching and sharing is an effective approach for achieving high performance to many applications in shared platforms such as the cloud. RAM and SSD are two popular caching devices widely used by many large-scale data application systems such Hadoop and Spark. Due to the limited size of RAM as well as the large access latency of SSD, there is a trend of integrating RAM and SSD (called semi-external memory) together for large-scale data caching. Paper 1:35pm-2:05pm2A: Fault-Tolerance Zoom Meeting A Karsin Ben Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method
Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method
Carlos Pachajoa, Christina Pacher, Markus Levonyak, and Wilfried N. Gansterer (University of Vienna) Abstract Video Link Slides Link As computers reach exascale and beyond, the incidence of faults will increase. Solutions to this problem are an active research topic. We focus on strategies to make the preconditioned conjugate gradient (PCG) solver resilient against node failures, specifically, the exact state reconstruction (ESR) method, which exploits redundancies in PCG. Robustness of the Young/Daly formula for stochastic iterative applications
Robustness of the Young/Daly formula for stochastic iterative applications
Yishu Du (Tongji University Shanghai, ENS Lyon); Loris Marchal (ENS Lyon); Guillaume Pallez (INRIA Bordeaux); and Yves Robert (ENS Lyon, University of Tennessee) Abstract Video Link Slides Link The Young/Daly formula for periodic checkpointing. is known to hold for a divisible load application where one can checkpoint at any time-step. In an nutshell, the optimal period is P = \sqrt{2 MUF C} where MUF is the Mean Time Between Failures (MTBF) and C the checkpoint time. Assuming unit execution time, P is also the amount of work executed. This paper assesses the accuracy of the formula for applications decomposed into computational segments where: (i) the duration of a segment is stochastic, i.e., obeys a probability distribution law Distrib of mean MUD; and (ii) one can checkpoint only at the end of a segment. We first consider static strategies where checkpoints are taken after a given number of segments k, and we show that using the Young/Daly formula to compute k (as k MUD = P) is asymptotically optimal among such strategies, and remains accurate even when the distribution Distrib had a large standard deviation. We also consider dynamic strategies where one decides to checkpoint at the end of a segment only if the total amount of work since the last checkpoint exceeds a threshold WTH, and otherwise proceed to the next segment; we show that an approximation of the optimal value of WTH is P. Finally, we provide an extensive set of simulations where Distrib is either Uniform, Gamma or truncated Normal, which show the global accuracy of the Young/Daly formula and establish that its relevance goes well its beyond its original framework. Energy-aware strategies for reliability-oriented real-time task allocation on heterogeneous platforms
Energy-aware strategies for reliability-oriented real-time task allocation on heterogeneous platforms
Li Han (East China Normal University, ENS Lyon); Yiqin Gao (ENS Lyon); Jing Liu (East China Normal University); Yves Robert (ENS Lyon, University of Tennessee Knoxville); and Fr�d�ric Vivien (ENS Lyon) Abstract Video Link Slides Link Low energy consumption and high reliability are widely identified as increasingly relevant issues in real-time systems on heterogeneous platforms. In this paper, we propose a multi-criteria optimization strategy to minimize the expected energy consumption while enforcing the reliability threshold and meeting all task deadlines. The tasks are replicated to ensure a prescribed reliability threshold. The platforms are composed of processors with different (and possibly unrelated) characteristics, including speed profile, energy cost and failure rate. We provide several mapping and scheduling heuristics towards this challenging optimization problem. Specifically, a novel approach is designed to control (i) how many replicas to use for each task, (ii) on which processor to map each replica and (iii) when to schedule each replica on its assigned processor. Different mappings achieve different levels of reliability and consume different amounts of energy. Scheduling matters because once a task replica is successful, the other replicas of that task are cancelled, which calls for minimizing the amount of temporal overlap between any replica pair. The experiments are conducted for a comprehensive set of execution scenarios, with a wide range of processor speed profiles and failure rates. The comparison results reveal that our strategies perform better than the random baseline, with a gain of 40% in energy consumption, for nearly all cases. The absolute performance of the heuristics is assessed by a comparison with a lower bound; the best heuristics achieve an excellent performance, with an average value only 4% higher than the lower bound. Paper 2B: Scheduling and Placement in Networks Zoom Meeting B Bin Ren Cooperative Game for Multiple Chargers with Dynamic Network Topology
Cooperative Game for Multiple Chargers with Dynamic Network Topology
Chi Lin, Ziwei Yang, and Yu Sun (Dalian University of Technology); Jing Deng (University of North Carolina); Lei Wang (University of North Carolina at Greensboro); and Guowei Wu (Dalian University of Technology) Abstract Video Link Slides Link Recent breakthrough in wireless power transfer technology has enabled wireless sensor networks to operate virtually forever with the help of mobile chargers (MCs), thus generating the concept of wireless rechargeable sensor networks (WRSNs). However, existing studies mainly focus on developing charging tours with fixed network topology, most of which are not suitable for networks with dynamic topology, usually leading to massive packet/data loss. In this work, we explore the problem of charging scheduling for WRSNs with multiple MCs when confronting with dynamic topology. To minimize the energy cost to prolong the network lifetime, we convert the charging scheduling problem into a vehicle routing problem, which is proved to be NP-hard. Then we model the problem as a cooperative game taken among sensors and propose a cooperative game theoretical charging scheduling (CGTCS) algorithm to construct the optimal coalition structure. Then, we design an adaptive optimal coalition structure updating algorithm (AOCSU) to update the optimal coalition structure, which works well with network dynamics. We discuss the reasonability and feasibility to guarantee the cooperation among sensors through carefully designing the characteristic function and allocating cost based on Shapley value. Finally, test-bed experiments and simulations are conducted, revealing that CGTCS outperforms other related works in terms of expenditure ratio, total traveling cost, and charging time. Optimizing Flow Bandwidth Consumption with Traffic-diminishing Middlebox Placement
Optimizing Flow Bandwidth Consumption with Traffic-diminishing Middlebox Placement
Yang Chen, Jie Wu, and Bo Ji (Temple University) Abstract Video Link Slides Link The implementation of network services is changed from dedicated hardware to software middleboxes with the evolution of Network Function Virtualization (NFV). The placement of such middleboxes are complicated not only by the selection of multiple available hosting servers, but also by the traffic-changing effect of middleboxes. In this paper, we address the placement problem of a single type of traffic-diminishing middlebox, where the objective is to minimize the total bandwidth consumption when the total number of placed middleboxes is limited. We prove the NP-hardness of checking the feasibility of our problem in general topologies. Then we propose a greedy solution and prove that it is performance-guaranteed when it generates a feasible deployment. Next we narrow down to tree-structured networks and propose an optimal dynamic programming based strategy. In order to improve the time efficiency, we also introduce an efficient greedy solution with an intuitive insight. Extensive simulations are conducted on a real-world dataset to evaluate the performance of our algorithms. Towards High-Efficiency Data Centers via Job-Aware Network Scheduling
Towards High-Efficiency Data Centers via Job-Aware Network Scheduling
Yang Shi, Mei Wen, and Chunyuan Zhang (National University of Defense Technology) Abstract Video Link Slides Link Distributed jobs typically facing competition for multiple resources in modern data centers, especially for network. Without effective network scheduling, this competition can cause low efficiency of the data center. Previous work on network scheduling has been focused on reducing flow completion time or improving per-flow fairness. Yet, its effect on improving jobs' performance is limited by the unawareness of relationships between communication and computation. In this paper, we focus on the problem of scheduling network resources for multiple jobs, with the specific objective to reduce the job completion time (JCT), which also makes the datacenter more efficient. With an in-depth investigation of communication and computation, we identify an opportunity for accelerating jobs in a way that occupies less bandwidth for DAG-based complicated modern jobs. Accordingly, this paper proposes JIT, a job-aware network scheduler that leverages the computational graph to accelerate jobs effectively. To cater to the goal of JIT, we first develop a mathematical model and formulate the scheduling problem as an integer linear programming (ILP) problem. We further prove that it has an equivalent linear programming (LP) problem through rigorous theoretical analysis in order to solve this ILP problem efficiently. Some reasonable simplifications are also adopted to reduce the solving time of JIT to only 1 second. The proposed JIT is simulated and compared against some state-of-the-art designs, and the simulation results demonstrate that JIT can achieve an acceleration of up to 1.55x, which successfully improves the efficiency of the data center. Paper 2C: Systems for Machine Learning Zoom Meeting C Dingwen Tao DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training
DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training
Lipeng Wang (Hong Kong University of Science and Technology), Songgao Ye (SenseTime Research), Baichen Yang (Hong Kong University of Science and Technology), Youyou Lu (Tsinghua University), Hequan Zhang and Shengen Yan (SenseTime Research), and Qiong Luo (Hong Kong University of Science and Technology) Abstract Video Link Slides Link We observe three problems in existing storage and caching systems for deep-learning training (DLT) tasks: (1) accessing a dataset containing a large number of small files takes a long time, (2) global in-memory caching systems are vulnerable to node failures and slow to recover, and (3) repeatedly reading a dataset of files in shuffled orders is inefficient when the dataset is too large to be cached in memory. Therefore, we propose DIESEL, a dataset-based distributed storage and caching system for DLT tasks. Our approach is via a storage-caching system co-design. Firstly, since accessing small files is a metadata-intensive operation, DIESEL decouples the metadata processing from metadata storage, and introduces metadata snapshot mechanisms for each dataset. This approach speeds up metadata access significantly. Secondly, DIESEL deploys a task-grained distributed cache across the worker nodes of a DLT task. This way node failures are contained within each DLT task. Furthermore, the files are grouped into large chunks in storage, so the recovery time of the caching system is reduced greatly. Thirdly, DIESEL provides chunk-based shuffle so that the performance of random file access is improved without sacrificing training accuracy. Our experiments show that DIESEL achieves a linear speedup on metadata access, and outperforms an existing distributed caching system in both file caching and file reading. In real DLT tasks, DIESEL halves the data access time of an existing storage system, and reduces the training time by hours without changing any training code. E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster
E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster
Abeda Sultana and Li Chen (University of Louisiana at Lafayette), Fei Xu (East China Normal University), and Xu Yuan (University of Louisiana at Lafayette) Abstract Video Link Slides Link With the prosperity of deep learning, enterprises, and large platform providers, such as Microsoft, Amazon, and Google, have built and provided GPU clusters to facilitate distributed deep learning training. As deep learning training workloads are heterogeneous, with a diverse range of characteristics and resource requirements, it becomes increasingly crucial to design an efficient and optimal scheduler for distributed deep learning jobs in the GPU cluster. This paper aims to propose a simple and yet effective scheduler, called E-LAS, with the objective of reducing the averaged training completion time of deep learning jobs. Without relying on the estimation or prior knowledge of the job running time, E-LAS leverages the real-time epoch progress rate, unique for distributed deep learning training jobs, as well as the attained services from temporal and spatial domains, to guide the scheduling decisions. The theoretical analysis for E-LAS is conducted to offer a deeper understanding on the components of scheduling criteria. Furthermore, we present a placement algorithm to achieve better resource utilization without involving much implementation overhead, complementary to the scheduling algorithm. Extensive simulations have been conducted, demonstrating that E-LAS improves the averaged job completion time (JCT) by 10� over an Apache YARN-based resource manager used in production. Moreover, E-LAS outperforms Tiresias, the state-of-the-art scheduling algorithm customized for deep learning jobs, by almost 1.5� for the average JCT as well as queuing time. ParSecureML: An Efficient Parallel Secure Machine Learning Framework on GPUs
ParSecureML: An Efficient Parallel Secure Machine Learning Framework on GPUs
Zheng Chen and Feng Zhang (Renmin University of China), Amelie Chi Zhou (Shenzhen University), Jidong Zhai (Tsinghua University), and Chenyang Zhang and Xiaoyong Du (Renmin University of China) Abstract Video Link Slides Link Machine learning has been widely used in our daily lives. Large amounts of data have been continuously produced and transmitted to the cloud for model training and data processing, which raises a problem: how to preserve the security of the data. Recently, a secure machine learning system named SecureML has been proposed to solve this issue using two-party computation. However, due to the excessive computation expenses of two-party computation, the secure machine learning is about 2x slower than the original machine learning methods. Previous work on secure machine learning mostly focused on novel protocols or improving accuracy, while the performance metric has been ignored. In this paper, we propose a GPU-based framework ParSecureML to improve the performance of secure machine learning algorithms based on two-party computation. The main challenges of developing ParSecureML lie in the complex computation patterns, frequent intra-node data transmission between CPU and GPU, and complicated inter-node data dependence. To handle these challenges, we propose a series of novel solutions, including profiling-guided adaptive GPU utilization, fine-grained double pipeline for intra-node CPU-GPU cooperation, and compressed transmission for inter-node communication. As far as we know, this is the first GPU-based secure machine learning framework. Compared to the state-of-the-art framework, ParSecureML achieves an average of 32.2x speedup. Paper 2:15pm-3:00pm
Poster Session Zoom Meeting C Paul Lu EPMA: Efficient Partial Message Access in IoT Era
EPMA: Efficient Partial Message Access in IoT Era
Hao Wu, Ziyue Jiang, Wei Liu, Yifan Gong, and Jiangming Jin (TuSimple) Abstract Video Link Slides Link With the communication technology development and ubiquitouscomputing devices, the feasibility of internet of things (IoT) sys-tems realization has become confirmed for applications such asenvironment sensing, object tracking and system monitoring. Asthe number of sensors and devices increases, the data produced isvast. To efficiently process these amount of data, data accesses playa crucial role. However, with the cutting edge industry middleware,such as ROS2, data access is inefficient for partial messages dueto limitations of data access interfaces, leading to significant com-putation resources waste. This work proposes and implements anovel interface, named EPMA, to facilitate efficient partial messageaccess. By evaluating an image persisting pipeline in an IoT system,the result shows the EPMA lowers CPU usage by 87% compared toROS2. Also, EPMA interface is non-intrusive to ROS2 ecosystem. Towards Parallelization of a Texture Description Algorithm for Breast Lesion Classification using OpenMP and CUDA
Towards Parallelization of a Texture Description Algorithm for Breast Lesion Classification using OpenMP and CUDA
Lisa Pal and Amilcar Meneses Viveros (Centro de Investigaci�n y de Estudios Avanzados, Zacatenco) and Wilfrido G�mez Flores (Centro de Investigaci�n y de Estudios Avanzados, Tamaulipas) Abstract Video Link Slides Link In this work, we improve on the performance of a novel texture extraction method based on auto-mutual information (AMI) for classifying breast lesions. This particular algorithm was chosen since it was experimentally shown to outperform other commonly used texture classification methods. However, its time performance is also the lowest in comparison. The objective of this work is to make this method scalable and viable to apply to larger medical images, such as mammograms. To achieve this, we study the algorithm (which consists of ranklet transforms, with subsequent computation of AMI for each transformed image) and find the sections which can be executed in parallel. We parallelized this in CPU using OpenMP and in GPU using CUDA. The dataset used was composed of four sets of squared images composed of the region of interest (ROI) obtained from mammograms. The ROIs are of size 128, 256, 512, and 1024 pixels per side. Each of the four sets contained 100 images of the corresponding sizes. With OpenMP, a speedup of 8.37 - 10.27 was achieved using one computing node containing 12 Xeon processors. The CUDA implementation of the algorithm is currently under progress, but a performance much higher than OpenMP is expected. We expect the ranklet transform to reduce from O(M^2r^2logr^2) to O(r^2logr^2) and the mutual information to be reduced from the time required for 3n_r2(M-1)(M^2logM^2) operations to that of n_r(M^2logM^2), where M is the side of a squared image, n_r is the number of window resolutions, and r is the window resolution. Jeor: Accelerate Linear Algebra Operation in SSDs
Jeor: Accelerate Linear Algebra Operation in SSDs
Xiaorui Wu and Hong Xu (City University of Hong Kong) and Yi Wang (Peng Cheng Laboratory) Abstract Slides Link GPU is widely used as a powerful computing device in a machine learning system. However, the current GPU memory is limited (∼10-32GB), which is much smaller than CPU memory (Up to TB). Large matrix operations within limited GPU memory can enable data analysis pipelines with large datasets. We propose a library called Jeor to accelerate the large matrix execution in SSDs within limited GPU memory. We also propose a pluggable scheduling policy manager, data reuse and buffer reuse to support the Jeor Lib have a high execution efficiency. We implement Jeor based on some computing libraries, such as cublas. The testbed experiments show that Jeor Lib can achieve better GPU memory utilization than cublasXT while still achieving 1.34x speed up. Saec: Similarity-Aware Embedding Compression in Recommendation Systems
Saec: Similarity-Aware Embedding Compression in Recommendation Systems
Xiaorui Wu and Hong Xu (City University of Hong Kong) and Honglin Zhang, Huaming Chen, and Jian Wang (Tencent) Abstract Slides Link Embedding vectors are commonly used to represent features about user and item in recommender systems. A practical challenge is that a large number of embedding vectors incurs substantial memory footprint for serving queries, especially as the number of features grows. We propose an embedding compression framework called Saec to address this challenge. Saec exploits the similarity among features within a field as they represent the same attribute of user or item, and uses clustering to compress the embeddings. We propose a new fast clustering method that relies on the empirical heavy-tailed nature of features to drastically reduce the clustering overhead. We implement Saec on a production system and evaluate it with a 10-day worth of feature data from a large Internet Company. Testbed experiments show that Saec consistently reduces the number of embedding vectors by two orders of magnitude without any performance degradation, the compression rate is up to 27x. Poster
| Wednesday, August 19th11:00am-11:40amKeynote-2: Michael Schulte, AMD Zoom Meeting A Lizy John Challenges and Opportunities for Extreme-Scale Computing
Challenges and Opportunities for Extreme-Scale Computing
Michael Schulte (AMD) Abstract Steady increases in computer performance, power efficiency, programmability, and scalability are needed to enable key discoveries in diverse fields ranging from medical science to astrophysics and climate modeling. In this talk, I will discuss major challenges in the design of extreme-scale computing systems, give an overview of emerging architectures and technologies to overcome these challenges, and describe important research in this area. I will also discuss Grand Challenge Problems for extreme-scale systems and how the combination of scientific computing, artificial intelligence, and data analytics can enable scientist to solve problems that previously were intractable. Keynote 11:50am-12:20pm3A: Graph Processing and Concurrent Data Structures Zoom Meeting A Michela Becchi Graffix: Efficient Graph Processing with a Tinge of GPU-Specific Approximations
Graffix: Efficient Graph Processing with a Tinge of GPU-Specific Approximations
Somesh Singh and Rupesh Nasre (Indian Institute of Technology Madras) Abstract Video Link Slides Link Parallelizing graph algorithms on GPUs is challenging due to the irregular memory accesses involved in graph traversals. In particular, three important GPU-specific aspects affect performance: memory coalescing, memory latency, and thread divergence. In this work, we attempt to tame these challenges using approximate computing. We target graph applications on GPUs that can tolerate some degradation in the quality of the output for obtaining the result in short order. We propose three techniques for boosting the performance of graph processing on the GPU by injecting approximations in a controlled manner. The first one creates a graph isomorph that brings relevant nodes nearby in memory and adds controlled replica of nodes to improve coalescing. The second reduces memory latency by facilitating the processing of subgraphs inside shared memory by adding edges among specific nodes and processing well-connected subgraphs iteratively inside shared-memory. The third technique normalizes degrees across nodes assigned to a warp to reduce thread divergence. Each of the techniques offers notable performance benefits and provides a knob to control the amount of inaccuracy added to an execution. We demonstrate the effectiveness of the proposed techniques using a suite of five large graphs with varied characteristics and five popular graph algorithms. Optimizing Linearizable Bulk Operations on Data Structures
Optimizing Linearizable Bulk Operations on Data Structures
Matthew A. Rodriguez and Michael F. Spear (Lehigh University) Abstract Video Link Slides Link We study the problem of ensuring the correctness of concurrent programs that perform mutating foreach and range operations over concurrent data structures. We introduce three algorithms which vary in the location and the granularity of concurrency control metadata. Our algorithms make the linearization of bulk operations visible to concurrent elemental operations, which enables them to scale well, keep overhead low, and operate within tight memory bounds. In our experimental evaluation, we demonstrate that our techniques do not hinder the performance of elemental operations in elemental-only workloads, and allow scalability among concurrent mutating bulk operations. Furthermore, in mixed workloads, our algorithms outperform the baseline, sometimes by an order of magnitude or more. GraBi: Communication-Efficient and Workload-Balanced Partitioning for Bipartite Graphs
GraBi: Communication-Efficient and Workload-Balanced Partitioning for Bipartite Graphs
Feng Sheng and Qiang Cao (Huazhong University of Science and Technology, Wuhan National Laboratory for Optoelectronics); Hong Jiang (University of Texas at Arlington, Department of Computer Science and Engineering); and Jie Yao (Huazhong University of Science and Technology, School of Computer Science and Technology) Abstract Video Link Slides Link Machine Learning and Data Mining (MLDM) applications generally represent their input data in bipartite graphs with two disjoint vertex-subsets connected only by edges between them. Despite the prevalence of bipartite graphs, existing graph partitioning frameworks have rarely sufficiently exploited their unique structures, especially the highly lopsided subset sizes and extremely skewed vertex degrees. As a result of poor partitioning quality, high communication cost and severe workload imbalance arise during subsequent computation over these bipartite graphs in distributed environments, significantly hampering the performance of MLDM applications. Paper 3B: Large-Scale Applications on Supercomputers Zoom Meeting B Kamesh Madduri Large-scale Simulations of Peridynamics on Sunway Taihulight Supercomputer
Large-scale Simulations of Peridynamics on Sunway Taihulight Supercomputer
Xinyuan Li (Computer Network Information Center, Chinese Academy of Science; University of Chinese Academy of Science) and Huang Ye and Jian Zhang (Computer Network Information Center, Chinese Academy of Science) Abstract Video Link Slides Link Peridynamics (PD) methods are good at describing solid mechanical behaviours and have the superiority on simulating the discontinuous problems. They can be applied to many fields, such as materials science, human health, and industrial manufacturing, etc., which motivates us to provide their efficient numerical simulations on the Sunway TaihuLight supercomputer. However, massive and complex calculations of PD simulations and the characteristics of Sunway TaihuLight bring challenges to efficient parallel PD simulations. In this paper, we present a series of performance optimization techniques to perform a large-scale parallel PD simulation application on Sunway TaihuLight. We first design the data grouping and SPM-based caching to increase the bandwidth of data transmission and reduce the time of the main memory access. Further, we design and implement vectorization and instruction-level optimization for PD applications to improve computational performance. Finally, we offer the overlapping strategies of data transmission and computation so that data transmission can be covered by computation. Our work in a core group improves the performance of the serial version on the SW26010 processor by 181 times. Compared to the serial and single-CPU Peirdigm-based simulations on Intel Xeon E5-2680 V3, our work gets a speedup of 60 times and 6 times, respectively. Near linear scalability is also obtained. When testing the weak scaling, the simulation of a 296,222,720-point example achieves 1.14 PFLOPS with 8192 (532,480 cores) processes. When testing the strong scaling, 90% parallel efficiency is observed as the number of processes increases 64 times to 4096 processes. Toward Large-Scale Image Segmentation on Summit
Toward Large-Scale Image Segmentation on Summit
Sudip Seal (Oak Ridge National Laboratory) and Seung-Hwan Lim, Dali Wang, Jacob Hinkle, Dalton Lunga, and Aristeidis Tsaris (ORNL) Abstract Video Link Slides Link Semantic segmentation of images is an important computer vision task that emerges in a variety of application domains such as medical imaging, robotic vision and autonomous vehicles to name a few. While these domain-specific image analysis tasks involve relatively small image sizes (∼ 100� 100), there are many applications that need to train machine learning models on image data with extents that are orders of magnitude larger (∼ 10000 � 10000). Training deep neural network (DNN) models on large extent images is extremely memory-intensive and often exceeds the memory limitations of a single graphical processing unit, a hardware accelerator of choice for computer vision workloads. Here, an efficient, sample parallel approach to train U-Net models on large extent image datasets is presented. Its advantages and limitations are analyzed and near- linear strong-scaling speedup demonstrated on 256 nodes (1536 GPUs) of the Summit supercomputer. Using a single node of the Summit supercomputer, an early evaluation of a recently released model parallel framework called GPipe is demonstrated to deliver ∼ 2X speedup in executing a U-Net model with an order of magnitude larger number of trainable parameters than reported before. Performance bottlenecks for pipelined training of U-Net models are identified and mitigation strategies to improve the speedups are discussed. Together, these results open up the possibility of combining both approaches into a unified scalable pipelined and data parallel algorithm to efficiently train U-Net models with very large receptive fields on datasets of ultra-large extent images. SWMapper: Scalable Read Mapper on SunWay TaihuLight
SWMapper: Scalable Read Mapper on SunWay TaihuLight
Kai Xu, Xiaohui Duan, Xiangxu Meng, and Xin Li (Shandong University); Bertil Schmidt (Johannes Gutenberg University); and Weiguo Liu (Shandong University) Abstract Video Link Slides Link With the rapid development of next-generation sequencing (NGS) technologies, high throughput sequencing platforms continuously produce large amounts of short read DNA data at low cost. Read mapping is a performance-critical task, being one of the first stages required for many different types of NGS analysis pipelines. We present SWMapper � a scalable and efficient read mapper for the Sunway TaihuLight supercomputer. A number of optimization techniques are proposed to achieve high performance on its heterogeneous architecture which are centered around a memory-efficient succinct hash index data structure including seed filtration, duplicate removal, dynamic scheduling, asynchronous data transfer, and overlapping I/O and computation. Furthermore, a vectorized version of the banded Myers algorithm for pairwise alignment with 256-bit vector registers is presented to fully exploit the computaional power of the SW26010 processor. Our performance evaluation shows that SWMapper using all 4 compute groups of a single Sunway TaihuLight node outperforms S-Aligner on the same hardware by a factor of 6.2. In addition, compared the state-of-the-art CPU-based mappers RazerS3, BitMapper2, and Hobbes3 running on a 4-core Xeon W-2123v3 CPU, SWMapper achieves speedups of 26.5, 7.8, and 2.6, respectively. Our optimizations achieve an aggregated speedup of 11 compared to the na�ve implementation on one compute group of an SW26010 processor as well as a strong scaling efficiency of 74% on 128 compute groups. Paper 3C: Machine Learning for Computing Zoom Meeting C Eunjung Park An Online Learning-Based Task Offloading Framework for 5G Small Cell Networks
An Online Learning-Based Task Offloading Framework for 5G Small Cell Networks
Xueying Zhang (Wuhan University); Ruiting Zhou (Wuhan University, Chinese University of Hong Kong); Zhi Zhou (Sun Yat-sen University); John C.S. Lui (Chinese University of Hong Kong); and Zongpeng Li (Huawei, Wuhan University) Abstract Video Link Slides Link Abstract�Small cells are deployed in 5G to complement the macro cell network for improving coverage and capacity. Small cells and edge computing are natural partners, which can benefit users� experience. Small cell nodes (SCNs) equipped with edge servers can support emerging computing services such as virtual reality which impose low-latency and precise contextual requirements. With the proliferation of wireless devices, there is an increasing demand for offloading tasks to SCNs. Given limited computation and communication resources, the fundamental problem for a small cell network is how to select computing tasks to maximize effective rewards in an uncertain and stochastic environment. To this end, we propose an online learning framework, LFSC, which has the performance guarantee to guide task offloading in a small cell network. LFSC balances between reward and constraint violations, and it consists of three subroutines: i) a randomized algorithm which calculates selection probability of each task based on task weights; ii) a greedy assignment algorithm which cooperatively allocates tasks among different SCNs based on the selection probability; iii) an update algorithm which exploits the multi-armed bandit (MAB) technique to update task weights according to the feedback. Theoretical analysis shows that both the regret and violations produced by LFSC are sub-linear. Extensive simulation studies based on real world data confirm that LFSC achieves a close- to-optimal reward with low violations, and outperforms many state-of-the-art algorithms. A Reinforcement Learning Based System for Minimizing Cloud Storage Service Cost
A Reinforcement Learning Based System for Minimizing Cloud Storage Service Cost
Haoyu Wang, Haiying Shen, Qi Liu, and Kevin Zheng (University of Virginia) and Jie Xu (George Mason University) Abstract Video Link Slides Link Currently, many web applications are deployed on cloud storage service provided by cloud service providers (CSPs). A CSP offers different types of storage including hot, cold and archive storage and sets unit prices for these different types, which vary substantially. By properly assigning the data files of a web application to different types of storage based on their usage profiles and the CSP's pricing policy, a cloud customer potentially can achieve substantial cost savings and minimize the payment to the CSP. However, no previous research handles this problem. Towards this goal, we present a Markov Decision Process formulation for the cost minimization problem, and then develop a reinforcement learning based approach to effectively solve the problem, which changes the type of storage of each data file periodically to minimize money cost in long term. We then propose a method to aggregate concurrently requested data files to further reduce the cloud storage service payment for a web application. Our experiments with Wikipedia traces show the effectiveness of the proposed methods for minimizing cloud customer cost in comparison with other methods. Deep Reinforcement Learning based Elasticity-compatible Heterogeneous Resource Management for Time-critical Computing
Deep Reinforcement Learning based Elasticity-compatible Heterogeneous Resource Management for Time-critical Computing
Zixia Liu and Liqiang Wang (Department of Computer Science, University of Central Florida) and Gang Quan (Electrical and Computer Engineering Department, Florida International University) Abstract Video Link Slides Link Rapidly generated data and the amount magnitude of data analytical jobs pose great pressure to the underlying computing facilities. A distributed multi-cluster computing environment such as a hybrid cloud consequently raises its necessity due to its advantages in adapting geographically distributed and potentially cloud-based computing resources. Different clusters forming such an environment could be heterogeneous and may be resource-elastic as well. From analytical perspective, in accordance with increasing needs on streaming applications and timely analytical demands, many data analytical jobs nowadays are time-critical in terms of their temporal urgency. And the overall workload of the computing environment can be hybrid to contain both time-critical and general applications. These all call for an efficient resource management approach capable to apprehend both computing environment and application features. Paper 12:30pm-1:00pm4A: Performance Tools and Methodology Zoom Meeting A Tanzima Islam Generating Robust Parallel Programs via Model Driven Prediction of Compiler Optimizations for Non-determinism
Generating Robust Parallel Programs via Model Driven Prediction of Compiler Optimizations for Non-determinism
Girish Mururu (Georgia Institute of Technology); Kaushik Ravichandran (Georgia Institute of Technology, Facebook); and Ada Gavrilovska and Santosh Pande (Georgia Institute of Technology) Abstract Video Link Slides Link Execution orders in parallel programs are governed by non-determinism and can vary substantially across different executions even on the same input. Thus, a highly non-deterministic program can exhibit rare execution orders never observed during testing. It is desirable to reduce non-determinism to suppress corner case behavior in production cycle (making the execution robust or bug-free) and increase non-determinism for reproducing bugs in the development cycle. Performance-wise different optimization levels (e.g. from O0 to O3) are enabled during development , however, non-determinism-wise, developers have no way to select right compiler optimization level in order to increase non-determinism for debugging or to decrease it for robustness. Memory-Centric Communication Mechanism for Real-time Autonomous Navigation Applications
Memory-Centric Communication Mechanism for Real-time Autonomous Navigation Applications
wei liu, yifan gong, and hao wu (Tusimple); jidong zhai (Tsinghua University, BNRist); and jiangming jin (Tusimple) Abstract Video Link Slides Link There has been a remarkable increase in the speed of AI development over the past few years. Artificial intelligence and deep learning techniques are blooming and expanding in all forms to every sector possible. With the emerging intelligent autonomous navigation systems, both memory allocation and data movement are becoming the main bottleneck in inter-process communication procedure, especially in supporting various types of messages between multiple programming languages. To reduce significant memory allocation and data movement cost, we propose a novel memory-centric mechanism, which includes a virtual layer based architecture and pre-record memory allocation algorithm. Furthermore, we have implemented a memory-centric communication framework called Z-framework based on our proposed techniques to achieve high efficient IPC procedures in autonomous navigation systems. Experimental results show that Z-framework is able to gain up to 41\% and 35\% performance improvement compared with the approach used in ROS2, which is an industry standard and the state-of-the-art approach used in CyberRT, respectively. Automatic Identification and Precise Attribution of DRAM Bandwidth Contention
Automatic Identification and Precise Attribution of DRAM Bandwidth Contention
Christian Helm and Kenjiro Taura (The University of Tokyo) Abstract Video Link Slides Link The limited DRAM bandwidth of today's computing systems is a bottleneck for many applications. But the identification of DRAM bandwidth contention in applications is difficult. The measured bandwidth consumption of an application can not identify bandwidth contention. In theory, NUMA systems can provide higher memory bandwidth. But applications often make poor use of the resources. To address these challenges, we introduce a novel method to identify DRAM bandwidth contention and bad usage of NUMA resources. It consists of metrics to judge the severity of bandwidth contention and the degree of imbalanced resource usage in NUMA systems. Our tool can automatically scan an application for DRAM contention problems. Together with the precise location of the origin, intuitive optimization guidance is given. This approach for finding DRAM contention is based on the memory access latency. By comparing the experienced latency of an application with the uncontended hardware latency, we can find the contention. Hardware instruction sampling enables precise identification of the origin. It also provides information about accessed memories, which we use to calculate a NUMA imbalance metric. In a detailed evaluation with several micro-benchmarks, we show that our new method does indeed quantify the severity of DRAM contention beyond the possibilities of simple bandwidth consumption measurement and existing tools. We also apply our approach to real applications and confirm that it gives useful optimization advice to users. Paper 4B: Storage Reliability & Memory Security Zoom Meeting B Radu Teodorescu An Adaptive Erasure-Coded Storage Scheme with an Efficient Code-Switching Algorithm
An Adaptive Erasure-Coded Storage Scheme with an Efficient Code-Switching Algorithm
Zizhong Wang, Haixia Wang, and Airan Shao (Tsinghua University) and Dongsheng Wang (Tsinghua University, Peng Cheng Laboratory) Abstract Video Link Slides Link Many distributed storage systems use erasure codes rather than replication for higher reliability at significantly lower storage costs. However, using traditional erasure codes increases consumption of network traffic and disk I/O tremendously when systems recover data, resulting in high latency of degraded reads. In order to mitigate this problem, we present an adaptive storage scheme based on data access skew, a fact that most data accesses are applied in a small fraction of data. In this scheme, we use both a Local Reconstruction Code (LRC) to store frequently accessed data, and a Hitchhiker (HH) code to store infrequently accessed data. Besides, an efficient switching algorithm between LRC and HH code with low network and computation costs is provided. The whole system will benefit from low degraded read latency while keeping a low storage overhead, and code-switching will not become a bottleneck. Experimental evaluation shows that this adaptive storage scheme's performance was in line with expectations. First Time Miss : Low Overhead Mitigation For Shared Memory Cache Side Channels
First Time Miss : Low Overhead Mitigation For Shared Memory Cache Side Channels
Kartik Ramkrishnan, Antonia Zhai, Stephen McCamant, and Pen Yew (University of Minnesota) Abstract Video Link Slides Link Cache hits are an important source of information leakage in cache side channel attacks. An attacker observes a much faster cache access time if the cache line had previously been filled in by the victim, and a much slower memory access time if the victim had not accessed this cache line, thus revealing to the attacker whether the victim had accessed the cache line or not. A Rack-aware Pipeline Repair Scheme for Erasure-coded Distributed Storage Systems
A Rack-aware Pipeline Repair Scheme for Erasure-coded Distributed Storage Systems
Tong Liu, Shakeel Alibhai, and Xubin He (Temple University) Abstract Video Link Slides Link Nowadays, modern industry data centers have employed erasure codes to provide reliability for large amounts of data at a low cost. Although erasure codes provide optimal storage efficiency, they suffer from high repair costs compared to traditional three-way replication: when a data miss occurs in a data center, erasure codes would require high disk usage and network bandwidth consumption across nodes and racks to repair the failed data. In this paper, we propose RPR, a rack-aware pipeline repair scheme for erasure-coded distributed storage systems. RPR for the first time investigates the insights of the racks, and explores the connection between the node level and rack level to help improve the repair performance when a single failure or multiple failures occur in a data center. The evaluation results on several common RS code configurations show that, for single-block failures, our RPR scheme reduces the total repair time by up to 81.5% compared to the traditional RS code repair method and 50.2% compared to the state-of-the-art CAR algorithm. For multi-block failures, RPR reduces the total repair time and cross-rack data transfer traffic by up to 64.5% and 50%, respectively, over the traditional RS code repair. Paper 4C: Supporting Efficient Machine Learning Zoom Meeting C Seyong Lee Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures
Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures
Qingchang Han (Beihang University, SenseTime Research); Yongmin Hu (Beihang University); Fengwei Yu (SenseTime Research); Hailong Yang (Beihang University); Bing Liu (SenseTime Research); Peng Hu (SenseTime Research, Beihang University); Ruihao Gong (Beihang University, SenseTime Research); Yanfei Wang (SenseTime Research); and Rui Wang, Zhongzhi Luan, and Depei Qian (Beihang University) Abstract Video Link Slides Link With the continuous demand for higher accuracy of deep neural networks, the model size has been increasing significantly. Quantization is one of the most widely used model compression methods, which can effectively reduce the model size without severe accuracy loss. Modern processors such as ARM CPU and NVIDIA GPU have already provided the support of low-bit arithmetic instructions. However, there lacks efficient and practical optimizations for convolution computation towards extremely low-bit on ARM CPU (e.g., 2~8-bit) and NVIDIA GPU (e.g., 4-bit and 8-bit). This paper explores the performance optimization methods of extremely low-bit convolution on diverse architectures. On ARM CPU, we propose two instruction schemes for 2~3-bit and 4~8-bit convolution with corresponding register allocation methods. In addition, we re-design the GEMM computation with data padding and packing optimizations. We also implement winograd algorithm for convolution with specific bit width (e.g., 4~6-bit) to achieve higher performance. On NVIDIA GPU, we propose a data partition mechanism and multi-level memory access optimizations, to better adapt the computation to GPU thread and memory hierarchy. We also propose quantization fusion to eliminate unnecessary data access. The experiment results demonstrate our implementations achieve better performance of extremely low-bit convolution compared to the state-of-the-art frameworks and libraries such as ncnn and cuDNN. To the best of our knowledge, this is the first work provides efficient implementations of extremely low-bit convolutions covering 2~8-bit on ARM CPU and 4-bit/8-bit on NVIDIA GPU. Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures
Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures
Jan Hueckelheim, Michel Schanen, Sri Hari Krishna Narayanan, and Paul Hovland (Argonne National Laboratory) Abstract Video Link Slides Link Automatic differentiation, back-propagation, differentiable programming and related methods have received widespread attention, due to their ability to compute accurate gradients of numerical programs for optimization, uncertainty quantification, and machine learning. Two strategies are commonly used. The forward mode, which is easy to implement but has an overhead compared to the original program that grows linearly with the number of inputs, and the reverse mode, which can compute gradients for an arbitrary number of program inputs with a constant factor overhead, although the constant can be large, more memory is required, and the implementation is often challenging. Previous literature has shown that the forward mode can be more easily parallelized and vectorized than the reverse mode, but case studies investigating when either mode is the best choice are lacking, especially for modern CPUs and GPUs. In this paper, we demonstrate that the forward mode can outperform the reverse mode for programs with tens or hundreds of directional derivatives, a number that may yet increase if current hardware trends continue. Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity
Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity
Zhenbo Hu, Xiangyu Zou, and Wen Xia (Harbin Institute of Technology, Shenzhen); Sian Jin (University of Alabama); Dingwen Tao (The University of Alabama, Tuscaloosa, AL, USA); and Yang Liu, Weizhe Zhang, and Zheng Zhang (Harbin Institute of Technology, Shenzhen) Abstract Video Link Slides Link Deep neural networks (DNNs) have gained considerable attention in various real-world applications due to the strong performance on representation learning. However, a DNN needs to be trained many epochs for pursuing a higher inference accuracy, which requires storing sequential versions of DNNs and releasing the updated versions to users. As a result, large amounts of storage and network resources are required, which significantly hamper DNN utilization on resource-constrained platforms (e.g., IoT, mobile phone, etc.). Paper 1:10pm-1:40pm5A: Data Center Networking Zoom Meeting A Scott Atchley AMRT: Anti-ECN Marking to Improve Utilization of Receiver-driven Transmission in Data Center
AMRT: Anti-ECN Marking to Improve Utilization of Receiver-driven Transmission in Data Center
Jinbin Hu, Jiawei Huang, Zhaoyi Li, and Jianxin Wang (Central South University) and Tian He (University of Minnesota) Abstract Video Link Slides Link Cloud applications generate a variety of workloads ranging from delay-sensitive flows to bandwidth-hungry ones in data centers. Existing reactive or proactive congestion control protocols are hard to simultaneously achieve ultra-low latency and high link utilization across all workloads in data center networks. We present a new receiver-driven transport scheme using anti-ECN (Explicit Congestion Notification) marking to achieve both near-zero queueing delay and full link utilization by reasonably increasing sending rate in the case of under-utilization. Specifically, switches mark the ECN bit of data packets once detecting spare bandwidth. When receiving the anti-ECN marked packet, the receiver generates the corresponding marked grant to trigger more data packets. The experimental results of small-scale testbed implementation and large-scale NS2 simulation show that AMRT effectively reduces the average flow completion time (AFCT) by up to 40.8% and improves the link utilization by up to 36.8% under high workload over the state-of-the-art receiver-driven transmission schemes. PS : Periodic Strategy for the 40-100Gbps Energy Efficient Ethernet
PS : Periodic Strategy for the 40-100Gbps Energy Efficient Ethernet
Wanchun Jiang, kaiqin Liao, yulong yan, and jianxin wang (School of Computer Science and Engineering, Central South University) Abstract Video Link Slides Link The 40-100Gbps Energy Efficient Ethernet (EEE) standardized by IEEE 802.3bj contains not only the DeepSleep mode, which is 10% of the normal power consumption, but also the FastWake mode, which has much less state transition time and 70% of the normal power consumption. Correspondingly, the strategy of determining the state transitions is crucial to both the power consumption and the incurred latency of frames. As EEE applications always place tail latency constraint on frame transmission, existing strategies for the 40-100Gbps EEE are rarely configured with proper parameters for good power saving under the given tail latency constraint, especially in face of the variable traffic load. To address this problem, we design the Periodic Strategy (PS) for the 40-100Gbps EEE. Specifically, PS works periodically and optimizes the power saving independently in each cycle. In this way, PS decouples the latency constraint and energy saving. 1) It limits the largest incurred latency by the length of the cycle to satisfy the given latency constraint. 2) It achieves good power saving by selecting the proper low power state based on the prediction of incoming frames and entering the selected state just once in each cycle. What�s more, the good performance of PS is insensitive to the traffic pattern. Extensive simulations confirm that PS achieves better energy saving than all the existing strategies under the given latency constraint, under several kinds of different traffic patterns. Polo: Receiver-Driven Congestion Control for Low Latency over Commodity Network Fabric
Polo: Receiver-Driven Congestion Control for Low Latency over Commodity Network Fabric
Chang Ruan, Jianxin Wang, and Wanchun Jiang (Central South University) and Tao Zhang (Changsha University) Abstract Video Link Slides Link Recently, numerous novel transport protocols are proposed for the low latency of applications deployed in data center networks, e.g., web search and retail recommendation system. The state-of-art receiver-driven protocols, e.g., Homa and NDP, show the superior performance for achieving the lowest possible latency. However, Homa assumes that the core layer in data center network has no congestion, which limits its application for the existing over-subscribed networks. NDP requires to modify the switch hardware since it trims packets to headers when the packets cause the switch buffer to overflow, resulting in high deployment cost. In this paper, we present Polo to realize low latency for flows over commodity network fabric relying on Explicit Congestion Notification (ECN) and priority queues. According to packets with ECN marking, the Polo receiver obtains the congestion information such that it dynamically adjusts the number of data packets in network for maintaining the small switch queue. The adjustment is carried out periodically. The time interval is determined by keeping the extra high priority packet always in flight nor by a fine-grained timer. Further, Polo designs the packet recovery mechanisms to retransmit the lost packets as soon as possible. Simulation experiment results show that Polo outperforms the state-of-art receiver-driven protocols in a wide range of scenarios including incast. Paper 5B: Parallel Algorithms I Zoom Meeting B Grey Ballard Prune the Unnecessary: Parallel Pull-Push Louvain Algorithms with Automatic Edge Pruning
Prune the Unnecessary: Parallel Pull-Push Louvain Algorithms with Automatic Edge Pruning
Jesmin Jahan Tithi, Andrzej Stasiak, Sriram Aananthakrishnan, and Fabrizio Petrini (Intel) Abstract Video Link Slides Link Community detection algorithms try to identify the underlying community structure (i.e., clearly distinguishable closely interacting groups of vertices) in a given graph representing complex systems such as social networks, protein-protein interaction networks, and the World-Wide-Web. The Louvain algorithm iteratively moves vertices from one community to another to construct disjoint sets of vertices to form communities such that the vertices within the same community have more edges within themselves compared to their connections to the vertices outside the community. A property of the Louvain algorithm is that the number of vertex moves drops significantly just after the first few iterations as the community-membership also stabilizes quickly. In this paper, we present a parallel pull-and-push Louvain algorithm that exploits this property to prune unnecessary edge explorations without sacrificing the quality of the solution. We present a collection of parallel Louvain algorithms that prune many edges and vertices, speeding up convergence by an order of magnitude over the previously best-known implementation of Louvain algorithm from Grappolo, while producing similar or better results. Fast Spectral Graph Layout on Multicore Platforms
Fast Spectral Graph Layout on Multicore Platforms
Ashirbad Mishra (Pennsylvania State University), Shad Kirmani (EBay Inc.), and Kamesh Madduri (Pennsylvania State University) Abstract Video Link Slides Link We present ParHDE, a shared-memory parallelization of the High-Dimensional Embedding (HDE) graph algorithm. Originally proposed as a graph drawing algorithm, HDE characterizes the global structure of a graph and is closely related to spectral graph computations such as computing the eigenvectors of the graph Laplacian. We identify compute- and memory-intensive steps in HDE and parallelize these steps for efficient execution on shared-memory multicore platforms. ParHDE can process graphs with billions of edges in minutes, is up to 18X faster than a prior parallel implementation of HDE, and achieves up to a 24X relative speedup on a 28-core system. We also implement several extensions of ParHDE and demonstrate its utility in diverse graph computation-related applications. Revisiting Sparse Dynamic Programming for the 0/1 Knapsack Problem
Revisiting Sparse Dynamic Programming for the 0/1 Knapsack Problem
Tarequl Islam Sifat (Corespeq Inc, Colorado State University); Nirmal Prajapati (LANL, Colorado State University); and Sanjay Rajopadhye (Colorado State University) Abstract Video Link Slides Link The 0/1-Knapsack Problem is a classic NP-hard problem. There are two common approaches to obtain the exact solution: branch-and-bound (BB) and dynamic programming (DP). A so-called, "sparse'' DP algorithm (SKPDP) that performs fewer operations than the standard algorithm (KPDP) is well known. To the best of our knowledge, there has been no quantitative analysis of the benefits of sparsity. We provide a careful empirical evaluation of SKPDP and observe that for a "large enough'' capacity, C, the number of operations performed by SKPDP is invariant with respect to C for many problem instances. This leads to the possibility of an exponential improvement over the conventional KPDP. We experimentally explore SKPDP over a large range of knapsack problem instances and provide a detailed study of the attributes that impact the performance. Paper 5C: Parallel and Distributed Machine Learning Zoom Meeting C Agnieszka K. Miedlar Developing a Loss Prediction-based Asynchronous Stochastic Gradient Descent Algorithm for Distributed Training of Deep Neural Networks
Developing a Loss Prediction-based Asynchronous Stochastic Gradient Descent Algorithm for Distributed Training of Deep Neural Networks
Junyu Li and Ligang He (University of Warwick), Shenyuan Ren (University of Oxford), and Rui Mao (Shenzhen University) Abstract Video Link Slides Link Training Deep Neural Network is a computation-intensive and time-consuming task. Asynchronous Stochastic Gradient Descent (ASGD) is an effective solution to accelerate the training process since it enables the network to be trained in a distributed fashion, but with a main issue of the delayed gradient update. A recent notable work called DC-ASGD improves the performance of ASGD by compensating the delay using a cheap approximation of the Hessian matrix. DC-ASGD works well with a short delay; however, the performance drops considerably with an increasing delay between the workers and the server. In real-life large-scale distributed training, such gradient delay experienced by the worker is usually high and volatile. In this paper, we propose a novel algorithm called LC-ASGD to compensate for the delay, basing on Loss Prediction. It effectively extends the tolerable delay duration for the compensation mechanism. Specifically, LC-ASGD utilizes additional models that reside in the parameter server and predict the loss to compensate for the delay, basing on historical losses collected from each worker. The algorithm is evaluated on the popular networks and benchmark datasets. The experimental results show that our LC-ASGD significantly improves over existing methods, especially when the networks are trained with a large number of workers. Federated Learning with Proximal Stochastic Variance Reduced Gradient Algorithms
Federated Learning with Proximal Stochastic Variance Reduced Gradient Algorithms
Canh T. Dinh and Nguyen H. Tran (The University of Sydney); Tuan Dung Nguyen (The University of Melbourne); and Wei Bao, Albert Y. Zomaya, and Bing B. Zhou (The University of Sydney) Abstract Video Link Slides Link Federated Learning (FL) is a fast-developing distributed machine learning technique involving the participation of a massive number of user devices. While FL has benefits of data privacy and the abun- dance of user-generated data, its challenges of heterogeneity across users� data and devices complicate algorithm design and conver- gence analysis. To tackle these challenges, we propose an algorithm that exploits proximal stochastic variance reduced gradient meth- ods for non-convex FL. The proposed algorithm consists of two nested loops, which allow user devices to update their local mod- els approximately up to an accuracy threshold (inner loop) before sending these local models to the server for global model update (outer loop). We characterize the convergence conditions for both local and global model updates and extract various insights from these conditions via the algorithm�s parameter control. We also propose how to optimize these parameters such that the training time of FL is minimized. Experimental results not only validate the theoretical convergence but also show that the proposed algorithm outperforms existing Stochastic Gradient Descent-based methods in terms of convergence speed in FL setting. Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning
Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning
Zijie Yan, Danyang Xiao, Mengqiang Chen, Jieying Zhou, and Weigang Wu (Sun Yat-sen University) Abstract Video Link Slides Link Distributed parallel training using computing clusters is desirable for large scale deep neural networks. One of the key challenges in distributed training is the communication cost for exchanging information, such as stochastic gradients, among training nodes. Recently, gradient sparsification techniques have been proposed to reduce the amount of data exchanged and thus alleviate the network overhead. However, most existing gradient sparsification approaches consider only synchronous parallelism and cannot be applied in asynchronous distributed training. Paper |
Thursday, August 20th11:00am-11:40amKeynote-3: Saman Amarasinghe, MIT
Zoom Meeting A
Xipeng Shen
How to Make Sparse Fast How to Make Sparse Fast
Saman Amarasinghe (MIT) Abstract Achieving high performance is no easy task. When it comes to program operating on sparse data, where there is very little hardware, language or compiler support, getting high performance is nearly impossible. As important modern applications such as machine learning, data analytics and simulations operate on sparse data, lack of performance is becoming a critical issue. Achieving high performance was so important from the early days of computing, many researchers have spent their lifetime trying to extract more FLOPS out of these critical codes. Hardcore performance engineers try to get to this performance nirvana single handedly without any help from languages, compilers or tools. In this talk, using two examples, TACO and GraphIt, I�ll argue that domain specific languages and compiler technology can reduce most of the performance optimization burden even in a very difficult domain such as sparse computations. TACO is an optimizing code generator for linear and tensor algebra. TACO introduces a new technique for compiling compound tensor algebra expressions into efficient loops. TACO-generated code has competitive performance to best-in-class hand-written codes for tensor and matrix operations. GraphIt is a DSL and compiler for high-performance graph computing. GraphIt separates algorithm, schedule and physical data layout, providing the programmer with the ultimate control over optimization. GraphIt outperforms the state-of-the-art libraries and DLSs up to 2.4� on scale-free graphs and 4.7� on road graphs. Keynote 11:50am-12:20pm6A: Heterogeneous Systems Zoom Meeting A Vivek Kale Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems
Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems
Matthew Agostini, Francis O'Brien, and Tarek Abdelrahman (University of Toronto) Abstract Video Link Slides Link We propose, implement and evaluate a work stealing based scheduler, called HWS, for graph processing on heterogeneous CPU-FPGA systems that tightly couple the CPU and the FPGA to share system memory. HWS addresses unique concerns that arise with work stealing in the context of our target system. Our evaluation is conducted on the Intel Heterogeneous Architecture Research Platform (HARPv2), using three key processing kernels and seven real-world graphs. We show that HWS effectively balances workloads. Further, the use of HWS results in better graph processing performance compared to static scheduling and a representative of existing adaptive partitioning techniques, called HAP. Improvements vary by graph processing application, input graph and number of threads, and can be up to 100% over static scheduling, and up to 17% over HAP. We also compare to an oracle chunk self-scheduler, in which the best chunk size is known a priori for each number of threads and each input graph. HWS performs no worse than 1-3% in most cases. Finally, our graph processing throughput scales well with increasing threads. These results collectively demonstrate the effectiveness of work stealing for graph processing on our heterogeneous target platform. Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors
Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors
Juan Carlos Saez, Fernando Castro, and Manuel Prieto-Matias (Complutense University of Madrid) Abstract Video Link Slides Link Asymmetric multicore processors (AMPs) couple high-performance big cores and low-power small cores with the same instruction-set architecture but different features, such as clock frequency or microarchitecture. Previous work has shown that asymmetric designs may deliver higher energy efficiency than symmetric multicores for diverse workloads. Despite their benefits, AMPs pose significant challenges to runtime systems of parallel programming models. While previous work has mainly explored how to efficiently execute task-based parallel applications on AMPs, via enhancements in the runtime system, improving the performance of unmodified data-parallel applications on these architectures is still a big challenge. In this work we analyze the particular case of loop-based OpenMP applications, which are widely used today in scientific and engineering domains, and constitute the dominant application type in many parallel benchmark suites used for performance evaluation on multicore systems. We observed that conventional loop-scheduling OpenMP approaches are unable to efficiently cope with the load imbalance that naturally stems from the different performance delivered by big and small cores. Detecting Anomalous Computation with RNNs on GPU-Accelerated HPC Machines
Detecting Anomalous Computation with RNNs on GPU-Accelerated HPC Machines
Pengfei Zou (Clemson University), Ang Li and Kevin Barker (Pacific Northwest National Laboratory), and Rong Ge (Clemson University) Abstract Video Link Slides Link This paper presents a workload classification framework that accurately discriminates illicit computation from authorized workloads on GPU-accelerated HPC systems at runtime. As such systems become increasingly powerful and widely-adopted, attackers have begun to run illicit and for-profit programs that typically require extremely high computing capability to be successful, depriving mission-critical and authorized workloads of execution cycles and increasing risks of data leaking and empowered attacks. Traditional measures on CPU hosts are oblivious to such attacks. Our classification framework leverages the distinctive signatures between illicit and authorized GPU workloads, and explores machine learning methods and workload profiling to classify them. We face multiple challenges in designing the framework: achieving high detection accuracy, maintaining low profiling and inference overhead, and overcoming the limitation of lacking data types and volumes typically required by deep learning models. To address these challenges, we use lightweight, non-intrusive, high-level workload profiling, collect multiple sequences of easily obtainable multimodal input data, and build recurrent neural networks (RNNs) to learn from history for online anomalous workload detection. Evaluation results on three generations of GPU machines demonstrate that the workload classification framework can tell apart the illicit workloads with a high accuracy of over 95\%. The collected dataset, detection framework, and neural network models will be released on github. Paper 6B: Performance Evaluation and Characterization Zoom Meeting B Probir Roy Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver
Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver
Adrian Munera, Sara Royuela, German Llort, and Estanislao Mercadal (Barcelona Supercomputing Center); Franck Wartel (Airbus Defence and Space); and Eduardo Qui�ones (Barcelona Supercomputing Center) Abstract Video Link Slides Link The computing needs of current embedded systems are on the increase. Cutting-edge functionalities, such as those in autonomous vehicles, require the use of multi- and many-core architectures to meet their real-time requirements. This introduces a new layer in the software stack typically deployed in embedded systems: the parallel programming model. In this regard, the OpenMP tasking model is being increasingly considered to exploit the capabilities of the most advanced embedded architectures by virtue of its productivity, and its newly discovered time analyzability and schedulability. SPECcast: A Methodology for Fast Performance Evaluation with SPEC CPU 2017 Multiprogrammed Workloads
SPECcast: A Methodology for Fast Performance Evaluation with SPEC CPU 2017 Multiprogrammed Workloads
Pablo Prieto, Pablo Abad, Jose Angel Herrero, Jose Angel Gregorio, and Valentin Puente (University of Cantabria) Abstract Video Link Slides Link Performance comparison of alternative computing platforms is a fundamental task in computer architecture research. Evaluation tools must adapt to current computing environments where resources are shared among multiple and dissimilar applications running simultaneously. In many cases, well known benchmarking tools such as SPEC CPU2017 do not provide evaluation metrics reflecting such scenarios. Previous attempts to improve workload realism rely on the random combination of applications, making use of statistical models to limit population size. The computational cost of this kind of solutions is not negligible, given the huge amount of combinations required to guarantee the correctness of these statistical models. In this paper we present SPECcast, an alternative methodology for performance comparison making use of SPEC workloads. Our proposal relies also on application combination, but at a much lower computational cost. Using source code annotation for sampling we first obtain an small portion of code of each SPEC application belonging to its Region of Interest (ROI) that resembles with high accuracy the whole program characteristics. Then, making use of synchronization mechanisms we are able to run any combination of such small pieces of code obtaining performance results that no only emulate the execution of multiprogrammed SPEC workloads up to 95% faster but open the way to execute a high number of workload combinations to make the performance and comparison results statistically reliable. The Art of CPU-Pinning: Evaluating and Improving the Performance of Virtualization and Containerization Platforms
The Art of CPU-Pinning: Evaluating and Improving the Performance of Virtualization and Containerization Platforms
Davood Ghatrehsamani, Chavit Denninnart, and Mohsen Amini Salehi (University of Louisiana at Lafayette) and Josef Bacik (Facebook) Abstract Video Link Slides Link Cloud providers offer a variety of execution platforms in form of bare-metal, VM, and containers. However, due to the pros and cons of each execution platform, choosing the appropriate platform for a specific cloud-based application has become a challenge for solution architects. The possibility to combine these platforms (e.g., deploying containers within VMs) offers new capacities that makes the challenge even further complicated. However, there is a little study in the literature on the pros and cons of deploying different application types on various execution platforms. In particular, evaluation of diverse hardware configurations and different CPU provisioning methods, such as CPU pining, have not been sufficiently studied in the literature. In this work, the performance overhead of container, VM, and bare-metal execution platforms are measured and analyzed for four categories of real-world applications, namely video processing, parallel processing (MPI), web processing, and No-SQL, respectively representing CPU intensive, parallel processing, and two IO intensive processes. Our analyses reveal a set of interesting and sometimes counterintuitive findings that can be used as best practices by the solution architects to efficiently deploy cloud-based applications. Here are some notable mentions: (A) Under specific circumstances, containers can impose a higher overhead than VMs; (B) Containers on top of VMs can mitigate the overhead of VMs for certain applications; (C) Containers with a large number of cores impose a lower overhead than those with a few cores. Paper 6C: Routing and Mapping in Networks Zoom Meeting C Feng Zhang XShot: Light-weight Link Failure Localization using Crossed Probing Cycles in SDN
XShot: Light-weight Link Failure Localization using Crossed Probing Cycles in SDN
Hongyun Gao, Laiping Zhao, Huanbin Wang, Zhao Tian, Lihai Nie, and Keqiu Li (Tianjin Key Laboratory of Advanced Networking (TANKLab), College of Intelligence and Computing (CIC), Tianjin University) Abstract Video Link Slides Link Accurate and quick failure localization is critical for automatic network troubleshooting. While it is particularly difficult to solve the problem in the traditional network due to the uncertain routing, Software Defined Networking (SDN) enables the deterministic routing for packet transmission through the traffic engineering algorithm in the centralized controller. On Network Locality in MPI-Based HPC Applications
On Network Locality in MPI-Based HPC Applications
Felix Zahn and Holger Fr�ning (Heidelberg University) Abstract Video Link Slides Link Data movements through interconnection networks exceed local memory accesses in terms of latency as well as energy by multiple orders of magnitude. While many optimizations make great effort to improve memory accesses, large distances in the network can easily dash these improvements, resulting in increasing overall costs. Therefore, a deep understanding of network locality is key for further optimizations, such as improved mapping of ranks to physical entities. In this work, we are looking at locality in the hardware-independent application level and at locality aspects of common network structures. In order to quantize the former, two new metrics are introduced, namely rank locality and selectivity. Our studies are per- formed on a selection of 16 exascale proxy mini apps, with scale ranging from eight to 1152 ranks. These traces are statically analyzed regarding their spatial communication pattern at MPI level. The resulting practice in actual hardware is evaluated with a net- work model, which implements topologies such as tori, fat tree, and dragonfly, and an according minimal routing. As a result, this work is founded on a large set of experimental configurations, based on different applications, scales, and topologies. Overall, our findings indicate that static analyses could assist to select an advanced mapping, which assigns groups of heavily communicating ranks to physical entities in close proximity. This could help to minimize the total number of packet hops and, thereby, improve latency and reduce the probability of congestions. DeepHop on Edge: Hop-by-hop Routing by Distributed Learning with Semantic Attention
DeepHop on Edge: Hop-by-hop Routing by Distributed Learning with Semantic Attention
Bo He, Jingyu Wang, Qi Qi, Haifeng Sun, and Zirui Zhuang (Beijing University of Posts and Telecommunications); Cong Liu (China Mobile Research Institute); and Jianxin Liao (Beijing University of Posts and Telecommunications) Abstract Video Link Slides Link Multi-access Edge Computing (MEC) and ubiquitous smart devices help serve end-users efficiently and optimally through providing emerging edge-deployed services. Meanwhile, more heavy and time-varying traffic loads are produced in the edge network, so that an efficient traffic forwarding mechanism is required. In this paper, we propose a parallel and distributed learning approach, DeepHop, to adapt to the volatile environments and realize hop-by-hop routing. The Multi-Agent Deep Reinforcement Learning (MADRL) is used to alleviate the edge network congestion and maximize the utilization of network resources. DeepHop determines the routing among edge network nodes for heterogeneous types of traffic according to the current workload and capability. By joining with an attention model, DeepHop obtains the semantics from the elements of the network state to help the agents learn the importance of each element on routing. Experiment results show that DeepHop achieves the increase of successfully transmitted packets by 15\% compared with the state-of-the-art algorithms. Besides, DeepHop with an attention model reduces convergence time by nearly half compared with the common-used structures of neural networks. Paper 12:30pm-1:00pm7A: Microarchitecture and Power Management Zoom Meeting A Hyeran Jeon A GPU Register File using Static Data Compression
A GPU Register File using Static Data Compression
Alexandra Angerd, Erik Sintorn, and Per Stenstr�m (Chalmers University of Technology) Abstract Video Link Slides Link GPUs rely on large register files to unlock thread-level parallelism for high throughput. Unfortunately, large register files are power hungry, making it important to seek for new approaches to improve their utilization. HCAPP: Scalable Power Control for Heterogeneous 2.5D Integrated Systems
HCAPP: Scalable Power Control for Heterogeneous 2.5D Integrated Systems
Kramer Straube, Jason Lowe-Power, Christopher Nitta, Matthew Farrens, and Venkatesh Akella (University of California, Davis) Abstract Video Link Slides Link Package pin allocation is becoming a key bottleneck in the capabilities of designs due to the increased bandwidth requirements. 2.5D integration compounds these package-level requirements while introducing an increased number of compute units within the package. We propose a decentralized power control implementation called Heterogeneous Constant Average Power Processing (HCAPP) to maintain the power limit while maximizing the efficiency of the package pins allocated for power. HCAPP uses a hardware-based decentralized design to handle fast power limits, maintain scalability and enable simplified control for heterogeneous systems while maximizing performance. As extensions, we evaluate a software interface and the impact of different accelerator designs. Overall, HCAPP achieves 7% speedup over a RAPL-like implementation. The power utilization improves from 79.7% (RAPL-like) to 93.9% (HCAPP) with this design. A priority-based static software control methodology alongside HCAPP provides average speedups of 8.3% (CPU), 5.4% (GPU), and 12% (Accelerator) for the prioritized component compared to the unprioritized version. DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics
DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics
Jiaxin Peng (The George Washington University); Yousra Alkabani (Halmstad University); and Shuai Sun, Volker Sorger, and Tarek El-Ghazawi (The George Washington University) Abstract Video Link Slides Link Deep Neural Networks (DNNs) are currently used in many fields, including critical real-time applications. Due to its compute-intensive nature, speeding up DNNs has become an important topic in current research. We propose a hybrid opto-electronic computing architecture targeting the acceleration of DNNs based on the residue number system (RNS). In this novel architecture, we combine the use of Wavelength Division Multiplexing (WDM) and RNS for efficient execution. WDM is used to enable a high level of parallelism while reducing the number of optical components needed to decrease the area of the accelerator. Moreover, RNS is used to generate optical components with short optical critical paths. In addition to speed, this has the advantage of lowering the optical losses and reducing the need for high laser power. Our RNS compute modules use one-hot encoding and thus enable fast switching between the electrical and optical domains. Paper 7B: Parallel Algorithms II Zoom Meeting B Abdou Guermouche Adaptive Bulk Search: Solving Quadratic Unconstrained Binary Optimization Problems on Multiple GPUs
Adaptive Bulk Search: Solving Quadratic Unconstrained Binary Optimization Problems on Multiple GPUs
Ryota Yasudo, Koji Nakano, and Yasuaki Ito (Hiroshima University) and Masaru Tatekawa, Ryota Katsuki, Takashi Yazane, and Yoko Inaba (NTT DATA Corporation) Abstract Video Link Slides Link The quadratic unconstrained binary optimization (QUBO) is recently gathering attention in conjunction with quantum annealing (QA), since it is equivalent to finding the ground state of an Ising model. Due to the limitation of current QA systems, classical computers may outperform them, and thus solving QUBO on FPGA, GPU, and special purpose processor has been proposed. In this paper, we propose an adaptive bulk search (ABS), a framework for solving QUBO that can perform many searches in parallel on multiple GPUs. It supports fully-connected Ising models with up to 32k (= 32,768) spins and 16-bit weights. In our ABS, a CPU host performs genetic algorithm (GA) while GPUs asynchronously perform local searches. A bottleneck for solving QUBO exists in the evaluation of the energy function, which requires $O(n^2)$ computational cost for each solution. We show this can be reduced to $O(1)$ in our ABS. The experimental results show that, with four NVIDIA GeForce RTX 2080 Ti GPUs, our framework can search up to $1.24 \times 10^{12}$ solutions per second. Also, we show that our system quickly solves maximum cut and traveling salesman problems. Efficient Block Algorithms for Parallel Sparse Triangular Solve
Efficient Block Algorithms for Parallel Sparse Triangular Solve
Zhengyang Lu, Yuyao Niu, and Weifeng Liu (China University of Petroleum, Beijing) Abstract Video Link Slides Link The sparse triangular solve (SpTRSV) kernel is an important building block for a number of linear algebra routines such as sparse direct and iterative solvers. The major challenge of accelerating SpTRSV lies in the difficulties of finding higher parallelism. Existing work mainly focuses on reducing dependencies and synchronizations in the level-set methods. However, the 2D block layout of the input matrix has been largely ignored in designing more efficient SpTRSV algorithms. Selective Coflow Completion for Time-sensitive Distributed Applications with Poco
Selective Coflow Completion for Time-sensitive Distributed Applications with Poco
Shouxi Luo, Pingzhi Fan, and Huanlai Xing (Southwest Jiaotong University) and Hongfang Yu (University of Electronic Science and Technology of China) Abstract Video Link Slides Link Recently, the abstraction of coflow is introduced to capture the collective data transmission patterns among distributed data-parallel applications. During the processing, coflows generally act as barriers; accordingly, time-sensitive applications prefer their coflows to complete within deadlines and deadline-aware coflow scheduling becomes very crucial. Regarding these data-parallel applications, we notice that many of them, including large-scale query system, distributed iterative training, and erasure codes enabled storage, are able to tolerate loss-bounded incomplete inputs by design. This tolerance indeed brings a flexible design space for the schedule of their coflows: when getting overloaded, the network can trade coflow completeness for timeliness, and balance the completenesses of different coflows on demand. Unfortunately, existing coflow schedulers neglect this tolerance, resulting in inflexible and inefficient bandwidth allocations. In this paper, we explore this fundamental trade-off and design Poco, a POlicy-based COflow scheduler, to achieve customizable selective coflow completions for these emerging time-sensitive distributed applications. Internally, Poco employs a suite of novel designs along with admission controls to make flexible, work-conserving, and performance-guaranteed rate allocation to online coflow requests very efficiently. Extensive trace-based simulations indicate that Pocoishighlyflexibleandachievesoptimalcoflowschedules respecting the requirements specified by applications. Paper 7C: Resource Management on the Cloud Zoom Meeting C Madhusudhan Govindaraju Improving Load Balance via Resource Exchange in Large-Scale Search Engines
Improving Load Balance via Resource Exchange in Large-Scale Search Engines
Kaiyue Duan, Yusen Li, Trent Marbach, Gang Wang, and Xiaoguang Liu (College of Computer Science, Nankai University) Abstract Video Link Slides Link Load balance is one of the major issues in large-scale search engines. A commonly used load balancing approach in search engine datacenters is to reassign index shards among machines. However, reassigning shards under stringent resource environments is challenging due to transient resource constraints (during reassignment, some resources are consumed simultaneously by a shard on the initial machine and its copy on the target machine). Rendering Server Allocation for MMORPG Players in Cloud Gaming
Rendering Server Allocation for MMORPG Players in Cloud Gaming
Iryanto Jaya and Wentong Cai (School of Computer Science and Engineering, Nanyang Technological University) and Yusen Li (School of Computer Science, Nankai University) Abstract Video Link Slides Link Cloud gaming services enable users with heterogeneous device capabilities to get access to game titles with hardware demanding specifications. While visual quality requirements are to be satisfied by the cloud rendering servers, latency is one of the major problems which arises as most of the tasks are offloaded remotely. Cloud gaming players must experience latency below a certain acceptable limit for the game to be responsive with playable visual quality under limited bandwidth capacity. Playing multiplayer games in the cloud needs to cater for player interactions and their commonality especially if they are playing in the same virtual space. The main characteristic of cloud gaming is that the whole rendering pipeline occurs in the cloud; therefore, rendering optimization by reusing common information across players is possible by taking advantage of multi-view rendering. In this paper, we propose a cloud gaming architecture for MMORPGs which involves hundreds of players as well as online optimization heuristic algorithms on rendering server allocations in order to minimize rendering server rental cost from game provider's point of view. In addition, we also compare the performance of those online heuristics with an offline lower bound in order to observe the scalability of the online scenario. Impact of Memory DoS Attacks on Cloud Applications and Real-Time Detection Schemes
Impact of Memory DoS Attacks on Cloud Applications and Real-Time Detection Schemes
Zhuozhao Li (University of Chicago), Tanmoy Sen and Haiying Shen (University of Virginia), and Mooi Choo Chuah (Lehigh University) Abstract Video Link Slides Link Even though memory-based denial-of-service attacks can cause severe performance degradations on co-located virtual machines, a previous detection scheme against such attacks cannot accurately detect the attacks and also generates high detection delay and high performance overhead since it assumes that cache-related statistics of an application follow the same probability distribution at all times, which may not be true for all types of applications. In this paper, we present the experimental results showing the impacts of memory DoS attacks on different types of cloud-based applications. Based on these results, we propose two lightweight, responsive Statistical based Detection Schemes (SDS/B and SDS/P) that can detect such attacks accurately. SDS/B constructs a profile of normal range of cache-related statistics for all applications and use statistical methods to infer an attack when the real-time collected statistics exceed this normal range, while SDS/P exploits the increased periods of access patterns for periodic applications to infer an attack. Our evaluation results show that SDS/B and SDS/P outperform the state-of-the-art detection scheme, e.g., with 65% higher specificity, 40% shorter detection delay, and 7% less performance overhead. Paper 1:10pm-1:40pm8A: GPU-Accelerated Applications Zoom Meeting A Ana Lucia Varbanescu Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators
Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators
David B. Williams-Young and Chao Yang (Lawrence Berkeley National Laboratory) Abstract Video Link Slides Link The solution of large scale eigenvalue problems (EVP) is often the computational bottleneck for many scientific and engineering applications. Traditional eigensolvers, such as direct (e.g. ScaLAPACK) and Krylov subspace (e.g. Lanczos) methods, have struggled in achieving high scalability on large computing resources due to communication and synchronization bottlenecks which are inherent in their implementation. This includes a difficulty in developing well-performing ports of these algorithms to architectures which rely on the use of accelerators, such as graphics processing units (GPU), for the majority of their floating point operations. Recently, there has been significant research into the development of eigensolvers based on spectrum slicing, in particular shift-invert spectrum slicing, to alleviate the communication and synchronization bottlenecks of traditional eigensolvers. In general, spectrum slicing trades the global EVP for many smaller, independent EVPs which may be combined to assemble some desired subset of the entire eigenspectrum. The result is a method which utilizes more floating point operations than traditional eigensolvers, but in a way which allows for the expression of massive concurrency leading to an overall improvement in time-to-solution on large computing resources. In this work, we will examine the performance of parallel shift-invert spectrum slicing on modern GPU clusters using state-of-the-art linear algebra software. Detailed Analysis and Optimization of CUDA K-means Algorithm
Detailed Analysis and Optimization of CUDA K-means Algorithm
Martin Kruli� and Miroslav Kratochv�l (Charles University) Abstract Video Link Slides Link K-means is one of the most frequently used algorithms for unsupervised clustering data analysis. Individual steps of the k-means algorithm include nearest neighbor finding, efficient distance computation, and cluster-wise reduction, which may be generalized to many other purposes in data analysis, visualization, and machine learning. Efficiency of the available implementations of k-means computation steps therefore directly affect many other applications. In this work, we examine the performance limits in the context of modern massively parallel GPU accelerators. Despite the existence of many published papers on this topic, we have found that crucial performance aspects of the GPU implementations remain unaddressed, including the optimizations for memory bandwidth, cache limits, and workload dispatching on problem instances of varying cluster count, dataset size, and dimensionality. We present a detailed analysis of individual computation steps and propose several optimizations that improve the overall performance on contemporary GPU architectures. Our open-source prototype exhibits significant speedup over the current state-of-the-art implementations in virtually all practical scenarios. Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architectures
Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architectures
Ichitaro Yamazaki, Sivasankaran Rajamanickam, and Nathan Ellingwood (Sandia National Labs) Abstract Video Link Slides Link Sparse triangular solver is an important kernel in many applications. Instead of developing a general sparse triangular solver, in this paper, we take advantage of the supernodal structures of the triangular matrices that come from the direct factorization of a sparse matrix. We compare the effects of different scheduling schemes on the performance, and also investigate an algorithmic variant called the partitioned inverse method. We implemented our solver using Kokkos, and hence, it is portable to different machine architectures with a single code base. Our experimental results on CPUs and a GPU demonstrate the trade-offs between the different approaches in term of stability, storage, and performance. Paper 1:10pm-1:50pm8B: Data Centers and the Edge Zoom Meeting B Sudhanva Gurumurthi OVERSEE: Outsourcing Verification to Enable Resource Sharing in Edge Environment
OVERSEE: Outsourcing Verification to Enable Resource Sharing in Edge Environment
Xiaoqing Cai, Jiuchen Shi, Rui Yuan, Chang Liu, Wenli Zheng, Quan Chen, Chao Li, Jingwen Leng, and Minyi Guo (Shanghai Jiao Tong University) Abstract Video Link Slides Link Multi-tenant (or colocation) data centers are good solutions to support edge computing, since each enterprise or organization usually has limited servers at an edge site. When any data center tenant faces a burst of workload, renting resources from the other tenants in the same data center can provide the required resources while keeping the merits of edge computing, but its challenges of reliability and performance are daunting. In this paper, we propose OVERSEE, an outsourcing verification mechanism that enables resource sharing in multi-tenant data centers, fully exploiting the benefits of edge computing. OVERSEE addresses the above two challenges by making skillful use of Intel SGX suite. OVERSEE consists of two sub schemes, the Report-Proof mechanism and Sampling-Challenging mechanism. The Report-Proof mechanism g-uarantees a task outsourced by a tenant can be executed correctly, i.e., completely and without modification, in the operating environment provided by another tenant. The Sampling-Challenging mechanism can be used to verify that sufficient computing capacity is provided to achieve the required QoS according to the resource lease agreement between the tenants. The theoretical analysis shows the effectiveness of OVERSEE and the experimental results show that it brings minimal overhead. Reducing Latency in Multi-Tenant Data Centers via Cautious Congestion Watch
Reducing Latency in Multi-Tenant Data Centers via Cautious Congestion Watch
Ahmed M. Abdelmoniem (Hong Kong University of Science and Technology; Assiut University, Egypt) and Hengky Susanto and Brahim Bensaou (Hong Kong University of Science and Technology) Abstract Video Link Slides Link Modern data centers host a plethora of interactive data-intensive applications that are known to generate large numbers of short-lived parallel flows. These short-lived flows must complete their transfer quickly to meet the stringent performance requirements of interactive applications. Unfortunately, network resources (e.g., switch buffer space) in the data center are scarce, and congestion results in long latency. As a first solution, following a traditional design rationale, Data Center TCP (DCTCP) was proposed to speed up the completion time of short-lived flow by maintaining a low buffer occupancy in the switch. In general, DCTCP performs well in homogeneous environments. However, its performance degrades quickly in heterogeneous environments or when it uses a large initial congestion window. To resolve this problem, we propose a Hypervisor-based congestion watching mechanism (HWatch), which measures the network load in the data center via ECN and uses the resulting statistics to determine the appropriate initial congestion window size to avoid congestion. HWatch neither needs modification to the TCP stack in the VMs nor requires any specialized network hardware features to meet its targets. In our evaluation, we demonstrate the benefits of HWatch in improving the performance of TCP flows through large-scale in ns2 simulations and testbed experiments in a small data center. URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public Clouds
URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public Clouds
Wei Zhang, Ningxin Zheng, and Quan Chen (Shanghai Jiao Tong University); Yong Yang, Zhuo Song, and Tao Ma (Alibaba Cloud); and Jingwen Leng and Minyi Guo (Shanghai Jiao Tong University) Abstract Video Link Slides Link Database platform-as-a-service (dbPaaS) is developing rapidly and a large number of databases have been migrated to run on the Clouds for the low cost and flexibility. Emerging Clouds rely on the tenants to provide the resource specification for their database workloads. However, they tend to over-estimate the resource requirement of their databases, resulting in the unnecessarily high cost and low Cloud utilization. A methodology that automatically suggests the �just-enough� resource specification that fulfills the performance requirement of every database workload is profitable. Reliability Augmentation of Requests with Service Function Chain Requirements in Mobile Edge-Cloud Networks
Reliability Augmentation of Requests with Service Function Chain Requirements in Mobile Edge-Cloud Networks
Weifa Liang and Yu Ma (The Australian National University), Wenzheng Xu (Sichuan University), Xiaohua Jia (City University of Hong Kong), and Sid Chi-Kin Chau (The Australian National University) Abstract Video Link Slides Link Provisioning reliable network services for mobile users in a mobile edge computing environment is the top priority for most network service providers, as unreliable or severely failed services will result in a tremendous loss on their revenues and consumers. In this paper, we study a novel service reliability augmentation problem in a Mobile Edge-Cloud (MEC) network, where mobile users request various network services through issuing requests with service function chain (SFC) requirements and reliability expectations, and an admitted request may not meet its reliability expectation initially. To enhance its service reliability to reach its expectation, it is a common practice to make use of redundant backups, that is to place redundant VNF instances of each Virtual Network Function (VNF) in its SFC in case its primary VNF instance fails. In this paper, we aim to augment the reliability of each admitted request as much as possible with the ultimate objective to reach its reliability expectation, subject to computing capacity on each cloudlet in the network. To this end, we first formulate a novel service reliability augmentation problem. We then deal with the problem for the admitted request, for which we propose an integer linear program (ILP) solution, and develop a randomized algorithm with a provable approximation ratio while a moderate resource constraint violation. We also devise an efficient heuristic algorithm for the problem without any resource constraint violation. We finally evaluate the performance of the proposed algorithms through experimental simulations. Experimental results demonstrate that the proposed algorithms are promising. Paper 8C: Storage and I/O Optimization Zoom Meeting C Sriram Krishnamoorthy OPS: Optimized Shuffle Management System for Apache Spark
OPS: Optimized Shuffle Management System for Apache Spark
Yuchen Cheng, Chunghsuan Wu, Yanqiang Liu, and Rui Ren (Shanghai Jiao Tong University); Hong Xu (City University of Hong Kong); Bin Yang (Intel Corporation); and Zhengwei Qi (Shanghai Jiao Tong University) Abstract Video Link Slides Link In recent years, distributed computing frameworks, such as Hadoop MapReduce and Spark, are widely used for big data processing. With the explosive growth of the amount of data, companies tend to store intermediate data of the shuffle phase on disk instead of memory. Therefore, intensive network and disk I/O are both involved in the shuffle phase. To optimize the overhead of the shuffle phase, we propose OPS, an open-source distributed computing shuffle management system based on Spark, which provides an independent shuffle service for Spark. By using early-merge and early-shuffle strategy, OPS alleviates the I/O overhead in the shuffle phase and efficiently schedules the I/O and computing resources. OPS also proposes a slot-based scheduling algorithm to predict and calculate the optimal scheduling result of the reduce task. Besides, OPS provides a taint-redo strategy to ensure the fault tolerance of computing jobs. We evaluated the performance of OPS on a 100-node Amazon AWS EC2 cluster. Overall, OPS optimizes the overhead of shuffle by nearly 50%. In the test cases of HiBench, OPS improves end-to-end completion time by nearly 30% on average. SeRW: Adaptively Separating Read and Write upon SSDs of Hybrid Storage Server in Clouds
SeRW: Adaptively Separating Read and Write upon SSDs of Hybrid Storage Server in Clouds
Fan Deng, Qiang Cao, Shucheng Wang, Shuyang Liu, and Jie Yao (Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology) and Yuanyuan Dong and Puyuan Yang (Alibaba Inc.) Abstract Video Link Slides Link Nowadays, cloud providers embrace hybrid storage servers to reap both high IO performance of solid-state drives (SSDs) and low-cost of hard disk drives (HDDs). These hybrid storage servers generally employ SSDs as primary storage directly serving requests from front-end applications while using HDDs as the secondary storage to provide sufficient storage capacity. Scalable Coordination of Hierarchical Parallelism
Scalable Coordination of Hierarchical Parallelism
Vinay Devadas and Matthew Curtis-Maury (NetApp, Inc) Abstract Video Link Slides Link Given continually increasing core counts, multiprocessor software scaling becomes critical. One set of applications that is especially difficult to parallelize efficiently are those that operate on hierarchical data. In such applications, correct execution relies on all threads coordinating their accesses within the hierarchy. At the same time, high performance execution requires that this coordination happen efficiently while maximizing parallelism. Mass: Workload-Aware Storage Policy for OpenStack Swift
Mass: Workload-Aware Storage Policy for OpenStack Swift
Yu Chen, Wei Tong, Dan Feng, and Zike Wang (Huazhong University of Science and Technology) Abstract Video Link Slides Link A cloud object store entails serving workloads of multi tenants. Different applications have different access characteristics and various performance requirements. Thus, it is necessary for the cloud object store to provide tenant-specific policies. OpenStack Swift develops a storage policy mechanism to enable separate storage configurations for different applications. However, the limited configurability of the policy mechanism makes it difficult to provide efficient and flexible policies to meet the evolving needs of the applications. First, original policies only control the frontend paths of the request forwarding without covering the backend paths of the request processing. Thus, they cannot provide sufficient optimizations for workload performance. Second, those policies need to be pre-defined and remain unchanged at runtime, so that they lack the flexibility to adapt to the possible workload changes during runtime. In addition, effective execution of tenant-specific policies is based on the premise that all storage layers are aware of the requests' sources, while the storage server layer lacks the requisite information to distinguish requests from multi tenants. Paper
|
Monday, August 17th8:00am-10:00amSRMPDS: International Workshop on Scheduling and Resource Management for Parallel and Distributed...
Zoom Meeting B
Workshop 10:00am-11:00amEMS: International Workshop on Embedded Multicore Systems
Zoom Meeting A
Workshop 12:00pm-2:45pmP2S2: International Workshop on Parallel Programming Models and Systems Software for High-End Com... Zoom Meeting C Workshop 8:00pm-9:30pmSoftware Stack for Hardware Accelerators Workshop (SSHAW) Zoom Meeting C Workshop 10:00pm-11:00pmAWASN: International Workshop on Applications of Wireless Ad hoc and Sensor Networks Meeting Link Provided by Chairs Workshop AsynchronousEXA_PMRA: International Workshop on Performance modelling, Runtime System and Applications at the Exascale
Workshop |
Tuesday, August 18th11:00am-11:15amOpening Remarks Zoom Meeting A J. Nelson Amaral Opening Remark 11:15am-11:55amKeynote-1: Kathy Yelick, U.C. Berkeley Zoom Meeting A Xipeng Shen Genomic Analysis and Learning at Scale: Mapping Irregular Computations to Advanced Architectures Keynote 12:05pm-12:45pmBest-Paper Candidates Zoom Meeting A Lizy John Huffman Coding with Gap Arrays for GPU Acceleration CapelliniSpTRSV: A Thread-Level Synchronization-Free Sparse Triangular Solve on GPUs SkyChain: A Deep Reinforcement Learning-Empowered Dynamic Blockchain Sharding System GOSH: Embedding Big Graphs on Small Hardware Paper 12:55pm-1:25pm1A: Distributed Systems Zoom Meeting A Martin Kong CARD: A Congestion-Aware Request Dispatching Scheme for Replicated Metadata Server Cluster Safe, Fast Sharing of memcached as a Protected Library DQEMU: A Scalable Emulator with Retargetable DBT on Distributed Platforms Paper 1B: Edge Learning and Inference Zoom Meeting B Chih-Chieh Yang ShadowTutor: Distributed Partial Distillation for Mobile Video DNN Inference FEEL: A Federated Edge Learning System for Efficient and Privacy-Preserving Mobile Healthcare Adaptive Distributed Convolutional Neural Network Inference at the Network Edge with ADCNN Paper 1C: Memory Systems Zoom Meeting C Alaa Alameldeen An Efficient Wear-level Architecture using Self-adaptive Wear Leveling CCHL: Compression-Consolidation Hardware Logging for Efficient Failure-Atomic Persistent Memory Updates Balancing Fairness and Efficiency for Cache Sharing in Semi-external Memory System Paper 1:35pm-2:05pm2A: Fault-Tolerance Zoom Meeting A Karsin Ben Algorithm-Based Checkpoint-Recovery for the Conjugate Gradient Method Robustness of the Young/Daly formula for stochastic iterative applications Energy-aware strategies for reliability-oriented real-time task allocation on heterogeneous platforms Paper 2B: Scheduling and Placement in Networks Zoom Meeting B Bin Ren Cooperative Game for Multiple Chargers with Dynamic Network Topology Optimizing Flow Bandwidth Consumption with Traffic-diminishing Middlebox Placement Towards High-Efficiency Data Centers via Job-Aware Network Scheduling Paper 2C: Systems for Machine Learning Zoom Meeting C Dingwen Tao DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training E-LAS: Design and Analysis of Completion-Time Agnostic Scheduling for Distributed Deep Learning Cluster ParSecureML: An Efficient Parallel Secure Machine Learning Framework on GPUs Paper 2:15pm-3:00pm
Poster Session Zoom Meeting C Paul Lu EPMA: Efficient Partial Message Access in IoT Era Towards Parallelization of a Texture Description Algorithm for Breast Lesion Classification using OpenMP and CUDA Jeor: Accelerate Linear Algebra Operation in SSDs Saec: Similarity-Aware Embedding Compression in Recommendation Systems Poster
| Wednesday, August 19th11:00am-11:40amKeynote-2: Michael Schulte, AMD Zoom Meeting A Lizy John Challenges and Opportunities for Extreme-Scale Computing Keynote 11:50am-12:20pm3A: Graph Processing and Concurrent Data Structures Zoom Meeting A Michela Becchi Graffix: Efficient Graph Processing with a Tinge of GPU-Specific Approximations Optimizing Linearizable Bulk Operations on Data Structures GraBi: Communication-Efficient and Workload-Balanced Partitioning for Bipartite Graphs Paper 3B: Large-Scale Applications on Supercomputers Zoom Meeting B Kamesh Madduri Large-scale Simulations of Peridynamics on Sunway Taihulight Supercomputer Toward Large-Scale Image Segmentation on Summit SWMapper: Scalable Read Mapper on SunWay TaihuLight Paper 3C: Machine Learning for Computing Zoom Meeting C Eunjung Park An Online Learning-Based Task Offloading Framework for 5G Small Cell Networks A Reinforcement Learning Based System for Minimizing Cloud Storage Service Cost Deep Reinforcement Learning based Elasticity-compatible Heterogeneous Resource Management for Time-critical Computing Paper 12:30pm-1:00pm4A: Performance Tools and Methodology Zoom Meeting A Tanzima Islam Generating Robust Parallel Programs via Model Driven Prediction of Compiler Optimizations for Non-determinism Memory-Centric Communication Mechanism for Real-time Autonomous Navigation Applications Automatic Identification and Precise Attribution of DRAM Bandwidth Contention Paper 4B: Storage Reliability & Memory Security Zoom Meeting B Radu Teodorescu An Adaptive Erasure-Coded Storage Scheme with an Efficient Code-Switching Algorithm First Time Miss : Low Overhead Mitigation For Shared Memory Cache Side Channels A Rack-aware Pipeline Repair Scheme for Erasure-coded Distributed Storage Systems Paper 4C: Supporting Efficient Machine Learning Zoom Meeting C Seyong Lee Extremely Low-bit Convolution Optimization for Quantized Neural Network on Modern Computer Architectures Vector Forward Mode Automatic Differentiation on SIMD/SIMT architectures Delta-DNN: Efficiently Compressing Deep Neural Networks via Exploiting Floats Similarity Paper 1:10pm-1:40pm5A: Data Center Networking Zoom Meeting A Scott Atchley AMRT: Anti-ECN Marking to Improve Utilization of Receiver-driven Transmission in Data Center PS : Periodic Strategy for the 40-100Gbps Energy Efficient Ethernet Polo: Receiver-Driven Congestion Control for Low Latency over Commodity Network Fabric Paper 5B: Parallel Algorithms I Zoom Meeting B Grey Ballard Prune the Unnecessary: Parallel Pull-Push Louvain Algorithms with Automatic Edge Pruning Fast Spectral Graph Layout on Multicore Platforms Revisiting Sparse Dynamic Programming for the 0/1 Knapsack Problem Paper 5C: Parallel and Distributed Machine Learning Zoom Meeting C Agnieszka K. Miedlar Developing a Loss Prediction-based Asynchronous Stochastic Gradient Descent Algorithm for Distributed Training of Deep Neural Networks Federated Learning with Proximal Stochastic Variance Reduced Gradient Algorithms Dual-Way Gradient Sparsification for Asynchronous Distributed Deep Learning Paper |
Thursday, August 20th11:00am-11:40amKeynote-3: Saman Amarasinghe, MIT
Zoom Meeting A
Xipeng Shen
How to Make Sparse Fast Keynote 11:50am-12:20pm6A: Heterogeneous Systems Zoom Meeting A Vivek Kale Balancing Graph Processing Workloads Using Work Stealing on Heterogeneous CPU-FPGA Systems Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors Detecting Anomalous Computation with RNNs on GPU-Accelerated HPC Machines Paper 6B: Performance Evaluation and Characterization Zoom Meeting B Probir Roy Experiences on the characterization of parallel applications in embedded systems with Extrae/Paraver SPECcast: A Methodology for Fast Performance Evaluation with SPEC CPU 2017 Multiprogrammed Workloads The Art of CPU-Pinning: Evaluating and Improving the Performance of Virtualization and Containerization Platforms Paper 6C: Routing and Mapping in Networks Zoom Meeting C Feng Zhang XShot: Light-weight Link Failure Localization using Crossed Probing Cycles in SDN On Network Locality in MPI-Based HPC Applications DeepHop on Edge: Hop-by-hop Routing by Distributed Learning with Semantic Attention Paper 12:30pm-1:00pm7A: Microarchitecture and Power Management Zoom Meeting A Hyeran Jeon A GPU Register File using Static Data Compression HCAPP: Scalable Power Control for Heterogeneous 2.5D Integrated Systems DNNARA: A Deep Neural Network Accelerator using Residue Arithmetic and Integrated Photonics Paper 7B: Parallel Algorithms II Zoom Meeting B Abdou Guermouche Adaptive Bulk Search: Solving Quadratic Unconstrained Binary Optimization Problems on Multiple GPUs Efficient Block Algorithms for Parallel Sparse Triangular Solve Selective Coflow Completion for Time-sensitive Distributed Applications with Poco Paper 7C: Resource Management on the Cloud Zoom Meeting C Madhusudhan Govindaraju Improving Load Balance via Resource Exchange in Large-Scale Search Engines Rendering Server Allocation for MMORPG Players in Cloud Gaming Impact of Memory DoS Attacks on Cloud Applications and Real-Time Detection Schemes Paper 1:10pm-1:40pm8A: GPU-Accelerated Applications Zoom Meeting A Ana Lucia Varbanescu Parallel Shift-Invert Spectrum Slicing on Distributed Architectures with GPU Accelerators Detailed Analysis and Optimization of CUDA K-means Algorithm Performance Portable Supernode-based Sparse Triangular Solver for Manycore Architectures Paper 1:10pm-1:50pm8B: Data Centers and the Edge Zoom Meeting B Sudhanva Gurumurthi OVERSEE: Outsourcing Verification to Enable Resource Sharing in Edge Environment Reducing Latency in Multi-Tenant Data Centers via Cautious Congestion Watch URSA: Precise Capacity Planning and Fair Scheduling based on Low-level Statistics for Public Clouds Reliability Augmentation of Requests with Service Function Chain Requirements in Mobile Edge-Cloud Networks Paper 8C: Storage and I/O Optimization Zoom Meeting C Sriram Krishnamoorthy OPS: Optimized Shuffle Management System for Apache Spark SeRW: Adaptively Separating Read and Write upon SSDs of Hybrid Storage Server in Clouds Scalable Coordination of Hierarchical Parallelism Mass: Workload-Aware Storage Policy for OpenStack Swift Paper
|