

Enabling performance portability of data-parallel OpenMP applications on asymmetric multicore processors

### Juan Carlos Sáez\*, Fernando Castro\*, Manuel Prieto-Matías\*,†

\*Facultad de Informática <sup>†</sup>Instituto de Tecnología del Conocimiento (ITC) COMPLUTENSE UNIVERSITY OF MADRID, SPAIN

49th International Conference on Parallel Processing (ICPP '20)

August 17-20, 2020







This research has been supported by



Grant references: RTI2018-093684-B-I00 and S2018/TCS-4423  $\,$ 







Comunidad de Madrid



# Asymmetric Multicore Processors (AMPs)

- Performance asymmetry: big cores + small cores
- Same Instruction Set Architecture (ISA) but different features:
  - Processor frequency and power consumption
  - Microarchitecture
    - In-order vs. out-of-order pipeline
    - Retirement/issue width
  - Cache(s) size and hierarchy









# Example: ARM big.LITTLE processor





ArTeCS





49th International Conference on Parallel Processing (ICPP '20) - 4

# Intel Lakefield's hybrid processor



1 Sunny Cove core + 4 Tremont cores



ArTeCS

Samsung Galaxy Book S



#### Microsoft Neo Surface







Goal: Automatically deliver good performance to data-parallel loop-based OpenMP programs on AMPs













Goal: Automatically deliver good performance to data-parallel loop-based OpenMP programs on AMPs







- Main limiting factors for scalability of loop-based OpenMP programs
  - 1 Phases with limited parallelism (e.g. sequential sections)
  - 2 Load imbalance in iteration distribution
  - 3 Shared-resource contention (Last-level cache, memory bandwidth)

### **Issue AMPs**

ArTec Cores with different performance introduce load imbalance inherently

ArTeCS



Application with a single parallel loop runs on AMP (2 big cores + 2 small cores)



- Legacy OpenMP code targets symmetric multicore
- The static schedule is used as iterations have similar amount of work
  - Each thread runs same # of iterations
- Execution of *unmodified* application on an AMP

ArTeCS

Application with a single parallel loop runs on AMP (2 big cores + 2 small cores)



Application with a single parallel loop runs on sCMP (4 small cores)















# Addressing the load imbalance

Cannot just we assign more iterations to big-core threads in proportion to the big-to-small relative performance? Ctime<sub>small</sub>

• Speedup Factor  $(SF)^1 \Rightarrow$  big-to-small relative performance:



Arrecs <sup>1</sup>For these experiments, the SF was measured with the ratio of completion times (small-to-big) registered for each loop running with a single thread

Cannot just we assign more iterations to big-core threads in proportion to the big-to-small relative performance?
Ctime<sub>small</sub>

• Speedup Factor  $(SF)^1 \Rightarrow$  big-to-small relative performance:



SF is not only platform- and application- specific but may also vary across loops



<sup>1</sup>For these experiments, the SF was measured with the ratio of completion times (small-to-big) registered or each loop running with a single thread







- We proposed three asymmetry-aware loop-scheduling methods
  - AID: Asymmetric Iteration Distribution
  - Replacements for static and dynamic methods on AMP
    - Cater to the demands of different applications







- We proposed three asymmetry-aware loop-scheduling methods
  - AID: Asymmetric Iteration Distribution
  - Replacements for static and dynamic methods on AMP
    - Cater to the demands of different applications

### Features

- Implemented in *libgomp* (GNU OpenMP runtime system)
- Applications need to be recompiled, but no changes required in source code
- The same binary can be used on different platforms with the same ISA
  - The runtime system automatically adapts to the platform







**2** Design and implementation of AID

**3** Experimental Evaluation







### **2** Design and implementation of AID

### **3** Experimental Evaluation







## **2** Design and implementation of AID

**3** Experimental Evaluation



# **AID** loop-scheduling methods

### ■ 3 variants of Asymmetric-Iteration Distribution (AID)

- 1 AID-static: replacement for static on AMPs
- 2 AID-hybrid: "safer" version of AID-static
- **3** AID-dynamic: replacement for dynamic on AMPs





# AID loop-scheduling methods



## ■ 3 variants of Asymmetric-Iteration Distribution (AID)

- 1 AID-static: replacement for static on AMPs
- 2 AID-hybrid: "safer" version of AID-static
- **3** AID-dynamic: replacement for dynamic on AMPs

### **Common aspects**

ArTeC.

- Usually assign more loop iterations to big-core threads than to small-core threads
  - Based on the loop's SF (predicted at runtime)
- Designed for scenarios with no oversubscription
- There is no need to modify applications to activate them
  - Environment variables for enabling and setting parameters





| it0  |
|------|
| it1  |
| it2  |
| it3  |
| it4  |
| it5  |
| it6  |
| it7  |
| it8  |
| it9  |
| it10 |
| it11 |
| it12 |
| it13 |
| it14 |
| it15 |























14



### Lock-free implementation

- 2 shared counters: next and end
- chunk (default value 1)
- Uses fetch-and-add
  - Atomic: next+=chunk
- Each thread invokes

gomp\_iter\_dynamic\_next() until next>=end

**AID-Static** 



Designed for loops where iterations have the same amount of work



- All threads are allotted "the same" amount of iterations
- Big-core threads complete their share earlier causing imbalance



**AID-Static** 

Designed for loops where iterations have the same amount of work



- All threads are allotted "the same" amount of iterations
- Big-core threads complete their share earlier causing imbalance





Begin Loop





49th International Conference on Parallel Processing (ICPP '20) 15







- All threads are allotted "the same" amount of iterations
- Big-core threads complete their share earlier causing imbalance





- Small-core threads  $\rightarrow k$  iterations
- Big-core threads  $\rightarrow SF \cdot k$  iterations
- total\_iterations =  $N_{big} \cdot SF \cdot k + N_{small} \cdot k$



49th International Conference on Parallel Processing (ICPP '20)

## All threads are allotted "the same" amount of iterations

static schedule

 Big-core threads complete their share earlier causing imbalance

### ATD-static



- Small-core threads  $\rightarrow k$  iterations
- Big-core threads  $\rightarrow SF \cdot k$  iterations
- total\_iterations =  $N_{big} \cdot SF \cdot k + N_{small} \cdot k$





 $k = \frac{total\_iterations}{N_{big} \cdot SF + N_{small}}$ 





# **AID-Static**

Begin Loop -

End Loop -





















- Efficient lock-free implementation
- Threads complete iterations even during the sampling phase (δ<sub>i</sub>)
- Each thread needs to gather 2 timestamps (vsyscall)
- Shared counters to maintain aggregate completion time

# **AID-Static: Implementation**

- Threads in 3 possible states
  - A state transition may occur when the thread "steals" work from the shared pool







# **AID-static: Limitations**

- Predicted SF may not be representative throughout the loop
  - Processing varies slightly across iterations
  - SF misprediction





### AID-Static could introduce load imbalance

# **AID-Hybrid: Implementation**



■ AID-hybrid: AID-static + OpenMP's dynamic

• *f* is a configurable parameter (percentage)







# **AID-dynamic**

- Goal: To make a good replacement for dynamic on AMPs
- It relies on two configurable *chunk* values:
  - major (M): Used for AID phases (variant of dynamic)
    - small-core threads  $\rightarrow M$  iterations
    - big-core threads  $\rightarrow M \cdot R$  iterations
    - *R* = g(SF)
  - minor (m): Used in between AID phases and at the end of the loop's execution







# **AID-dynamic**





$$R(t+1) = \begin{cases} SF & t = 0\\ R(t) \cdot \frac{AvgTimeAID_{small}(t)}{AvgTimeAID_{big}(t)} & t > 0 \end{cases}$$



# Required changes in the GCC compiler



■ To guarantee performance portability with our proposal:

The runtime system must be deployed as a dynamic library (libgomp.so)
 The compiled program must invoke loop-related runtime API calls

■ Issue: GCC omits loop-related API calls when schedule clause not provided

```
...
#pragma omp for
for (j = 0; j < grid_points[1]; j++) {
    eta = (double)j * dnym1;
    for (k = 0; k < grid_points[2]; k++) {
        zeta = (double)k * dnzm1;
        exact_solution(xi, eta, zeta, temp);
        for (m = 0; m < 5; m++) {
            u[i][j][k][m] = temp[m];
        }
    }
}</pre>
```

```
Terminal

$ nm -u bt.B | grep -i GOMP_

U GOMP_barrier@@GOMP_1.0

U GOMP_parallel@@GOMP_4.0
```



The runtime system cannot control the schedule of those loops



 $\blacksquare$  We changed *default* value for schedule clause in GCC: static  $\rightarrow$  runtime

- If clause omitted, runtime uses schedule defined in OMP\_SCHEDULE env. variable
- Very simple change in GCC 8.3: omp\_extract\_for\_data() at gcc/omp-general.c

```
#pragma omp for
for (j = 0; j < grid_points[1]; j++) {
    eta = (double)j * dnym1;
    for (k = 0; k < grid_points[2]; k++) {
        zeta = (double)k * dnzm1;
        exact_solution(xi, eta, zeta, temp);
        for (m = 0; m < 5; m++) {
            u[i][j][k][m] = temp[m];
        }
    }
}
```

| Terminal                                          |
|---------------------------------------------------|
| <pre>\$ nm -u bt.B_modified   grep -i GOMP_</pre> |
| U GOMP_loop_end@@GOMP_1.0                         |
| U GOMP_loop_end_nowait@@GOMP_1.0                  |
| U GOMP_loop_runtime_next@@GOMP_1.0                |
| U GOMP_loop_runtime_start@@GOMP_1.0               |
| U GOMP_parallel@@GOMP_4.0                         |

Runtime system is now notified when each loop begins (GOMP\_loop\_\*\_start()) and when each thread requests work to be assigned to it (GOMP\_loop\_\*\_next())





### **2** Design and implementation of AID

### **3** Experimental Evaluation



# **Experimental platforms**





- 32-bit ARM big.LITTLE processor
  - 4 × Cortex A15 *big* cores @ 2.0Ghz
  - 4 × Cortex A7 *small* cores @ 1.5Ghz
- 2GB LPDDR3 SDRAM @ 933MHz





# **Experimental platforms**





#### Platform B (Intel server platform)



- 32-bit ARM big.LITTLE processor
  - 4 × Cortex A15 *big* cores @ 2.0Ghz
  - 4 x Cortex A7 *small* cores @ 1.5Ghz
- 2GB LPDDR3 SDRAM @ 933MHz

ArTeCs

- 64-bit Intel Xeon E5-2620 v4 (Broadwell-EP)
  - 4 × fast cores @ 2.1Ghz
  - 4 x slow cores @ 1.2Ghz and 87.5% duty cycle
- 32GB DDR4 SDRAM @ 2133MHz

# Applications and thread-to-core mappings

- 21 OpenMP benchmarks
  - NAS Parallel
  - PARSEC 3
  - Rodinia
- **GCC** 8.3 + Linux kernel 4.14.165
- Evaluated loop-scheduling methods
  - static (BS and SB)
  - dynamic (BS and SB)
  - guided (BS and SB)
  - ATD-static
  - AID-hybrid

ArTeCS

AID-dynamic



SB mapping

Core

BS mapping

т٦

Core Core Core Core

Τ6

Core

6

Τ1

Core

5

**T7** 

Core

7

TΟ



# Relative performance on Platform A





- Running the master thread on a big core brings substantial improvements in some cases
- AID-static and AID-hybrid make good replacements for static (up to 30.7% and 56% improvement)
- $\blacksquare$  OpenMP dynamic and AID-dynamic perform in a close range but a  $>\!10\%$  improvement is observed



# Relative performance on Platform B





- Smaller big-to-small performance ratios (max. 2.3x vs. 8.9x on Platform A)
- The overhead of dynamic negates its benefits in some cases due to lower SF values
  - AID-dynamic delivers higher gains vs. dynamic on this platform (22% on average)



# Average relative performance







# Dynamic vs AID-dynamic: different chunk values



- The average improvement with best chunk settings for AID-dynamic vs. static is 5.5%
- AID-dynamic delivers up to a 21.9% performance improvement
- With AID-dynamic performance is less sensitive to the choice of the chunk values





- **2** Design and implementation of AID
- **3** Experimental Evaluation
- **4** Conclusions and Future Work







- Conventional OpenMP loop-scheduling methods are not suitable for AMPs
  - static introduces load imbalance
  - dynamic better than static but subject to high overhead







- Conventional OpenMP loop-scheduling methods are not suitable for AMPs
  - static introduces load imbalance
  - dynamic better than static but subject to high overhead
- We proposed 3 alternative asymmetry-aware loop-scheduling methods
  - Implemented in *libgomp* (GCC 8.3)
  - No changes required in application code
  - Applications must be recompiled with our modified compiler







- Conventional OpenMP loop-scheduling methods are not suitable for AMPs
  - static introduces load imbalance
  - dynamic better than static but subject to high overhead
- We proposed 3 alternative asymmetry-aware loop-scheduling methods
  - Implemented in *libgomp* (GCC 8.3)
  - No changes required in application code
  - Applications must be recompiled with our modified compiler
- Our experimental evaluation on real AMP hardware reveals their effectiveness
  - AID-static, AID-hybrid outperform static by up to 30.7% and 56%, respectively
  - AID-dynamic improves dynamic by up to 16.8%
    - Higher relative improvements when using the best chunk settings for each application





**1** Explore the potential from using multiple AID methods in the same application

- $\blacksquare$  Loops with same-sized iterations  $\rightarrow$  AID-static or AID-hybrid
- $\blacksquare$  Loops amenable to dynamic  $\rightarrow$  AID-dynamic
- Requires making changes in the application and parameter-tunning + profiling





**1** Explore the potential from using multiple AID methods in the same application

- $\blacksquare$  Loops with same-sized iterations  $\rightarrow \texttt{AID-static}$  or <code>AID-hybrid</code>
- $\blacksquare$  Loops amenable to dynamic  $\rightarrow$  AID-dynamic
- Requires making changes in the application and parameter-tunning + profiling
- 2 Leverage AID in multi-application scenarios
  - Devise interaction mechanisms between the OS and the runtime system





**1** Explore the potential from using multiple AID methods in the same application

- $\blacksquare$  Loops with same-sized iterations  $\rightarrow \texttt{AID-static}$  or <code>AID-hybrid</code>
- $\blacksquare$  Loops amenable to dynamic  $\rightarrow$  AID-dynamic
- Requires making changes in the application and parameter-tunning + profiling
- 2 Leverage AID in multi-application scenarios
  - Devise interaction mechanisms between the OS and the runtime system
- **3** Evaluate the effectiveness of AID in other types of applications and heterogeneous platforms

