#### **TORSTEN HOEFLER**

# Progress in automatic GPU compilation and why you want to run MPI on your GPU

with Tobias Grosser and Tobias Gysi @ SPCL presented at High Performance Computing, Cetraro, Italy 2016





#### Evading various "ends" – the hardware view



Data partially collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond

















### Holy grail – auto-parallelization heterogenization



Regression Free

High Performance





## **Tool: Polyhedral Modeling**



#### Program Code

$$N = 4$$

$$(i, j) = (4,4)$$

#### Iteration Space







## **Mapping Computation to Device**



#### Iteration Space





$$BID = \{(i,j) \rightarrow \left( \left\lfloor \frac{i}{4} \right\rfloor \% 2, \left\lfloor \frac{j}{3} \right\rfloor \% 2 \right) \}$$

$$TID = \{(i,j) \rightarrow (i \% 4, j \% 3)\}$$

#### Device Blocks & Threads







## Memory Hierarchy of a Heterogeneous System







#### **Host-device date transfers**









#### **Host-device date transfers**









# Mapping onto fast memory









# Mapping onto fast memory









for (i = 1; i <= 6; i++)  
for (j = 1; j <= 4; j++)  
... = 
$$A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];$$









for (i = 1; i <= 6; i++)  
for (j = 1; j <= 4; j++)  
... = 
$$A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];$$









for (i = 1; i <= 6; i++)  
for (j = 1; j <= 4; j++)  
... = 
$$A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];$$









for (i = 1; i <= 6; i++)  
for (j = 1; j <= 4; j++)  
... = 
$$A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];$$



- Data needed on device
- 12 elements
- Minimal data, but complex transfer







for (i = 1; i <= 6; i++)  
for (j = 1; j <= 4; j++)  
... = 
$$A[i+1][j] + A[i-1][j] + A[i][j] + A[i][j-1] + A[i][j+1];$$



- One-dimensional hull
- 20 elements
  - Simple transfer, but redundant data









# Modeling multi-dimensional access behavior is important



- Two-dimensional hull
- 16 elements
- Simple transfer, less redundant data





## **Profitability Heuristic**









#### Some results: Polybench 3.2



Xeon E5-2690 (10 cores, 0.5Tflop) vs. Titan Black Kepler GPU (2.9k cores, 1.7Tflop)





## Compiles all of SPEC CPU 2006 – Example: LBM







#### Brave new compiler world!?

#### Unfortunately not ...

- Limited to affine code regions
- Maybe generalizes to control-restricted programs
- No distributed anything!!



- Much of traditional HPC fits that model
- Infrastructure is coming along





#### Bad news:

- Modern data-driven HPC and Big Data fits less well
- Need a programming model for <u>distributed</u> heterogeneous machines!









# How do we program GPUs today?





#### **CUDA**

- over-subscribe hardware
- use spare parallel slack for latency hiding

#### **MPI**

- host controlled
- full device synchronization

device compute core active thread ■■▶ instruction latency





## Latency hiding at the cluster level?



#### dCUDA (distributed CUDA)

- unified programming model for GPU clusters
- avoid unnecessary device synchronization to enable system wide latency hiding





# dCUDA extends CUDA with MPI-3 RMA and notifications

```
for (int i = 0; i < steps; ++i) {</pre>
for (int idx = from; idx < to; idx += jstride)</pre>
  out[idx] = -4.0 * in[idx] +
                                       computation
    in[idx + 1] + in[idx - 1] +
    in[idx + jstride] + in[idx - jstride];
if (lsend)
  dcuda put notify(ctx, wout, rank - 1,
    len + jstride, jstride, &out[jstride], tag);
if (rsend)
  dcuda put notify(ctx, wout, rank + 1,
    0, jstride, &out[len], tag);
                                      communication
dcuda wait notifications(ctx, wout,
  DCUDA ANY SOURCE, tag, 1send + rsend);
swap(in, out); swap(win, wout);
```

- iterative stencil kernel
- thread specific idx





- map ranks to blocks
- device-side put/get operations
- notifications for synchronization
- shared and distributed memory





### Hardware supported communication overlap







#### Implementation of the dCUDA runtime system





# Overlap of a copy kernel with halo exchange communication





# Weak scaling of MPI-CUDA and dCUDA for a stencil program







# Weak scaling of MPI-CUDA and dCUDA for a particle simulation







# Weak scaling of MPI-CUDA and dCUDA for sparse-matrix vector multiplication





#### http://spcl.inf.ethz.ch/Polly-ACC

#### dCUDA – distributed memory





try now: https://translate.google.de/#en/de/a%20bad%20day%20for%20Europe





