

# **High-Performance Distributed RMA Locks**

PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER



















































# ETHzürich

# Locks



Various performance penalties













#### LOCKS: CHALLENGES





#### LOCKS: CHALLENGES































#### LOCKS: CHALLENGES

We need intra- and inter-node topology-awareness

We need to cover arbitrary topologies











# LOCKS: CHALLENGES

We need to distinguish between readers and writers

Reader

Reader

Reader

Write

Reader

# LOCKS: CHALLENGES



We need to distinguish between readers and writers

Reader

Reader

Reader

Writer

We need flexible performance for both types of processes









# spcl.inf.ethz.ch @spcl\_eth











































# WHAT WE WILL USE Reader-Writer Locks





# WHAT WE WILL USE Reader-Writer Locks





# WHAT WE WILL USE Reader-Writer Locks

R

\_\_\_





#### spcl.inf.ethz.ch @spcl\_eth

# WHAT WE WILL USE Reader-Writer Locks









What mechanism to use for efficient implementation?





How to ensure tunable performance?

What mechanism to use for efficient implementation?





How to ensure tunable performance?

What mechanism to use for efficient implementation?

























Cray BlueWaters







Cray BlueWaters















Cray BlueWaters



































Supported by many HPC libraries and languages



Supported by many HPC libraries and languages



Supported by many HPC libraries and languages







How to ensure tunable performance?

What mechanism to use for efficient implementation?



How to ensure tunable performance?

What mechanism to use for efficient implementation?





How to ensure tunable performance?

What mechanism to use for efficient implementation?

























































































































Modular design



















Each element has its own distributed MCS queue (DQ) of writers



























Each element has its own distributed MCS queue (DQ) of writers



























Each element has its own distributed MCS queue (DQ) of writers







Modular design



















Each element has its own distributed MCS queue (DQ) of writers

























Each element has its own distributed MCS queue (DQ) of writers



Modular design





















Each element has its own distributed MCS queue (DQ) of writers























Each element has its own distributed MCS queue (DQ) of writers

Readers and writers synchronize with a distributed counter (DC)























Each element has its own distributed MCS queue (DQ) of writers

Readers and writers synchronize with a distributed counter (DC)











































A tradeoff parameter for every structure



















Each DQ: fairness vs throughput of writers



A tradeoff parameter for every structure



















Each DQ: fairness vs throughput of writers



A tradeoff parameter for every structure





DT: a
parameter for
the
throughput of
readers vs
writers















R1 R9 W3 > W8

Each DQ: fairness vs throughput of writers

DC: a parameter for the latency of readers vs writers

A tradeoff parameter for every structure





DT: a
parameter for
the
throughput of
readers vs
writers





























Each DQ: The maximum number of lock passings within a DQ at level i, before it is passed to another DQ at i.

















Each DQ: The maximum number of lock passings within a DQ at level i, before it is passed to another DQ at i.

















Each DQ: The maximum number of lock passings within a DQ at level i, before it is passed to another DQ at i.

















 $T_{L,3}$ 



Each DQ: The maximum number of lock passings within a DQ at level i, before it is passed to another DQ at i.

















Larger  $T_{L,i}$ : more throughput at level i. Smaller  $T_{L,i}$ : more fairness at level i.

 $T_{L,3}$ 



Each DQ: The maximum number of lock passings within a DQ at level i, before it is passed to another DQ at i.















## DISTRIBUTED TREE OF QUEUES (DT) Throughput of readers vs writers



















## DISTRIBUTED TREE OF QUEUES (DT) Throughput of readers vs writers



DT: The maximum number of consecutive lock passings within readers (  $T_R$  ).

















## DISTRIBUTED TREE OF QUEUES (DT) Throughput of readers vs writers



DT: The maximum number of consecutive lock passings within readers (  $T_R$  ).































DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.

$$k = T_{DC}$$

















DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.



b|x|y

















DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.



A writer holds the lock

b|x|y

















DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.



A writer holds the lock

- plxla

Readers that / arrived at the CS

















DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.



A writer holds the lock

Readers that / arrived at the CS

















DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.



A writer holds the lock

Readers that / arrived at the CS















DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.



A writer holds the lock

Readers that / arrived at the CS



















DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.



A writer holds the lock

Readers that / arrived at the CS



















DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.



A writer holds the lock

Readers that / arrived at the CS















DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.



A writer holds the lock

Readers that / arrived at the CS



$$T_{DC}=2$$













DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.



A writer holds the lock

Readers that / arrived at the CS



$$T_{DC}=2$$

































A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation







A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation







A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation







A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation

0|7|7 R1 0|1|1 R4 R2





A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation







A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation









A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation







A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation







A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation







A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation







A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation





A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation











































































































# **EVALUATION**CONSIDERED BENCHMARKS



# **EVALUATION**CONSIDERED BENCHMARKS

The latency benchmark



### **EVALUATION**Considered Benchmarks

The latency benchmark

### Throughput benchmarks:

**Empty-critical-section** 

Single-operation

Wait-after-release

Workload-critical-section

## **EVALUATION**CONSIDERED BENCHMARKS

The latency benchmark

#### DHT

Distributed hashtable evaluation

### Throughput benchmarks:

**Empty-critical-section** 

Single-operation

Wait-after-release

Workload-critical-section





### **EVALUATION DISTRIBUTED COUNTER ANALYSIS**







Throughput, 2% writers
Single-operation benchmark





### **EVALUATION**READER THRESHOLD ANALYSIS

Throughput, 0.2% writers, Empty-critical-section benchmark



### **EVALUATION**Comparison to the State-of-the-Art



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.



### **EVALUATION**Comparison to the State-of-the-Art

Throughput, single-operation benchmark



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.



## **EVALUATION**DISTRIBUTED HASHTABLE



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.



#### OTHER ANALYSES

#### **OTHER ANALYSES**





@spcl\_eth

#### **CONCLUSIONS**



Modular distributed RMA lock, correctness with SPIN





Modular distributed RMA lock, correctness with SPIN



Parameter-based design, feasible with various RMA libs/languages





Modular distributed RMA lock, correctness with SPIN



Parameter-based design, feasible with various RMA libs/languages







throughput over state-of-the-art



Modular distributed RMA lock, correctness with SPIN

COMPARISON TO THE STATE-OF-THE-ART

Throughput [mln locks/s]

Percentages ar

MPI processes (P)



Improves latency and throughput over state-of-the-art



Parameter-based design, feasible with various RMA libs/languages



Enables high-performance distributed hashtabled



ble

es

#### spcl.inf.ethz.ch @spcl\_eth

#### **CONCLUSIONS**

Modular (

corre



Thank you for your attention

COMPARISON TO THE STATE-OF-THE-ART

Throughput [mln locks/s]

Percentages as

MPI processes (P)







Enables high-performance distributed hashtabled























DT: The maximum number of consecutive lock passings within writers (  $T_W$ ) and readers (  $T_R$ ).



















DT: The maximum number of consecutive lock passings within writers (  $T_W$ ) and readers (  $T_R$ ).

$$T_W = \prod_{i=1}^N T_{L,i}$$



















DT: The maximum number of consecutive lock passings within writers (  $T_W$ ) and readers (  $T_R$ ).

$$T_W = \prod_{i=1}^N T_{L,i}$$





























#### THE SPACE OF DESIGNS



#### THE SPACE OF DESIGNS















#### **EVALUATION D-MCS** vs others

#### Latency (LB)



#### Throughput (ECSB)





#### **EVALUATION**WRITER THRESHOLD ANALYSIS





#### **EVALUATION**FAIRNESS VS THROUGHPUT ANALYSIS

Throughput, 25% of writers, Single-operation benchmark





#### **EVALUATION**READER THRESHOLD ANALYSIS

Throughput, 2% and 5% writers, Empty-critical-section benchmark





#### **EVALUATION**DISTRIBUTED HASHTABLE



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.



#### **FEASIBILITY ANALYSIS**



#### **FEASIBILITY ANALYSIS**

|                                                 | UPC (standard) [44]                        | Berkeley UPC [1]                                                                                        | SHMEM [4]                                             |
|-------------------------------------------------|--------------------------------------------|---------------------------------------------------------------------------------------------------------|-------------------------------------------------------|
| Put<br>Get                                      | UPC_SET<br>UPC_GET                         | <pre>bupc_atomicX_set_RS bupc_atomicX_read_RS</pre>                                                     | shmem_swap<br>shmem_mswap                             |
| Accumulate<br>FAO (SUM)<br>FAO (REPLACE)<br>CAS | UPC_INC UPC_INC, UPC_DEC UPC_SET UPC_CSWAP | <pre>bupc_atomicX_fetchadd_RS bupc_atomicX_fetchadd_RS bupc_atomicX_swap_RS bupc_atomicX_cswap_RS</pre> | shmem_fadd<br>shmem_fadd<br>shmem_swap<br>shmem_cswap |



#### **FEASIBILITY ANALYSIS**

|                                                | UPC (standard) [44]                                        | Berkeley UPC [1]                                                                                                                                 | SHMEM [4]                                                           |
|------------------------------------------------|------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------|
| Put Get Accumulate FAO (SUM) FAO (REPLACE) CAS | UPC_SET UPC_GET UPC_INC UPC_INC, UPC_DEC UPC_SET UPC_CSWAP | <pre>bupc_atomicX_set_RS bupc_atomicX_read_RS bupc_atomicX_fetchadd_RS bupc_atomicX_fetchadd_RS bupc_atomicX_swap_RS bupc_atomicX_cswap_RS</pre> | shmem_swap<br>shmem_fadd<br>shmem_fadd<br>shmem_swap<br>shmem_cswap |

|               | Fortran 2008 [27] | Linux RDMA/IB [33, 43] | iWARP [18,41]  |
|---------------|-------------------|------------------------|----------------|
| Put           | atomic_define     | MskCmpSwap             | masked CmpSwap |
| Get           | atomic_ref        | MskCmpSwap             | masked CmpSwap |
| Accumulate    | atomic_add        | FetchAdd               | FetchAdd       |
| FAO (SUM)     | atomic_add        | FetchAdd               | FetchA.dd      |
| FAO (REPLACE) | atomic_define*    | MskCmpSwap             | masked CmpSwap |
| CAS           | atomic_cas        | CmpSwap                | CmpSwap        |







# **Process p** Memory

































