

### **High-Performance** Distributed RMA Locks

PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER

Presented at ARM Research, Austin, Texas, USA, Feb. 2017



#### **NEED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION**













#### Locks An example structure R Proc p Proc q Inuitive lock semantics accesses ... 10CK ---unlock accesses Various performance penalties



#### LOCKS: CHALLENGES



Calciu et al.: NUMA-aware reader-writer locks, PPoPP'13





#### LOCKS: CHALLENGES

We need intra- and inter-node topologyawareness

We need to cover arbitrary topologies



#### **LOCKS: CHALLENGES**

Reader

# We need to distinguish between readers and writers

Reader

Reader

We need flexible performance for both types of processes

[1] V. Venkataramani et al. Tao: How facebook serves the social graph. SIGMOD'12.



# What will we use in the design?

8



#### WHAT WE WILL USE MCS Locks



Mellor-Crummey and Scott: Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, ACM TOCS'91



#### WHAT WE WILL USE Reader-Writer Locks





# How to manage the design complexity?

#### How to ensure tunable performance?

#### What mechanism to use for efficient implementation?



### **REMOTE MEMORY ACCESS (RMA) PROGRAMMING**



TH, J. Dinan, R. Thakur, B. Barrett, P. Balaji, W. Gropp, K. Underwood: Remote Memory Access Programming in MPI-3, ACM TOPC'15



### **REMOTE MEMORY ACCESS PROGRAMMING**

Implemented in hardware in NICs in the majority of HPC networks (RDMA support).





# How to manage the design complexity?

#### How to ensure tunable performance?

What mechanism to use for efficient implementation?





Each element has its

spcl.inf.ethz.ch



P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC'16, best paper

How to manage the design complexity?







#### ETHzürich



Each DQ: The

•

#### **DISTRIBUTED MCS QUEUES (DQS)** Throughput vs Fairness





#### **DISTRIBUTED TREE OF QUEUES (DT)** Throughput of readers vs writers

DT: The maximum number of consecutive lock passings within readers ( $T_R$ ).



#### EHzürich



#### **DISTRIBUTED COUNTER (DC)** Latency of readers vs writers

DC: every *k*th compute node hosts a partial counter, all of which constitute the DC.











#### LOCK ACQUIRE BY READERS

A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation







spcl.inf.ethz.ch Ƴ @spcl\_eth

#### LOCK ACQUIRE BY WRITERS





CRAY

CRAY

CRAN

spcl.inf.ethz.ch

#### **EVALUATION**

- CSCS Piz Daint (Cray XC30)
- 5272 compute nodes
- 8 cores per node
- 169TB memory
- Microbenchmarks: acquire/release: latency, throughput
- Distributed hashtable



#### **EVALUATION** DISTRIBUTED COUNTER ANALYSIS





Throughput, 2% writers

Single-operation benchmark





#### spcl.inf.ethz.ch Y @spcl\_eth

#### **EVALUATION** READER THRESHOLD ANALYSIS





#### **EVALUATION** COMPARISON TO THE STATE-OF-THE-ART



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.



#### **EVALUATION** COMPARISON TO THE STATE-OF-THE-ART

Throughput, single-operation benchmark



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.



#### **EVALUATION** DISTRIBUTED HASHTABLE



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.



#### **EVALUATION DISTRIBUTED HASHTABLE**

# 2% of writers

#### 0% of writers



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.



34

#### **Another application area - Databases**

MPI-RMA for distributed databases?



#### Hash-Join

#### Sort-Join



### **Another application area - Databases**

MPI-RMA for distributed databases on Piz Daint



C. Barthels, et al., TH: Distributed Join Algorithms on Thousands of Cores presented in Munich, Germany, VLDB Endowment, Aug. 2017



### **Another application area - Databases**

MPI-RMA for distributed databases on Piz Daint



C. Barthels, et al., TH: Distributed Join Algorithms on Thousands of Cores presented in Munich, Germany, VLDB Endowment, Aug. 2017

36



#### **OTHER ANALYSES**





#### CONCLUSIONS





### Modular o

# Thank you for your attention

COMPARISON TO THE STATE-OF-THE-ART

Throughput [min locks/s]

Percentages a

alues of F<sub>W</sub>

16

Latency (LB

foMPI-RW

64

MPI processes (P)

256

1024



Improves latency and throughput over state-of-the-art

P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC'16, best paper



ble

les

AN - S -

Accelerates distributed hashtabled