

## **High-Performance Distributed RMA Locks**

PATRICK SCHMID, MACIEJ BESTA, TORSTEN HOEFLER





### **N**EED FOR EFFICIENT LARGE-SCALE SYNCHRONIZATION









### Locks



Various performance penalties







### LOCKS: CHALLENGES



### LOCKS: CHALLENGES

We need intra- and inter-node topology-awareness

We need to cover arbitrary topologies

### LOCKS: CHALLENGES



We need to distinguish between readers and writers

Reader

Reader

Reader

Writer

We need flexible performance for both types of processes





# WHAT WE WILL USE MCS Locks





# WHAT WE WILL USE Reader-Writer Locks







How to manage the design complexity?

How to ensure tunable performance?

What mechanism to use for efficient implementation?



## REMOTE MEMORY ACCESS (RMA) PROGRAMMING





### REMOTE MEMORY ACCESS PROGRAMMING

Implemented in hardware in NICs in the majority of HPC





How to manage the design complexity?

How to ensure tunable performance?

What mechanism to use for efficient implementation?







#### How to manage the design complexity?

R1 R9 W3>W8

Each element has its own distributed MCS queue (DQ) of writers

Readers and writers synchronize with a distributed counter (DC)







MCS
queues form
a distributed
tree (DT)









P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC'16, best paper







#### How to ensure tunable performance?

R1 R9 W3 - W8

Each DQ: fairness vs throughput of writers

DC: a parameter for the latency of readers vs writers

A tradeoff parameter for every structure





DT: a
parameter for
the
throughput of
readers vs
writers













# DISTRIBUTED MCS QUEUES (DQS) Throughput vs Fairness

Larger  $T_{L,i}$ : more throughput at level i. Smaller  $T_{L,i}$ : more fairness at level i.

 $T_{L,3}$ 



Each DQ: The maximum number of lock passings within a DQ at level i, before it is passed to another DQ at i.

 $T_{L,i}$ 













P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC'16, best paper





# DISTRIBUTED TREE OF QUEUES (DT) Throughput of readers vs writers



DT: The maximum number of consecutive lock passings within readers (  $T_R$  ).













P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC'16, best paper

### DISTRIBUTED COUNTER (DC) Latency of readers vs writers



DC: every kth compute node hosts a partial counter, all of which constitute the DC.



A writer holds the lock

> Readers that arrived at the CS



Readers that left the CS



$$T_{DC}=2$$













#### LOCK ACQUIRE BY READERS

A lightweight acquire protocol for readers: only one atomic fetch-and-add (FAA) operation





#### LOCK ACQUIRE BY WRITERS



P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC'16, best paper







# **EVALUATION**DISTRIBUTED COUNTER ANALYSIS







Throughput, 2% writers
Single-operation benchmark





# **EVALUATION**READER THRESHOLD ANALYSIS

Throughput, 0.2% writers, Empty-critical-section benchmark



# **EVALUATION**Comparison to the State-of-the-Art



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.



## **EVALUATION**Comparison to the State-of-the-Art

Throughput, single-operation benchmark



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

# **EVALUATION**DISTRIBUTED HASHTABLE



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.



# **EVALUATION**DISTRIBUTED HASHTABLE



[1] R. Gerstenberger et al. Enabling Highly-scalable Remote Memory Access Programming with MPI-3 One Sided. ACM/IEEE Supercomputing 2013.

### **Another application area - Databases**

MPI-RMA for distributed databases?





### **Another application area - Databases**

MPI-RMA for distributed databases on Piz Daint





### **Another application area - Databases**

MPI-RMA for distributed databases on Piz Daint



#### **OTHER ANALYSES**



P. Schmid, M. Besta, TH: High-Performance Distributed RMA Locks, ACM HPDC'16, best paper



#### **CONCLUSIONS**

Modular (

corre



Thank you for your attention







ble

es

Improves latency and throughput over state-of-the-art

Accelerates distributed hashtabled