Towards Remote Memory Access Programming for Data Analytics

With R. Gerstenberger, M. Besta, R. Belli @ SPCL
presented at Lawrence Berkeley National Laboratory, Berkeley, CA, June 2015
COMMUNICATION IN TODAY’S HPC SYSTEMS

- The de-facto programming model: MPI-1
  - Using send/recv messages and collectives

- The de-facto network standard: RDMA, SHM
  - Zero-copy, user-level, os-bypass, fuzz-bang
MPI-1 MESSAGE PASSING – SIMPLE EAGER

Producer

Consumer

MPI-1 MESSAGE PASSING – SIMPLE EAGER

1. Data transfer to intermediate buffer

MPI-1 MESSAGE PASSING – SIMPLE EAGER

Producer  

Send

1. Data transfer to intermediate buffer

2. Acknowledgement

Consumer  

Mailbox

: origin aware of completion

MPI-1 MESSAGE PASSING – SIMPLE EAGER

Critical path: 1 latency + 1 copy

MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS

Producer ➔ Consumer

MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS

1. Transfer of communication parameters

Producer

Consumer

Mailbox

Send

MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS

MPI-1 MESSAGE PASSING – SIMPLE RENDEZVOUS

1. Transfer of communication parameters
2. Message matching
3. Request

MPI-1 Message Passing – Simple Rendezvous

1. Transfer of communication parameters
2. Message matching
3. Request
4. Data transfer

*: target aware of completion

Critical path: 3 latencies

A Critique of RDMA
by Patrick Geoffray, Ph.D.

Do you remember VIA, the Virtual Interface Architecture? I do. In 1998, according to its promoters — Intel, Compaq, and Microsoft — VIA was supposed to change the face of high-performance networking. VIA was a buzzword at the time; Venture Capital was flowing, and startups multiplying. Many HPC pundits were rallying behind this low-level programming interface, which promised scalable, low-overhead, high-throughput communication, initially for HPC and eventually for the data center. The hype was on and doom was spelled for the non-believers.

It turned out that VIA, based on RDMA (Remote Direct Memory Access, or Remote DMA), was not an improvement on existing APIs to support widely used application-software interfaces such as MPI and Sockets. After a while, VIA faded away, overtaken by other developments.

VIA was eventually reborn into the RDMA programming model that is the basis of various InfiniBand Verbs implementations, as well as DAPL (Direct Access Provider Library) and iWARP (Internet Wide Area RDMA Protocol). The pundits have returned, VCs are spending their money, and RDMA is touted as an ideal solution for the efficiency of high-performance networks.

However, the evidence I'll present here shows that the revamped RDMA model is more a problem than a solution. What's more, the objective that RDMA pretends to address of efficient user-level communication between computing nodes is already solved by the two-sided Send/Recv model in products such as Quadrics QsNet, Cray SeaStar (implementing Sandia Portals), Qlogic InfiniPath, and Myricom's Myrinet Express (MX).

Send/Recv versus RDMA

The difference between these two paradigms, Send/Receive (Send/Recv) and RDMA, resides essentially in the
REMOTE MEMORY ACCESS PROGRAMMING

- Why not use these RDMA features more directly?
  - A global address space may simplify programming
  - ... and accelerate communication
  - ... and there could be a widely accepted standard

- MPI-3 RMA (“MPI One Sided”) [1] was born
  - Just one among many others (UPC, CAF, …)
  - Designed to react to hardware trends, learn from others
  - Direct (hardware-supported) remote access
  - New way of thinking for programmers

“Traditionally, HPC programming models are following hardware developments” (IPDPS’15)

MPI-3 RMA SUMMARY

- MPI-3 updates RMA ("MPI One Sided")
  - Significant change from MPI-2

- Communication is „one sided” (no involvement of destination)
  - Utilize direct memory access

- RMA decouples communication & synchronization
  - Fundamentally different from message passing

MPI-3 RMA COMMUNICATION OVERVIEW

- **Process A (passive)**
  - Memory
  - MPI window

- **Process B (active)**
  - Memory
  - MPI window
  - Put
  - Non-atomic communication calls (put, get)

- **Process C (active)**
  - Atomic
  - Get

- **Process D (active)**
  - Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Gropp, Hoefler, Thakur, Lusk: Using Advanced MPI
MPI-3 RMA COMMUNICATION OVERVIEW

- Process A (passive)
  - Memory
  - MPI window

- Process B (active)
  - Memory
  - MPI window
  - Non-atomic communication calls (put, get)

- Process C (active)
  - Atomic
  - Atomic communication calls (Acc, Get & Acc, CAS, FAO)

- Process D (active)
  - Get

Gropp, Hoefler, Thakur, Lusk: Using Advanced MPI
MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive)

Memory

Put

Non-atomic communication calls (put, get)

Process B (active)

Memory

MPI window

Process C (active)

Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Process D (active)

Gropp, Hoefler, Thakur, Lusk: Using Advanced MPI
MPI-3 RMA COMMUNICATION OVERVIEW

- Process A (passive)
  - Memory
  - MPI window

- Process B (active)
  - Memory
  - MPI window

- Process C (active)
  - Non-atomic communication calls (put, get)

- Process D (active)
  - Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Gropp, Hoefler, Thakur, Lusk: Using Advanced MPI
MPI-3 RMA Communication Overview

Process A (passive)

Memory

Put

Process B (active)

Memory

Non-atomic communication calls (put, get)

MPI window

Get

Process C (active)

Atomic communication calls (Acc, Get & Acc, CAS, FAO)

Process D (active)

Atomic

Gropp, Hoefler, Thakur, Lusk: Using Advanced MPI
MPI-3 RMA SYNCHRONIZATION OVERVIEW

Active Target Mode
- Fence
- Post/Start/Complete/Wait
- Active process
- Passive process

Passive Target Mode
- Lock
- Lock All
- Synchronization
- Communication

Gropp, Hoefler, Thakur, Lusk: Using Advanced MPI, MIT Press, 2014
MPI-3 RMA SYNCHRONIZATION OVERVIEW

Active Target Mode
- Fence
- Post/Start/Complete/Wait

Passive Target Mode
- Lock

Active process
Passive process
Synchronization
Communication

Gropp, Hoefler, Thakur, Lusk: Using Advanced MPI, MIT Press, 2014
MPI-3 RMA Synchronization Overview

Active Target Mode:
- Fence
- Post/Start/Complete/Wait

Passive Target Mode:
- Lock
- Lock All

Gropp, Hoefler, Thakur, Lusk: Using Advanced MPI, MIT Press, 2014
MPI-3 RMA SYNCHRONIZATION OVERVIEW

Active Target Mode
- Fence
- Post/Start/Complete/Wait

Passive Target Mode
- Lock
- Lock All

Active process
Passive process
Synchronization
Communication

Gropp, Hoefler, Thakur, Lusk: Using Advanced MPI, MIT Press, 2014
MPI-3 RMA SYNCHRONIZATION OVERVIEW

Active Target Mode

- Fence
- Post/Start/Complete/Wait

- Active process
- Passive process

Passive Target Mode

- Lock
- Lock All

Gropp, Hoefler, Thakur, Lusk: Using Advanced MPI, MIT Press, 2014
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

- Scalable & generic protocols
  - Can be used on any RDMA network (e.g., OFED/IB)
  - Window creation, communication and synchronization

- foMPI, a fully functional MPI-3 RMA implementation
  - DMAPP: lowest-level networking API for Cray Gemini/Aries systems
  - XPMEM, a portable Linux kernel module

http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
**SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION**

- Scalable & generic protocols
  - Can be used on any RDMA network (e.g., OFED/IB)
  - Window creation, communication and synchronization

- foMPI, a fully functional MPI-3 RMA implementation
  - DMAPP: lowest-level networking API for Cray Gemini/Aries systems
  - XPMEM, a portable Linux kernel module

http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

- Scalable & generic protocols
  - Can be used on any RDMA network (e.g., OFED/IB)
  - Window creation, communication and synchronization

- foMPI, a fully functional MPI-3 RMA implementation
  - DMAPP: lowest-level networking API for Cray Gemini/Aries systems
  - XPMEM: a portable Linux kernel module

http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
PERFORMANCE INTER-NODE: LATENCY

Put Inter-Node

Get Inter-Node

80% faster

20% faster

Half ping-pong
Proc 0
Proc 1
put
sync memory

Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13
**Performance Intra-node: Latency**

Put/Get Intra-Node

![Graph showing latency comparison for different transport layers](image)

**Transport Layer**
- FOMPI MPI-3.0
- Cray UPC
- Cray MPI-2.2
- Cray MPI-1
- Cray CAF

![Diagram showing half ping-pong](image)

Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13
Performance: Message Rate

Inter-Node

Transport Layer
- FOMPI MPI-3.0
- Cray UPC
- Cray MPI-2.2
- Cray MPI-1
- Cray CAF

Message Rate [Million Mess./Sec.]

Message Size [Bytes]

Intra-Node

Transport Layer
- FOMPI MPI-3.0
- Cray UPC
- Cray MPI-2.2
- Cray MPI-1
- Cray CAF

Message Rate [Million Mess./Sec.]

Message Size [Bytes]
PART 3: SYNCHRONIZATION

Active Target Mode

- Fence
- Post/Start/Complete/Wait

Passive Target Mode

- Lock
- Lock All

- Active process
- Passive process
- Synchronization
- Communication
SCALABLE FENCE PERFORMANCE

Time bound \( \mathcal{O}(\log p) \)
Memory bound \( \mathcal{O}(1) \)

\( \sim \frac{1}{2} \) latency
**Flush Synchronization**

- Guarantees remote completion
- Performs a remote bulk synchronization and an x86 mfence
- One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path
**Flush Synchronization**

- Guarantees remote completion
- Performs a remote bulk synchronization and an x86 mfence
- One of the most performance critical functions, we add only 78 x86 CPU instructions to the critical path

**Time bound** \(O(1)\)

**Memory bound** \(O(1)\)

Gerstenberger, Besta, Hoefler: Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided, SC13
APPLICATION PERFORMANCE

- Evaluation on Blue Waters System
  - 22,640 computing Cray XE6 nodes
  - 724,480 schedulable cores
- One nearly full-scale run 😊
PERFORMANCE: APPLICATIONS


NAS 3D FFT [1] Performance

MILC [2] Application Execution Time

[1] Nishtala et al.: Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09
[2] Shan et al.: Accelerating applications at scale using one-sided communication. PGAS’12
IN CASE YOU WANT TO LEARN MORE

- Available in most MPI libraries today
- Some are even fast!

Using Advanced MPI
Modern Features of the Message-Passing Interface

How to implement producer/consumer in passive mode?

William Gropp
Torsten Hoefler
Rajeev Thakur
Ewing Lusk
PRODUCER-CONSUMER RELATIONS

- Most important communication idiom
  - Some examples:

- Perfectly supported by MPI-1 Message Passing
  - But how does this actually work over RDMA?

ONE SIDED – PUT + SYNCHRONIZATION

Producer

Consumer
ONE SIDED – PUT + SYNCHRONIZATION

Producer

Consumer

Put

1. Data transfer

ONE SIDED – PUT + SYNCHRONIZATION

Producer

Put

Flush

Consumer

1. Data transfer

2. Producer waits for remote completion

: origin aware of completion

ONE SIDED – PUT + SYNCHRONIZATION

Critical path: 3 latencies
COMPARING APPROACHES

Message Passing
1 latency + copy / 3 latencies

One Sided
3 latencies

IDEA: RMA NOTIFICATIONS

- First seen in Split-C (1992)
- Combine communication and synchronization using RDMA
- RDMA networks can provide various notifications
  - Flags
  - Counters
  - Event Queues
**Comparing Approaches**

**Message Passing**
- 1 latency + copy / 3 latencies

**One Sided**
- 3 latencies

**Notified Access**
- 1 latency

COMPARING APPROACHES

Message Passing
1 latency + copy / 3 latencies

One Sided
3 latencies

Notified Access
1 latency

But how to notify?
Sources:  
- Flags (polling at the remote side)  
  - Used in GASPI, DMAPP, NEON

**Disadvantages**
- Location of the flag chosen at the sender side
- Consumer needs at least one flag for every process
- Polling a high number of flags is inefficient
PREVIOUS WORK: COUNTING INTERFACE

- Atomic counters (accumulate notifications → scalable)
  - Used in Split-C, LAPI, SHMEM - Counting Puts, ...

**Disadvantages**
- Dataflow applications may require many counters
- High polling overhead to identify accesses
- Does not preserve order (may not be linearizable)
WHAT IS A GOOD NOTIFICATION INTERFACE?

- Scalable to yotta-scale
  - Does memory or polling overhead grow with # of processes?

- Computation/communication overlap
  - Do we support maximum asynchrony? (better than MPI-1)

- Complex data flow graphs
  - Can we distinguish between different accesses locally?
  - Can we avoid starvation?
  - What about load balancing?

- Ease-of-use
  - Does it use standard mechanisms?
Our Approach: Notified Access

- Notifications with MPI-1 (queue-based) matching
  - Retains benefits of previous notification schemes
  - Poll only head of queue
  - Provides linearizable semantics

NOTIFIED ACCESS – AN MPI INTERFACE

- Minor interface evolution
  - Leverages MPI two sided <source, tag> matching
  - Wildcards matching with FIFO semantics

Example Communication Primitives

```c
int MPI_Put_notify(void *origin_addr, int origin_count, MPI_Datatype origin_type, int target_rank, 
                   MPI_Aint target_disp, int target_count, MPI_Datatype target_type, MPI_Win win, 
                   int tag);
int MPI_Get_notify(void *origin_addr, int origin_count, MPI_Datatype origin_type, int target_rank, 
                   MPI_Aint target_disp, int target_count, MPI_Datatype target_type, MPI_Win win, 
                   int tag);
```

Example Synchronization Primitives

```c
int MPI_Notify_init(MPI_Win win, int src_rank, int tag, int expected_count, MPI_Request *request);
/*Functions already available in MPI*/
int MPI_Start(MPI_Request *request);
int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status);
int MPI_Wait(MPI_Request *request, MPI_Status *status);
```
**EXPERIMENTAL SETTING**

- **Piz Daint [1]**
  - Cray XC30, Aries interconnect
  - 5,272 computing nodes (Intel Xeon E5-2670 + NVIDIA Tesla K20X)
  - Theoretical Peak Performance 7.787 Petaflops
  - Peak Network Bisection Bandwidth 33 TB/s

PING PONG PERFORMANCE (INTER-NODE)

- 1000 repetitions, each timed separately, RDTSC timer
- 95% confidence interval always within 1% of median

CHOLESKY – MANY-TO-MANY SYNCHRONIZATION

- 1,000 repetitions, each timed separately, RDTSC timer
- 95% confidence interval always within 10% of median

So what if we drive tasking to the extreme of data analytics?

[1]: J. Kurzak, H. Ltaief, J. Dongarra, R. Badia: "Scheduling dense linear algebra operations on multicore processors", CCPE 2010
LARGE-SCALE IRREGULAR GRAPH PROCESSING

- Becoming more important [1]
  - Machine learning
  - Computational science
  - Social network analysis

SYNCHRONIZATION MECHANISMS
COARSE LOCKS

Simple protocols

Serialization

Detrimental performance

An example graph

Proc p

lock

accesses

unlock

Proc q

lock

accesses
SYNCHRONIZATION MECHANISMS
FINE LOCKS

Higher performance possible

Complex protocols

Risk of deadlocks

Complex access patterns 😊
SYNCHRONIZATION MECHANISMS

ATOMIC OPERATIONS

High performance (may be challenging to get)

Complex protocols

Subtle issues (ABA, ...)

Complex access patterns 😊
SYNCHRONIZATION MECHANISMS
SOFTWARE TRANSACTIONAL MEMORY (STM) [1]

Conflicts solved with rollbacks and/or serialization.

SYNCHRONIZATION MECHANISMS
HARDWARE TRANSACTIONAL MEMORY (HTM)

Conflicts solved with rollbacks and/or HW serialization.

High performance? For graphs?
Simple protocols

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
HARDWARE TRANSACTIONAL MEMORY

They offer programmability, how about performance?
ACTIVE MESSAGES (AM)


AM + HTM = ATOMIC ACTIVE MESSAGES

AM handlers run as HTM transactions

Node A
Proc p

Node B
Proc q

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
ACCESSING MULTIPLE VERTICES ATOMICALLY
Example: BFS

Size (the number of vertices) must be appropriate to minimize overheads from both commits and rollbacks
Transactions must be appropriately coalesced to minimize communication overheads.
EXECUTING TRANSACTIONS ON MULTIPLE NODES

Vertices must be appropriately relocated to enable execution of a hardware transaction
PERFORMANCE ANALYSIS
RESEARCH QUESTIONS

- What are the advantages of HTM over atomics for AAM?
- How can we implement AAM handlers to run most efficiently?
- What are performance tradeoffs related to HTM?
- What are the optimal transaction sizes?

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
PERFORMANCE ANALYSIS

TYPES OF MACHINES

- Evaluation on 3 machines
  - Intel Haswell server
  - InfiniBand cluster
  - IBM BlueGene/Q

Commodity machines

Supercomputing machines

HPC clusters

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
PERFORMANCE ANALYSIS
CONSIDERED MECHANISMS

Haswell HTM
- Intel
- Deployed in L1
- 32KB per core
- 8-way associative
- RTM (Restricted Transactional Memory)
- HLE (Hardware Lock Elision)

BlueGene/Q HTM
- IBM
- Deployed in L2
- 2MB per core
- 16-way associative
- The Long Running Mode
- The Short Running Mode

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC'15
**SINGLE-VERTEX TRANSACTIONS**

**MARKING A VERTEX AS VISITED**

Lower contention
(10 racing accesses/vertex)

// start handler
if (!v.visited) {
  v.visited = 1;
}
// finish handler

Very few aborts

Atomics (CAS) slightly faster than HTM

Commit overheads dominate

Used in BFS, SSSP, ...

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
SINGLE-VERTEX TRANSACTIONS
MARKING A VERTEX AS VISITED

Higher contention
(100 racing accesses/vertex)

Still very few aborts

// start handler
if(!v.visited) {
  v.visited = 1;
}
// finish handler

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC'15
SINGLE-VERTEX TRANSACTIONS
INCREMENTING VERTEX RANK

Atomics always outperform HTM

The reason: each transaction always modifies some memory cell, increasing the number of conflicts

// start handler
v.rank++;
// finish handler

Used in PageRank

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
**Performance Model**

**Atoms vs Transactions**

Time to modify $N$ vertices with atomics:

$$T_{AT}(N) = A_{AT}N + B_{AT}$$

Time to modify $N$ vertices with a transaction

$$T_{HTM}(N) = A_{HTM}N + B_{HTM}$$

Overhead per vertex

Startup overheads

We predict that:

$$B_{AT} < B_{HTM}$$

$$A_{AT} > A_{HTM}$$

Transactions’ cost grows slower

Transaction startup overheads dominate

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
**PERFORMANCE MODEL**

**ATOMICS VS TRANSACTIONS**

- Can we amortize HTM startup/commit overheads with larger transaction sizes?

Indeed:

\[ B_{AT} < B_{HTM} \]

\[ A_{AT} > A_{HTM} \]

Yes, we can!
MULTI-VERTEX TRANSACTIONS
MARKING VERTICES AS VISITED

The sweetspot! (144 vertices)

Abort and rollback overheads

Startup and commit overheads

BGQ mechanism

HTM-Long-Mode

HTM-Short-Mode

Total time [s]

Transaction size (M) [vertices]

HTM: Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
**Multi-Vertex Transactions**

**Marking Vertices as Visited**

- **Startup and commit overheads**
- **The sweetspot! (2 vertices)**
- **Abort and rollback overheads**
- **Numbers: % of aborts due to HTM capacity overflows**

Majority of aborts are due to HTM capacity overflows (small cache size & associativity)

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC'15
**Performance Analysis**

**Questions Answered**

- What are the advantages of HTM over atomics for AAM?
- What are the optimal transaction sizes?
- What are the performance tradeoffs related to HTM?
- How can we implement AAM handlers most effectively?

> “It really depends” 😊.

> But... There are some regularities

> Handlers most effectively?

> “May fail”

> For some algorithms (BFS) HTM is better

> atomics of

> “Always succeed”

> For others (PageRank) atomics give more performance

> AAM establishes a whole hierarchy of algorithms; check the paper 😊

> Larger cache & associativity → fewer aborts & more coarsening

> Larger (L2) cache → higher latency

> Size for BG/Q ~100 > Size for Haswell ~10

> Same for other graphs


---

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
EVALUATION

CONSIDERED ENGINES


GRAPh 500

[2] Runtimes that exploit amorphous data-parallelism

PBGL [4]

Improving Graph500 design

AAM +

Distributed HPC libraries

HAMA [3]

Hadoop-based BSP engines

EVALUATION

CONSIDERED TYPES OF GRAPHS

Real-world SNAP graphs [3]

Social networks

Road networks

Comm. graphs

Citation graphs

Web graphs

Purchase networks

Synthetic graphs

Kronecker [1]

Erdös-Rényi [2]

ACCELERATING STATE-OF-THE-ART GRAPH500 + AAM (BLUEGENE/Q) IBM

Fill the whole memory

Numbers are speedups of AAM over Graph500 for a given data point

Implementation

- Graph500–BGQ
- AAM–BGQ

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
ACCELERATING STATE-OF-THE-ART GRAPH500 + AAM (HASWELL)

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC'15

Implementation
- Graph500–Haswell
  - Graphs with 2M vertices: 1.09, 1.12, 1.15
  - Graphs with 8M vertices: 1.10, 1.14, 1.15

AAM–Haswell
- Numbers are speedups of AAM over Graph500 for a given data point
- Fill the whole memory

Edges per vertex ($\bar{d}$)
- Total time [s]

Graphs with 8M vertices
- 1.09, 1.10, 1.11, 1.15, 1.27

Graphs with 2M vertices
- 1.03, 1.05, 1.09, 1.14, 1.15

...64M...

1.23
OUTPERFORMING STATE-OF-THE-ART

😊 No, you don’t have to read it. All details are in the paper. Here: just a summary.
OUTPERFORMING STATE-OF-THE-ART

Average overall speedup (geometric mean) over Graph 500: 1.07, Galois: 1.40, HAMA ~1000

1.85x on average, up to 4.3x

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
OUTPERFORMING STATE-OF-THE-ART
SCALABILITY ANALYSIS: DISTRIBUTED-MEMORY

Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
DISCUSSION AND CONCLUSIONS

- MPI-3 RMA [1] standardizes weak remote memory
  - Builds on existing practice (UPC, CAF, ARMCI etc.)
  - Rich set of synchronization mechanisms
- Notified Access [2] can support producer/consumer
  - Maintains benefits of RDMA
- HTM can accelerate parallel applications [3]
  - Uses optimistic coarsened irregular parallelism
- Atomic Active Messages use HTM on DM systems
  - First steps towards software-emulated TM over RDMA
  - Thinking about hardware support
- Significant speedups over highly-tuned graph frameworks
  - Haswell: 7% over Graph 500, 40% over Galois, 1000x over Hama
- What next? Discussions?
  - GraphBlas using RMA+HTM?

[3] Besta, Hoefler: Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages, HPDC’15
ACKNOWLEDGMENTS