Non quia difficilia sunt non audemus, sed quia non audemus difficilia sunt
Home -> Publications
Home
  Publications
    
all years
    2017
    2016
    2015
    2014
    2013
    2012
    2011
    2010
    2009
    2008
    2007
    2006
    2005
    2004
    theses
    techreports
    presentations
    edited volumes
    conferences
  Awards
  Research
  Teaching
  BLOG
  Miscellaneous
  Full CV [pdf]






  Events








  Past Events





Publications of Torsten Hoefler
Copyright Notice:

The documents distributed by this server have been provided by the contributing authors as a means to ensure timely dissemination of scholarly and technical work on a noncommercial basis. Copyright and all rights therein are maintained by the authors or by other copyright holders, notwithstanding that they have offered their works here electronically. It is understood that all persons copying this information will adhere to the terms and constraints invoked by each author's copyright. These works may not be reposted without the explicit permission of the copyright holder.

H. Schweizer, M. Besta, T. Hoefler:

 Evaluating the Cost of Atomic Operations on Modern Architectures

(presented in San Francisco, CA, USA, ACM, Oct. 2015, Accepted at the 24th International Conference on Parallel Architectures and Compilation (PACT'15) )

Abstract

Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance of these operations has not been thoroughly analyzed, hindering the development and the reasoning of high performance codes. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the hardware implementation of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis enables simpler and more effective parallel programming and accelerates data processing on various architectures deployed in both off-the-shelf machines and large compute systems.

Documents

download article:
 

BibTeX

@inproceedings{atomics-cost,
  author={H. Schweizer and M. Besta and T. Hoefler},
  title={{Evaluating the Cost of Atomic Operations on Modern Architectures}},
  year={2015},
  month={Oct.},
  location={San Francisco, CA, USA},
  publisher={ACM},
  note={Accepted at the 24th International Conference on Parallel Architectures and Compilation (PACT'15)},
  source={http://www.unixer.de/~htor/publications/},
}

serving: 54.81.198.150:59162© Torsten Hoefler