Indiana University’s new Datacenter (the bunker)

I’ve been at the inauguration (dedication ceremony) of the new $32 Mio datacenter. I have to say this building is impressive for a data center and reminds me of my times in the German army. I think everybody would agree to call it “the bunker”. It is designed to withstand an F5 tornado, but it might also survice plane crashes or bomb raids with its three feet (one meter) thick concrete walls. The new datacenter is really amazing and solves all problems of Wrubel (some of us might remember some power problems ;-)).

There is a nice vidoe at youtube: http://www.youtube.com/watch?v=zdHvnt3D7Tc

And this nice picture: 7984_h

UITS moved all servers and services in a single day! Well done guys!

ICPP 2009 in Vienna

I presented our initial work on Offloading Collective Operations, which is the definition of an Assembly language for group operations (GOAL), at ICPP’09 in Vienna. I was rather disappointed by this year’s ICPP. We had some problems with the program selection already before the conference (I’ll happily tell you details on request) and the final program was not great. Some talks were very entertaining though. I really enjoyed the P2S2 workshop, especially Pete Beckman’s keynote. Other highlights (in my opinion) include:

  • Mondrian’s “A resource optimized remote-memory-access architecture for low-latency communication” (I need to talk to those guys (I did ;))
  • Argonne’s “Improving Resource Availability By Relaxing Network Allocation Constraints on the Blue Gene/P” (I need to read the paper because I missed the talk due to chaotic re-scheduling, but Narayan’s 5-minute elevator pitch summary seemed very interesting)
  • Prof. Resch’s keynote on “Simulation Performance through Parallelism -Challenges and Options” (he even mentioned the German Pirate party which I really enjoyed!)
  • Brice work with Argonne on “Cache-Efficient, Intranode Large-Message MPI Communication with MPICH2-Nemesis”
  • Argonne’s “End-to-End Study of Parallel Volume Rendering on the IBM Blue Gene/P” (yes, another excellent Argonne talk right before my presentation :))

Here are some nice pictures:
vienna1
My talk at the last day was a real success (very well attended, even though it was the last talk in the conference)! It’s good to have friends (and a good talk from Argonne right before mine :-)). Btw. two of the three talks in the (only) “Information Retrieval” session were completely misplaced and had nothing to do with it, weird …

vienna2
My co-author, friendly driver, and camera-man and me in front of the parliament.

EuroPVM/MPI 2009 report

This year’s EuroPVM/MPI was held in Helsinki (not quite, but close to it). I stayed in Hanasaari, a beautiful island with a small hotel and conference center on it. It’s a bit remote but nicely surrounded by nature.

The conference was nice, I learned about formal verification of MPI programs in the first day’s tutorial. This technique seems really nice for non-deterministic MPI programs (how many are there?) but there are certainly some open problems (similar to the state explosion of thread-checkers). The remainder of the conference was very nice and it feels good to meet the usual MPI suspects again. Some highlights were in my opinion:

  • Edgar’s “VolpexMPI: an MPI Library for Execution of Parallel Applications on Volatile Nodes” (indeterminism is an interesting discussion in this context)
  • Rusty’s keynote on “Using MPI to Implement Scalable Libraries” (which I suspect could use collectives)
  • Argonne’s “Processing MPI Datatypes Outside MPI” (could be very very useful for LibNBC)
  • and Steven’s invited talk on “Formal Verification for Scientific Computing: Trends and Progress” (an excellent overview for our comunity)

The whole crowd:
image_preview

Unfortunately, I had to leave before the MPI Forum information session to catch my flight.
Videos of many talks are available at Videos. All-in-all, it was worth to attend. Next year’s EuroMPI (yes, the conference was finally renamed after the second year in a row without a PVM paper) will be in Stuttgart. So stay tuned and submit papers!

MPI 2.2 is now officially ratified by the MPI Forum!

I just came back from lunch after the MPI Forum meeting in Helsinki. This meeting focused again (the last time) on MPI 2.2. We finished the review of the final document and edited several minor things. Bill did a great job in chairing and pushing the MPI 2.2 work and the overall editing. Unfortunately, we did not meet our own deadlines, i.e., the chapters and reviews were not finished two weeks ago (I tried to push my chapters (5 and 7) as hard as possible, but getting the necessary reviews was certainly not easy). However, the whole document was reviewed (read) by forum members during the meeting and my confidence is high that everybody did a good job.

Here are the results of the official vote on the main document:
yes: 14
no: 1
abstain: 2 (did not participate)

The votes by chapter will be online soon.

The feature-set of the standard did not change. I posted it earlier here and Jeff also. But it’s official now! Implementors should now get everything implemented so that all users can enjoy the new features.

Here is a local copy (mirror) of the official document: mpi-report-2_2.pdf (the creation date might change)

One downside is that we already have errata items for things that were discovered too late in the process. This seems odd, however, we decided that we should not break our own rules. And even if the standard says that an MPI_BOOL is 4 bytes, we had to close the door for changes at some point. The errata (MPI_BOOL is one byte) will be voted on and posted soon on the main webpage.

Rolf will publish the MPI 2.2 book (like he did for MPI 2.1) and it will be available at Supercomputing 2009. I already ordered my copy :).

And now we’re moving on to MPI 3, so stay tuned (or participate in the forum)!

mpi-report-2.2-2009-09-04-as-1book

Hot Interconnects (HOTI’09) in New York

I attended the Hot Interconnects conference for the second time and it was as great as last year! This conference is rather convincing because it is a single-track conference with only a small number of highly interesting papers. And still, the attendance is huge, unlike on some other conferences where people only come when they have to present a paper and the audience is often sparse.

I gave a talk on static oblivious InfiniBand routing which was well received (I received a lot of questions and had very interesting discussions, especially during the breaks). Other highlights of the conference (in my opinion) were:

  • Fulcrum’s impressive 10 GiB FocalPoint switch design (the switch has full bandwidth at 10 GiB/s line-rate, real 10 GiB/s not 8 ;))
  • A paper about the implementation of collective communication on BG/P (I was hoping for a bit more theoretical background and a precise model of the BG/P network)
  • Some talks on optical networking were rather interesting
  • The panel about remote memory access over converged Ethernet was rather funny. Some people are seriously trying to implement the <irony> simple and intuitive </irony> OFED interface to Ethernet. I am wondering which real application (not MPI) uses OFED as communication API?

Here are some commented pictures:
ny1
View from the Empire State Building (credits go to Patrick!).

ny2
Another view from Empire State, the red arrow points at the conference location (the Credit Suisse Palace).

ny2
Times square (I think).

ny2
The Empire State “tip”.

ny2
Another downtown view.

ny6
The Empire State foyer.

ny2
I feel like in Paris ;).

ny2
We saw this scary building without windows as we walked from The Empire State down to the World Trade Center Site (weird).
ny9
*yeah* (seen next to the WTC site).

ny2
All we could see from the WTC site. It was not worth the long walk … but we talked anyway most of the time so this paid off.

ny2
The Wall Street (should be closed immediately!).

ny2
The subway is rather scary … seriously, New York!? Why do they have such a bad subway …

ny2
View from my (extremely cheap) hotel room. It was awesome, really!

ny14
The Credit Suisse Palace from the inside. Somebody has too much money (still).

The MPI Standard MPI-2.2 is fixed now

We just finished all voting on the last MPI-2.2 tickets! This means that MPI-2.2 is fixed now, no changes are possible. The remaining work is simply to merge the accepted tickets into the final draft that will be voted on next time in Helsinki. I just finished editing my parts of the standard draft. Everything (substantial) that I proposed made it in with a reasonable quorum. The new graph topology interface was nicely accepted this time (I think I explained it better and I presented an optimized implementation). However, other tickets didn’t go that smooth. The process seems very interesting from a social perspective (the way we vote has a substantial impact on the results etc.).

Some tickets that I think are worth discussing are:

Add a local Reduction Function – this enables the user to use MPI reduction operations locally (without communication). This is very useful for library implementors (e.g., implementing new collective routines on top of MPI) – PASSED!


Regular (non-vector) version of MPI_Reduce_scatter – this addresses a kind of missing functionality. The current Reduce_scatter should be Reduce_scatterv … but it isn’t. Anyway, if you ever asked yourself why the heck should I use Reduce_scatter then think about parallel matrix multiplication! An example is attached to the ticket. – PASSED!

Add MPI_IN_PLACE option to Alltoall – nobody knows why this is not in MPI-2. I suppose that it seemed complicated to implement (an optimized implementation is indeed NP hard), but we have a simple (non-optimal, linear time) algorithm to do it. It’s attached to the ticket :). – PASSED!


Fix Scalability Issues in Graph Topology Interface
– this is in my opinion the most interesting/important addition in MPI-2.2. The graph topology interface in MPI-2.1 is horribly broken in that every process needs to provide the *full* graph to the library (which even in sparse graphs leads to $\Omega(P)$ memory *per node*). I think we have an elegant fix that enables fully distributed specification of the graph as well as each node specifies its neighbors. This will be even more interesting in MPI-3, when we start to use the topology as communication context. – PASSED!

Extending MPI_COMM_CREATE to create several disjoint sub-communicators from an intracommunicator -Neat feature that allows you to create multiple communicators with a single call! – PASSED!

Add MPI_IN_PLACE option to Exscan – again, don’t know why this is missing in MPI-2.0. The rationale that is given is not convincing. PASSED!

Define a new MPI_Count Datatype – MPI-2.1 can’t send more than 2^31 (=2 Mio) objects on 32-bit systems right now – we should fix that! However, we had to move this to MPI-3 due to several issues that came up during the implementation (most likely ABI issues) POSTPONED! It feels really good to have this strict implementation requirement! We will certainly have this important fix in MPI-3!

Add const Keyword to the C bindings – most discussed feature I guess 🙂 – I am not sure about the consequences yet, but it seems nice to me (so far). – POSTPONED! We moved this to MPI-3 because some part of the Forum wasn’t sure about the consequences. I am personally also going back and forth, the issue with strided datatypes seems really worrysome.

Allow concurrent access to send buffer – most programmers probably did not know that this is illegal, but it certainly is in MPI<=2.0. For example: int sendbuf; MPI_Request req[2]; MPI_Isend(&sendbuf, 1, MPI_INT, 1, 1, MPI_COMM_WORLD, &req[0]); MPI_Isend(&sendbuf, 1, MPI_INT, 2, 1, MPI_COMM_WORLD, &req[1]); MPI_Waitall(2, &req); is not valid! Two threads are also not allowed to concurrently send the same buffer. This proposal will allow such access. - PASSED!

MPI_Request_free bad advice to users – I personally think that MPI_Request_free is dangerous (especially in the context of threads) and does not provide much to the user. But we can’t get rid of it. … so let’s discourage users to use it! – PASSED!

Deprecate the C++ bindings – that’s funny, isn’t it? But look at the current C++ bindings, they’re nothing more then pimped C bindings and only create problems. Real C++ programmers would use Boot.MPI (which internally uses the C bindings ;)), right? – PASSED (even though I voted against it ;))

Something odd happened to New Predefined Datatypes. We found a small typo in the ticket (MPI_C_BOOL should be 1 instead of 4 bytes). However, it wasn’t small enough that we could just change it (the process doesn’t allow significant changes after the first vote). It was now voted in with this bug (I abstained after the heavy discussion though) and it’s also too late to file a new ticket to fix this bug. However, we will have an errata item that will clarify this. It might sound strange, but I’m very happy that we stick to our principles and don’t change anything without proper reviews (these reviews between the meetings where vendors could get user-feedback have influences tickets quite a lot in the past). But still PASSED!

For all tickets and votes, see MPI Forum votes!

I’m very satisfied with the way the Forum works (Bill Gropp is doing a great job with MPI-2.2), I hear about other standardization bodies and have to say that our rules seem very sophisticated. I think MPI-2.2 will be a nice new standard which is not only a bugfix but offers new opportunities to library developers and users (see the tickets above). We are also planning to have a book again (perhaps with an editorial comment addressing the issue in ticket 18 (MPI_C_BOOL)!

IPDPS’09 report

I’m just back from IPDPS 2009. Overall, it was a nice conference, some ups and downs included as usual. I had several papers at workshops from which I had to present three (I was planning on two only, but one of my co-authors fell sick and couldn’t attend). They were all very well received (better than I hoped/expected).

I’m attending the CAC workshop since several years and have been surprised pleasantly each year. It only has high-quality papers and about 50% acceptance rate (be very careful with this metric, some of the best conferences in CS have a very high rate ;)). This year’s program was nicely laid out. The keynote speaker, Wu Feng, presented his view on green computing, and my talk was next. It was a perfect fit — Wu pretty much asked for more data, and I presented the data of our (purely empirical) study. My other talk presented the work on NBC of the group in Aachen – nicely done, I like the idea with the Hamiltonian path numbering but am wondering if one could do better (suggestions for a proof idea are welcome!).

Some talks were remarkable: Ashild’s talk about “Deadlock-Free Reconfiguration” was very interesting for me. Brice’s talk about “Decoupling Memory Pinning from the Application” reminded me a bit of the pipelined protocol in Open MPI, I’m not sure if I like it or not because it seems to hinder overlapping of computation and communication. The last talk about improving the RDMA-based eager protocol is a hybrid between eager and rendezvous for often-used buffers (each buffer has a usage-count and is registered after some number of uses). However, the empirical result data seemed to indicate that this only makes sense for larger buffers. And I agree to D.K. Panda’s comment that one could just decreases the protocol switching point for all considered applications. However, the idea could be very interesting for some applications with varying buffer usage.

It was in Rome this year and I don’t like Rome. I think it’s the dirtiest European city I know, and I had to stay for a week. The catering at IPDPS was bad as usual (only not-so-good cookies in coffee breaks and a unspectacular dinner). But I wasn’t there for the food anyway.

The main track was ok. I didn’t agree with some of the best paper selections. The OS jitter talk was interesting and contained some new data, however, it wasn’t clear what the new fundamental findings were. I suppose I have to read the paper. Some other theoretical papers seemed interesting, but I also need to read the articles. The panel was nice, I mostly agreed to Prof. Kale who stated that caches are getting much less important and Prof. Pingali who wants to consider locality. I seriously wonder what happened to all those dataflow architectures – I think they are a worthwile alternative to multicore systems. I was following Nir Shavit’s activities already, and I liked his keynote presentation about TM, even though there are obvious open problems.

Friday’s LSPP workshop was very interesting too. I’ve been the second year in this workshop and like it a lot (large-scale processing seems to gain importance). I enjoyed Abhinav’s talk who perfectly motivated my talk (it was right after his) and I enjoyed the lively discussion during and after my talk (sorry for delaying the schedule). I’m also happy to see that there is now an asynchronous profiling layer for the cell messaging layer (mini-cell-MPI).

I did not enjoy the flight back … Italy is awful (train runs late, airport was overcrowded and super-slow, boarding was a catastrophe because I was on the waiting list until 5 minutes before departure, …). But I was able to upgrade to first class in the US so that my last flight was at least comfortable. Here are some pictures from a five hour walk through Rome. We didn’t really pay attention because we were busy chatting :):

spanish_steps
The Spanish steps (don’t ask … it was on the map).

river
Some random river …

me
Yep, I was there (we think it’s the Vatican in the background).

collosseum
That’s simple — the collosseum (and some arch).

balcony
The view from my hotel. I couldn’t stay in the conference hotel because it was overbooked. I wasn’t mad because this one was significantly cheaper and nicer :).

Cluster Challenge 2008 – an adviser’s perspective

I did the Cluster Challenge again this year. Last year was fun, but this year was better – we won! Here’s the story:

We started Saturday morning in Bloomington. The travel went pretty
smooth and Guido picked us up at the airport in Austin. We went directly
to the Conference location. The time before that was very stressful
because the machine wasn’t working quite as nice as we would like to.
The biggest problem was that the we could not change the CPU frequency
of the new iDataplex system. However, we were able to change it on the
older version and saw significant gains. We benchmarked that we could
run 16 nodes during the challenge and use 12 of them for HPCC (while 4
were idle) with our power constraints (2×13 A). So we convinced IBM to
give us a pre-release BIOS update which was supposed to enable CPU
frequency scaling. And it looked good! We were able to reduce the CPU
clock to 2.0 GHz (as on the older systems). However, it was 4am and we
had to ship at 6, so we didn’t have the time to test more. But back to
Austin …

The guys from Dresden were already waiting for us because the organizers
did not allow them to unpack the cluster alone (it was supposed to be a
team effort). We unpacked our huge box and rolled our 900 pound cluster into
our booth.

Our Cluster

We spent the remaining day with installing the system and pimping (;-))
our booth. It went pretty well. Then we began to deploy our hardware and
boot it from USB to do some performance benchmarks.

Installing the fragile Fiber Myrinet equipment (we didn’t break anything!)

We started with HPCC and were shocked twice. Number one was that the CPU
scaling that costed us so many sleepless night did not seem to help. All
tools and /proc/cpuinfo showed 2.0 GHz – but the power consumption was
still as high as with 2.5 GHz. So we wrote a small RDTSC benchmark to
check the CPU frequency – it still ran at 2.5 GHz. The BIOS was lying to
us :-(.  The second shock was that HPL was twice as slow as it should
be. So much to the sleep …
Quite some time after midnight … still hacking on stuff. I’m trying to motivate (I am a good slave driver) our guys to go on.

The students tried to fix it … all night long. The conclusion was that
we had to drop our cool booting from USB idea due to the performance
loss. Later, it turned out that shared memory communication uses /tmp
(which was mounted via NFS) and was thus really slow (WTF!). Anyway, we
decided about one hour before the challenge started to fall back to
disks. This worked.

How high can one stack harddrives? Not too high actually ;). Man, this was hard to plug them back into the system.

The second problem was a tough one. The BIOS … lying to us. We were
finally able to get hold of an engineer from IBM. He tried hard but
couldn’t help us either. So the students had to make the hard decision
to run with two nodes less :-(.

In the meantime, me and Bob had fun while biking in order to power
laptops ;).

Bob Beck (UA) generating power on our fancy machine ;).

I was driving my laptop with the sandwiches I ate before :).

The Challenge finally starts

The challenge was about to start, the advisors couldn’t do anything
anymore, so we decided to get some fuel from the opening banquet for our
students in the nightshift ;).
Guido and me thinking about getting some good stuff for the students!

We finally found some good stuff on the showfloor *yay*.Advisor’s success!

Some of us were not totally up to speed all the time 😉 – It looks like somebody missed the start:

So the Challenge ran, and we had nothing to do (especially the advisers
who were just hanging around to feed and motivate the students). So we
did all kinds of weird things over night – and we had a bike ;).

I also started some coding during the challenge because I didn’t really
do anything but it was way to noisy to work on papers. I had to pose
inside the microsoft booth, while my laptop finished up some cool
things! Thanks to Erez for taking the picture at exactly the right time.

Some Linux-based “research” performed/finished inside the Microsoft booth.
Guido explains Vampir to the other teams on one of our three ultra-cool
41” displays (again, around midinght ;)). We had really nice speakers
at the challenge. Especially on Sunday, when all the others left, we
cranked them up and listened to the soundtrack of Black Hawk down. The
security guys seemed kind of confused to hear really loud base at 4am in
the morning ;).

Guido! Don’t help the “enemies” ;).

Youtube made it also on our display :). And nearly costed us a point by
disturbing the sound output of our power warning system. But Jens
realized it fortunately.

Watch yourself:  Achmed the Dead Terrorist

Oh, and there was this Novell penguin that spontaneously caught fire. I
guess this happens when experienced computer scientists spend two days
to install a completely retarded operating system (with InfiniBand – ask
me about details if you’re interested). I love Linux, but it’s a shame
that the abbreviation SLES has the word “Linux” in it. Debian or Ubuntu
is so much better! But apparently, SLES is better prepared for the
applications (clearly not for administration or software maintenance
though).

Each booth was “armed” with at least one student during all the time.
Here are some images from after midnight.

The MIT booth – doesn’t it look more like Stonybrook?

The folks from Arizona State – they had a neat Cray – with Windows though. But it seems that it worked for them.

The guys from Colorado with Aspen systems (don’t ask them about their vendor partner).

The National Tsing Hua University – excellent people but their system was more of a jet engine than a cluster.

Our booth … note the image on the big screen ;).

The Alberta folks – last year’s champions. Darn good hackers!

Purdue with their SciCortex – they seemed rather annoyed all the time.

Our social corner: At 2am, most students didn’t have to do a lot (just
watching jobs). So they all gathered in front of our booth and played
cards :).

Two fluffy spectators were watching our oscilloscope animation during
the show-off on Thursday.

The team of judges, led by Jack Dongarra, talked to our students to
assess their abilities

After that, we won! We don’t have a picture of our fabulous win yet, but I’ll post it with some more links after I got it.

Living in the UITS datacenter

Timo (my visiting student from Germany) and me are pretty much living in the UITS data-center (Wrubel building) these days. It’s a rather nice serverroom and we’re having our cluster directly next to the Data Capacitor and Big Red.

It was nice the last days, we stayed overnight during the weekend (more than 16 hours a day where others celebrated Halloween) and some nights during last week. Prof. Lumsdaine was so generous to buy us pizza at night.Thanks a lot!
The fun fact is the some caperpillar ran over a huge power transformer yesterday and cut off a part of the power supply to the building.The immediate results was that they had to switch off the air conditioning and some systems (Big Red was among them). It was rather funny to be in a really big data-center without air conditioning. It got hotter literally every hour, I guess it reached about 30-35 degrees at the end (we were sweating while just sitting there). The biggest problem was that we did power measurements but our system used 0.3-0.4 A more power than usual :-(. And we don’t know if it is linear :-/. We still ran some benchmarks. Anyway, the power and cold air came back around 2am. The IU physical plant people were really quick in getting a new transformer from Cincinnatti by truck (in less than 6 hours).

We then tried to get some food at Wednesday night 2:30am … man, all pizza places were closed. So we decided to drive to Taco Bell and have a “fourth meal”. So paked next to it and tried to walk into the restaurant – but it was locked. Ok, the drive-in was still open (how weird). Shortly after we left, the police stopped us and asked us if we had gasoline in the car (what a *really stupid questions* – no, our car is running with hydrogen … man, those people). Anyway, we realized that they were searching for a firebug and the Taco Bell people told them that we were suspicious. WTF! Again, man, those people.

Cluster Challenge is really fun again. I’m happy that Guido and Jupp are here to help us. It’s getting really close but our system is really nice ;).