Dedicated to
High Processor Count Computing
Dedicated to
High Processor Count Computing
Welcome to the very first entry for “Big N Computing” Matt Reilly's High Processor Count Computing blog.
My name is Matt Reilly, and I'm the Chief Engineer at SiCortex, Inc. SiCortex designs and builds high processor count computer systems that solve problems in a variety of engineering and science disciplines.
But this blog is less about SiCortex, and more about trends in technical computing, system design, technology, and programming techniques. All the material here is mine; it doesn't come from the marketing department, it doesn't come from the sales team, it doesn't come from the PR consultants. So, if some of what I have to say annoys you, don't blame it on SiCortex. If, on the other hand, what I have to say resonates, take a look at SiCortex and call the sales department: the viewpoint I'm trying to articulate has a lot to do with SiCortex' vision of what technical computing can be, and will be.
So much for the commercial.
The historical and steady advance in uniprocessor performance gains by the grace of Moore's law has ended. For the foreseeable future increased computing capability per chip will come from more processors, not faster processors. At the same time our appetite for computes, our need for higher resolution results, and our burgeoning datasets are all growing faster than Moore's law will accommodate.
And so, this blog is dedicated to a simple proposition:
The key to advancing technical computing is high processor count computing.
"High processor count" doesn't mean eight or sixteen or thirty-two processors. It means thousands of processors. Tens of thousands. Big N computing.
"High processor count" doesn't mean machines that cost a hundred million dollars. It means machines that your company or department or team would buy. That you would use. That you will build your business, research, and engineering on.
In the coming months, I'll be writing about where I think the technical computing industry is headed, where it needs to be going, and how we can get there. Each entry will include a commentary segment, and a technical segment. This is the first commentary segment.

The technical segment will take the form of "Experiments in High Processor Count Computing" that explore some aspect of exploiting parallelism. I'll present results produced from experiments run on SiCortex cluster systems, and from time to time, a pizzabox or blade cluster. You can unpack the attached tarball each week and try these experiments at home. Where the commentary section will be breezy and typically brief, the technical section will dive into details and the nuts and bolts of a problem.
Many of the experiments are inspired by a simple model of a computation
Tsolution = Tcalculation + Tcommunication + Tio
The object is to improve time to solution by applying thousands of processors to the problem. But we often find that one or another of the factors "won't scale." Communications becomes the bottleneck, or IO begins to dominate, or the application just can't be carved up into thousands of pieces.
This first experiment looks at simple MPI send/receive performance, as that is a large part of the Tcomm factor for many applications. In fact, we can often predict where a parallel solution will run out of steam just by calculating the ratio of Tcalc to Tcomm for various values of N.
The first step in finding the optimal balance is to find the cost of the calculation kernel. We all know how to do that, and it is more or less specific to the problem we're trying to solve at the moment.
The second step is to know how much a communications operation will cost. This is less a matter of the application and more a question of the size of the data transfer. We can express that either in latency or in bandwidth. I've found bandwidth is usually the number that I can keep in my head and apply most generally.
There are benchmarks (especially in the HPC Challenge set) that attempt to measure the communications capability of a system. I find these benchmarks don't give me enough information to really help in planning an application. Ping-pong latency, for instance, tells me little about how the network will behave with messages longer than 1K bytes. Nor does it tell me much about network contention. Random ring bandwidth, on the other hand, tells me what to expect from very large message sizes in the presence of contention, but the ring nature of the code (that is, every process is dependent on every other process to complete the next step) can confound the bandwidth measurement.
The first set of experiments that we’ll try is contained in this BigNComputing_1.tgz To run the experiments, you’ll need a Linux or Unix system with an MPI library and a C compiler. Unpack the tarball like this:
tar xzvf BigNComputing_1.tgz
Then take a look at the README file. Every system has a different approach to compiling and running MPI applications, so I can’t provide much in the way of detail here. Note that the code for this experiment needs to link with the math library, so you may need to add the ‘-lm’ switch to the compile command. The SiCortex MPI implementation is based on MPICH. You can learn more about MPICH here. If your cluster doesn’t have an MPI library installed, MPICH is a good place to start. Once you get the program compiled, run it on an even number of processors and take a look at the data. The program itself takes no arguments.
Experiment 1 explores the cost of communication between processors in the absence of contention. Pairs of processors are chosen at random from the pool of available MPI ranks. When a pair is chosen, it exchanges messages of sizes varying from 4 bytes to 1MB. All other ranks are silent. The pair reports the achieved (measured) bandwidth for each size of message. The program picks 1000 pairs at random. The result is a scatter plot with message size on the X axis and message bandwidth on the Y axis. Each point in the plot represents the reported one way bandwidth between a pair of processes.
Let’s take a look at the results from Experiment 1 when I ran the test program on an SC5832. The SC5832 contains 5832 MIPS cores on 972 nodes in a single cabinet. (See the product information on the SiCortex website for more details.) The test was run on 5000 processors distributed over 834 nodes.

Note that we see some dispersion, not all the chosen pairs saw the same achievable bandwidth. This is partly due to the effects of network path length, and partly due to the “stuff happens” principle of all experiments. Whenever I see tight distributions, I wonder if enough samples were taken. OS interference may be playing a role here, but I have no evidence to support or refute such an argument. In any case, the actual measured contention-free bandwidth is quite impressive, at 1.4GB/s for a 512KB packet in several cases: the average is in the neighborhood of 1.3GB/s.
I ran the same experiment on 512 processors distributed over 128 nodes of the TACC Lonestar cluster. (Thanks to TACC for making this system available.) The Lonestar system uses Infiniband and a modified Fat-Tree to connect its nodes.

Here too, the peak bandwidth is quite impressive at almost 1GB/s. This is pretty much 100% of the advertised IB bandwidth. Note however the loose clustering as the message size grows.
Both the SiCortex system and the Lonestar system show a divot between 10K and 100KB messages. I’ll ask the folks working on SiCortex’ MPI implementation what they think of the dip.
Experiment 2 modifies the first experiment so that we can measure the result of contention among the processors.
I find myself using the result from Experiment 2 quite often in the design or analysis of parallel programs. Where the normal benchmark numbers that we see posted in the HPCC challenge give us “not to exceed” figures, Experiment 2 gives us a measure of the interconnect system under a realistic load where all the processors in an application are communicating at the same time.

The plot above reports the achieved bandwidth when 5000 processors are active in an SC5832. Note that six processors on each node are now sharing the communications engine, so instead of reporting peak bandwidths of 1.4GB/sec, we share the peak among six processors. But by that reasoning, we should see a peak of 200MB/s for each processor. What happened?
Contention happened. In the case of the SC5832, the network diameter is 6. The average message path is about 6 hops as well. Consider that each node has three outbound links available. If the average message traverses six links, then each outbound link will carry six “remotely originated” messages for every message that is sourced from its own node. (I think I got that right. Is it six? or five?) So, we start with a node output bandwidth of 4.2GB/s (3 links time 1.4GB/s per link), divide it by 7 to get 600MB/s to be shared among 6 processors. This gives an upper bound of 100MB/s when the fabric is fully occupied.
Happily enough, that’s pretty much what we see for 8KB messages, and then things start to decline. I suspect that the transition between 8KB and 16KB is the result of a performance optimization that helps the SiCortex system’s IO performance, but causes some degradation to MPI send/receive performance in the presence of very high contention. (Note that experiment 1 showed no transition.) The other dip between 256 bytes and 1K bytes is apparently in the region of the “eager” to “rendezvous” switchover. The SiCortex folks will be looking at this behavior too. Again, I suspect it is the outcome of a tradeoff between high and low contention dynamics. Real engineering is always a matter of careful compromise.
Finally, let’s compare the measured results from the TACC cluster with the results from the SiCortex test.

As one would expect, the Fat Tree provides very good performance right at large message sizes. However, each Infiniband port on a node is being shared among four processors. Perfect sharing will limit each processor to 250MB/s. It is unlikely that contention in the Fat Tree was much of a factor, as this test used less than 10% of all the nodes in the cluster, and Lonestar didn’t appear to be very busy on the morning of the test run. (Though I would not claim an absolutely rigorous measurement regime. It is hard for me to get the entire Lonestar machine allocated to a single experiment, I am a guest after all.)
I’m intrigued by the comparison at the low end of the range of message sizes. The SiCortex fabric boasts a very low PingPong delay, which means the fixed cost of a short message is in the neighborhood of 2.5 μS which fits the curve very well for messages up to about 128 bytes. The Infiniband measurements suggest a fixed cost of a short transfer of about 50 μS and tracks the actual measured bandwidth quite well up to about 4KB.
This difference is significant: as we distribute a problem among more and more processors, the average message must get smaller. But as messages shrink, the efficiency of the fabric decreases, in some cases quite dramatically.
This is why we need to start looking at network performance in terms of a matrix of characteristics (bandwidth vs. message size, vs contention rate) not just as a “ping-pong, ring-bandwidth” tuple.
Try out the experiment in the tarball. Let me know what you find out. My email address is matt at BigNComputing dot org.
High Processor Count Computing
4/14/08
The computing industry turned a corner sometime back: the GHz march is over, and future advances in high performance computing will rely on our ability to exploit parallel architectures and develop parallel solutions.