<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:iweb="http://www.apple.com/iweb" version="2.0">
  <channel>
    <title>High Processor Count Computing</title>
    <link>http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Big_N_Computing.html</link>
    <description>Thanks for coming to “Big N Computing,” a blog about parallel processing with large numbers of processors. &lt;br/&gt;&lt;br/&gt;Big N Computing isn’t just the future of technical computing. It is where technical computing is now.</description>
    <generator>iWeb 3.0.1</generator>
    <image>
      <url>http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Big_N_Computing_files/SC5832.jpg</url>
      <title>High Processor Count Computing</title>
      <link>http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Big_N_Computing.html</link>
    </image>
    <item>
      <title>Cleaning the Attic</title>
      <link>http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2010/12/9_Cleaning_the_Attic.html</link>
      <guid isPermaLink="false">c3e969c4-9eb0-41fa-b063-552180a3796c</guid>
      <pubDate>Thu, 9 Dec 2010 15:54:03 -0500</pubDate>
      <description>&lt;a href=&quot;http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2010/12/9_Cleaning_the_Attic_files/droppedImage.jpg&quot;&gt;&lt;img src=&quot;http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Media/object002_1.jpg&quot; style=&quot;float:left; padding-right:10px; padding-bottom:10px; width:254px; height:135px;&quot;/&gt;&lt;/a&gt;Intermittent might be too kind a word for a blog that gets updated at six month intervals. &lt;br/&gt;&lt;br/&gt;Today’s entry will be the last for a while, and it is aimed particularly at folks designing microprocessors. If it is a useful insight, then have at it -- I make no claim on it. If it is a well known insight, then we’ve found another example of independent development in parallel cultures: it is likely that fire got discovered in a lot of places at about the same time. &lt;br/&gt;I’ve been packing up the office and house here at Matt Reilly Consulting’s galactic headquarters, as we’re getting ready to move to Maryland and to a new job. That’s involved a lot sifting through dusty books and faded papers. Some have been quite entertaining. One stack of material is the specification for the Alpha EV8 processor that was cancelled when Intel bought the Alpha Development Group back in 2001. I asked permission of the appropriate folks and they were gracious enough to allow me to keep the document as a historical monument. &lt;br/&gt;I am struck by the amount of machinery in the design devoted to handling cache coherence intrusions and other inter-processor memory transactions. (Alpha processors used a split load-locked/store-conditional pair to implement “atomic” memory references -- these and the memory consistency rules required processors to do some pretty fancy book-keeping to ensure that memory writes occurred in the appropriate order and that reads were evident in conformance to the memory rules.) Almost all the machinery involved keeping track of memory references that were issued by the issue unit and had passed through the execution stages. These were checked against incoming references from other processors. Load instructions, for instance, that might have issued out of order, might be cancelled by an intervening write from another processor. Stores that hadn’t made it out to “main memory” might need to be checked against incoming requests. &lt;br/&gt;That kind of stuff always made my head hurt. I don’t think I ever saw an engineering conversation around these issues that didn’t use two or three colors of markers and span a whole lot of white-board real-estate. &lt;br/&gt;Further, the complexity of the intrusion management suggests that verification of the intrusion mechanisms might be rather complex. One must manage to insert remote intrusion requests “before,” “after,” and “during” conflicting local operations -- that is, the verification effort must ensure that it tickles lots and lots of cross products that have both functional and temporal dimensions. Alpha and MIPS implementations with which I am familiar injected the intrusions into a very late stage of the pipe -- quite remote from the instruction issue stage. This means that the verification scaffolding may have very little direct control over the rendezvous between an intrusion and a relevant instruction sequence. (“Pushing on a rope” may be an apt image.)&lt;br/&gt;None of this is magic, or even beyond the state of the art: there are lots of very very smart people out there who have spent their careers figuring out how to test this stuff. &lt;br/&gt;But, it seems to me that managing the cross-product generation process would be much easier if the intrusions entered the pipeline at the instruction scheduling/dispatch stage rather than later.* And modern multithreaded processor designs make this approach less radical than it might have been two or three generations ago. &lt;br/&gt;Specifically, imagine a processor whose memory consistency model requires just a few “remote” operations: &lt;br/&gt;	•	Flush Outstanding Writes and Dirty L1 Cache Block to “Memory”&lt;br/&gt;	•	Invalidate Lock Flag for Address X&lt;br/&gt;	•	Invalidate Cache Block X&lt;br/&gt;	•	Update Cache Block X with Data Word W&lt;br/&gt;	•	Fill Cache Block X&lt;br/&gt;Note that the first and last items aren’t necessarily peculiar to multiprocessors. &lt;br/&gt;My guess is that you can implement most of a cache coherence regime with just those operations and perhaps one or two more. In the traditional approach, all of these operations enter the processor in the MBox or cache management logic very late in the pipeline. In an out-of-order machine they need to be checked against the actual  program order of all updates (and reads) of the cache to ensure that no memory ordering rules are broken. &lt;br/&gt;But what if the intrusions were introduced as fundamental “hardware only” instructions at the instruction dispatch/schedule stage as an instruction stream for an “intrusion thread?”  That is, what if we create a thread that exists only for the purpose of injecting intrusions into the instruction stream. Such a thread would have few, if any, physical registers or architectural state. It exists purely to inject marker or coherence instructions in issue time order. These special instructions would cause the cache controller and memory pipeline to perform the related tasks, but they’d flow through the pipeline interspersed with the normal instruction flow. &lt;br/&gt;This would allow the verification team far more control over which cross products get exercised -- no more rope pushing. More importantly, it creates a great deal of common ground between the problem of designing the inter-processor intrusion logic and the problem of designing the inter-thread memory reference reconciliation logic. (Threads must obey the memory ordering rules too.) There must be some virtue in reducing the apparent architectural complexity of inter-processor memory ordering. &lt;br/&gt;I never got a chance to implement a scheme like this. Perhaps I might someday. But gee, this seems like it makes a lot of things a whole lot simpler. &lt;br/&gt;If you are a microprocessor designer, and this makes sense to you, drop me a line. &lt;br/&gt;If you aren’t a microprocessor designer, this was probably more tedium than you deserved. I hope to offer something more interesting next time.&lt;br/&gt;.................&lt;br/&gt;postscript/update:&lt;br/&gt;* SiCortex alums may recognize this approach, as it was a feature suggested for the second generation Ice-T core design which was not yet completed when the company shut its doors. </description>
      <enclosure url="http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2010/12/9_Cleaning_the_Attic_files/droppedImage.jpg" length="57874" type="image/jpeg"/>
    </item>
    <item>
      <title>Give Me a SINE</title>
      <link>http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2010/6/21_Give_Me_a_SINE.html</link>
      <guid isPermaLink="false">0dbce614-936e-4808-bc0a-5541cde99085</guid>
      <pubDate>Mon, 21 Jun 2010 16:00:12 -0400</pubDate>
      <description>&lt;a href=&quot;http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2010/6/21_Give_Me_a_SINE_files/sin.jpg&quot;&gt;&lt;img src=&quot;http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Media/object001_1.jpg&quot; style=&quot;float:left; padding-right:10px; padding-bottom:10px; width:257px; height:193px;&quot;/&gt;&lt;/a&gt;I was coding up a set of FFT routines recently when the question of sin vs. native_sin accuracy came up. Specifically, on the Radeon 5870 widget on my desk, the native_sin is very very fast, on the order of a mul-add operation or two, while the normal sin operation takes at least 20 times as long. But the OpenCL document suggests that native_sin may produce results that are in error by an implementation defined amount. (No mention is made as to where this “implementation definition” might be found, or even if the implementer of the OpenCL run time is obliged to share the information with us.)  &lt;br/&gt;&lt;br/&gt;So how accurate is native_sin? What do we mean by accuracy?  Some suggest measuring the difference between the test implementation and a “known good and accurate” implementation in terms of “units in the last place.” In its simplest form, compare native_sin to a reference and count the number of times we’d have to increment/decrement the mantissa of the reference before it agreed with the native_sin result. &lt;br/&gt;&lt;br/&gt;For a test framework, I generated a long vector of a few million arguments “uniformly distributed” over a range. In the tests reported here, the range was never larger than the 2pi range surrounding zero.  (Going beyond this range introduces the error caused by taking the argument mod pi, and in the applications that I care about, the argument is always within [-pi, pi].) The argument vector is then passed to a kernel that calculates the value for sin(arg[i]) and native_sin(arg[i]). &lt;br/&gt;&lt;br/&gt;I compare both sets of results (calculated on the GPU) against a “reference” implementation of sin -- the version in the standard gnu implementation of libm for x86_64, with all arguments and results calculated in double precision. The double precision value is then rounded (by the default mode: round-to-nearest) and compared with the GPU result. ULP error (error in terms of the “unit in the last place”) is calculated by finding dividing the difference between the reference and test values and dividing by DELTA where DELTA is defined as the difference between the reference value and the reference value with its low order mantissa bit flipped. &lt;br/&gt;&lt;br/&gt;A plot of the ULP error for sin(arg) shows that the sin implementation for this version of the AMD SDK is within one bit or so of the reference for all the argument values that were tested. &lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;So the GPU implementation of sin() is reasonably good across the range. All errors are within one bit of the “reference” value. &lt;br/&gt;&lt;br/&gt;Now let’s look at the plot for native_sin().  I tested 1e6 points over the range. &lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Obviously, there are some problems in sin() for small angles or angles near multiples of Pi. Note that an error of 1e6 ULP suggests that the low 20 bits of the mantissa are in error. &lt;br/&gt;&lt;br/&gt;Clearly the problem is in expressing results that are close to zero. Let’s zoom in around sin() for very small numbers: &lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;Ahhh.. Now we’re beginning to see a pattern.  It is clear that close to zero, even the exponent is in error! Why? Well, recall that sin(x) = x as x -&gt; 0.  Let’s see what native_sin(x) looks like in this region: &lt;br/&gt;&lt;br/&gt;&lt;br/&gt;&lt;br/&gt;The GPU rounds the value of native_sin() to the nearest multiple of 3.57e-7 -- which is about 3x the value of FLT_EPSILON, the difference between 1.0 and the next highest representable floating point number. &lt;br/&gt;&lt;br/&gt;So what does this mean?  Probably very little, unless you do a lot of calculations with very small angles. &lt;br/&gt;&lt;br/&gt;But that is exactly what happens in large FFT operations.  Calculating twiddle factors for large FFTs may involve taking the sin of angles smaller than Pi/1024. In this range the native_sin error may be as large as 1000 ULP -- that is, ten bits of the 24 bit mantissa are in error. What does this mean for the accuracy of large FFT operations? I’ll leave that to the numerical experts. In the mean time, I’ll be careful. &lt;br/&gt;&lt;br/&gt;Would you like to explore this topic further?  The tarball here &lt;a href=&quot;Entries/2010/6/21_Give_Me_a_SINE_files/sin_test_kit.tgz&quot;&gt;sin_test_kit.tgz&lt;/a&gt; contains all the experimental code and a make file. Edit the make file to point to your OpenCL installation, and type make.  Then fix the problems.  The file “do_sin_exp.sh” will produce the plots shown here. &lt;br/&gt;&lt;br/&gt;And speaking of the OpenCL specification...&lt;br/&gt;&lt;br/&gt;A specification is more than a catalog of interesting features:&lt;br/&gt;&lt;br/&gt;The OpenCL specification is an entertaining read, but somewhat frustrating.  &lt;br/&gt;&lt;br/&gt;In fact, to call it a specification is somewhat premature. At best the current specification, identified as version 1.1 &lt;a href=&quot;http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf&quot;&gt;here,&lt;/a&gt; is a report on an interesting API and programming language.  It fails to rise to the necessary level of precision or discipline that will be required if OpenCL is to become a viable platform. &lt;br/&gt;&lt;br/&gt;The problem is, of course, that the report is moot on a wide variety of topics. For instance, I can find no part of the specification that describes what the acceptable behavior is for a kernel that initiates an access off the end of a vector.  The document does not even suggest that the behavior is undefined.  As far as the specification writers are concerned, the event doesn’t occur in practice. You and I might assume that out-of-bounds references are “undefined.”  But does that mean that the kernel will abort? I would not assume so, but would the developer in the next cubicle make the same assumption?&lt;br/&gt;&lt;br/&gt;Nor does the Khronos hosted document indicate what we should expect of the contents of a block of newly allocated GPU memory.  The implementation that I use has a peculiar behavior: if I allocate a block of GPU memory with clCreateBuffer and then write some data to it, that data will often be there the next time I run the program! Is this “bad” behavior?  Perhaps. Is it behavior that I would rely on? Of course not. But the language document doesn’t contain a whisper about what is supposed to happen, or what it is safe to assume will happen. Not even a statement that “the contents of a newly allocated buffer are undefined.” &lt;br/&gt;&lt;br/&gt;Why does this matter? It matters because OpenCL is being implemented by multiple teams. Each team will interpret the language note in its own way. Application writers will do the same. A good specification will limit the number of erroneous or mismatched interpretations. The OpenCL 1.1 document does not. Nor does the document identify any authority or process for resolving ambiguities. One might look to the Khronos hosted OpenCL community forum, but the problem with community forums and language specifications is that “on the web, nobody knows you’re a dog.” Or a precocious teen. Or an implementer with tunnel vision. A discussion forum on its own is not an authoritative source.&lt;br/&gt;&lt;br/&gt;You and I are smart enough and disciplined enough that we’d never assume that data in the GPU would persist from one invocation to another. And neither of us would ever rely on a particular behavior in the wake of an out-of-bounds vector element index. But what about more subtle distinctions? &lt;br/&gt;&lt;br/&gt;What about the order in which worker IDs are issued?  I was recently working on an algorithm that would have taken advantage of the behavior I’d observed in some early tests. Fortunately, I knew enough to check to see if the language document made any guarantees about ID ordering and assignment -- it doesn’t. (If the document were a full-fledged specification, it would at least indicate that the order in which ID’s are assigned is “implementation dependent.”)  Someday somebody is going to make an unwarranted assumption about the behavior of an OpenCL program and it will be correct for their development platform and wrong in the production device. And it may not get caught in beta testing. Given the configuration cross products that are aggravated by the OpenCL JIT compiler, the failure may only manifest itself in the field. &lt;br/&gt;&lt;br/&gt;Most engineers who’ve been in the industry long enough have encountered this problem. When the specification is fuzzy, flexible, or “purposefully vague” (in the words of an apparently well meaning developer) users will experiment with the current implementation to “reverse engineer” their way to a private specification. In the end, the actual behavior of an OpenCL will be defined by the “dominant” implementation. Someday there will be a disagreement between the two or three major OpenCL implementations and the community will have to resolve the problem. Some old codes will break. That’s not a good omen for folks who are considering an investment in OpenCL libraries and infrastructure. &lt;br/&gt;&lt;br/&gt;This wouldn’t be such a big deal if OpenCL applications were limited to game playing and entertainment. But they aren’t. We can expect OpenCL to be used in devices that steer, that inspect, that search, and that fly. &lt;br/&gt;&lt;br/&gt;The solution isn’t at all complicated. The OpenCL committee needs to resolve to prepare a legitimate language specification. OpenCL is a good idea. I even like the programming model and its control API. OpenCL programs have a chance of running on lots of different hardware platforms. But OpenCL is doomed if we’re only going to see an update to the defining document every 18 months and if the defining document makes a fetish out of “flexibility” and&lt;br/&gt;“ambiguity” (freedom for the OpenCL implementers! Hooray!) at the expense of predictability and portability that are required if application and library developers are to see some return from their investment. &lt;br/&gt;&lt;br/&gt;Here are a few simple steps that could demonstrate that the OpenCL consortium is serious about this problem&lt;br/&gt;	1.	 Identify a specific authority responsible for resolving questions about the OpenCL language and API in a timely manner. Make a channel to that authority available to the user community. &lt;br/&gt;	2.	 Assign resources to completing the description of thread launch behavior (including ID assignments) with particular attention paid to what assumptions are permissible and which are not. &lt;br/&gt;	3.	 Assign resources to define and codify the memory ordering rules, especially those related to the clEnqueueMap operations. What are the guarantees around memory contents vs. ordering of updates from the host and from the executing device? &lt;br/&gt;	4.	 Identify a mechanism by which vendors are required to codify or document “implementation dependent” accuracy limits for the native_xxx operations. &lt;br/&gt;	5.	 Rework the descriptions of the buffer allocation operations. Specifically, the definitions of the various allocation options  CL_MEM_USE/ALLOC/COPY_HOST_PTR need clarification. (The various OpenCL forums frequently address this issue.) &lt;br/&gt;	6.	 Release the next version of the document before May of 2011. &lt;br/&gt;&lt;br/&gt;OpenCL is a good solution to an interesting problem. Now the vendors behind OpenCL need to step up and produce a specification document that does justice to the name. </description>
      <enclosure url="http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2010/6/21_Give_Me_a_SINE_files/sin.jpg" length="56553" type="image/jpeg"/>
    </item>
    <item>
      <title>Small Brick, Big ‘N’</title>
      <link>http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2010/2/22_Small_Brick,_Big_%E2%80%98N%E2%80%99.html</link>
      <guid isPermaLink="false">58ecdcf3-8f33-453c-aa3a-0c784e971883</guid>
      <pubDate>Mon, 22 Feb 2010 21:28:59 -0500</pubDate>
      <description>&lt;a href=&quot;http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2010/2/22_Small_Brick,_Big_%E2%80%98N%E2%80%99_files/JohnMucci-LG.jpg&quot;&gt;&lt;img src=&quot;http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Media/object000_1.jpg&quot; style=&quot;float:left; padding-right:10px; padding-bottom:10px; width:199px; height:212px;&quot;/&gt;&lt;/a&gt;Today’s post will introduce SOCL -- a quick and dirty wrapper library to get folks started with OpenCL. &lt;br/&gt;--&lt;br/&gt;Up until now, these intermittent entries have concerned themselves with big iron. But the GPGPU fans would tell us that there are many ways to express parallelism, and many ways to share the work over a large number of processors. &lt;br/&gt;On the face of it, the model for GPGPUs recalls Dalibor Vrsalovic’s summary of a big N parallel machine being constructed back in the mid 80’s: “instead of building one big tractor, they have harnessed thousands of chickens to a plow.” The dig had been that arrays of small processors sacrifice per-processor horsepower in hope that the collective effort will more than compensate.&lt;br/&gt;But that was before we’d hit the memory wall. Now that arithmetic units are small, cheap, and much faster than large memory arrays, the real problem in computing is moving the bits around, not bashing them together. (Have I said that before?) It doesn’t take much of a processor to generate addresses into an array or vector. The GPU’s chicken-like processing elements are more than adequate to the real task: generating the next address. &lt;br/&gt;So, on the face of it, GPGPUs are worth a close look. Most of the high end units offer more than 100GB/s of external memory bandwidth, some kind of programming environment, and Linux support. So, I took the plunge and bought an ATI/AMD Radeon 5870. &lt;br/&gt;It rocks. &lt;br/&gt;I’ll be writing a few posts, time permitting, on my adventures in GPUland. This is the first. (And it comes with a free tarball inside!)&lt;br/&gt;--&lt;br/&gt;CUDA followers note that I’ve got nothing against nVidia, I just happened to chose the ATI hardware because of a consulting opportunity. But I was attracted by the relatively low price for a pretty capable widget: the 5870 sells for about $400. On the other hand, the OpenCL ecosystem is still developing: in fact, compared with CUDA, OpenCL is in the embryonic stage. &lt;br/&gt;This is a huge barrier to acceptance for the geek demographic that I live in. To get a picture of just how primitive things are, take a look at any of the OpenCL tutorial pages. For some reason, tutorial writers seem to be compelled to spend lots of time at the outset on all the machinery to setup the GPU. (It is as if a host ushered his guests into the dining room, and asked them to paint the ceiling and polish the silverware before sitting down to dinner.) Enough with the machinery already!  Show me a vector sum!  &lt;br/&gt;To make it easier to take OpenCL out for a test drive, I set out to abstract as much of the OpenCL boilerplate as I could and distill it down to a few simple operations. The goal was to get something like “hello world” that I could use in various simple and quick experiments. The result is a set of routines, macros, structures, and the like called “Simplified OpenCL” or “SOCL” for short. (Don’t make too big a deal of this, think of SOCL as training wheels for the OpenCL bicycle. We should eventually jettison the training wheels.)  SOCL simplifies the process of loading and compiling the OpenCL kernel setting up GPU memory buffers, copying data to and from the GPU, and launching kernel operations.  This is, by no means, professional quality software -- I whipped this up over a couple of days and expect it to be thrown away.  The documentation is limited, the release control is non-existent. &lt;br/&gt;That said, the widgets are pretty cool.  They get the basic OpenCL framework setup in a few simple calls. The wrappers for functions like clEnqueueNDRangeKernel measure the time to complete key operations and store the measurements in a log that can be dumped at the end of the run. Here’s the SOCL.log output from the socl_demo.c program included in the tarball. Each log entry starts with an event type, optionally followed by one or two fields of additional information about the event, and finally the wallclock time consumed by the event. &lt;br/&gt;LDPROG 1.5409e-05&lt;br/&gt;COMPILE 0.1339&lt;br/&gt;CREATE_BUF gpu_a 33554432 0.000686912&lt;br/&gt;CREATE_BUF gpu_b 33554432 0.00550139&lt;br/&gt;CREATE_BUF gpu_c 33554432 0.0115236&lt;br/&gt;WRITE_BUF gpu_a 1.30768e+09 0.0256595&lt;br/&gt;WRITE_BUF gpu_b 1.42136e+09 0.0236073&lt;br/&gt;SET_ARG vector_add 0 4.478e-06&lt;br/&gt;SET_ARG vector_add 1 1.78e-07&lt;br/&gt;SET_ARG vector_add 2 2.01e-07&lt;br/&gt;CALL_KERN vector_add 5.42603e+09 0.00154599&lt;br/&gt;CALL_KERN vector_add 6.0735e+09 0.00138118&lt;br/&gt;CALL_KERN vector_add 6.36995e+09 0.0013169&lt;br/&gt;CALL_KERN vector_add 7.42195e+09 0.00113024&lt;br/&gt;CALL_KERN vector_add 7.94338e+09 0.00105605&lt;br/&gt;READ_BUF gpu_c 8.05025e+08 0.0416812&lt;br/&gt;The problem with toys like this is that we can spend far too much time polishing them and adding features. My real goal here is to get the code out so that more people can try OpenCL and the AMD SDK. So here it is, quick, dirty, and ready to go.&lt;br/&gt;You can find the AMD OpenCL SDK &lt;a href=&quot;http://developer.amd.com/gpu/ATIStreamSDK/Pages/default.aspx&quot;&gt;here&lt;/a&gt;. The SOCL tarball is &lt;a href=&quot;Entries/2010/2/22_Small_Brick,_Big_%E2%80%98N%E2%80%99_files/socl_v0r01_kit.tgz&quot;&gt;socl_v0r01_kit.tgz&lt;/a&gt;. Please note that there is minimal documentation. Read the README file and look at the sample code. (A note to owners of earlier AMD/ATI widgets -- the OpenCL SDK support is spotty, at best, for older cards -- ymmv.)&lt;br/&gt;Let me know how it all works for you. I’ll be using SOCL for lots of little experiments and hope to write up a few of them. &lt;br/&gt;Happy coding. &lt;br/&gt;&lt;br/&gt;&amp;lt;&amp;lt; Addendum: For more information, take a look at the &lt;a href=&quot;../../tools/OpenCL.html&quot;&gt;documentation page&lt;/a&gt;, a work in progress. &gt;&gt; &lt;br/&gt;&lt;br/&gt;UPDATE!  Please note the new version of the kit here: &lt;a href=&quot;Entries/2010/2/22_Small_Brick,_Big_%E2%80%98N%E2%80%99_files/socl_v0r01_kit.tgz&quot;&gt;socl_v0r01_kit.tgz&lt;/a&gt; -- the v0r0 edition of the tarball had some code that was missing return values and such.  There’s nothing like building a chunk of software on a new platform to flush out some problems.  Let me know if you find issues when you try to build the kit. -- </description>
      <enclosure url="http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2010/2/22_Small_Brick,_Big_%E2%80%98N%E2%80%99_files/JohnMucci-LG.jpg" length="14226" type="image/jpeg"/>
    </item>
    <item>
      <title>SC09 -- Postscript</title>
      <link>http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2009/11/20_SC09_-_Postscript.html</link>
      <guid isPermaLink="false">db245cb0-c46b-4231-86d3-28802db5a57f</guid>
      <pubDate>Fri, 20 Nov 2009 08:01:22 -0500</pubDate>
      <description>&lt;a href=&quot;http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2009/11/20_SC09_-_Postscript_files/droppedImage.jpg&quot;&gt;&lt;img src=&quot;http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Media/object039.jpg&quot; style=&quot;float:left; padding-right:10px; padding-bottom:10px; width:254px; height:201px;&quot;/&gt;&lt;/a&gt;Coming home from a SuperComputing conference is always something of a relief. (Though the journey home, for me, has almost always included the realization that I’ve caught a cold in the SC conference floor incubator: this year was no exception.)&lt;br/&gt;The big take-away from SC09 might have been that there was no big take-away from SC09.  The user community was well represented on the exhibit floor. No vendor unveiled any breakthrough technology (despite what their breathless marketing departments might have said). I didn’t win an iPod, or a car, or a pony. &lt;br/&gt;The continued march of the heterogeneous vanguard isn’t news, unless you expected the heterogeneity movement to collapse in the face of improved memory subsystems in Intel’s Xeon products. It is notable however, that there is little evidence of a scalable programming model for heterogeneous systems that can be used by any but the most heroic programmers. &lt;br/&gt;Intel has clearly taken back the dominant position over AMD in the technical computing world: the Nehalem generation of Xeon processors really is a substantial improvement over the previous lackluster Intel products. AMD is preparing an answer, but Intel is alone right now in providing a quad x86 with more than 15GB/s of stream triad bandwidth. Intel’s advantage was clearly represented on the exhibit floor. But even the most casual observer of the market has known that for several months.&lt;br/&gt;If there was any evidence of Sony/IBM Cell products on the floor, I didn’t see it. Not news, however, as the Cell processor’s eventual phase-out has been rumored and expected for some time. &lt;br/&gt;Did anyone else notice that there was much more attention being paid to IO among the vendors? This  may be the real news story for SC09: IO solutions are here, and not just for the lunatic fringe. SSD (solid state disk) storage has become the economical alternative for applications that need read/write bandwidth more than storage capacity, and vendors are beginning to field useful and affordable products. &lt;br/&gt;No blinding insights, but there it is. &lt;br/&gt;-----------&lt;br/&gt;&lt;br/&gt;Overheard in the hallways: &lt;br/&gt;One startup employee to another about second round stock valuations: “Flat is the new ‘Up’.”&lt;br/&gt;Security guard to attendee at the Poster Session reception: “Sir, the buffet is for people with ‘T badges’ only.” Attendee eating a corn chip: “Do I need to return this?”&lt;br/&gt;</description>
      <enclosure url="http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2009/11/20_SC09_-_Postscript_files/droppedImage.jpg" length="39469" type="image/jpeg"/>
    </item>
    <item>
      <title>SC09 -- Tuesday</title>
      <link>http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2009/11/16_SC09_-_Tuesday.html</link>
      <guid isPermaLink="false">628967b2-ebed-4944-a2dc-8c626e4330cc</guid>
      <pubDate>Mon, 16 Nov 2009 19:13:01 -0500</pubDate>
      <description>&lt;a href=&quot;http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2009/11/16_SC09_-_Tuesday_files/Hiking_boots_on_sand.jpg&quot;&gt;&lt;img src=&quot;http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Media/object040.jpg&quot; style=&quot;float:left; padding-right:10px; padding-bottom:10px; width:254px; height:217px;&quot;/&gt;&lt;/a&gt;If there is a key to success in wandering the exhibit hall at SC09 it is “wear comfortable shoes.” The exhibit hall seems larger than in previous years. Certainly the venue, downtown Portland, encourages walking. There is much to see and there are many to meet. &lt;br/&gt;I’m here this year to see what’s going on -- a bit of a change from recent years where I spent most of my time in sales presentations. Being an exhibitor was quite a lot of fun, and I’m finding that being an attendee has its moments as well. &lt;br/&gt;The buzz in the exhibit hall is all about GPGPUs. You can’t walk more than a dozen feet (4m) without bumping into a system provider, software vendor, graduate student, or component supplier who’s touting one GPGPU related feature or another. In previous years you could hear the steady drumbeat of the heterogeneous computing folks -- now it is a loud chorus of hosannas and a brass band whooping it up and proclaiming a new day. That’s good: the industry needs new approaches if we’re going to expand the reach and utility of high performance technical computing. &lt;br/&gt;But brass bands and angelic choirs do not a solution make. While the GPGPU concept offers a much better way of moving data from memory to ALUs (which, after all the hype is peeled away, is the real advantage of the GPU architecture) it leaves open the question of scaling. Current products offer 4 way GPGPU units connected to x86 hosts. Most often, parallelism beyond 4 units is achieved by linking their x86 hosts together in a cluster fabric via MPI code running on the host.  This may suffice for much of the market, but is hardly the programming model that we’d want to see for assemblies of hundreds of GPGPUs. &lt;br/&gt;So I’ve been asking folks -- especially members of the band and the choir -- about scalable heterogeneous programming models. I’ve gotten a few answers, but see no dominant solution emerging. Responses have ranged from “I don’t care about parallelism -- I’m just happy that I can run my app on the graphics chip in my laptop” to “Yup, we looked around and decided to roll our own solution that allows every computing element to invoke a small set of communication primitives.” &lt;br/&gt;The answer is important most of all to the purveyors of GPGPU and other heterogeneous solutions. Heterogeneity is a response to inadequacies in the dominant PC based cluster model. As much of the GPGPU advantage rises from careful management of “a plurality” of processor-DRAM pipelines, the x86 vendors are quite capable of reacting and improving their own competitive position. While the bigN story for homogeneous clusters is by no means complete or even adequate, it is far better developed than the heterogeneous bigN story. The x86 cadre is smart, resourceful, and will subsume any computing functions and features into the CPU if there is a profit in it. The next five years will tell the story: will heterogeneity take root, or will industry’s natural tendency toward processor hegemony draw the best features of GPGPUs into the CPU. &lt;br/&gt;If bigN heterogeneous computing is to take root, it is time for the energy and innovation that produced CUDA and the like to turn to the problem of bigN parallelism. On Wednesday (between meetings and meals) I’m going to see what I can ferret out in my quest for an accessible heterogeneous bigNcomputing model. &lt;br/&gt;I only wish I’d brought my boots. &lt;br/&gt;&lt;br/&gt;Photo Source: &lt;a href=&quot;http://commons.wikimedia.org/wiki/User:Florian_Prischl&quot;&gt;Florian Prischl&lt;/a&gt; &lt;a href=&quot;http://commons.wikimedia.org/wiki/File:Hiking_boots_on_sand.jpg&quot;&gt;http://commons.wikimedia.org/wiki/File:Hiking_boots_on_sand.jpg&lt;/a&gt;</description>
      <enclosure url="http://www.bigncomputing.org/Big_N_Computing/Big_N_Computing/Entries/2009/11/16_SC09_-_Tuesday_files/Hiking_boots_on_sand.jpg" length="103225" type="image/jpeg"/>
    </item>
  </channel>
</rss>

