Bragging

By cdwan, on August 14th, 2007

There’s a core piece of advice that I give to developers on compute clusters … which is to “stage” data motion as much as possible. This requires that the code take the architecture of the cluster into account, but it frequently gives large improvements in both performance and stability.

Case in Point:

I was working with a group who own both a compute cluster and a big SAN. Their workhorse code does the following:

* Start a single thread.
* Read some input state and data
* Open a file for writing
* Trickle results (rows of tab delimited integers, a few K at a time) into that file over about 12 hours.
* Close the file.

They have an infinite stack of input data to process.

They found that the more concurrent jobs they started, the longer the jobs took and (more disturbingly) that if they kicked off enough jobs at the same time, there would be the occasional bogus row in their output. The problem wasn’t confined to any particular machine in the cluster (we had seen it on each machine at least once), nor was any particular piece of input data doomed to fail (we developed a test set that ran to completion on a single node).

I walked into a meeting with the developers and took something of a beating because “my SAN” and “my cluster” didn’t work right. Words like “shoddy” were bantered around.

I suggested that, rather than writing output data directly to the SAN … a single large shared filesystem … they trickle the output into a file on the local disk of each compute node and then *copy* it at the end.

Such resistance. Oh the horror, because:

* This was a “machine specific” change to the code (yep)
* This didn’t address the core issue that the SAN was “unreliable,” and couldn’t I just build something that works reliably? (nope, think of it as an operating point if it makes you feel better)
* It would require such *changes* to the code (nope, two conditionals, one up top and one at the bottom)

Eventually, after a fair bit of persuasion, one developer consented to *try* the modification I had suggested. The next day, I awoke to a jubilant email saying that when writing to the local disk there were ZERO errors, and – by the way – the code ran in 6 hours rather than 12.

The next day another developer wrote in quietly to say that the change yielded the same sort of improvements in his code as well.

Doubling performance *and* sidestepping a nasty and context based debugging problem.

Score. They’ll never say “thank you,” but it feels good to be right.

Bragging

Leave a Reply Cancel reply

Recent Posts

Blogroll

martial arts

Meta