Thursday, April 29, 2010

The Ramp

I love stories about performance problems. Recently, my friend Debra Lilley sent me this one:
I went to see a very large publishing company about 6 months after they went live. I asked them what their biggest issue was, and they told me querying in GL was very slow, and I was able to fix quite easily. (There was a very simple concatenated index trick for the Chart of Accounts segments that people just never used.) Then I asked if there was anything else. The manager said no but the clerk who sat behind him said, “I have a problem.” His manager seemed embarrassed, but when I pressed him, the clerk continued, “Every day I throw away reams of paper from our invoice listing.”

I asked to look at the request, which ran a simple listing of all invoices entered at a scheduled time each day. I opened up the schedule screen and there was a tick box to “Increment date on each run.” This was not ticked, and they were running the report from day 1, every day. When they accepted the system at go live there was no issue. I think all system implementations should include a 3- or 6-month review. Regardless of how good the implementers are, their setup is based on the information known at the time. In production, that information (volumes, etc.) often changes, and when it does, it can affect your decisions.
My friends Connie Smith and Lloyd Williams call this performance antipattern The Ramp. With the ramp, processing duration increases as the system is used. This invoicing system exhibited ramp behavior, because every invoicing process execution would take just a little bit longer and print just a few more pages than the prior execution did.

The problem of the ramp reminds me of a joke I heard when I was young. A boy, one who is athletically very talented but not too bright, takes on a job as a stripe painter for the highway department. The department gives him bucket of paint and a brush and drives him out to the highway he’s supposed to paint. His first day on the job, he paints a stripe almost seven miles long. This is an utterly stunning feat, for no one previously had ever painted more than five miles in a day. The department was ecstatic. Apparently, this boy’s true calling was to paint roadways.

The excitement abated a little bit on the second day, when the boy painted only five miles of highway. But still, five miles is the best that anyone had ever done before him. But on the third day, the distance dropped to two miles, and on the fourth day, it fell to less than one mile.

The department managers were gravely concerned, especially after having been so excited on the first couple of days. So they had a driver go out to fetch the boy, to bring him back to the office to explain why his productivity had been so outstanding at first but had then declined so horribly.

The reason was easy to understand, the boy explained. Every day he painted, he kept getting farther and farther away from where he had set his paint bucket on the first day.

I’ve known people who’ve written linked list insertion algorithms this way. Joel Spolsky has written about string library functions in C that work this way. I’ve seen people write joins in SQL that work this way. And Debra’s publishing company ran their invoices this way.

When you have the ramp problem, individual response times increase linearly. ...Which is bad. But overall response time—through the history of using such an application—varies in proportion to the square of the number of items being processed. ...Which is super-duper bad.

Imagine, in the invoicing problem that Debra solved, that the system had been processing just one invoice per day and that each invoice is only one page long. Given that she was at a “very large publishing company,” it’s certain that the volume was greater than this, but for the sake of simplifying my argument, let’s assume that there was just one new invoice each day. Then, with the “Increment date on each run” box left unchecked, there would be one invoice to print on day 1, two on day 2, etc. On any day n, there would be n invoices to print.

Obviously, the response time on any given day n would thus be n times longer than it needed to be. At the end of the first year of operation with the new application, an invoice would take 365 times longer to print than on the first day of the year.

But the pain each day of invoice generation is not all there is to the problem. The original concern was expressed in terms of all the paper that was wasted. That paper waste is important, not just because of the environmental impact of unnecessary paper consumption, but also because of all the computing power expended over the operational history of the application required to generate those pages. That includes the resources (the electrical power, the CPU cycles, the memory, the disk and network I/Os, etc.) that could have been put to better use doing something else.

In the grossly over-simplified invoicing system I’ve asked you to imagine (which creates only one invoice per day), the total number of pages printed as of the end of day n is 1 + 2 + ... + n, which is n(n + 1)/2. All but n of those pages are unnecessary. Thus the total number of wasted pages that will have been printed by the end of day n is n(n + 1)/2 – n, which is n(n – 1)/2, or (n2n)/2. The number of invoices that should never been printed is proportional to the square of the number of days using the application.

To get a sense for what that means, think about this (remember, all these points refer to a grossly over-simplified system that creates only one invoice per day):
  • By the end of the first month, you'll have printed 465 pages when you only needed 30. That’s 435 unnecessary pages.
  • But by the end of the first year, you’ll have printed 66,795 pages instead of 365. That’s 66,430 unnecessary pages. It’s 27 unnecessary 2,500-page boxes of paper.
  • And by the end of the fifth year, you’ll have used 668 boxes of paper to print 1,668,508 pages instead of using just one box to print 1,826 pages. The picture below shows how tremendously wasteful this is.

When total effort varies as the square of something (like the number of items to process, or the number of days you’ve been using an application), it’s bad, bad news for efficiency. It means that every time your something doubles, your performance (time, materials consumption, etc.) will degrade by a factor of four. Every time your something increases by a factor of ten, your performance will degrade by a factor of a hundred. When your something increases a hundred fold, performance will degrade by a factor of 10,000.

Algorithm analysts characterize algorithms that behave this way as O(n2), pronounced “big-oh of n-squared.” O(n2) performance is no way to live. The good news is that you can usually break yourself out of a O(n2) regime. Sometimes, as Debra’s story illustrates, the solution isn’t even technical: she solved her client’s problem by using an option designed into the end-user interface.

No matter where the problem is—whether it’s problem with use, setup, implementation, design, or concept—it’s worth significant time and effort to find the O(n2) problems in your system and eliminate them. Whenever you need reassurance of that idea, just glance again at the image of the paper boxes shown here.

And by the way, do you remember my post about “Just go look at it?” Tally one for Debra, for the win.


Tony said...

Math is cool. This story would have been interesting without it, but it is just that much better with it. This is the real world stuff they should provide in schools.

Cary Millsap said...

Tony, I agree. ...And I want to help make that happen.

pudge said...

A thing of beauty.

Joel Garry said...

The ramps I see look more like a quarter pipe.

Around 1982 I was on a project where, of course, every manager wanted huge reports, sorted every possible way, because that made them important, even though they just would look at the last page. So I suggested we add page estimates to the report layouts submitted for approval. Since the controller was, well, an accountant, that solved the operations problem before coding.

In my current position, every few months people complain about a particular end-of-month process. It reads all the data into several in-memory structures, manipulates and verifies it, then reports on it. The vendor did it this way because, well, everyone knows that memory is faster than disk. Well, they didn't seem to think there would be a scaling problem as the various parts are repeatedly scanned linearly. The net effect is the run time is all over the place, depending on what else is running, how many transactions of what type, etc. The harder part is getting from management the desired requirement.

word: villa
word: rexpsive