Friday, April 30, 2010

The Ramp

I love stories about performance problems. Recently, my friend Debra Lilley sent me this one:
I went to see a very large publishing company about 6 months after they went live. I asked them what their biggest issue was, and they told me querying in GL was very slow, and I was able to fix quite easily. (There was a very simple concatenated index trick for the Chart of Accounts segments that people just never used.) Then I asked if there was anything else. The manager said no but the clerk who sat behind him said, “I have a problem.” His manager seemed embarrassed, but when I pressed him, the clerk continued, “Every day I throw away reams of paper from our invoice listing.”

I asked to look at the request, which ran a simple listing of all invoices entered at a scheduled time each day. I opened up the schedule screen and there was a tick box to “Increment date on each run.” This was not ticked, and they were running the report from day 1, every day. When they accepted the system at go live there was no issue. I think all system implementations should include a 3- or 6-month review. Regardless of how good the implementers are, their setup is based on the information known at the time. In production, that information (volumes, etc.) often changes, and when it does, it can affect your decisions.
My friends Connie Smith and Lloyd Williams call this performance antipattern The Ramp. With the ramp, processing duration increases as the system is used. This invoicing system exhibited ramp behavior, because every invoicing process execution would take just a little bit longer and print just a few more pages than the prior execution did.

The problem of the ramp reminds me of a joke I heard when I was young. A boy, one who is athletically very talented but not too bright, takes on a job as a stripe painter for the highway department. The department gives him bucket of paint and a brush and drives him out to the highway he’s supposed to paint. His first day on the job, he paints a stripe almost seven miles long. This is an utterly stunning feat, for no one previously had ever painted more than five miles in a day. The department was ecstatic. Apparently, this boy’s true calling was to paint roadways.

The excitement abated a little bit on the second day, when the boy painted only five miles of highway. But still, five miles is the best that anyone had ever done before him. But on the third day, the distance dropped to two miles, and on the fourth day, it fell to less than one mile.

The department managers were gravely concerned, especially after having been so excited on the first couple of days. So they had a driver go out to fetch the boy, to bring him back to the office to explain why his productivity had been so outstanding at first but had then declined so horribly.

The reason was easy to understand, the boy explained. Every day he painted, he kept getting farther and farther away from where he had set his paint bucket on the first day.

I’ve known people who’ve written linked list insertion algorithms this way. Joel Spolsky has written about string library functions in C that work this way. I’ve seen people write joins in SQL that work this way. And Debra’s publishing company ran their invoices this way.

When you have the ramp problem, individual response times increase linearly. ...Which is bad. But overall response time—through the history of using such an application—varies in proportion to the square of the number of items being processed. ...Which is super-duper bad.

Imagine, in the invoicing problem that Debra solved, that the system had been processing just one invoice per day and that each invoice is only one page long. Given that she was at a “very large publishing company,” it’s certain that the volume was greater than this, but for the sake of simplifying my argument, let’s assume that there was just one new invoice each day. Then, with the “Increment date on each run” box left unchecked, there would be one invoice to print on day 1, two on day 2, etc. On any day n, there would be n invoices to print.

Obviously, the response time on any given day n would thus be n times longer than it needed to be. At the end of the first year of operation with the new application, an invoice would take 365 times longer to print than on the first day of the year.

But the pain each day of invoice generation is not all there is to the problem. The original concern was expressed in terms of all the paper that was wasted. That paper waste is important, not just because of the environmental impact of unnecessary paper consumption, but also because of all the computing power expended over the operational history of the application required to generate those pages. That includes the resources (the electrical power, the CPU cycles, the memory, the disk and network I/Os, etc.) that could have been put to better use doing something else.

In the grossly over-simplified invoicing system I’ve asked you to imagine (which creates only one invoice per day), the total number of pages printed as of the end of day n is 1 + 2 + ... + n, which is n(n + 1)/2. All but n of those pages are unnecessary. Thus the total number of wasted pages that will have been printed by the end of day n is n(n + 1)/2 – n, which is n(n – 1)/2, or (n2n)/2. The number of invoices that should never been printed is proportional to the square of the number of days using the application.

To get a sense for what that means, think about this (remember, all these points refer to a grossly over-simplified system that creates only one invoice per day):
  • By the end of the first month, you'll have printed 465 pages when you only needed 30. That’s 435 unnecessary pages.
  • But by the end of the first year, you’ll have printed 66,795 pages instead of 365. That’s 66,430 unnecessary pages. It’s 27 unnecessary 2,500-page boxes of paper.
  • And by the end of the fifth year, you’ll have used 668 boxes of paper to print 1,668,508 pages instead of using just one box to print 1,826 pages. The picture below shows how tremendously wasteful this is.

When total effort varies as the square of something (like the number of items to process, or the number of days you’ve been using an application), it’s bad, bad news for efficiency. It means that every time your something doubles, your performance (time, materials consumption, etc.) will degrade by a factor of four. Every time your something increases by a factor of ten, your performance will degrade by a factor of a hundred. When your something increases a hundred fold, performance will degrade by a factor of 10,000.

Algorithm analysts characterize algorithms that behave this way as O(n2), pronounced “big-oh of n-squared.” O(n2) performance is no way to live. The good news is that you can usually break yourself out of a O(n2) regime. Sometimes, as Debra’s story illustrates, the solution isn’t even technical: she solved her client’s problem by using an option designed into the end-user interface.

No matter where the problem is—whether it’s problem with use, setup, implementation, design, or concept—it’s worth significant time and effort to find the O(n2) problems in your system and eliminate them. Whenever you need reassurance of that idea, just glance again at the image of the paper boxes shown here.

And by the way, do you remember my post about “Just go look at it?” Tally one for Debra, for the win.

Friday, March 5, 2010

On "Is a computer science degree a good goal?"

Dan Fink's "Is a computer science degree a good goal?" has gotten my wheels going. I think it's important to note this:
Computer Science ≠ Information Technology
Not only are these two disciplines not equal, neither is a subset of the other.

One of my most memorable culture shocks coming out of school into the Oracle domain was how many people didn't understand the difference between computer science, which is a specialized branch of mathematics, and information technology, which is a specialized branch of business administration. They both deal with computers (the IT major more than the CS one, actually), so of course there's risk that people will miss the distinction.

Over dinner Friday night with some of my friends from Percona, we touched on one of the problems. It's difficult for a technical major in school to explain even to his family and friends back home what he's studying. I remember saying once during my senior year as a math major, "I haven't seen a number bigger than 1 since I was a sophomore." I heard a new one tonight: "I got to the level where the only numbers in my math books were the page numbers."

It's difficult for people who don't study computer science to understand who you are or how the min/max kd-trees and deterministic finite automata and predicate calculus and closures that you're studying are different from the COBOL and SQL and MTBFs and ITIL that the IT majors are studying. It's easy to see why laypeople don't understand how these sets of topics arrange into distinctly different categories. What continually surprises me is how often even IT specialists don't understand the distinction. I guess even the computer science graduates soften that distinction when they take jobs doing tasks (to make a living, of course) that will be automated within ten years by other computer scientist graduates.

I agree with Dan and the comments from Tim, Robyn, Noons, Gary, and David about where the IT career path is ultimately headed in the general case. What I don't believe is that the only career path for computer scientists and mathematicians is IT. It's certainly not the only career path for the ones who can actually create things.

I believe that college (by which I mean "University" in the European sense) is a place where the most valuable skill you learn is how to learn, and that, no matter what your major, as long as you work hard and apply yourself to overcoming the difficult challenges, there will be things in this world for you to do to earn your way.

I really hope that the net effect of a depressed, broken, and downward-trending IT industry is not that it further discourages kids from engaging in math and computer science studies in school. But I don't want for so many of our kids today who'll be our adults of tomorrow to become just compartmentalized, highly specialized robots with devastatingly good skills at things that nobody's really willing to pay good money for. I think that the successful human of the future will need to be able to invent, design, create, empathize, teach, see (really see), listen (not just hear), learn, adapt, and solve.

...Just exactly like the successful human of the past.

Wednesday, March 3, 2010

RobB's Question about M/M/m

Today, user RobB left a comment on my recent blog post, asking this:
I have some doubts on how valid an M/M/m model is in a typical database performance scenario. Taking the example from the wait chapter of your book where you have a 0.49 second (service_time) query that you want to perform in less than a second, 95% percent of the time. The most important point here is the assumption of an exponential distribution for service_time immediately states that about 13% of the queries will take more than 2X(Average Service Time), and going the other way most likely to take 0 seconds. From just this assumption only, it is immediately clear that it is impossible to meet the design criteria without looking at anything else. From your article and link to the Kendall notation, wouldn’t an M/D/m model be more appropriate when looking at something like SQL query response time?? Something like M/M/m seems more suited to queueing at the supermarket, for example, and probably many other ‘human interactive’ scenarios compared to a single sub-component of an IT system.
Here’s my answer, part 1.

RobB,

First, I believe your observation about the book example is correct. It is correct that if service times are exponentially distributed, then about 13% (13.5335%, more precisely) of those times will be 2S, where S is the mean service time. So in the problem I stated, it would be impossible to achieve subsecond response time in more than about 86% of executions, even if there were no competing workload at all. You’re right: you don't need a complicated model to figure that out. You could get that straight from the CDF of the exponential distribution.

However, I think the end of the example provides significant value, where it demonstrates how to use the M/M/m model to prove that you're not going to be able to meet your design criteria unless you can work the value of S down to .103 seconds or less (Optimizing Oracle Performance Fig 9-26, p277). I’ve seen lots of people argue, “You need to tune that task,” but until M/M/m, I had never seen anyone be able to say what the necessary service time goal was, which of course varies as a function of the anticipated arrival rate. A numerical goal is what you need when you have developers who want to know when they’re finished writing code.

With regard to whether real-life service times are really exponentially distributed, you’ve got me wondering now, myself. If service times are exponentially distributed, then for any mean service time S, there’s a 9.5% probability that a randomly selected service time will be less than .1S (in Mathematica, CDF[ExponentialDistribution[1/s],.1s] is 0.0951626 if 0.1s > 0). I’ve got to admit that at the moment, I’m baffled as to how this kind of distribution would model any real-life service process, human, IT, or otherwise.

On its face, it seems like a distribution that prohibits service times smaller than a certain minimum value would be a better model (or perhaps, as you suggest, even fixed service times, as in M/D/m). I think I’m missing something right now that I used to know, because I remember thinking about this previously.

I have two anecdotal pieces of evidence to consider.

One, nowhere in my library of books dedicated to the application of queueing theory to modeling computer software performance (that’s more than 6,000 pages, over 14 inches of material) does Kleinrock, Allen, Jain, Gunther, Menascé, et al mention an M/D/m queueing system. That’s no proof that M/D/m is not the right answer, but it’s information that implies that an awful lot of thinking has gone into the application of queueing theory to software applications without anyone deciding that M/D/m is important enough to write about.

Two, I’ve used M/M/m before in modeling a trading system for a huge investment management company. The response time predictions that M/M/m produced were spectacularly accurate. We did macro-level testing only, comparing response times predicted by M/M/m to actual response times measured by Tuxedo. We didn’t check to see whether service times were exponentially distributed, because the model results were consistently within 5% of perfect accuracy.

Neither of these is proof, of course, that M/M/m is superior in routine applicability to M/D/m. One question I want to answer is whether an M/D/m system would provide better or worse performance than a similar M/M/m system. My intuition is leaning in favor of believing that the M/M/m system would give better performance. If that’s true, then M/M/m is an optimistic model compared to M/D/m, which means that if a real-life system is M/D/m and an M/M/m model says it’s not going to meet requirements, then it assuredly won’t.

I did find a paper online by G. J. Franx about M/D/m queueing. Maybe that paper contains an R=f(λ,μ) function that I can use to model an M/D/m system, which would enable me to do the comparison. I’ll look into it.

Then there’s the issue of whether M/M/m or M/D/m is a more appropriate model for a given real circumstance. The answer to that is simple: test your service times to see if they’re exponentially distributed. The Perl code in Optimizing Oracle Performance, pages 248–254 will do that for you.

Monday, February 22, 2010

Thinking Clearly About Performance, revised to include Skew

I’ve just updated the “Thinking Clearly” paper to include an absolutely vital section that was, regrettably, missing from the first revision. It’s a section on the subject of skew.

I hope you enjoy.

Wednesday, February 10, 2010

Thinking Clearly About Performance

I’ve posted a new paper at method-r.com called “Thinking Clearly About Performance.” It’s a topic I’ll be presenting this year at:
The paper is only 13 pages long, and I think you’ll be pleased with its information density. Here is the table of contents:
  1. An Axiomatic Approach
  2. What is Performance?
  3. Response Time vs Throughput
  4. Percentile Specifications
  5. Problem Diagnosis
  6. The Sequence Diagram
  7. The Profile
  8. Amdahl’s Law
  9. Minimizing Risk
  10. Load
  11. Queueing Delay
  12. The Knee
  13. Relevance of the Knee
  14. Capacity Planning
  15. Random Arrivals
  16. Coherency Delay
  17. Performance Testing
  18. Measuring
  19. Performance is a Feature
As usual, I learned a lot writing it. I hope you’ll find it to be a useful distillation of how performance works.

Tuesday, December 22, 2009

The “Do What You Love” Mirage

I am inspired by having read an article called “Do what you love mirage” by Denis Basaric. It begins...
“Do what you love” is advice I hear exclusively from financially secure people. And it rings hollow to me. When you need money to survive, you do any work that is available, love does not play into that choice. Desperation does.
Please read it before you go on.

Welcome back.

This article puts a very important cycle within my life into words. I believe, as Denis says, that a lot of times, we get the cause-effect relationship mixed up when we think about loving what we do.

I love what I do. Well, a lot of it. But Denis is right: I didn’t choose what I do out of love. I chose what I love out of doing. Some examples:
  • I love mathematics. But I most assuredly did not always love it. I learned to love it through working hard at it.
  • It’s the same thing with writing. I love it, but I didn’t always. At first, writing was unrewarding drudgery, which is how most people I meet seem to feel about it.
  • I love public speaking, but I sure didn’t love it when my speech class made me sick to my stomach three mornings a week for a whole semester my freshman year.
  • I love being an Oracle performance specialist, but I sure didn’t love being airlifted into crisis after crisis throughout the early 1990s.
I could go on. The point is, my life would be unrecognizably different if not for several really painful situations that I decided to endure with the resolve to get really good at what I hated. Until I loved it.

In retrospect, I seem to have been very lucky in many important situations. Of course, I have. But you make your own luck. Although I believe deeply in the idea of, “The harder I work, the luckier I get,” that is not what I’m talking about here. I’m talking about the power that you have to define for yourself whether something that happened was lucky for you or not. Your situations do not define your life. You create your life based on how you regard your situations.

I could have rebelled against Jimmy Harkey and hated math for the rest of my life. Lots of kids did. I could have rebelled against Lewis Parkhill and never become a writer. I could have refused Craig Newberger’s advice to take his second speech course and never become comfortable in front of an audience. I could have left Oracle in 1991 and found a job where they had more mature products....

One of the most important questions that I ever asked my wife before our engagement was this:
If you were forced to wash cars for 12 hours a day, just to make a living, could you enjoy it?
This is a “soulmate” kind of question for me. My wife’s attitude about it is, for our children and me, possibly the most valuable gift in our lives.

Loving what you do can be difficult. I think Denis hits the nail on the head by suggesting that,
By doing good work, you just might find out that what you are doing, is what you are supposed to do. And if you don’t, quality work will get you to where you want to be.
I hope you will find love in what you do today. Do it well, and it’ll definitely improve your odds.

Monday, December 21, 2009

My Whole System Is Slow. Now What?

At CMG'09 a couple of weeks ago, I presented "Measuring Response Times of Code on Oracle Systems." The paper for this presentation was a subset of "For Developers: Making Friends with the Oracle Database." In the presentation, I spent a few minutes talking about why to measure response times in Oracle, and then I spent a lot of minutes talking about how. As usual, I focused heavily on the importance of measuring response times of individual business tasks executed by individual end users.

At the end of the talk, a group of people came to the podium to ask questions (always a good sign). The first question was the question that a lot of people ask. It was:
My whole system is slow. That's all my users will tell me. So then, how do I begin to do what you're describing?
Here's the answer:
Ask your users to show you what they're doing. Just go look at it.
The results of this simple advice are routinely spectacular. Just go look at it: I'm surprised whenever someone doesn't think of doing that, but I shouldn't be. That's because I didn't do it either, for the longest time. I had to learn to. And that's the story I want to tell you here.

In the early 1990s, I was a consultant with Oracle Corporation visiting clients with performance problems at a pace of more than 30 per year. Back then, I did Oracle performance work the old fashioned way: I checked everything I knew how to check, and then I fixed everything I knew how to fix. All billable by the hour. (Note: When I was doing it this way, I had not yet been taught by Dave Ensor, who changed me forever.)

On weeks when I was lucky, I'd be finished checking and fixing by sometime Wednesday, leaving a couple of days to find out what people thought of my work. If I were lucky again (that's two "lucky"s now), everyone would be thrilled with the results. I'd get my hug (so to speak), and I'd catch my flight.

But I wasn't always lucky. Some weeks, I wouldn't find anything suspicious in my checking and fixing. Some weeks, I'd find plenty, but still not everyone would be thrilled with the work. Having people be less than thrilled with my work caused pain for me, which motivated me to figure out how to take more control of my consulting engagements, to drive luck out of the equation.

The most important thing I figured out was...
People knew before I came on-site how they were going to measure on Thursday whether they liked the results of my work.
And...
They were willing to tell me on Monday.
All I had to do was be honest, like this:
On the day I'm done working here, I'd imagine you're going to want to run something that will demonstrate whether I accomplished what you were hoping for while I was here. Would you mind telling me about that now? Maybe even showing me?
I could ask that on Monday, and people were glad to tell me. I'd watch the things run and record how long they ran, and then I'd know how to prioritize my time on site. I'd record how long they ran so at the end of my engagement, I'd be able to show very clearly what improvements I had made.

Sometimes, there would be thirty different things that people would expect to measure on Thursday. If I might not have time to fix them all, then I needed to make sure that I knew the priority of the things I was being asked to fix.

That one step alone—knowing on Monday that prioritized list of what tasks needed to be fast by Thursday—drastically reduced my reliance on luck as a success factor in my job at these sites. Knowing that list on Monday is just like when your teacher in school tells you exactly what's going to be on your next test. It allows you to focus your attention on exactly what you need to do to optimize your reward for the week. (Note to fellow education enthusiasts: Please don't interpret this paragraph as my advocating the idea that GPA should be a student's sole—or even dominant—optimization constraint.)

So, what I learned is that the very first step of any good performance optimization method is necessarily this:
1. Identify the task that's the most important to you.
When I say "task," think "program" or "click" or "batch job" if you want to. What I mean is "a useful unit of work that makes sense to the business." ...Something that a business user would show you if you just went and watched her work for a few minutes.

Then comes step two:
2. Measure its response time (R). In detail.
Why is response time so important? Because that's what's important to the person who'll be watching it run on Thursday, assessing whether she thinks you've done a good job or not. That person's going to click and then wait. Happiness will be inversely proportional to how long the wait is. That's it. That's what "performance" means at 99% of sites I've ever visited.

(If you're interested in the other 1% of sites I've visited, they're interested in throughput, which I've written about in another blog post.)

Measuring response time is vital. You must be able to measure response time if you're going to nail that test on Thursday.

The key is to understand that the term response time doesn't even have a definition except in the context of a task. You can't measure response time if you don't first decide what task you're going to measure. In other words, you cannot do step 2 before you do step 1. With Oracle, for example, you can collect ASH data (if you're licensed to use it) or even trace data for a whole bunch of Oracle processes, but you won't have a single response time until you define which tasks buried within that data are the ones you want to extract and pay attention to.

You get that by visiting a user and watching what she does.

There are lots of excuses for not watching your users. Like these...
  • "I don't know my users." I know. But you should. You'd do your job better if you did. And your users would, too.
  • "My users aren't here." I know. They're on the web. They're in Chicago and Singapore and Istanbul, buying plane tickets or baseball caps or stock shares. But if you can't watch at least a simulation of the things those users actually do with the system you help manage, then I can't imagine how you would possibly succeed at providing good performance to them.
  • "I'm supposed to be able to manage performance with my dashboard." I know. I was supposed to have a hover car by the year 2000.
The longer you stay mired in excuses like these, the longer it's going to be before you can get the benefit of my point here. Your users are running something, and whatever that is that they're running is your version of my Thursday test. You can check and fix all you want, but unless you get lucky and fix the exact tooth that's hurting, your efforts aren't going to be perceived as "helpful." Checking and fixing everything you can think of is far less efficient and effective than targeting exactly what your user needs you to target.

Lots of performance analysts (DBAs, developers, architects, sysadmins, and so on) assume that when someone says, "The whole system is slow," it means there must be a single parameter somewhere in the bowels of the system that needs adjustment, and if you can just make that adjustment, everything is going to be ok. It might mean that, but in my experience, the overwhelming majority of cases are not that way. (Pages 25–29 of Optimizing Oracle Performance has more information about this.)

The great thing about measuring response time is that no matter what the problem is, you'll see it. If the program you're watching is poorly written, you'll see it. If some other program is hogging too much of a resource that your program needs, you'll see it. If you have a bad disk controller, you'll see it. If some parameter needs adjusting, you'll see it.

Realize that when a business user says "system," she doesn't mean what you would mean if you said "system." She means that the thing she runs is slow. Look at that thing. Maybe there are seventeen of them. And sure, maybe all seventeen suffer from the same root cause. If that's the case, then fantastic, because fixing the first problem will magically fix the other sixteen, too. If it's not, then fantastic anyway, because now all of them are on your prioritized list of tasks to optimize, and you'll probably surprise yourself how quickly you'll be able to pick them all off when you focus on one task at a time.