Monday, December 21, 2009

My Whole System Is Slow. Now What?

At CMG'09 a couple of weeks ago, I presented "Measuring Response Times of Code on Oracle Systems." The paper for this presentation was a subset of "For Developers: Making Friends with the Oracle Database." In the presentation, I spent a few minutes talking about why to measure response times in Oracle, and then I spent a lot of minutes talking about how. As usual, I focused heavily on the importance of measuring response times of individual business tasks executed by individual end users.

At the end of the talk, a group of people came to the podium to ask questions (always a good sign). The first question was the question that a lot of people ask. It was:
My whole system is slow. That's all my users will tell me. So then, how do I begin to do what you're describing?
Here's the answer:
Ask your users to show you what they're doing. Just go look at it.
The results of this simple advice are routinely spectacular. Just go look at it: I'm surprised whenever someone doesn't think of doing that, but I shouldn't be. That's because I didn't do it either, for the longest time. I had to learn to. And that's the story I want to tell you here.

In the early 1990s, I was a consultant with Oracle Corporation visiting clients with performance problems at a pace of more than 30 per year. Back then, I did Oracle performance work the old fashioned way: I checked everything I knew how to check, and then I fixed everything I knew how to fix. All billable by the hour. (Note: When I was doing it this way, I had not yet been taught by Dave Ensor, who changed me forever.)

On weeks when I was lucky, I'd be finished checking and fixing by sometime Wednesday, leaving a couple of days to find out what people thought of my work. If I were lucky again (that's two "lucky"s now), everyone would be thrilled with the results. I'd get my hug (so to speak), and I'd catch my flight.

But I wasn't always lucky. Some weeks, I wouldn't find anything suspicious in my checking and fixing. Some weeks, I'd find plenty, but still not everyone would be thrilled with the work. Having people be less than thrilled with my work caused pain for me, which motivated me to figure out how to take more control of my consulting engagements, to drive luck out of the equation.

The most important thing I figured out was...
People knew before I came on-site how they were going to measure on Thursday whether they liked the results of my work.
And...
They were willing to tell me on Monday.
All I had to do was be honest, like this:
On the day I'm done working here, I'd imagine you're going to want to run something that will demonstrate whether I accomplished what you were hoping for while I was here. Would you mind telling me about that now? Maybe even showing me?
I could ask that on Monday, and people were glad to tell me. I'd watch the things run and record how long they ran, and then I'd know how to prioritize my time on site. I'd record how long they ran so at the end of my engagement, I'd be able to show very clearly what improvements I had made.

Sometimes, there would be thirty different things that people would expect to measure on Thursday. If I might not have time to fix them all, then I needed to make sure that I knew the priority of the things I was being asked to fix.

That one step alone—knowing on Monday that prioritized list of what tasks needed to be fast by Thursday—drastically reduced my reliance on luck as a success factor in my job at these sites. Knowing that list on Monday is just like when your teacher in school tells you exactly what's going to be on your next test. It allows you to focus your attention on exactly what you need to do to optimize your reward for the week. (Note to fellow education enthusiasts: Please don't interpret this paragraph as my advocating the idea that GPA should be a student's sole—or even dominant—optimization constraint.)

So, what I learned is that the very first step of any good performance optimization method is necessarily this:
1. Identify the task that's the most important to you.
When I say "task," think "program" or "click" or "batch job" if you want to. What I mean is "a useful unit of work that makes sense to the business." ...Something that a business user would show you if you just went and watched her work for a few minutes.

Then comes step two:
2. Measure its response time (R). In detail.
Why is response time so important? Because that's what's important to the person who'll be watching it run on Thursday, assessing whether she thinks you've done a good job or not. That person's going to click and then wait. Happiness will be inversely proportional to how long the wait is. That's it. That's what "performance" means at 99% of sites I've ever visited.

(If you're interested in the other 1% of sites I've visited, they're interested in throughput, which I've written about in another blog post.)

Measuring response time is vital. You must be able to measure response time if you're going to nail that test on Thursday.

The key is to understand that the term response time doesn't even have a definition except in the context of a task. You can't measure response time if you don't first decide what task you're going to measure. In other words, you cannot do step 2 before you do step 1. With Oracle, for example, you can collect ASH data (if you're licensed to use it) or even trace data for a whole bunch of Oracle processes, but you won't have a single response time until you define which tasks buried within that data are the ones you want to extract and pay attention to.

You get that by visiting a user and watching what she does.

There are lots of excuses for not watching your users. Like these...
  • "I don't know my users." I know. But you should. You'd do your job better if you did. And your users would, too.
  • "My users aren't here." I know. They're on the web. They're in Chicago and Singapore and Istanbul, buying plane tickets or baseball caps or stock shares. But if you can't watch at least a simulation of the things those users actually do with the system you help manage, then I can't imagine how you would possibly succeed at providing good performance to them.
  • "I'm supposed to be able to manage performance with my dashboard." I know. I was supposed to have a hover car by the year 2000.
The longer you stay mired in excuses like these, the longer it's going to be before you can get the benefit of my point here. Your users are running something, and whatever that is that they're running is your version of my Thursday test. You can check and fix all you want, but unless you get lucky and fix the exact tooth that's hurting, your efforts aren't going to be perceived as "helpful." Checking and fixing everything you can think of is far less efficient and effective than targeting exactly what your user needs you to target.

Lots of performance analysts (DBAs, developers, architects, sysadmins, and so on) assume that when someone says, "The whole system is slow," it means there must be a single parameter somewhere in the bowels of the system that needs adjustment, and if you can just make that adjustment, everything is going to be ok. It might mean that, but in my experience, the overwhelming majority of cases are not that way. (Pages 25–29 of Optimizing Oracle Performance has more information about this.)

The great thing about measuring response time is that no matter what the problem is, you'll see it. If the program you're watching is poorly written, you'll see it. If some other program is hogging too much of a resource that your program needs, you'll see it. If you have a bad disk controller, you'll see it. If some parameter needs adjusting, you'll see it.

Realize that when a business user says "system," she doesn't mean what you would mean if you said "system." She means that the thing she runs is slow. Look at that thing. Maybe there are seventeen of them. And sure, maybe all seventeen suffer from the same root cause. If that's the case, then fantastic, because fixing the first problem will magically fix the other sixteen, too. If it's not, then fantastic anyway, because now all of them are on your prioritized list of tasks to optimize, and you'll probably surprise yourself how quickly you'll be able to pick them all off when you focus on one task at a time.

12 comments:

Ross said...

Thanks for the great article. I don't work with Oracle but I'm still going to take your ideas in my own business and apply them for my clients. I'm sure my clients will appreciate your ideas as I apply them to match or exceed their expectations when Thursday comes around.

Noons said...

"Lots of performance analysts (DBAs, developers, architects, sysadmins, and so on) assume that when someone says, "The whole system is slow," it means there must be a single parameter somewhere in the bowels of the system that needs adjustment"

Correction, Cary. Most DBAs, architects, sysadmins and in general anyone with a bit of nous in IT know that is precisely NOT the case.

This "single magic parameter" is invariably the characteristic of the pretenders who think they know a lot about IT but have never seen a command line in their sorry lifes.

Rarely of the ones who already are in the thick of it.

Tyically: project managers and other managerial types.

Let's call a spade a spade, OK?
Enough blaming techos for the faults that are not their own!

Cary Millsap said...

Thank you, Ross.

Nuno, I take your point, and I wish I agreed that the belief is restricted to the subset of people that you describe. I hope not to offend the people I'm trying to reach, but I have met so many technical people—hundreds—who have been trained well to believe the silver bullet theory (though most people who believe in it would never call it that) that I have to stand by my "lots" quantifier.

Cary Millsap said...

Nuno, I don't want to edit the post and lose the context of your comment. But the more I think about your point, the more I want to change my words.

Were I to do it again, I would write, "Lots of people (technical and non-technical alike) assume that when someone says, 'The whole system is slow,' it means..."

—Cary

Marcin Przepiorowski said...

Hi,

I have to support Cary. It's very rare to see that system architects or developers are thinking about performance on early stage because on their test/dev environment a data volume is small and everything is working well. And when this solution is implemented to production (with huge volume of data) I have heard very often that system is running slow because there are no enough CPU/RAM/IO and it will be working well on better machine. And this is for me "silver bullet" approach.
I think that this is a part of old "war" between ADMINs and rest of IT world. Admins typically know that there is no single solution for performance problems.
When I first time read about Oracle Wait Interface in 2002 I presented that approach on Polish OUG and general reaction was - we have a good cache hit ratio so our performance is OK. There was very few people who said that we have a problem because our customer had to wait too long.
Of course I don't want to blame everybody and I see that now more people are using a better approach - but there are still some old myths and hope that people like Cary will be still fighting with them.

regards,
Marcin

Noons said...

Look: my comments are not important.

What is important is that the message gets across that the "silver bullet" thing must stop, wherever it comes from.

In my experience here, I see it mostly in the management and project damagers.

Heck, only in the last week I've been asked to "install Oracle" for some application someone has purchased.

Simple details like which hardware, which OS, what version, what release, what license, how much CPU/memory/disk, is the license paid, do we need a dev/test/prod setup, what is the HA requirement, does it need a DR presence, all that was glossed over by what is supposed to be an IT-savvy PM.

When I asked the obvious, simple questions, I was told "I was not being helpful".

For as long as folks like these are encouraged to join and work in IT, there is no hope for this industry.

But yes: there is also a number of "techos" that use the same approach you describe.

Nowadays I've noticed them being less and less. But it still happens...

uhesse said...

Although we are maybe inclined to think that the focus on "Response Time" should be common sense and practiced in general - it ain't so in reality often.
So I think your posting is really helpful to spread and to repeat this knowledge, especially because of your very high reputation inside of the Oracle Community.

Kind regards
Uwe

Dan Norris said...

Something else that fits this same model: the job you have right now. Many companies/managers give performance appraisals (or whatever they're called in your organization). If you don't know the criteria on which you will be evaluated, how can you ever expect to get a good review? Your job has specific requirements and if you meet or exceed those, that's typically cause for a favorable review. As you say, Cary, it all starts with knowing your target.

Great post and I particularly like the application of examples and the inclusion of real-life excuses and scenarios. You've obviously seen this stuff once or twice. :)

Debra Lilley said...

Not being technical I have a different view on the silver bullet theory, in my experience a service or project manager calls in the 'expert' to fix the problem and the 'expert' looks at the system and identifies the first obvious thing that needs fixing, and there is an immediate improvement, a good 'expert' says it won't be the only issue but the purse holding manager says, that is enough go home. So my point is the silver bullet theory is even more relevant to the non technical stakeholders.

Kerry Osborne said...

Very good post. A couple of thoughts.

1. Most technical people seem to prefer talking to computers over talking to other people. I know I do. I think there are several reasons for this, not the least of which is that computers always do what you tell them to, people ..., not so much. Seeking out people that are not even in the IT group (end users) takes a lot more effort than just sitting at a computer typing away. Which is why we so often solve the wrong problems.

2. Many organizations put up big walls between real users and IT folks. This seems to be the rule, rather than the exception. Often times the IT folks are not even in the same physical location as the users. This makes it even more difficult to communicate and see what is going on from their perspective.

3. There is also a similar situation where someone says they are having a problem and they would like some help, but they don't actually want you to log on to a system and see what it's doing. This is often blamed on security issues, but more often than not (in my experience) it is due to a controlling personality that doesn't really want someone to touch "their" system. This can be very frustrating and unless it can be resolved, severely limits the possibility of getting a good grade on the "test". It's a little like going to the doctor and saying that you're sick, and you think it's your kidney. And when the doctor asks to take a look, you say "No, I think I'll just describe all the things that I think are relevant and then you can tell me what you think I should do." The doctor will probably make you sign a waiver saying you won't sue before he gives you any advice.

Keep up the good work.

Kerry

Michael Fontana said...

My take on why the "silver bullet" theory persists:

This might surprise you, but I think your fine work reinforces it. By the time we consultants are brought in to fix such problems, it has usually been communicated to upper management in such a way that they are willing to pay to get it done.

From a management perspective, the middle-level managers who sold our services to the executives who have to pay for it sold it as a single unit of work (call it the "consulting silver bullet" if you will), and when it's fixed, the details and the many steps involved are completely lost.

I might go so far as to say that the manager who put you in this position got where he was by recognizing he needed help and verbally encapsulated it to his executives in such a manner. I would go so far as to say the executives don't really want to know.

Many times, consultants get where they are because out of the box thinking is required; they might even be especially good at it. So they come into a shop where no one has actually bothered to measure and/or quantify bad performance, and may not even be in a position, politically, where they can communicate with the people needed to do so. Even if they do, they may take risks (even though it would be ultimately rewarding).

Even worse, by the time the consultant arrives, several expensive steps (like buying expensive new hardware, or even making extensive application changes) have already been attempted and failed.

Sometimes you have to tread delicately or watch what words you use to describe how or what was fixed. You may not want to offend someone who would view it as crticism. A consultant might even be tempted to say he simply adjusted a dial or parameter, as it feeds the beastly legend.

Good post, Cary!

Cary Millsap said...

Thanks Michael. I think you've got it.