Cary Millsap: dave ensor

Showing posts with label dave ensor. Show all posts

Friday, February 27, 2015

What happened to “when the application is fast enough to meet users’ requirements?”

On January 5, I received an email called “Video” from my friend and former employee Guðmundur Jósepsson from Iceland. His friends call him Gummi (rhymes with “do me”). Gummi is the guy whose name is set in the ridiculous monospace font on page xxiv of Optimizing Oracle Performance, apparently because O’Reilly’s Linotype Birka font didn’t have the letter eth (ð) in it. Gummi once modestly teased me that this is what he is best known for. But I digress...

His email looked like this:

It’s a screen shot of frame 3:12 from my November 2014 video called “Why you need a profiler for Oracle.” At frame 3:12, I am answering the question of how you can know when you’re finished optimizing a given application function. Gummi’s question is, «Oi! What happened to “when the application is fast enough to meet users’ requirements?”»

Gummi noticed (the good ones will do that) that the video says something different than the thing he had heard me say for years. It’s a fair question. Why, in the video, have I said this new thing? It was not an accident.

When are you finished optimizing?

The question in focus is, “When are you finished optimizing?” Since 2003, I have actually used three different answers:

When are you are finished optimizing?

When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.
Source: Optimizing Oracle Performance (2003) pages 302–304.

When the application is fast enough to meet your users’ requirements.
Source: I have taught this in various courses, conferences, and consulting calls since 1999 or so.

When there are no unnecessary calls, and the calls that remain run at hardware speed.
Source: “Why you need a profiler for Oracle” (2014) frames 2:51–3:20.

My motive behind answers A and B was the idea that optimizing beyond what your business needs can be wasteful. I created these answers to deter people from misdirecting time and money toward perfecting something when those resources might be better invested improving something else. This idea was important, and it still is.

So, then, where did C come from? I’ll begin with a picture. The following figure allows you to plot the response time for a single application function, whatever “given function” you’re looking at. You could draw a similar figure for every application function on your system (although I wouldn’t suggest it).

Somewhere on this response time axis for your given function is the function’s actual response time. I haven’t marked that response time’s location specifically, but I know it’s in the blue zone, because at the bottom of the blue zone is the special response time R_T. This value R_T is the function’s top speed on the hardware you own today. Your function can’t go faster than this without upgrading something.

It so happens that this top speed is the speed at which your function will run if and only if (i) it contains no unnecessary calls and (ii) the calls that remain run at hardware speed. ...Which, of course, is the idea behind this new answer C.

Where, exactly, is your “requirement”?

Answer B (“When the application is fast enough to meet your users’ requirements”) requires that you know the users’ response time requirement for your function, so, next, let’s locate that value on our response time axis.

This is where the trouble begins. Most DBAs don’t know what their users’ response time requirements really are. Don’t despair, though; most users don’t either.

At banks, airlines, hospitals, telcos, and nuclear plants, you need strict service level agreements, so those businesses invest into quantifying them. But realize: quantifying all your functions’ response time requirements isn’t about a bunch of users sitting in a room arguing over which subjective speed limits sound the best. It’s about knowing your technological speed limits and understanding how close to those values your business needs to pay to be. It’s an expensive process. At some companies, it’s worth the effort; at most companies, it’s just not.

How about using, “well, nobody complains about it,” as all the evidence you need that a given function is meeting your users’ requirement? It’s how a lot of people do it. You might get away with doing it this way if your systems weren’t growing. But systems do grow. More data, more users, more application functions: these are all forms of growth, and you can probably measure every one of them happening where you’re sitting right now. All these forms of growth put you on a collision course with failing to meet your users’ response time requirements, whether you and your users know exactly what they are, or not.

In any event, if you don’t know exactly what your users’ response time requirements are, then you won’t be able to use “meets your users’ requirement” as your finish line that tells you when to stop optimizing. This very practical problem is the demise of answer B for most people.

Knowing your top speed

Even if you do know exactly what your users’ requirements are, it’s not enough. You need to know something more.

Imagine for a minute that you do know your users’ response time requirement for a given function, and let’s say that it’s this: “95% of executions of this function must complete within 5 seconds.” Now imagine that this morning when you started looking at the function, it would typically run for 10 seconds in your Oracle SQL Developer worksheet, but now after spending an hour or so with it, you have it down to where it runs pretty much every time in just 4 seconds. So, you’ve eliminated 60% of the function’s response time. That’s a pretty good day’s work, right? The question is, are you done? Or do you keep going?

Here is the reason that answer C is so important. You cannot responsibly answer whether you’re done without knowing that function’s top speed. Even if you know how fast people want it to run, you can’t know whether you’re finished without knowing how fast it can run.

Why? Imagine that 85% of those 4 seconds are consumed by Oracle enqueue, or latch, or log file sync calls, or by hundreds of parse calls, or 3,214 network round-trips to return 3,214 rows. If any of these things is the case, then no, you’re absolutely not done yet. If you were to allow some ridiculous code path like that to survive on a production system, you’d be diminishing the whole system’s effectiveness for everybody (even people who are running functions other than the one you’re fixing).

Now, sure, if there’s something else on the system that has a higher priority than finishing the fix on this function, then you should jump to it. But you should at least leave this function on your to-do list. Your analysis of the higher priority function might even reveal that this function’s inefficiencies are causing the higher-priority function’s problems. Such can be the nature of inefficient code under conditions of high load.

On the other hand, if your function is running in 4 seconds and (i) its profile shows no unnecessary calls, and (ii) the calls that remain are running at hardware speeds, then you’ve reached a milestone:

if your code meets your users’ requirement, then you’re done;
otherwise, either you’ll have to reimagine how to implement the function, or you’ll have to upgrade your hardware (or both).

There’s that “users’ requirement” thing again. You see why it has to be there, right?

Well, here’s what most people do. They get their functions’ response times reasonably close to their top speeds (which, with good people, isn’t usually as expensive as it sounds), and then they worry about requirements only if those requirements are so important that it’s worth a project to quantify them. A requirement is usually considered really important if it’s close to your top speed or if it’s really expensive when you violate a service level requirement.

This strategy works reasonably well.

It is interesting to note here that knowing a function’s top speed is actually more important than knowing your users’ requirements for that function. A lot of companies can work just fine not knowing their users’ requirements, but without knowing your top speeds, you really are in the dark. A second observation that I find particularly amusing is this: not only is your top speed more important to know, your top speed is actually easier to compute than your users’ requirement (…if you have a profiler, which was my point in the video).

Better and easier is a good combination.

Tomorrow is important, too

When are you are finished optimizing?

When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.

When the application is fast enough to meet your users’ requirements.

When there are no unnecessary calls, and the calls that remain run at hardware speed.

Answer A is still a pretty strong answer. Notice that it actually maps closely to answer C. Answer C’s prescription for “no unnecessary calls” yields answer A’s goal of call reduction, and answer C’s prescription for “calls that remain run at hardware speed” yields answer A’s goal of latency reduction. So, in a way, C is a more action-oriented version of A, but A goes further to combat the perfectionism trap with its emphasis on the cost of action versus the cost of inaction.

One thing I’ve grown to dislike about answer A, though, is its emphasis on today in “…exceeds the cost of the performance you’re getting today.” After years of experience with the question of when optimization is complete, I think that answer A under-emphasizes the importance of tomorrow. Unplanned tomorrows can quickly become ugly todays, and as important as tomorrow is to businesses and the people who run them, it’s even more important to another community: database application developers.

Subjective goals are treacherous for developers

Many developers have no way to test, today, the true production response time behavior of their code, which they won’t learn until tomorrow. ...And perhaps only until some remote, distant tomorrow.

Imagine you’re a developer using 100-row tables on your desktop to test code that will access 100,000,000,000-row tables on your production server. Or maybe you’re testing your code’s performance only in isolation from other workload. Both of these are problems; they’re procedural mistakes, but they are everyday real-life for many developers. When this is how you develop, telling you that “your users’ response time requirement is n seconds” accidentally implies that you are finished optimizing when your query finishes in less than n seconds on your no-load system of 100-row test tables.

If you are a developer writing high-risk code—and any code that will touch huge database segments in production is high-risk code—then of course you must aim for the “no unnecessary calls” part of the top speed target. And you must aim for the “and the calls that remain run at hardware speed” part, too, but you won’t be able to measure your progress against that goal until you have access to full data volumes and full user workloads.

Notice that to do both of these things, you must have access to full data volumes and full user workloads in your development environment. To build high-performance applications, you must do full data volume testing and full user workload testing in each of your functional development iterations.

This is where agile development methods yield a huge advantage: agile methods provide a project structure that encourages full performance testing for each new product function as it is developed. Contrast this with the terrible project planning approach of putting all your performance testing at the end of your project, when it’s too late to actually fix anything (if there’s even enough budget left over by then to do any testing at all). If you want a high-performance application with great performance diagnostics, then performance instrumentation should be an important part of your feedback for each development iteration of each new function you create.

My answer

So, when are you finished optimizing?

When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.
When the application is fast enough to meet your users’ requirements.
When there are no unnecessary calls and the calls that remain run at hardware speed.

There is some merit in all three answers, but as Dave Ensor taught me inside Oracle many years ago, the correct answer is C. Answer A specifically restricts your scope of concern to today, which is especially dangerous for developers. Answer B permits you to promote horrifically bad code, unhindered, into production, where it can hurt the performance of every function on the system. Answers A and B both presume that you know information that you probably don’t know and that you may not need to know. Answer C is my favorite answer because it is tells you exactly when you’re done, using units you can measure and that you should be measuring.

Answer C is usually a tougher standard than answer A or B, and when it’s not, it is the best possible standard you can meet without upgrading or redesigning something. In light of this “tougher standard” kind of talk, it is still important to understand that what is optimal from a software engineering perspective is not always optimal from a business perspective. The term optimized must ultimately be judged within the constraints of what the business chooses to pay for. In the spirit of answer A, you can still make the decision not to optimize all your code to the last picosecond of its potential. How perfect you make your code should be a business decision. That decision should be informed by facts, and these facts should include knowledge of your code’s top speed.

Thank you, Guðmundur Jósepsson, of Iceland, for your question. Thank you for waiting patiently for several weeks while I struggled putting these thoughts into words.

Monday, December 21, 2009

My Whole System Is Slow. Now What?

At CMG'09 a couple of weeks ago, I presented "Measuring Response Times of Code on Oracle Systems." The paper for this presentation was a subset of "For Developers: Making Friends with the Oracle Database." In the presentation, I spent a few minutes talking about why to measure response times in Oracle, and then I spent a lot of minutes talking about how. As usual, I focused heavily on the importance of measuring response times of individual business tasks executed by individual end users.

At the end of the talk, a group of people came to the podium to ask questions (always a good sign). The first question was the question that a lot of people ask. It was:

My whole system is slow. That's all my users will tell me. So then, how do I begin to do what you're describing?

Here's the answer:

Ask your users to show you what they're doing. Just go look at it.

The results of this simple advice are routinely spectacular. Just go look at it: I'm surprised whenever someone doesn't think of doing that, but I shouldn't be. That's because I didn't do it either, for the longest time. I had to learn to. And that's the story I want to tell you here.

In the early 1990s, I was a consultant with Oracle Corporation visiting clients with performance problems at a pace of more than 30 per year. Back then, I did Oracle performance work the old fashioned way: I checked everything I knew how to check, and then I fixed everything I knew how to fix. All billable by the hour. (Note: When I was doing it this way, I had not yet been taught by Dave Ensor, who changed me forever.)

On weeks when I was lucky, I'd be finished checking and fixing by sometime Wednesday, leaving a couple of days to find out what people thought of my work. If I were lucky again (that's two "lucky"s now), everyone would be thrilled with the results. I'd get my hug (so to speak), and I'd catch my flight.

But I wasn't always lucky. Some weeks, I wouldn't find anything suspicious in my checking and fixing. Some weeks, I'd find plenty, but still not everyone would be thrilled with the work. Having people be less than thrilled with my work caused pain for me, which motivated me to figure out how to take more control of my consulting engagements, to drive luck out of the equation.

The most important thing I figured out was...

People knew before I came on-site how they were going to measure on Thursday whether they liked the results of my work.

And...

They were willing to tell me on Monday.

All I had to do was be honest, like this:

On the day I'm done working here, I'd imagine you're going to want to run something that will demonstrate whether I accomplished what you were hoping for while I was here. Would you mind telling me about that now? Maybe even showing me?

I could ask that on Monday, and people were glad to tell me. I'd watch the things run and record how long they ran, and then I'd know how to prioritize my time on site. I'd record how long they ran so at the end of my engagement, I'd be able to show very clearly what improvements I had made.

Sometimes, there would be thirty different things that people would expect to measure on Thursday. If I might not have time to fix them all, then I needed to make sure that I knew the priority of the things I was being asked to fix.

That one step alone—knowing on Monday that prioritized list of what tasks needed to be fast by Thursday—drastically reduced my reliance on luck as a success factor in my job at these sites. Knowing that list on Monday is just like when your teacher in school tells you exactly what's going to be on your next test. It allows you to focus your attention on exactly what you need to do to optimize your reward for the week. (Note to fellow education enthusiasts: Please don't interpret this paragraph as my advocating the idea that GPA should be a student's sole—or even dominant—optimization constraint.)

So, what I learned is that the very first step of any good performance optimization method is necessarily this:

1. Identify the task that's the most important to you.

When I say "task," think "program" or "click" or "batch job" if you want to. What I mean is "a useful unit of work that makes sense to the business." ...Something that a business user would show you if you just went and watched her work for a few minutes.

Then comes step two:

2. Measure its response time (R). In detail.

Why is response time so important? Because that's what's important to the person who'll be watching it run on Thursday, assessing whether she thinks you've done a good job or not. That person's going to click and then wait. Happiness will be inversely proportional to how long the wait is. That's it. That's what "performance" means at 99% of sites I've ever visited.

(If you're interested in the other 1% of sites I've visited, they're interested in throughput, which I've written about in another blog post.)

Measuring response time is vital. You must be able to measure response time if you're going to nail that test on Thursday.

The key is to understand that the term response time doesn't even have a definition except in the context of a task. You can't measure response time if you don't first decide what task you're going to measure. In other words, you cannot do step 2 before you do step 1. With Oracle, for example, you can collect ASH data (if you're licensed to use it) or even trace data for a whole bunch of Oracle processes, but you won't have a single response time until you define which tasks buried within that data are the ones you want to extract and pay attention to.

You get that by visiting a user and watching what she does.

There are lots of excuses for not watching your users. Like these...

"I don't know my users." I know. But you should. You'd do your job better if you did. And your users would, too.
"My users aren't here." I know. They're on the web. They're in Chicago and Singapore and Istanbul, buying plane tickets or baseball caps or stock shares. But if you can't watch at least a simulation of the things those users actually do with the system you help manage, then I can't imagine how you would possibly succeed at providing good performance to them.
"I'm supposed to be able to manage performance with my dashboard." I know. I was supposed to have a hover car by the year 2000.

The longer you stay mired in excuses like these, the longer it's going to be before you can get the benefit of my point here. Your users are running something, and whatever that is that they're running is your version of my Thursday test. You can check and fix all you want, but unless you get lucky and fix the exact tooth that's hurting, your efforts aren't going to be perceived as "helpful." Checking and fixing everything you can think of is far less efficient and effective than targeting exactly what your user needs you to target.

Lots of performance analysts (DBAs, developers, architects, sysadmins, and so on) assume that when someone says, "The whole system is slow," it means there must be a single parameter somewhere in the bowels of the system that needs adjustment, and if you can just make that adjustment, everything is going to be ok. It might mean that, but in my experience, the overwhelming majority of cases are not that way. (Pages 25–29 of Optimizing Oracle Performance has more information about this.)

The great thing about measuring response time is that no matter what the problem is, you'll see it. If the program you're watching is poorly written, you'll see it. If some other program is hogging too much of a resource that your program needs, you'll see it. If you have a bad disk controller, you'll see it. If some parameter needs adjusting, you'll see it.

Realize that when a business user says "system," she doesn't mean what you would mean if you said "system." She means that the thing she runs is slow. Look at that thing. Maybe there are seventeen of them. And sure, maybe all seventeen suffer from the same root cause. If that's the case, then fantastic, because fixing the first problem will magically fix the other sixteen, too. If it's not, then fantastic anyway, because now all of them are on your prioritized list of tasks to optimize, and you'll probably surprise yourself how quickly you'll be able to pick them all off when you focus on one task at a time.