Cary Millsap: optimizing oracle performance

Showing posts with label optimizing oracle performance. Show all posts

Friday, February 27, 2015

What happened to “when the application is fast enough to meet users’ requirements?”

On January 5, I received an email called “Video” from my friend and former employee Guðmundur Jósepsson from Iceland. His friends call him Gummi (rhymes with “do me”). Gummi is the guy whose name is set in the ridiculous monospace font on page xxiv of Optimizing Oracle Performance, apparently because O’Reilly’s Linotype Birka font didn’t have the letter eth (ð) in it. Gummi once modestly teased me that this is what he is best known for. But I digress...

His email looked like this:

It’s a screen shot of frame 3:12 from my November 2014 video called “Why you need a profiler for Oracle.” At frame 3:12, I am answering the question of how you can know when you’re finished optimizing a given application function. Gummi’s question is, «Oi! What happened to “when the application is fast enough to meet users’ requirements?”»

Gummi noticed (the good ones will do that) that the video says something different than the thing he had heard me say for years. It’s a fair question. Why, in the video, have I said this new thing? It was not an accident.

When are you finished optimizing?

The question in focus is, “When are you finished optimizing?” Since 2003, I have actually used three different answers:

When are you are finished optimizing?

When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.
Source: Optimizing Oracle Performance (2003) pages 302–304.

When the application is fast enough to meet your users’ requirements.
Source: I have taught this in various courses, conferences, and consulting calls since 1999 or so.

When there are no unnecessary calls, and the calls that remain run at hardware speed.
Source: “Why you need a profiler for Oracle” (2014) frames 2:51–3:20.

My motive behind answers A and B was the idea that optimizing beyond what your business needs can be wasteful. I created these answers to deter people from misdirecting time and money toward perfecting something when those resources might be better invested improving something else. This idea was important, and it still is.

So, then, where did C come from? I’ll begin with a picture. The following figure allows you to plot the response time for a single application function, whatever “given function” you’re looking at. You could draw a similar figure for every application function on your system (although I wouldn’t suggest it).

Somewhere on this response time axis for your given function is the function’s actual response time. I haven’t marked that response time’s location specifically, but I know it’s in the blue zone, because at the bottom of the blue zone is the special response time R_T. This value R_T is the function’s top speed on the hardware you own today. Your function can’t go faster than this without upgrading something.

It so happens that this top speed is the speed at which your function will run if and only if (i) it contains no unnecessary calls and (ii) the calls that remain run at hardware speed. ...Which, of course, is the idea behind this new answer C.

Where, exactly, is your “requirement”?

Answer B (“When the application is fast enough to meet your users’ requirements”) requires that you know the users’ response time requirement for your function, so, next, let’s locate that value on our response time axis.

This is where the trouble begins. Most DBAs don’t know what their users’ response time requirements really are. Don’t despair, though; most users don’t either.

At banks, airlines, hospitals, telcos, and nuclear plants, you need strict service level agreements, so those businesses invest into quantifying them. But realize: quantifying all your functions’ response time requirements isn’t about a bunch of users sitting in a room arguing over which subjective speed limits sound the best. It’s about knowing your technological speed limits and understanding how close to those values your business needs to pay to be. It’s an expensive process. At some companies, it’s worth the effort; at most companies, it’s just not.

How about using, “well, nobody complains about it,” as all the evidence you need that a given function is meeting your users’ requirement? It’s how a lot of people do it. You might get away with doing it this way if your systems weren’t growing. But systems do grow. More data, more users, more application functions: these are all forms of growth, and you can probably measure every one of them happening where you’re sitting right now. All these forms of growth put you on a collision course with failing to meet your users’ response time requirements, whether you and your users know exactly what they are, or not.

In any event, if you don’t know exactly what your users’ response time requirements are, then you won’t be able to use “meets your users’ requirement” as your finish line that tells you when to stop optimizing. This very practical problem is the demise of answer B for most people.

Knowing your top speed

Even if you do know exactly what your users’ requirements are, it’s not enough. You need to know something more.

Imagine for a minute that you do know your users’ response time requirement for a given function, and let’s say that it’s this: “95% of executions of this function must complete within 5 seconds.” Now imagine that this morning when you started looking at the function, it would typically run for 10 seconds in your Oracle SQL Developer worksheet, but now after spending an hour or so with it, you have it down to where it runs pretty much every time in just 4 seconds. So, you’ve eliminated 60% of the function’s response time. That’s a pretty good day’s work, right? The question is, are you done? Or do you keep going?

Here is the reason that answer C is so important. You cannot responsibly answer whether you’re done without knowing that function’s top speed. Even if you know how fast people want it to run, you can’t know whether you’re finished without knowing how fast it can run.

Why? Imagine that 85% of those 4 seconds are consumed by Oracle enqueue, or latch, or log file sync calls, or by hundreds of parse calls, or 3,214 network round-trips to return 3,214 rows. If any of these things is the case, then no, you’re absolutely not done yet. If you were to allow some ridiculous code path like that to survive on a production system, you’d be diminishing the whole system’s effectiveness for everybody (even people who are running functions other than the one you’re fixing).

Now, sure, if there’s something else on the system that has a higher priority than finishing the fix on this function, then you should jump to it. But you should at least leave this function on your to-do list. Your analysis of the higher priority function might even reveal that this function’s inefficiencies are causing the higher-priority function’s problems. Such can be the nature of inefficient code under conditions of high load.

On the other hand, if your function is running in 4 seconds and (i) its profile shows no unnecessary calls, and (ii) the calls that remain are running at hardware speeds, then you’ve reached a milestone:

if your code meets your users’ requirement, then you’re done;
otherwise, either you’ll have to reimagine how to implement the function, or you’ll have to upgrade your hardware (or both).

There’s that “users’ requirement” thing again. You see why it has to be there, right?

Well, here’s what most people do. They get their functions’ response times reasonably close to their top speeds (which, with good people, isn’t usually as expensive as it sounds), and then they worry about requirements only if those requirements are so important that it’s worth a project to quantify them. A requirement is usually considered really important if it’s close to your top speed or if it’s really expensive when you violate a service level requirement.

This strategy works reasonably well.

It is interesting to note here that knowing a function’s top speed is actually more important than knowing your users’ requirements for that function. A lot of companies can work just fine not knowing their users’ requirements, but without knowing your top speeds, you really are in the dark. A second observation that I find particularly amusing is this: not only is your top speed more important to know, your top speed is actually easier to compute than your users’ requirement (…if you have a profiler, which was my point in the video).

Better and easier is a good combination.

Tomorrow is important, too

When are you are finished optimizing?

When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.

When the application is fast enough to meet your users’ requirements.

When there are no unnecessary calls, and the calls that remain run at hardware speed.

Answer A is still a pretty strong answer. Notice that it actually maps closely to answer C. Answer C’s prescription for “no unnecessary calls” yields answer A’s goal of call reduction, and answer C’s prescription for “calls that remain run at hardware speed” yields answer A’s goal of latency reduction. So, in a way, C is a more action-oriented version of A, but A goes further to combat the perfectionism trap with its emphasis on the cost of action versus the cost of inaction.

One thing I’ve grown to dislike about answer A, though, is its emphasis on today in “…exceeds the cost of the performance you’re getting today.” After years of experience with the question of when optimization is complete, I think that answer A under-emphasizes the importance of tomorrow. Unplanned tomorrows can quickly become ugly todays, and as important as tomorrow is to businesses and the people who run them, it’s even more important to another community: database application developers.

Subjective goals are treacherous for developers

Many developers have no way to test, today, the true production response time behavior of their code, which they won’t learn until tomorrow. ...And perhaps only until some remote, distant tomorrow.

Imagine you’re a developer using 100-row tables on your desktop to test code that will access 100,000,000,000-row tables on your production server. Or maybe you’re testing your code’s performance only in isolation from other workload. Both of these are problems; they’re procedural mistakes, but they are everyday real-life for many developers. When this is how you develop, telling you that “your users’ response time requirement is n seconds” accidentally implies that you are finished optimizing when your query finishes in less than n seconds on your no-load system of 100-row test tables.

If you are a developer writing high-risk code—and any code that will touch huge database segments in production is high-risk code—then of course you must aim for the “no unnecessary calls” part of the top speed target. And you must aim for the “and the calls that remain run at hardware speed” part, too, but you won’t be able to measure your progress against that goal until you have access to full data volumes and full user workloads.

Notice that to do both of these things, you must have access to full data volumes and full user workloads in your development environment. To build high-performance applications, you must do full data volume testing and full user workload testing in each of your functional development iterations.

This is where agile development methods yield a huge advantage: agile methods provide a project structure that encourages full performance testing for each new product function as it is developed. Contrast this with the terrible project planning approach of putting all your performance testing at the end of your project, when it’s too late to actually fix anything (if there’s even enough budget left over by then to do any testing at all). If you want a high-performance application with great performance diagnostics, then performance instrumentation should be an important part of your feedback for each development iteration of each new function you create.

My answer

So, when are you finished optimizing?

When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.
When the application is fast enough to meet your users’ requirements.
When there are no unnecessary calls and the calls that remain run at hardware speed.

There is some merit in all three answers, but as Dave Ensor taught me inside Oracle many years ago, the correct answer is C. Answer A specifically restricts your scope of concern to today, which is especially dangerous for developers. Answer B permits you to promote horrifically bad code, unhindered, into production, where it can hurt the performance of every function on the system. Answers A and B both presume that you know information that you probably don’t know and that you may not need to know. Answer C is my favorite answer because it is tells you exactly when you’re done, using units you can measure and that you should be measuring.

Answer C is usually a tougher standard than answer A or B, and when it’s not, it is the best possible standard you can meet without upgrading or redesigning something. In light of this “tougher standard” kind of talk, it is still important to understand that what is optimal from a software engineering perspective is not always optimal from a business perspective. The term optimized must ultimately be judged within the constraints of what the business chooses to pay for. In the spirit of answer A, you can still make the decision not to optimize all your code to the last picosecond of its potential. How perfect you make your code should be a business decision. That decision should be informed by facts, and these facts should include knowledge of your code’s top speed.

Thank you, Guðmundur Jósepsson, of Iceland, for your question. Thank you for waiting patiently for several weeks while I struggled putting these thoughts into words.

Friday, November 18, 2011

I Can Help You Trace It

The first product I ever created after leaving Oracle Corporation in 1999 was a 3-day course about optimizing Oracle performance. The experiences of teaching this course from 2000 through 2003 (heavily revising the material each time I taught it) added up to the knowledge that Jeff Holt and I needed to write Optimizing Oracle Performance (2003).

Between 2000 and 2006, I spent many weeks on the road teaching this 3-day course. I stopped teaching it in 2006. An opportunity to take or teach a course ought to be a joyous experience, and this one had become too much of a grind. I didn’t figure out how to fix it until this year. How I fixed it is the story I’d like to tell you.

The Problem

The problem was simply inefficiency. The inefficiency began with the structure of the course, the 3-day lecture marathon. Realize, 6 × 3 = 18 hours of sitting in a chair, listening attentively to a single voice (my voice) is the equivalent of a 6-week university term of a 3-credit-hour course, taught straight through in three days. No hour-plus homework assignment after each hour of lecture to reinforce the lessons; but a full semester’s worth of listening to one voice, straight through, for three days. What retention rate would you expect from a university course compressed into just 3 days?

So, I optimized. I have created a new course that lasts one day (not even an exhausting full day at that). But how can a student possibly learn as much in 1 day as we used to teach in 3 days? Isn’t a 1-day event bound to be a significantly reduced-value experience?

On the contrary, I believe our students benefit even more now than they used to. Here are the big differences, so you can see why.

The Time Savings

In the 3-day course, I would spend half a day explaining why people should abandon their old system-wide-ratio-based ways of managing system performance. In the new 1-day course, I spend less than an hour explaining the Method R approach to thinking about performance. The point of the new course is not to convince people to abandon anything they’re already doing; it’s to show students the tremendous additional opportunities that are available to them if they’ll just look at what Oracle trace files have to offer. Time savings: 2 hours.

In the 3-day course, I would spend a full day explaining how to interpret trace data. By hand. These were a few little lab exercises, about an hour’s worth. Students would enter dozens of numbers from trace files into laptops or pocket calculators and write results on worksheets. In the new 1-day course, the software tools that a student needs to interpret files of any size—or even directories full of files—are included in the price of the course. Time savings: 5 hours.

In the 3-day course, I would spend half a day explaining how to collect trace data. In the new 1-day course, the software tools that a student needs to get started collecting trace files are included in the price of the course. For software architectures that require more work than our software can do for you, there’s detailed instruction in the course book. Time savings: 3 hours.

In the 3-day course, I would spend half a day working through about five example cases using a software tool to which students would have access for 30 days after they had gone home. In the new 1-day course, I spend one hour working through about eight example cases using software tools that every student will take home and keep forever. I can spend less time per case yet teach more because the cases are thoroughly documented in the course book. So, in class, we focus on the high-level decision making instead of the gnarly technical details you’ll want to look up later anyway. Time savings: 3 hours.

...That’s 13 classroom hours we’ve eliminated from the old 3-day experience. I believe that in these 13 hours, I was teaching material that students weren’t retaining to begin with.

The Book

The next big difference: the book.

In the old 3-day course, I distributed two books: (1) the “Course Notebook,” which was a black and white listing of the course PowerPoint slides, and (2) a copy of Optimizing Oracle Performance (O’Reilly 2003). The O’Reilly book was great, because it contained a lot of detail that you would want to look up after the course. But of course it doesn’t contain any new knowledge we’ve learned since 2003. The Course Notebook, in my opinion, was never worth much to begin with. (In my opinion, no PowerPoint slide printout is worth much to begin with.)

The Mastering Oracle Trace Data (MOTD) book we give each student in my new 1-day course is a full-color, perfect-bound book that explains the course material and far more in deep detail. It is full-color for an important reason. It’s not gratuitous or decorative; it’s because I’ve been studying Edward Tufte. I use color throughout the book to communicate detailed, high-resolution information faster to your brain.

Color in the book helps to reduce student workload and deliver value long after a student has left the classroom. In this class, there is no collection of slide printouts like you’ve archived after every Oracle class you’ve been to since the 1980s. The MOTD book is way better than any other material I’ve ever distributed in my career. I’ve heard students tell their friends that you have to see it to believe it.

“A paper record tells your audience that you are serious, responsible, exact, credible. For deep analysis of evidence and reasoning about complex matters, permanent high-resolution displays [that is, paper] are an excellent start.” —Edward Tufte

The Software

So, where does a student recoup all the time we used to spend going through trace files, and studying how to collect trace data on half a dozen different software architectures? In the thousands of man-hours we’ve invested into the software that we give you when you come to the course. Instead of explaining every little detail about quirks in Oracle trace data that change between Oracle versions 10.1 and 10.2 and 11.2 or 11.2.0.2 and 11.2.0.4, the software does the work for you. Instead of having to explain all the detail work, we have time to explain how to use the results of our software to make decisions about your data.

What’s the catch? Of course, we hope you’ll love our software and want to buy it. The software we give you is completely full-featured and yours to keep forever, but the license limits you to using it only with one login id, and it doesn’t include patches and upgrades, which we release a few times each year. We hope you’ll love our software so much that you’ll want to buy a license that lets you use it on any of your systems and that includes the right to upgrade as we fix bugs and add features. We hope you’ll love it so much that you encourage your colleagues to buy it.

But there’s really no catch. You get software and a course (and a book and a shirt) for less than the daily rate that we used to charge for just a course.

A Shirt?

MOTD London 2011-09-08: “I can help you trace it.”

Yes, a shirt. Each student receives a Method R T-shirt that says, “I can help you trace it.” We don’t give these things away to anyone except for students in my MOTD course. So if you see one, the person wearing it can, in actual fact, Help You Trace It.

The Net Result

The net result of all this optimization is benefits on several fronts:

The course costs a lot less than it used to. The fee is presently only about 25% of the 3-day course’s price, and the whole experience requires less than ⅓ of time away from work that the original course did.
In the new course, our students don’t have to work so hard to make productive use of the course material. The book and the software take so much of the pressure off. We do talk about what the fields in raw trace data mean—I think it’s necessary to know that in order to use the data properly, and have productive debates with your sys/SAN/net/etc. administration colleagues. But we don’t spend your time doing exercises to untangle nested (recursive) calls by hand. The software you take home does that for you. That’s why it is so much easier for a student to put this course to work right away.
Since the course duration is only one day, I can visit far more cities and meet far more students each year. That’s good for students who want to participate, and it’s great for me, because I get to meet more people.

Plans

The only thing missing from our Mastering Oracle Trace Data course right now is you. I have taught the event now in Southlake, Texas (our home town), in Copenhagen, and in London. It’s field-tested and ready to roll. We have several cities on my schedule right now. I’ll be teaching the course in Birmingham UK on the day after UKOUG wraps up, December 8. I’ll be doing Orlando and Tampa in mid-December. I’ll teach two courses this coming January in Manhattan and Long Island. There’s Billund (Legoland) DK in April. We have more plans in the works for Seattle, Portland, Dallas, and Cleveland, and we’re looking for more opportunities.

Share the word by linking the official
MOTD sticker to http://method-r.com/.

My wish is for you to help me book more cities in North America and Europe (I hope to expand beyond that soon). If you are part of a company or a user group with colleagues who would be interested in attending the course, I would love to hear from you. Registering en masse saves you money. The magic number for discounting is 10 students on a single registration from one company or user group.

I can help you trace it.

Monday, December 21, 2009

My Whole System Is Slow. Now What?

At CMG'09 a couple of weeks ago, I presented "Measuring Response Times of Code on Oracle Systems." The paper for this presentation was a subset of "For Developers: Making Friends with the Oracle Database." In the presentation, I spent a few minutes talking about why to measure response times in Oracle, and then I spent a lot of minutes talking about how. As usual, I focused heavily on the importance of measuring response times of individual business tasks executed by individual end users.

At the end of the talk, a group of people came to the podium to ask questions (always a good sign). The first question was the question that a lot of people ask. It was:

My whole system is slow. That's all my users will tell me. So then, how do I begin to do what you're describing?

Here's the answer:

Ask your users to show you what they're doing. Just go look at it.

The results of this simple advice are routinely spectacular. Just go look at it: I'm surprised whenever someone doesn't think of doing that, but I shouldn't be. That's because I didn't do it either, for the longest time. I had to learn to. And that's the story I want to tell you here.

In the early 1990s, I was a consultant with Oracle Corporation visiting clients with performance problems at a pace of more than 30 per year. Back then, I did Oracle performance work the old fashioned way: I checked everything I knew how to check, and then I fixed everything I knew how to fix. All billable by the hour. (Note: When I was doing it this way, I had not yet been taught by Dave Ensor, who changed me forever.)

On weeks when I was lucky, I'd be finished checking and fixing by sometime Wednesday, leaving a couple of days to find out what people thought of my work. If I were lucky again (that's two "lucky"s now), everyone would be thrilled with the results. I'd get my hug (so to speak), and I'd catch my flight.

But I wasn't always lucky. Some weeks, I wouldn't find anything suspicious in my checking and fixing. Some weeks, I'd find plenty, but still not everyone would be thrilled with the work. Having people be less than thrilled with my work caused pain for me, which motivated me to figure out how to take more control of my consulting engagements, to drive luck out of the equation.

The most important thing I figured out was...

People knew before I came on-site how they were going to measure on Thursday whether they liked the results of my work.

And...

They were willing to tell me on Monday.

All I had to do was be honest, like this:

On the day I'm done working here, I'd imagine you're going to want to run something that will demonstrate whether I accomplished what you were hoping for while I was here. Would you mind telling me about that now? Maybe even showing me?

I could ask that on Monday, and people were glad to tell me. I'd watch the things run and record how long they ran, and then I'd know how to prioritize my time on site. I'd record how long they ran so at the end of my engagement, I'd be able to show very clearly what improvements I had made.

Sometimes, there would be thirty different things that people would expect to measure on Thursday. If I might not have time to fix them all, then I needed to make sure that I knew the priority of the things I was being asked to fix.

That one step alone—knowing on Monday that prioritized list of what tasks needed to be fast by Thursday—drastically reduced my reliance on luck as a success factor in my job at these sites. Knowing that list on Monday is just like when your teacher in school tells you exactly what's going to be on your next test. It allows you to focus your attention on exactly what you need to do to optimize your reward for the week. (Note to fellow education enthusiasts: Please don't interpret this paragraph as my advocating the idea that GPA should be a student's sole—or even dominant—optimization constraint.)

So, what I learned is that the very first step of any good performance optimization method is necessarily this:

1. Identify the task that's the most important to you.

When I say "task," think "program" or "click" or "batch job" if you want to. What I mean is "a useful unit of work that makes sense to the business." ...Something that a business user would show you if you just went and watched her work for a few minutes.

Then comes step two:

2. Measure its response time (R). In detail.

Why is response time so important? Because that's what's important to the person who'll be watching it run on Thursday, assessing whether she thinks you've done a good job or not. That person's going to click and then wait. Happiness will be inversely proportional to how long the wait is. That's it. That's what "performance" means at 99% of sites I've ever visited.

(If you're interested in the other 1% of sites I've visited, they're interested in throughput, which I've written about in another blog post.)

Measuring response time is vital. You must be able to measure response time if you're going to nail that test on Thursday.

The key is to understand that the term response time doesn't even have a definition except in the context of a task. You can't measure response time if you don't first decide what task you're going to measure. In other words, you cannot do step 2 before you do step 1. With Oracle, for example, you can collect ASH data (if you're licensed to use it) or even trace data for a whole bunch of Oracle processes, but you won't have a single response time until you define which tasks buried within that data are the ones you want to extract and pay attention to.

You get that by visiting a user and watching what she does.

There are lots of excuses for not watching your users. Like these...

"I don't know my users." I know. But you should. You'd do your job better if you did. And your users would, too.
"My users aren't here." I know. They're on the web. They're in Chicago and Singapore and Istanbul, buying plane tickets or baseball caps or stock shares. But if you can't watch at least a simulation of the things those users actually do with the system you help manage, then I can't imagine how you would possibly succeed at providing good performance to them.
"I'm supposed to be able to manage performance with my dashboard." I know. I was supposed to have a hover car by the year 2000.

The longer you stay mired in excuses like these, the longer it's going to be before you can get the benefit of my point here. Your users are running something, and whatever that is that they're running is your version of my Thursday test. You can check and fix all you want, but unless you get lucky and fix the exact tooth that's hurting, your efforts aren't going to be perceived as "helpful." Checking and fixing everything you can think of is far less efficient and effective than targeting exactly what your user needs you to target.

Lots of performance analysts (DBAs, developers, architects, sysadmins, and so on) assume that when someone says, "The whole system is slow," it means there must be a single parameter somewhere in the bowels of the system that needs adjustment, and if you can just make that adjustment, everything is going to be ok. It might mean that, but in my experience, the overwhelming majority of cases are not that way. (Pages 25–29 of Optimizing Oracle Performance has more information about this.)

The great thing about measuring response time is that no matter what the problem is, you'll see it. If the program you're watching is poorly written, you'll see it. If some other program is hogging too much of a resource that your program needs, you'll see it. If you have a bad disk controller, you'll see it. If some parameter needs adjusting, you'll see it.

Realize that when a business user says "system," she doesn't mean what you would mean if you said "system." She means that the thing she runs is slow. Look at that thing. Maybe there are seventeen of them. And sure, maybe all seventeen suffer from the same root cause. If that's the case, then fantastic, because fixing the first problem will magically fix the other sixteen, too. If it's not, then fantastic anyway, because now all of them are on your prioritized list of tasks to optimize, and you'll probably surprise yourself how quickly you'll be able to pick them all off when you focus on one task at a time.

Thursday, November 12, 2009

Why We Made Method R

Twenty years ago (well, a month or so more than that), I entered the Oracle ecosystem. I went to work as a consultant for Oracle Corporation in September 1989. Before Oracle, I had been a language designer and compiler developer. I wrote code in lex, yacc, and C for a living. My responsibilities had also included improving other people's C code: making it more reliable, more portable, easier to read, easier to prove, and easier to maintain; and it was my job to teach other people in my department how to do these things themselves. I loved all of these duties.

In 1987, I decided to leave what I loved for a little while, to earn an MBA. Fortunately, at that time, it was possible to earn an MBA in a year. After a year of very difficult work, I had my degree and a new perspective on business. I interviewed with Oracle, and about a week later I had a job with a company that a month prior I had never heard of.

By the mid-1990s, circumstances and my natural gravity had matched to create a career in which I was again a software developer, optimizer, and teacher. By 1998, I was the manager of a group of 85 performance specialists called the System Performance Group (SPG). And I was the leader of the system architecture and system management consulting service line within Oracle Consulting's Global Steering Committee.

My job in the SPG role was to respond to all the system performance-related issues in the USA for Oracle's largest accounts. My job in the Global Steering Committee was to package the success of SPG so that other practices around the world could repeat it. The theory was that if a country manager in, say, Venezuela, wanted his own SPG, then he could use the financial models, budgets, hiring plans, training plans, etc. created by my steering committee group. Just add water.

But there was a problem. My own group of 85 people consisted of two very different types of people. About ten of these 85 people were spectacularly successful optimizers whom I could send anywhere with confidence that they'd thrive at either improving

performance or proving that performance improvements weren't possible. The other 75 were very smart, very hard-working people who would grow into the tip of my pyramid over the course of more years, but they weren't there yet.

The problem was, how to you convert good, smart, hard-working people in the base of the SPG pyramid into people in the tip? The practice manager in Venezuela would need to know that. The answer, of course, is supposed to be the Training Plan. Optimally, the Training Plan consists of a curriculum of a few courses, a little on-the-job training, and then, presto: tip of the pyramid. Just add water.

But unfortunately that wasn't the way things worked. What I had been getting instead, within my own elite group, was a

process that took many years to convert a smart, hard-working person into a reasonably reliable performance optimizer whom you could send anywhere. Worse yet, the peculiar stresses of the job—like being away from home 80% of the time, and continually visiting angry people each week, having to work for me—caused an outflow of talent that approximately equaled the inflow of people who made it to the tip of the pyramid. The tip of my pyramid never grew beyond roughly 10 people.

The problem, by definition, was the Training Plan. It just wasn't good enough. It wasn't that the instructors of Oracle's internal "tuning" courses were doing a poor job of teaching courses. And it wasn't that the course developers had done a poor job of creating courses. On the contrary, the instructors and course developers were doing excellent work. The problem was that the courses were focusing on the wrong thing. The reason that the courses weren't getting the job done was that the very subject matter that needed teaching hadn't been invented yet.

I expect that the people who write, say, the course called "Braking System Repair for Boeing 777" to have themselves invented the braking system they write about. So, the question was, who should be responsible for inventing the subject matter on how to optimize Oracle? I decided that I wanted that person to be me. I deliberated carefully and decided that my best chance of doing that the way I wanted to do it would be outside of Oracle. So in October 1999, ten years and one week after I joined the company, I left Oracle with the vision of creating a repeatable, teachable method for optimizing Oracle systems.

Ten years later, this is still the vision for my company, Method R Corporation. We exist not to make your system faster. We exist to make you faster at making all your systems faster. Our work is far from done, but here is what we have done:

Written white papers and other articles that explain Method R to you at no cost.
Written a book called Optimizing Oracle Performance, where you can learn Method R at a low cost.
Created a Method R course (on which the book is based), to teach you how to diagnose and repair response time problems in Oracle-based systems.
Spoken at hundreds of public and private events where we help people understand performance and how to manage it.
Provided consulting services to make people awesome at making their systems faster and more efficient.
Created the first response time profiling software ever for Oracle software applications, to let you analyze hundreds of megabytes of data without drudgery.
Created a free instrumentation library so that you can instrument the response times of Oracle-based software that you write.
Created software tools to help you be awesome at extracting every drop of information that your Oracle system is willing to give you about your response times.
Created a software tool that enables you to record the response time of every business task that runs on your system so you can effortlessly manage end-user performance.

As I said, our work is far from done. It's work that really, really matters to us, and it's work we love doing. I expect it to be a journey that will last long into the future. I hope that our journey will intersect with yours from time to time, and that you will enjoy it when it does.

Wednesday, September 16, 2009

On the Importance of Diagnosing Before Resolving

Today a reader posted a question I like at our Method R website. It's about the story I tell in the article called, "Can you explain Method R so even my boss could understand it?" The story is about sending your son on a shopping trip, and it takes him too long to complete the errand. The point is that an excellent way to fix any kind of performance problem is to profile the response time for the properly chosen task, which is the basis for Method R (both the method and the company).

Here is the profile that details where the boy's time went during his errand:

                    --Duration---
Subtask             minutes     %  Executions
------------------  -------  ----  ----------
Talk with friends        37   62%           3
Choose item              10   17%           5
Walk to/from store        8   13%           2
Pay cashier               5    8%           1
------------------  -------  ----  ----------
Total                    60  100%

I went on to describe that the big leverage in this profile is the elimination of the subtask called "Talk with friends," which will reduce response time by 62%.

The interesting question that a reader posted is this:

Not sure this is always the right approach. For example, lets imagine the son has to pick 50 items
Talk 3 times 37 minutes
Choose item 50 times 45 minutes
Walk 2 times 8 minutes
Pay 1 time 5 minutes
Working on "choose item" is maybe not the right thing to do...

Let's explore it. Here's what the profile would look like if this were to happen:

          --Duration---
Subtask   minutes     %  Executions
-------   -------  ----  ----------
Choose         45   47%          50
Talk           37   39%           3
Walk            8    8%           2
Pay             5    5%           1
-------   -------  ----  ----------
Total          95  100%

The point of the inquiry is this:

The right answer in this case, too, is to begin with eliminating Talk from the profile. That's because, even though it's not ranked at the very top of the profile, Talk is completely unnecessary to the real goal (grocery shopping). It's a time-waster that simply shouldn't be in the profile. At all. But with Cary's method of addressing the profile from the top downward, you would instead focus on the "Choose" line, which is the wrong thing.

In chapters 1 through 4 of our book about Method R for Oracle, I explained the method much more thoroughly than I did in the very brief article. In my brevity, I skipped past an important point. Here's a summary of the Method R steps for diagnosing and resolving performance problems using a profile:

(Diagnosis phase) For each subtask (row in the profile), visiting subtasks in order of descending duration...

Can you eliminate any executions without sacrificing required function?
Can you improve (reduce) individual execution latency?

(Resolution phase) Choose the candidate solution with the best net value (that is, the greatest value of benefit minus cost).

Here's a narrative of executing the steps of the diagnostic phase, one at a time, upon the new profile, which—again—is this:

          --Duration---
Subtask   minutes     %  Executions
-------   -------  ----  ----------
Choose         45   47%          50
Talk           37   39%           3
Walk            8    8%           2
Pay             5    5%           1
-------   -------  ----  ----------
Total          95  100%

Execution elimination for the Choose subtask: If you really need all 50 items, then no, you can't eliminate any Choose executions.
Latency optimization for the Choose subtask: Perhaps you could optimize the mean latency (which is .9 minutes per item). My wife does this. For example, she knows better where the items in the store are, so she spends less time searching for them. (I, on the other hand, can get lost in my own shower.) If, for example, you could reduce mean latency to, say, .8 minutes per item by giving your boy a map, then you could save (.9 – .8) × 50 = 5 minutes (5%). (Note that we don't execute the solution yet; we're just diagnosing right now.)
Execution elimination for the Talk subtask: Hmm, seems to me like if your true goal is fast grocery shopping, then you don't need your boy executing any of these 3 Talk events. Proposed time savings: 37 minutes (39%).
Latency optimization for the Talk subtask: Since you can eliminate all Talk calls, no need to bother thinking about latency reduction. ...Unless you're prey to some external constraint (like social advancement, say, in attempt to maximize your probability of having rich and beautiful grandchildren someday), in which case you should think about latency reduction instead of execution elimination.
Execution elimination for the Walk subtask: Well, the boy has to get there, and he has to get back, so this "executions=2" figure looks optimal. (Those Oracle applications we often see that process one row per network I/O call would have 50 Walk executions, one for each Choose call.)
Latency optimization for the Walk subtask: Walking takes 4 minutes each way. Driving might take less time, but then again, it might actually take even more. Will driving introduce new dependent subtasks? Warm Up? Park? De-ice? Even driving doesn't eliminate all the walking... Plus, there's not a lot of leverage in optimizing Walk, because it accounts for only 8% of total response time to begin with, so it's not worth a whole lot of bother trying to shave it down by some marginal proportion, especially since inserting a car into your life (or letting your boy drive yours) is no trivial matter.
Execution elimination for the Pay subtask: The execution count on Pay is already optimized down to the legally required minimum. No real opportunity for improvement here without some kind of radical architecture change.
Latency optimization for the Pay subtask: It takes 5 minutes to Pay? That seems a bit much. So you should look at the payment process. Or should you? Even if you totally eliminate Pay from the profile, it's only going to save 5% of your time. But, if every minute counts, then yes, you look at it. ...Especially if there might be an easy way to improve it. If the benefit comes at practically no cost, then you'll take it, even if the benefit is only small. So, imagine that you find out that the reason Pay was so slow is that it was executed by writing a check, which required waiting for store manager approval. Using cash or a credit/debit card might improve response time by, say, 4 minutes (4%).

Now you're done assessing the effects of (1) execution elimination and (2) latency reduction for each line in the profile. That ends the diagnostic phase of the method. The next step is the resolution phase: to determine which of these candidate solutions is the best. Given the analysis I've walked you through, I'd rank the candidate solutions in this order:

Eliminate all 3 executions of Talk. That'll save 37 minutes (39%), and it's easy to implement; you don't have to buy a car, apply for a credit card, train the boy how to shop faster, or change the architecture of how shopping works. You can simply discard the "requirement" to chat, or you can specify that it be performed only during non-errand time windows.
Optimize Pay latency by using cash or a card, if it's easy enough to give your boy access to cash or a card. That will save 4 minutes, which—by the way—will be a more important proportion of the total errand time after you eliminate all the Talk executions.
Finally, consider optimizing Choose latency. Maybe giving your son a map of the store will help. Maybe you should print your grocery list more neatly so he can read it without having to ask for help. Maybe by simply sending him to the store more often, he'll get faster as his familiarity with the place improves.

That's it.

So the point I want to highlight is this:

I'm not saying you should stick to the top line of your profile until you've absolutely conquered it.

It is important to pass completely through your profile to construct your set of candidate solutions. Then, on a separate pass, you evaluate those candidate solutions to determine which ones you want to implement, and in what order. That first full pass is key. You have to do it for Method R to be reliable for solving any performance problem.

Friday, April 24, 2009

The Most Common Performance Problem I See

At the Percona Performance Conference in Santa Clara this week, the first question an audience member asked our panel was, "What is the most common performance problem you see in the field?"

I figured, being an Oracle guy at a MySQL conference, this might be my only chance to answer something, so I went for the mic. Here is my answer.

The most common performance problem I see is people who think there's a most-common performance problem that they should be looking for, instead of measuring to find out what their actual performance problem actually is.

It's a meta answer, but it's a meta problem. The biggest performance problems I see, and the ones I see most often, are not problems with machines or software. They're problems with people who don't have a reliable process of identifying the right thing to work on in the first place.

That's why the definition of Method R doesn't mention Oracle, or databases, or even computers. It's why Optimizing Oracle Performance spends the first 69 pages talking about red rocks and informed consent and Eli Goldratt instead of Oracle, or databases, or even computers.

The most common performance problem I see is that people guess instead of knowing. The worst cases are when people think they know because they're looking at data, but they really don't know, because they're looking at the wrong data. Unfortunately, every case of guessing that I ever see is this worst case, because nobody in our business goes very far without consulting some kind of data to justify his opinions. Tim Cook from Sun Microsystems pointed me yesterday to a blog post that gives a great example of that illusion of knowing when you really don't.