## Wednesday, September 16, 2009

### On the Importance of Diagnosing Before Resolving

Today a reader posted a question I like at our Method R website. It's about the story I tell in the article called, "Can you explain Method R so even my boss could understand it?" The story is about sending your son on a shopping trip, and it takes him too long to complete the errand. The point is that an excellent way to fix any kind of performance problem is to profile the response time for the properly chosen task, which is the basis for Method R (both the method and the company).

Here is the profile that details where the boy's time went during his errand:
--Duration---
------------------ ------- ---- ----------
Talk with friends 37 62% 3
Choose item 10 17% 5
Walk to/from store 8 13% 2
Pay cashier 5 8% 1
------------------ ------- ---- ----------
Total 60 100%
I went on to describe that the big leverage in this profile is the elimination of the subtask called "Talk with friends," which will reduce response time by 62%.

The interesting question that a reader posted is this:
Not sure this is always the right approach. For example, lets imagine the son has to pick 50 items
Talk 3 times 37 minutes
Choose item 50 times 45 minutes
Walk 2 times 8 minutes
Pay 1 time 5 minutes
Working on "choose item" is maybe not the right thing to do...
Let's explore it. Here's what the profile would look like if this were to happen:
--Duration---
------- ------- ---- ----------
Choose 45 47% 50
Talk 37 39% 3
Walk 8 8% 2
Pay 5 5% 1
------- ------- ---- ----------
Total 95 100%
The point of the inquiry is this:
The right answer in this case, too, is to begin with eliminating Talk from the profile. That's because, even though it's not ranked at the very top of the profile, Talk is completely unnecessary to the real goal (grocery shopping). It's a time-waster that simply shouldn't be in the profile. At all. But with Cary's method of addressing the profile from the top downward, you would instead focus on the "Choose" line, which is the wrong thing.
In chapters 1 through 4 of our book about Method R for Oracle, I explained the method much more thoroughly than I did in the very brief article. In my brevity, I skipped past an important point. Here's a summary of the Method R steps for diagnosing and resolving performance problems using a profile:
1. (Diagnosis phase) For each subtask (row in the profile), visiting subtasks in order of descending duration...
1. Can you eliminate any executions without sacrificing required function?
2. Can you improve (reduce) individual execution latency?
2. (Resolution phase) Choose the candidate solution with the best net value (that is, the greatest value of benefit minus cost).
Here's a narrative of executing the steps of the diagnostic phase, one at a time, upon the new profile, which—again—is this:
--Duration---
------- ------- ---- ----------
Choose 45 47% 50
Talk 37 39% 3
Walk 8 8% 2
Pay 5 5% 1
------- ------- ---- ----------
Total 95 100%
1. Execution elimination for the Choose subtask: If you really need all 50 items, then no, you can't eliminate any Choose executions.
2. Latency optimization for the Choose subtask: Perhaps you could optimize the mean latency (which is .9 minutes per item). My wife does this. For example, she knows better where the items in the store are, so she spends less time searching for them. (I, on the other hand, can get lost in my own shower.) If, for example, you could reduce mean latency to, say, .8 minutes per item by giving your boy a map, then you could save (.9 – .8) × 50 = 5 minutes (5%). (Note that we don't execute the solution yet; we're just diagnosing right now.)
3. Execution elimination for the Talk subtask: Hmm, seems to me like if your true goal is fast grocery shopping, then you don't need your boy executing any of these 3 Talk events. Proposed time savings: 37 minutes (39%).
4. Latency optimization for the Talk subtask: Since you can eliminate all Talk calls, no need to bother thinking about latency reduction. ...Unless you're prey to some external constraint (like social advancement, say, in attempt to maximize your probability of having rich and beautiful grandchildren someday), in which case you should think about latency reduction instead of execution elimination.
5. Execution elimination for the Walk subtask: Well, the boy has to get there, and he has to get back, so this "executions=2" figure looks optimal. (Those Oracle applications we often see that process one row per network I/O call would have 50 Walk executions, one for each Choose call.)
6. Latency optimization for the Walk subtask: Walking takes 4 minutes each way. Driving might take less time, but then again, it might actually take even more. Will driving introduce new dependent subtasks? Warm Up? Park? De-ice? Even driving doesn't eliminate all the walking... Plus, there's not a lot of leverage in optimizing Walk, because it accounts for only 8% of total response time to begin with, so it's not worth a whole lot of bother trying to shave it down by some marginal proportion, especially since inserting a car into your life (or letting your boy drive yours) is no trivial matter.
7. Execution elimination for the Pay subtask: The execution count on Pay is already optimized down to the legally required minimum. No real opportunity for improvement here without some kind of radical architecture change.
8. Latency optimization for the Pay subtask: It takes 5 minutes to Pay? That seems a bit much. So you should look at the payment process. Or should you? Even if you totally eliminate Pay from the profile, it's only going to save 5% of your time. But, if every minute counts, then yes, you look at it. ...Especially if there might be an easy way to improve it. If the benefit comes at practically no cost, then you'll take it, even if the benefit is only small. So, imagine that you find out that the reason Pay was so slow is that it was executed by writing a check, which required waiting for store manager approval. Using cash or a credit/debit card might improve response time by, say, 4 minutes (4%).
Now you're done assessing the effects of (1) execution elimination and (2) latency reduction for each line in the profile. That ends the diagnostic phase of the method. The next step is the resolution phase: to determine which of these candidate solutions is the best. Given the analysis I've walked you through, I'd rank the candidate solutions in this order:
1. Eliminate all 3 executions of Talk. That'll save 37 minutes (39%), and it's easy to implement; you don't have to buy a car, apply for a credit card, train the boy how to shop faster, or change the architecture of how shopping works. You can simply discard the "requirement" to chat, or you can specify that it be performed only during non-errand time windows.
2. Optimize Pay latency by using cash or a card, if it's easy enough to give your boy access to cash or a card. That will save 4 minutes, which—by the way—will be a more important proportion of the total errand time after you eliminate all the Talk executions.
3. Finally, consider optimizing Choose latency. Maybe giving your son a map of the store will help. Maybe you should print your grocery list more neatly so he can read it without having to ask for help. Maybe by simply sending him to the store more often, he'll get faster as his familiarity with the place improves.
That's it.

So the point I want to highlight is this:
I'm not saying you should stick to the top line of your profile until you've absolutely conquered it.
It is important to pass completely through your profile to construct your set of candidate solutions. Then, on a separate pass, you evaluate those candidate solutions to determine which ones you want to implement, and in what order. That first full pass is key. You have to do it for Method R to be reliable for solving any performance problem.

Marcin Przepiorowski said...

Cary,

The best business level explanation of Oracle Wait related method I ever seen. This is a main point to explain a technical staff to managers.

regards,
Marcin

Brian Tkatch said...

Cary,

Excellent explanation.

When i read Method-R, it seems like common sense. Yet, in practice, it doesn't always happen. Perhaps we need more practice in doing this method. More examples, perhaps?

Cary Millsap said...

Marcin and Brian,

Thanks, guys. I'm hopeful that posts like this one and papers like this one help, too. I appreciate the encouragement. I'll keep chipping away at it.

—Cary

Joel Garry said...

I saw a place where all the programmers were in two rows, with their manager at the end.

No unnecessary talking. Very productive. But, is the productivity measured correctly? Would perhaps some programmers not want to work under such conditions?

That was just a temporary situation due to construction, so it didn't really hurt as much as if that were the norm.

But I saw another place where it was the norm for the desktop support group. Yes, they had to raise their hands to go to the bathroom. People would indeed say, take this job and shove it. I was amazed when one of those came back, having discovered some desktop support jobs are worse. I would just shake my head and be glad I was a DBA.

Many IT managers think that yelling and controlling with an iron fist is a good thing. Please don't give them more ammunition under the guise of "science" (particularly Taylor Scientific Management, if you think I'm being facetious). That's a typical green MBA and old-school IT manager mistake. In the context of method-R, it would be the business incorrectly defining the importance of tasks. Would it really work telling teenagers they would be more productive not talking to their friends?

word: turdste
word: prerign

Cary Millsap said...

In my MBA program, we had to work out a problem in our Probability & Statistics course that went something along the lines of "Mary writes 50 lines of code per day; John writes 30 lines of code per day. How much should you pay each programmer?"

I did the a*x, a*y thing that the instructor was obviously calling for, and then I appended an essay to my answer. I explained why if John's code performs an equivalent function as Mary's (or better), then, actually, he should probably be paid more than she. I might have mentioned that maybe his talents could be leveraged further by having him teach others in his department how to emulate his efficiency.

I'm pretty sure that most people in the program left thinking that basically you should pay these aliens called computer programmers in linear proportion to the average number of lines of code they can excrete in a day. My command of the language is inadequate to describe how disgusting I find that thought.

So I think I understand your point. ...Which is why I wrote the crack in the original post about the rich and beautiful grandchildren. At the point in the analysis where I wrote that, you have to think about what your constraints and goals really, truly are.

In some cases, urgency is well and truly the right goal. Perhaps you would have found my example less ambiguous if, instead of grocery shopping, I'd had the son in the role of professional courier of the human liver that's needed 10 minutes away for an emergency transplant.

Anonymous said...

"If that’s not enough 'R' for you, there’s another item in the 'Q' [...]"

Log Buffer #162

Kerry Osborne said...

Cary,

Well presented, as usual. Of course, as you point out, the magic is not in being able to create the profile, but knowing what to do with it. You've presented here (and elsewhere) a good description of the process (eliminate unnecessary iterations, re-profile, then speed up the operations of the top contributors). I think the weighting of the options with their costs and risks and potential benefit can be a very complicated task that human brains are better suited to than computer algorithms. Maybe that's why the auto tuning tools struggle with anything that's not fairly simple.

I once tried to use a bubble chart to help identify which items to address first (actually more to use as a argument for convincing people that my opinion was correct). I had Cost on the X axis and Risk on the Y axis. Potential benefit determined the size of the bubble. So you'd look for big bubbles close to the origin of the graph. It was an interesting way to display the data, but the raw data was still somewhat subjective (I had to arrive at a numeric value for risk for example which was really not based on anything I could measure).

You also pointed out that you need an understanding of the events that are taking place and how long they should take. Knowing how far it is to the store and what the speed limit is allows you to make reasonable decisions about whether to attempt speeding up that part of the process.

Thanks for taking the time to share.

Kerry

Prem said...

Cary,

That was a simple-neat explanation.
Good analogy.I liked it.

~ prem ~

John Fei said...
This comment has been removed by a blog administrator.