Yesterday I finally posted to the Method R website some of the papers I wrote while I was at Oracle Corporation (1989–1999). You can now find these papers where I am.
When I was uploading my OFA Standard paper, I noticed that today—24 September 2009—is the fourteenth birthday of its publication date. So, even though the original OFA paper was around for a few years before 1995, please join me in celebrating the birthday of the final version of the official OFA Standard document.
My web log for things I’m interested in, including design, software development, performance analysis, learning, and running a business.
Thursday, September 24, 2009
Wednesday, September 16, 2009
On the Importance of Diagnosing Before Resolving
Today a reader posted a question I like at our Method R website. It's about the story I tell in the article called, "Can you explain Method R so even my boss could understand it?" The story is about sending your son on a shopping trip, and it takes him too long to complete the errand. The point is that an excellent way to fix any kind of performance problem is to profile the response time for the properly chosen task, which is the basis for Method R (both the method and the company).
Here is the profile that details where the boy's time went during his errand:
The interesting question that a reader posted is this:
So the point I want to highlight is this:
Here is the profile that details where the boy's time went during his errand:
--Duration---I went on to describe that the big leverage in this profile is the elimination of the subtask called "Talk with friends," which will reduce response time by 62%.
Subtask minutes % Executions
------------------ ------- ---- ----------
Talk with friends 37 62% 3
Choose item 10 17% 5
Walk to/from store 8 13% 2
Pay cashier 5 8% 1
------------------ ------- ---- ----------
Total 60 100%
The interesting question that a reader posted is this:
Not sure this is always the right approach. For example, lets imagine the son has to pick 50 itemsLet's explore it. Here's what the profile would look like if this were to happen:
Talk 3 times 37 minutes
Choose item 50 times 45 minutes
Walk 2 times 8 minutes
Pay 1 time 5 minutes
Working on "choose item" is maybe not the right thing to do...
--Duration---The point of the inquiry is this:
Subtask minutes % Executions
------- ------- ---- ----------
Choose 45 47% 50
Talk 37 39% 3
Walk 8 8% 2
Pay 5 5% 1
------- ------- ---- ----------
Total 95 100%
The right answer in this case, too, is to begin with eliminating Talk from the profile. That's because, even though it's not ranked at the very top of the profile, Talk is completely unnecessary to the real goal (grocery shopping). It's a time-waster that simply shouldn't be in the profile. At all. But with Cary's method of addressing the profile from the top downward, you would instead focus on the "Choose" line, which is the wrong thing.In chapters 1 through 4 of our book about Method R for Oracle, I explained the method much more thoroughly than I did in the very brief article. In my brevity, I skipped past an important point. Here's a summary of the Method R steps for diagnosing and resolving performance problems using a profile:
- (Diagnosis phase) For each subtask (row in the profile), visiting subtasks in order of descending duration...
- Can you eliminate any executions without sacrificing required function?
- Can you improve (reduce) individual execution latency?
- (Resolution phase) Choose the candidate solution with the best net value (that is, the greatest value of benefit minus cost).
--Duration---
Subtask minutes % Executions
------- ------- ---- ----------
Choose 45 47% 50
Talk 37 39% 3
Walk 8 8% 2
Pay 5 5% 1
------- ------- ---- ----------
Total 95 100%
- Execution elimination for the Choose subtask: If you really need all 50 items, then no, you can't eliminate any Choose executions.
- Latency optimization for the Choose subtask: Perhaps you could optimize the mean latency (which is .9 minutes per item). My wife does this. For example, she knows better where the items in the store are, so she spends less time searching for them. (I, on the other hand, can get lost in my own shower.) If, for example, you could reduce mean latency to, say, .8 minutes per item by giving your boy a map, then you could save (.9 – .8) × 50 = 5 minutes (5%). (Note that we don't execute the solution yet; we're just diagnosing right now.)
- Execution elimination for the Talk subtask: Hmm, seems to me like if your true goal is fast grocery shopping, then you don't need your boy executing any of these 3 Talk events. Proposed time savings: 37 minutes (39%).
- Latency optimization for the Talk subtask: Since you can eliminate all Talk calls, no need to bother thinking about latency reduction. ...Unless you're prey to some external constraint (like social advancement, say, in attempt to maximize your probability of having rich and beautiful grandchildren someday), in which case you should think about latency reduction instead of execution elimination.
- Execution elimination for the Walk subtask: Well, the boy has to get there, and he has to get back, so this "executions=2" figure looks optimal. (Those Oracle applications we often see that process one row per network I/O call would have 50 Walk executions, one for each Choose call.)
- Latency optimization for the Walk subtask: Walking takes 4 minutes each way. Driving might take less time, but then again, it might actually take even more. Will driving introduce new dependent subtasks? Warm Up? Park? De-ice? Even driving doesn't eliminate all the walking... Plus, there's not a lot of leverage in optimizing Walk, because it accounts for only 8% of total response time to begin with, so it's not worth a whole lot of bother trying to shave it down by some marginal proportion, especially since inserting a car into your life (or letting your boy drive yours) is no trivial matter.
- Execution elimination for the Pay subtask: The execution count on Pay is already optimized down to the legally required minimum. No real opportunity for improvement here without some kind of radical architecture change.
- Latency optimization for the Pay subtask: It takes 5 minutes to Pay? That seems a bit much. So you should look at the payment process. Or should you? Even if you totally eliminate Pay from the profile, it's only going to save 5% of your time. But, if every minute counts, then yes, you look at it. ...Especially if there might be an easy way to improve it. If the benefit comes at practically no cost, then you'll take it, even if the benefit is only small. So, imagine that you find out that the reason Pay was so slow is that it was executed by writing a check, which required waiting for store manager approval. Using cash or a credit/debit card might improve response time by, say, 4 minutes (4%).
- Eliminate all 3 executions of Talk. That'll save 37 minutes (39%), and it's easy to implement; you don't have to buy a car, apply for a credit card, train the boy how to shop faster, or change the architecture of how shopping works. You can simply discard the "requirement" to chat, or you can specify that it be performed only during non-errand time windows.
- Optimize Pay latency by using cash or a card, if it's easy enough to give your boy access to cash or a card. That will save 4 minutes, which—by the way—will be a more important proportion of the total errand time after you eliminate all the Talk executions.
- Finally, consider optimizing Choose latency. Maybe giving your son a map of the store will help. Maybe you should print your grocery list more neatly so he can read it without having to ask for help. Maybe by simply sending him to the store more often, he'll get faster as his familiarity with the place improves.
So the point I want to highlight is this:
I'm not saying you should stick to the top line of your profile until you've absolutely conquered it.It is important to pass completely through your profile to construct your set of candidate solutions. Then, on a separate pass, you evaluate those candidate solutions to determine which ones you want to implement, and in what order. That first full pass is key. You have to do it for Method R to be reliable for solving any performance problem.