Cary Millsap: 2009

Tuesday, December 22, 2009

The “Do What You Love” Mirage

I am inspired by having read an article called “Do what you love mirage” by Denis Basaric. It begins...

“Do what you love” is advice I hear exclusively from financially secure people. And it rings hollow to me. When you need money to survive, you do any work that is available, love does not play into that choice. Desperation does.

Please read it before you go on.

Welcome back.

This article puts a very important cycle within my life into words. I believe, as Denis says, that a lot of times, we get the cause-effect relationship mixed up when we think about loving what we do.

I love what I do. Well, a lot of it. But Denis is right: I didn’t choose what I do out of love. I chose what I love out of doing. Some examples:

I love mathematics. But I most assuredly did not always love it. I learned to love it through working hard at it.
It’s the same thing with writing. I love it, but I didn’t always. At first, writing was unrewarding drudgery, which is how most people I meet seem to feel about it.
I love public speaking, but I sure didn’t love it when my speech class made me sick to my stomach three mornings a week for a whole semester my freshman year.
I love being an Oracle performance specialist, but I sure didn’t love being airlifted into crisis after crisis throughout the early 1990s.

I could go on. The point is, my life would be unrecognizably different if not for several really painful situations that I decided to endure with the resolve to get really good at what I hated. Until I loved it.

In retrospect, I seem to have been very lucky in many important situations. Of course, I have. But you make your own luck. Although I believe deeply in the idea of, “The harder I work, the luckier I get,” that is not what I’m talking about here. I’m talking about the power that you have to define for yourself whether something that happened was lucky for you or not. Your situations do not define your life. You create your life based on how you regard your situations.

I could have rebelled against Jimmy Harkey and hated math for the rest of my life. Lots of kids did. I could have rebelled against Lewis Parkhill and never become a writer. I could have refused Craig Newberger’s advice to take his second speech course and never become comfortable in front of an audience. I could have left Oracle in 1991 and found a job where they had more mature products....

One of the most important questions that I ever asked my wife before our engagement was this:

If you were forced to wash cars for 12 hours a day, just to make a living, could you enjoy it?

This is a “soulmate” kind of question for me. My wife’s attitude about it is, for our children and me, possibly the most valuable gift in our lives.

Loving what you do can be difficult. I think Denis hits the nail on the head by suggesting that,

By doing good work, you just might find out that what you are doing, is what you are supposed to do. And if you don’t, quality work will get you to where you want to be.

I hope you will find love in what you do today. Do it well, and it’ll definitely improve your odds.

Monday, December 21, 2009

My Whole System Is Slow. Now What?

At CMG'09 a couple of weeks ago, I presented "Measuring Response Times of Code on Oracle Systems." The paper for this presentation was a subset of "For Developers: Making Friends with the Oracle Database." In the presentation, I spent a few minutes talking about why to measure response times in Oracle, and then I spent a lot of minutes talking about how. As usual, I focused heavily on the importance of measuring response times of individual business tasks executed by individual end users.

At the end of the talk, a group of people came to the podium to ask questions (always a good sign). The first question was the question that a lot of people ask. It was:

My whole system is slow. That's all my users will tell me. So then, how do I begin to do what you're describing?

Here's the answer:

Ask your users to show you what they're doing. Just go look at it.

The results of this simple advice are routinely spectacular. Just go look at it: I'm surprised whenever someone doesn't think of doing that, but I shouldn't be. That's because I didn't do it either, for the longest time. I had to learn to. And that's the story I want to tell you here.

In the early 1990s, I was a consultant with Oracle Corporation visiting clients with performance problems at a pace of more than 30 per year. Back then, I did Oracle performance work the old fashioned way: I checked everything I knew how to check, and then I fixed everything I knew how to fix. All billable by the hour. (Note: When I was doing it this way, I had not yet been taught by Dave Ensor, who changed me forever.)

On weeks when I was lucky, I'd be finished checking and fixing by sometime Wednesday, leaving a couple of days to find out what people thought of my work. If I were lucky again (that's two "lucky"s now), everyone would be thrilled with the results. I'd get my hug (so to speak), and I'd catch my flight.

But I wasn't always lucky. Some weeks, I wouldn't find anything suspicious in my checking and fixing. Some weeks, I'd find plenty, but still not everyone would be thrilled with the work. Having people be less than thrilled with my work caused pain for me, which motivated me to figure out how to take more control of my consulting engagements, to drive luck out of the equation.

The most important thing I figured out was...

People knew before I came on-site how they were going to measure on Thursday whether they liked the results of my work.

And...

They were willing to tell me on Monday.

All I had to do was be honest, like this:

On the day I'm done working here, I'd imagine you're going to want to run something that will demonstrate whether I accomplished what you were hoping for while I was here. Would you mind telling me about that now? Maybe even showing me?

I could ask that on Monday, and people were glad to tell me. I'd watch the things run and record how long they ran, and then I'd know how to prioritize my time on site. I'd record how long they ran so at the end of my engagement, I'd be able to show very clearly what improvements I had made.

Sometimes, there would be thirty different things that people would expect to measure on Thursday. If I might not have time to fix them all, then I needed to make sure that I knew the priority of the things I was being asked to fix.

That one step alone—knowing on Monday that prioritized list of what tasks needed to be fast by Thursday—drastically reduced my reliance on luck as a success factor in my job at these sites. Knowing that list on Monday is just like when your teacher in school tells you exactly what's going to be on your next test. It allows you to focus your attention on exactly what you need to do to optimize your reward for the week. (Note to fellow education enthusiasts: Please don't interpret this paragraph as my advocating the idea that GPA should be a student's sole—or even dominant—optimization constraint.)

So, what I learned is that the very first step of any good performance optimization method is necessarily this:

1. Identify the task that's the most important to you.

When I say "task," think "program" or "click" or "batch job" if you want to. What I mean is "a useful unit of work that makes sense to the business." ...Something that a business user would show you if you just went and watched her work for a few minutes.

Then comes step two:

2. Measure its response time (R). In detail.

Why is response time so important? Because that's what's important to the person who'll be watching it run on Thursday, assessing whether she thinks you've done a good job or not. That person's going to click and then wait. Happiness will be inversely proportional to how long the wait is. That's it. That's what "performance" means at 99% of sites I've ever visited.

(If you're interested in the other 1% of sites I've visited, they're interested in throughput, which I've written about in another blog post.)

Measuring response time is vital. You must be able to measure response time if you're going to nail that test on Thursday.

The key is to understand that the term response time doesn't even have a definition except in the context of a task. You can't measure response time if you don't first decide what task you're going to measure. In other words, you cannot do step 2 before you do step 1. With Oracle, for example, you can collect ASH data (if you're licensed to use it) or even trace data for a whole bunch of Oracle processes, but you won't have a single response time until you define which tasks buried within that data are the ones you want to extract and pay attention to.

You get that by visiting a user and watching what she does.

There are lots of excuses for not watching your users. Like these...

"I don't know my users." I know. But you should. You'd do your job better if you did. And your users would, too.
"My users aren't here." I know. They're on the web. They're in Chicago and Singapore and Istanbul, buying plane tickets or baseball caps or stock shares. But if you can't watch at least a simulation of the things those users actually do with the system you help manage, then I can't imagine how you would possibly succeed at providing good performance to them.
"I'm supposed to be able to manage performance with my dashboard." I know. I was supposed to have a hover car by the year 2000.

The longer you stay mired in excuses like these, the longer it's going to be before you can get the benefit of my point here. Your users are running something, and whatever that is that they're running is your version of my Thursday test. You can check and fix all you want, but unless you get lucky and fix the exact tooth that's hurting, your efforts aren't going to be perceived as "helpful." Checking and fixing everything you can think of is far less efficient and effective than targeting exactly what your user needs you to target.

Lots of performance analysts (DBAs, developers, architects, sysadmins, and so on) assume that when someone says, "The whole system is slow," it means there must be a single parameter somewhere in the bowels of the system that needs adjustment, and if you can just make that adjustment, everything is going to be ok. It might mean that, but in my experience, the overwhelming majority of cases are not that way. (Pages 25–29 of Optimizing Oracle Performance has more information about this.)

The great thing about measuring response time is that no matter what the problem is, you'll see it. If the program you're watching is poorly written, you'll see it. If some other program is hogging too much of a resource that your program needs, you'll see it. If you have a bad disk controller, you'll see it. If some parameter needs adjusting, you'll see it.

Realize that when a business user says "system," she doesn't mean what you would mean if you said "system." She means that the thing she runs is slow. Look at that thing. Maybe there are seventeen of them. And sure, maybe all seventeen suffer from the same root cause. If that's the case, then fantastic, because fixing the first problem will magically fix the other sixteen, too. If it's not, then fantastic anyway, because now all of them are on your prioritized list of tasks to optimize, and you'll probably surprise yourself how quickly you'll be able to pick them all off when you focus on one task at a time.

Monday, December 7, 2009

C. J. Date: All Systems Go

Enrollment is open for the course taught by Christopher J. Date that we'll host 26–28 January in the Dallas/Fort Worth area. For many of us, this will be a once-in-a-lifetime opportunity to sit in a classroom for three days with one of the pioneers who created the field we live in each day.

I'm looking forward to this course myself. It is so easy to use Oracle in non-relational ways. But not understanding how to use SQL relationally leads to countless troubles and unnecessary complexities. Chris's focus in this course will be the discipline to use Oracle in a truly relational way, the mastery of which will make your applications faster, easier to prove, and more fun to write, maintain, and ehnance.

I'm particularly looking forward to his section about missing values—the ages-old debate about NULL—which is not covered in Chris's book upon which the course is based.

Thursday, December 3, 2009

The Thing (I Don’t Like) About Video

Video is awesome. I like high-bandwidth communication. Even on the cheapest, most un-produced videos, I can see facial expressions and body language that I’d never be able to pick up from text. I can see candidness that’s not going to come through in a document, even a blog that’s written pretty much off the cuff. And videos with high production value, ...well of course it’s awesome to watch a great short movie right at the tips of your fingers.

But...

But when you send me a 7:48 video, I have to budget 7:48 to watch it. (Well, more actually, because of the latency required to buffer it up.) When you send me a 13-page document, I can “read” it in 10 seconds if I want to. I can skim the first and final paragraphs really quickly and look for pictures or sidebars or quotes, and it takes practically no time for me to do it.

With a video, it’s just more difficult to do that. I can watch the first 10 seconds and usually know whether I want to watch the remainder. But skimming through the whole video—like skipping to the end—is more difficult, because I have to sit there un-utilized while the whole video buffers up. Then I have to sit there while words come at me aurally, which is annoyingly sequential compared to reading buckets of text in one eyeful.

So, the bottom line is that the first 10 seconds of your video need to convince me to watch the remainder.

Or I won’t.

Is it just me?

What’s out there to make video browsing a better, more time-efficient, and more fulfilling experience?

Friday, November 20, 2009

Performance Optimization with Global Entry. Or Not?

As I entered the 30-minute "U.S. Citizens" queue for immigration back into the U.S. last week, the helpful "queue manager" handed me a brochure. This is a great place to hand me something to read, because I'm captive for the next 30 minutes as I await my turn with the immigration officer at the Passport Control desk. The brochure said "Roll through Customs faster."

Ok. I'm listening.

Inside the brochure, the first page lays out the main benefits:

bypass the passport lines
no paper Customs declaration
in most major U.S. airports

Well, that's pretty cool. Especially as I'm standing only 5% deep in a queue with a couple hundred people in it. And look, there's a Global Entry kiosk right there with its own special queue, with nobody—nobody!—in it.

If I had this Global Entry thing, I'd have a superpower that would enable me to zap past the couple hundred people in front of me, and get out of the Passport Control queue right now. Fantastic.

So what does this thing cost? It's right there in the brochure:

Apply online at www.globalentry.gov. There is a non-refundable $100 application fee. Membership is valid for five years. That's $20 a year for the queue-bypassing superpower. Not bad. Still listening.
Schedule an in-person interview. Next, I have to book an appointment to meet someone at the airport for a brief interview.
Complete the interview and enrollment. I give my interview, get my photo taken, have my docs verified, and that's it, I'm done.

So, all in all, it doesn't cost too much: a hundred bucks and probably a couple hours one day next month sometime.

What's the benefit of the queue-bypassing superpower? Well, it's clearly going to knock a half-hour off my journey through Passport Control. I immigrate three or four times per year on average, and today's queue is one of the shorter ones I've seen, so that's at least a couple hours per year that I'd save... Wow, that would be spectacular: a couple more hours each year in my family's arms instead of waiting like a lamb at the abattoir to have my passport controlled.

But getting me into my family's arms 30 minutes earlier is not really what happens. The problem is a kind of logic that people I meet get hung up in all the time. When you think about subsystem (or resource) optimization, it looks like your latency savings for the subsystem should go straight to your system's bottom line, but that's often not what happens. That's why I really don't care about subsystem optimization; I care about response time. I could say that a thousand times, but my statement is too abstract to really convey what I mean unless you already know what I mean.

What really happens in the airport story is this: if I had used Global Entry on my recent arrival, it would have saved me only a minute or two. Not half an hour, not even close.

It sounds crazy, doesn't it? How can a service that cuts half an hour off my Passport Control time not get me home at least a half hour earlier?

You'll understand once I show you a sequence diagram of my arrival. Here it is (at right). You can click the image to embiggen it, if you need.

To read this sequence diagram, start at the top. Time flows downward. This sequence diagram shows two competing scenarios. The multicolored bar on the left-hand side represents the timeline of my actual recent arrival at DFW Airport, without using the Global Entry service. The right-hand timeline is what my arrival would have looked like had I been endowed with the Global Entry superpower.

You can see at the very bottom of the timeline on the right that the time I would have saved with Global Entry is minuscule: only a minute or two.

The real problem is easy to see in the diagram: Queue for Baggage Claim is the great equalizer in this system. No matter whether I'm a Global Entrant or not, I'm going to get my baggage when the good people outside with the Day-Glo Orange vests send it up to me. My status in the Global Entry system has absolutely no influence over what time that will occur.

Once I've gotten my baggage, the Global Entry superpower would have again swung into effect, allowing me to pass through the zero-length queue at the Global Entry kiosk instead of waiting behind two families at the Customs queue. And that's the only net benefit I would have received.

Wait: there were only two families in the Customs queue? What about the hundreds of people I was standing behind in the Passport Control queue? Well, many of them were gone already (either they had hand-carry bags only, or their bags had come off earlier than mine). Many others were still awaiting their bags on the Baggage Claim carousel. Because bags trickle out of the baggage claim process, there isn't the huge all-at-once surge of demand at Customs that there is at Passport Control when a plane unloads. So the queues are shorter.

At any rate, there were four queues at Customs, and none of them was longer than three or four families. So the benefit of Global Entry—in exchange for the $100 and the time spent doing the interview—for me, this day, would have been only the savings of a couple of minutes.

Now, if—if, mind you—I had been able to travel with only carry-on luggage, then Global Entry would have provided me significantly more value. But when I'm returning to the U. S. from abroad, I'm almost never allowed to carry on any bag other than my briefcase. Furthermore, I don't remember ever clearing Passport Control to find my bag waiting for me at Baggage Claim. So the typical benefit to me of enrolling in Global Entry, unfortunately, appears to be only a fraction of the duration required to clear Customs, which in my case is almost always approximately zero.

The problem causing the low value (to me) of the Global Entry program is that the Passport Control resource hides the latency of the Baggage Claim resource. No amount of tuning upon the Passport Control resource will affect the timing of the Baggage In Hand milestone; the time at which that milestone occurs is entirely independent of the Passport Control resource. And that milestone—as long as it occurs after I queue for Baggage Claim—is a direct determinant of when I can exit the airport. (Gantt or PERT chart optimizers would say that Queue for Baggage Claim is on the critical path.)

How could a designer make the airport experience better for the customer? Here are a few ideas:

Let me carry on more baggage. This idea would allow me to trot right through Baggage Claim without waiting for my bag. In this environment, the value of Global Entry would be tremendous. Well, nice theory; but allowing more carry-on baggage wouldn't work too well in the aggregate. The overhead bins on my flight were already stuffed to maximum capacity, and we don't need more flight delays induced by passengers who bring more stuff onboard than the cabin can physically accommodate.
Improve the latency of the baggage claim process. The sequence diagram shows clearly that this is where the big win is. It's easy to complain about baggage claim, because it's nearly always noticeably slower than we want it to be, and we can't see what's going on down there. Our imaginations inform us that there's all sorts of horrible waste going on.
Use latency hiding to mask the pain of the baggage claim process. Put TV sets in the Baggage Claim area, and tune them to something interesting instead of infinite loops of advertising. At CPH, they have a Danish hot dog stand in the baggage claim area. They also have a currency exchange office in there. Excellent latency hiding ideas if you need a snack or some DKK walkin'-around-money.

Latency hiding is a weak substitute for improving the speed of the baggage claim process. The killer app would certainly be to make Baggage Claim faster. Note, however, that just making Baggage Claim a little bit faster wouldn't make the Global Entry program any more valuable. To make Global Entry any more valuable, you'd have to make Baggage Claim fast enough that your bag would be waiting for anyone who cleared the full Passport Control queue.

So, my message today: When you optimize, you must first know your goal. So many people optimize subsystems (resources) that they think are important, but optimizing subsystems is often not a path to optimizing what you really want. At the airport, I really don't give a rip about getting out of the Passport Control queue if it just means I'm going to be dumped earlier into a room where I'll have to wait until an affixed time for my baggage.

Once you know what your real optimization goal is (that's Method R step 1), then the sequence diagram is often all you need to get your breakthrough insight that either helps you either (a) solve your problem or (b) understand when there's nothing further that you can really do about it.

Thursday, November 12, 2009

Why We Made Method R

Twenty years ago (well, a month or so more than that), I entered the Oracle ecosystem. I went to work as a consultant for Oracle Corporation in September 1989. Before Oracle, I had been a language designer and compiler developer. I wrote code in lex, yacc, and C for a living. My responsibilities had also included improving other people's C code: making it more reliable, more portable, easier to read, easier to prove, and easier to maintain; and it was my job to teach other people in my department how to do these things themselves. I loved all of these duties.

In 1987, I decided to leave what I loved for a little while, to earn an MBA. Fortunately, at that time, it was possible to earn an MBA in a year. After a year of very difficult work, I had my degree and a new perspective on business. I interviewed with Oracle, and about a week later I had a job with a company that a month prior I had never heard of.

By the mid-1990s, circumstances and my natural gravity had matched to create a career in which I was again a software developer, optimizer, and teacher. By 1998, I was the manager of a group of 85 performance specialists called the System Performance Group (SPG). And I was the leader of the system architecture and system management consulting service line within Oracle Consulting's Global Steering Committee.

My job in the SPG role was to respond to all the system performance-related issues in the USA for Oracle's largest accounts. My job in the Global Steering Committee was to package the success of SPG so that other practices around the world could repeat it. The theory was that if a country manager in, say, Venezuela, wanted his own SPG, then he could use the financial models, budgets, hiring plans, training plans, etc. created by my steering committee group. Just add water.

But there was a problem. My own group of 85 people consisted of two very different types of people. About ten of these 85 people were spectacularly successful optimizers whom I could send anywhere with confidence that they'd thrive at either improving

performance or proving that performance improvements weren't possible. The other 75 were very smart, very hard-working people who would grow into the tip of my pyramid over the course of more years, but they weren't there yet.

The problem was, how to you convert good, smart, hard-working people in the base of the SPG pyramid into people in the tip? The practice manager in Venezuela would need to know that. The answer, of course, is supposed to be the Training Plan. Optimally, the Training Plan consists of a curriculum of a few courses, a little on-the-job training, and then, presto: tip of the pyramid. Just add water.

But unfortunately that wasn't the way things worked. What I had been getting instead, within my own elite group, was a

process that took many years to convert a smart, hard-working person into a reasonably reliable performance optimizer whom you could send anywhere. Worse yet, the peculiar stresses of the job—like being away from home 80% of the time, and continually visiting angry people each week, having to work for me—caused an outflow of talent that approximately equaled the inflow of people who made it to the tip of the pyramid. The tip of my pyramid never grew beyond roughly 10 people.

The problem, by definition, was the Training Plan. It just wasn't good enough. It wasn't that the instructors of Oracle's internal "tuning" courses were doing a poor job of teaching courses. And it wasn't that the course developers had done a poor job of creating courses. On the contrary, the instructors and course developers were doing excellent work. The problem was that the courses were focusing on the wrong thing. The reason that the courses weren't getting the job done was that the very subject matter that needed teaching hadn't been invented yet.

I expect that the people who write, say, the course called "Braking System Repair for Boeing 777" to have themselves invented the braking system they write about. So, the question was, who should be responsible for inventing the subject matter on how to optimize Oracle? I decided that I wanted that person to be me. I deliberated carefully and decided that my best chance of doing that the way I wanted to do it would be outside of Oracle. So in October 1999, ten years and one week after I joined the company, I left Oracle with the vision of creating a repeatable, teachable method for optimizing Oracle systems.

Ten years later, this is still the vision for my company, Method R Corporation. We exist not to make your system faster. We exist to make you faster at making all your systems faster. Our work is far from done, but here is what we have done:

Written white papers and other articles that explain Method R to you at no cost.
Written a book called Optimizing Oracle Performance, where you can learn Method R at a low cost.
Created a Method R course (on which the book is based), to teach you how to diagnose and repair response time problems in Oracle-based systems.
Spoken at hundreds of public and private events where we help people understand performance and how to manage it.
Provided consulting services to make people awesome at making their systems faster and more efficient.
Created the first response time profiling software ever for Oracle software applications, to let you analyze hundreds of megabytes of data without drudgery.
Created a free instrumentation library so that you can instrument the response times of Oracle-based software that you write.
Created software tools to help you be awesome at extracting every drop of information that your Oracle system is willing to give you about your response times.
Created a software tool that enables you to record the response time of every business task that runs on your system so you can effortlessly manage end-user performance.

As I said, our work is far from done. It's work that really, really matters to us, and it's work we love doing. I expect it to be a journey that will last long into the future. I hope that our journey will intersect with yours from time to time, and that you will enjoy it when it does.

Sunday, November 8, 2009

Latency Hiding

A few weeks ago, James Morle posted an article called "Latency hiding for fun and profit." Latency hiding one of the fundamental skills that, I believe, distinguishes the people who are Really On The Ball from the people who Just Don't Get It.

Last night, I was calling to my 12-year old boy Alex to come look at something I wanted him to see my computer. At the same time, his mom was reminding him to hurry up if he wanted something to eat, because he only had five minutes before he had to head up to his bedroom. "Alex, come here," I told him, putting a little extra pressure on him. "Just a second, Dad." I looked up and notice that he was unwrapping his ready-made ham and cheese sandwich that he had gotten out of the freezer. He dropped it into the microwave and initiated its two-minute ride, and then he came over to spend two minutes looking at my computer with me while his sandwich cooked. Latency hiding. Excellent.

James's blog helped me put a name to a game that I realize that I play very, very often. Today, I realized that I play the latency hiding game every time I go through an airport security checkpoint. How you lay your stuff on the X-ray machine conveyor belt determines how long you're going to spend getting your stuff off on the other side. So, while I'm queued for the X-ray, I figure out how to optimize my exit once I get through to the other side.

When I travel every week, I don't really have to think too much about it; I just do the same thing I did a few days ago. When I haven't been through an airport for a while, I go through it all in my mind a little more carefully. And of course, airport rules change regularly, which adds a little spice to the analysis. Some airports require me to carry my boarding pass through the metal detector; others don't. Some airports let me keep my shoes on. Some airports let me keep my computer in my briefcase.

Today, the rules were:

I had my briefcase and my carry-on suitcase.
Boarding pass can go back into the briefcase.
Shoes off.
1-quart ziplock back of liquids and gels: out.
MacBook: out.

Here's how I put my things onto the belt, optimized for latency hiding. I grabbed two plastic boxes and loaded the belt this way:

Plastic box with shoes and ziplock bag.
Suitcase.
Plastic box with MacBook.
Briefcase.

That way, when I cleared the metal detector, I could perform the following operations in this order:

Box with shoes and ziplock bag arrive.
Put my shoes on.
Take the ziplock bag out of the plastic box.
Suitcase arrives.
Put the ziplock bag back into my suitcase.
Box with MacBook arrives.
Take my MacBook out.
Stack the two boxes for the attendant.
Briefcase arrives.
Put the MacBook into the briefcase.
Get the heck out of the way.

Latency hiding helps me exit a slightly uncomfortable experience a little more quickly, and it helps me cope with time spent queueing—a process that's difficult to enjoy—for a process that's itself difficult to enjoy.

I don't know what a lot of the other people in line are thinking while they're standing there for their 15 minutes, watching 30 people ahead of them go through the same process they'll soon endure, 30 identical times. Maybe it's finances or football or cancer or just their own discomfort from being in unusual surroundings. For me, it's usually latency hiding.

Thursday, September 24, 2009

Happy Birthday, OFA Standard

Yesterday I finally posted to the Method R website some of the papers I wrote while I was at Oracle Corporation (1989–1999). You can now find these papers where I am.

When I was uploading my OFA Standard paper, I noticed that today—24 September 2009—is the fourteenth birthday of its publication date. So, even though the original OFA paper was around for a few years before 1995, please join me in celebrating the birthday of the final version of the official OFA Standard document.

Wednesday, September 16, 2009

On the Importance of Diagnosing Before Resolving

Today a reader posted a question I like at our Method R website. It's about the story I tell in the article called, "Can you explain Method R so even my boss could understand it?" The story is about sending your son on a shopping trip, and it takes him too long to complete the errand. The point is that an excellent way to fix any kind of performance problem is to profile the response time for the properly chosen task, which is the basis for Method R (both the method and the company).

Here is the profile that details where the boy's time went during his errand:

                    --Duration---
Subtask             minutes     %  Executions
------------------  -------  ----  ----------
Talk with friends        37   62%           3
Choose item              10   17%           5
Walk to/from store        8   13%           2
Pay cashier               5    8%           1
------------------  -------  ----  ----------
Total                    60  100%

I went on to describe that the big leverage in this profile is the elimination of the subtask called "Talk with friends," which will reduce response time by 62%.

The interesting question that a reader posted is this:

Not sure this is always the right approach. For example, lets imagine the son has to pick 50 items
Talk 3 times 37 minutes
Choose item 50 times 45 minutes
Walk 2 times 8 minutes
Pay 1 time 5 minutes
Working on "choose item" is maybe not the right thing to do...

Let's explore it. Here's what the profile would look like if this were to happen:

          --Duration---
Subtask   minutes     %  Executions
-------   -------  ----  ----------
Choose         45   47%          50
Talk           37   39%           3
Walk            8    8%           2
Pay             5    5%           1
-------   -------  ----  ----------
Total          95  100%

The point of the inquiry is this:

The right answer in this case, too, is to begin with eliminating Talk from the profile. That's because, even though it's not ranked at the very top of the profile, Talk is completely unnecessary to the real goal (grocery shopping). It's a time-waster that simply shouldn't be in the profile. At all. But with Cary's method of addressing the profile from the top downward, you would instead focus on the "Choose" line, which is the wrong thing.

In chapters 1 through 4 of our book about Method R for Oracle, I explained the method much more thoroughly than I did in the very brief article. In my brevity, I skipped past an important point. Here's a summary of the Method R steps for diagnosing and resolving performance problems using a profile:

(Diagnosis phase) For each subtask (row in the profile), visiting subtasks in order of descending duration...

Can you eliminate any executions without sacrificing required function?
Can you improve (reduce) individual execution latency?

(Resolution phase) Choose the candidate solution with the best net value (that is, the greatest value of benefit minus cost).

Here's a narrative of executing the steps of the diagnostic phase, one at a time, upon the new profile, which—again—is this:

          --Duration---
Subtask   minutes     %  Executions
-------   -------  ----  ----------
Choose         45   47%          50
Talk           37   39%           3
Walk            8    8%           2
Pay             5    5%           1
-------   -------  ----  ----------
Total          95  100%

Execution elimination for the Choose subtask: If you really need all 50 items, then no, you can't eliminate any Choose executions.
Latency optimization for the Choose subtask: Perhaps you could optimize the mean latency (which is .9 minutes per item). My wife does this. For example, she knows better where the items in the store are, so she spends less time searching for them. (I, on the other hand, can get lost in my own shower.) If, for example, you could reduce mean latency to, say, .8 minutes per item by giving your boy a map, then you could save (.9 – .8) × 50 = 5 minutes (5%). (Note that we don't execute the solution yet; we're just diagnosing right now.)
Execution elimination for the Talk subtask: Hmm, seems to me like if your true goal is fast grocery shopping, then you don't need your boy executing any of these 3 Talk events. Proposed time savings: 37 minutes (39%).
Latency optimization for the Talk subtask: Since you can eliminate all Talk calls, no need to bother thinking about latency reduction. ...Unless you're prey to some external constraint (like social advancement, say, in attempt to maximize your probability of having rich and beautiful grandchildren someday), in which case you should think about latency reduction instead of execution elimination.
Execution elimination for the Walk subtask: Well, the boy has to get there, and he has to get back, so this "executions=2" figure looks optimal. (Those Oracle applications we often see that process one row per network I/O call would have 50 Walk executions, one for each Choose call.)
Latency optimization for the Walk subtask: Walking takes 4 minutes each way. Driving might take less time, but then again, it might actually take even more. Will driving introduce new dependent subtasks? Warm Up? Park? De-ice? Even driving doesn't eliminate all the walking... Plus, there's not a lot of leverage in optimizing Walk, because it accounts for only 8% of total response time to begin with, so it's not worth a whole lot of bother trying to shave it down by some marginal proportion, especially since inserting a car into your life (or letting your boy drive yours) is no trivial matter.
Execution elimination for the Pay subtask: The execution count on Pay is already optimized down to the legally required minimum. No real opportunity for improvement here without some kind of radical architecture change.
Latency optimization for the Pay subtask: It takes 5 minutes to Pay? That seems a bit much. So you should look at the payment process. Or should you? Even if you totally eliminate Pay from the profile, it's only going to save 5% of your time. But, if every minute counts, then yes, you look at it. ...Especially if there might be an easy way to improve it. If the benefit comes at practically no cost, then you'll take it, even if the benefit is only small. So, imagine that you find out that the reason Pay was so slow is that it was executed by writing a check, which required waiting for store manager approval. Using cash or a credit/debit card might improve response time by, say, 4 minutes (4%).

Now you're done assessing the effects of (1) execution elimination and (2) latency reduction for each line in the profile. That ends the diagnostic phase of the method. The next step is the resolution phase: to determine which of these candidate solutions is the best. Given the analysis I've walked you through, I'd rank the candidate solutions in this order:

Eliminate all 3 executions of Talk. That'll save 37 minutes (39%), and it's easy to implement; you don't have to buy a car, apply for a credit card, train the boy how to shop faster, or change the architecture of how shopping works. You can simply discard the "requirement" to chat, or you can specify that it be performed only during non-errand time windows.
Optimize Pay latency by using cash or a card, if it's easy enough to give your boy access to cash or a card. That will save 4 minutes, which—by the way—will be a more important proportion of the total errand time after you eliminate all the Talk executions.
Finally, consider optimizing Choose latency. Maybe giving your son a map of the store will help. Maybe you should print your grocery list more neatly so he can read it without having to ask for help. Maybe by simply sending him to the store more often, he'll get faster as his familiarity with the place improves.

That's it.

So the point I want to highlight is this:

I'm not saying you should stick to the top line of your profile until you've absolutely conquered it.

It is important to pass completely through your profile to construct your set of candidate solutions. Then, on a separate pass, you evaluate those candidate solutions to determine which ones you want to implement, and in what order. That first full pass is key. You have to do it for Method R to be reliable for solving any performance problem.

Monday, August 31, 2009

How much disk space did Snow Leopard really save?

Like apparently hundreds of thousands of others, I upgraded my machines running Mac OS X from version 10.5 (Leopard) to 10.6 (Snow Leopard) last Friday. I'm now a Snow Leopard user, and I like it just fine.

I was excited about this upgrade, because I love the notion that the people who released it care about optimizing the performance of my system. One of the optimizations I looked forward to was reclaiming over 6 GB of disk space after the upgrade (see Bertrand Serlet's announcement at 00:20:48 to 00:21:11 in the WWDC 2009 keynote video).

Lots of people in the Twittersphere were excited about the space savings, too. Before I upgraded, I checked to see what people were tweeting, just to make sure I wasn't about to walk off a cliff. Many people mentioned tremendous disk space savings that were well in excess of the 6 GB that Apple promised. Pretty exciting.

We have two Mac computers. Here's how the savings went for us:

Mac #1      10.5      10.6      Savings
-------  ---------  ---------  ---------
Total    148.73 GB  159.70 GB
Free      47.70 GB   59.71 GB   12.01 GB

Mac #2      10.5      10.6      Savings
-------  ---------  ---------  ---------
Total    185.99 GB  199.71 GB
Free      68.66 GB   83.95 GB   15.29 GB

So, ...wow, we saved over twice as much space as Apple had advertised. But there's a curiosity in the numbers. Do you see it? How did my total capacity get bigger as the result of a software upgrade? The answer is that my capacity didn't really get bigger; it's just that Apple now measures disk space differently in 10.6.

I knew this was coming because of this article called "Snow Leopard's New Math." Snow Leopard still uses the abbreviation "GB" to refer, now, to 10⁹ bytes, whereas, before, Leopard used the abbreviation "GB" to refer to 2³⁰ bytes. The problem, see, is that 10⁹ ≠ 2³⁰. In fact, 2³⁰ is bigger. So in Snow Leopard, Apple is dividing by a smaller unit than it used to, which results in disk capacities and file sizes looking bigger than they used to. (Here's a good article about that.)

It is misleading that Apple used the same abbreviation—"GB"—to refer to two different units of measure. However, Apple is well justified in using "GB" in Snow Leopard. IEEE 1541-2002 says the right abbreviations would have been "GiB" (gibibytes) in 10.5 and "GB" (gigabytes) in 10.6. By that standard, Snow Leopard is right, and Leopard was wrong. All's well that ends well, I suppose.

Now, back to the space savings question. How much space did I really save when I upgraded to Snow Leopard? To answer that, I need to convert one of the two columns in my analysis (labeled "10.5" and "10.6") to the other column's unit, so I can subtract. Since when I watched the WWDC keynote film, my mindset was of 10.5-style "gigabytes" (properly gibibytes), I'll convert to GiB. Here's the answer:

Mac #1      10.5        10.6      Savings
-------  ----------  ----------  ---------
Total    148.73 GiB  148.73 GiB
Free      47.70 GiB   55.61 GiB   7.91 GiB

Mac #2      10.5        10.6      Savings
-------  ----------  ----------  ---------
Total    185.99 GiB  185.99 GiB
Free      68.66 GiB   78.18 GiB   9.52 GiB

That's still spectacular, and I'm plenty happy with it. I have basically bought a whole bunch of performance enhancements and 17 GiB of disk space for $49 plus tax (I bought the Snow Leopard upgrade family pack). I think that's a pretty good deal.

This whole story reminded me of the old days when I used to install Oracle for a living. People would buy, say, a brand-new 100,000,000-byte disk drive and then be upset when the df utility showed considerably less than 100 "MB" of free space. Part of the explanation was that df reported in mibibytes, not millions of bytes.

It's interesting to note that in Snow Leopard, df -h now reports in Bi/Ki/Mi/Gi units, and df -H reports in B/K/M/G units (defined as IEEE 1541 defines them). Smart.

Thursday, July 2, 2009

Fundamentals of Software Performance Quick Reference Card

I just posted "Fundamentals of Software Performance Quick Reference Card" at the Method R company website:

This two-page quick reference card written by Cary Millsap sums up computer software performance the Method R way. The first page lists definitions of the terms you need to know: efficiency, knee, load, response time, and so on. The second page lists ten principles that are vital to your ability to think clearly about software performance. This document contains meaningful insight in a format that's compact enough to hang on your wall.

It's free, and there's no sign-up required. I hope you will enjoy it.

Tuesday, June 23, 2009

An Essay on Science

Richard Feynman defined science as "the belief in the ignorance of experts." Science begins by questioning established ideas. ...Even those ideas promoted by so-called experts.

The value of science that's obvious to everybody is the chance you might discover some valuable truth that nobody else has discovered before. That's the glamorous idea that might motivate you to begin the hard work that science sometimes requires. Science is also valuable to you when you learn that an established idea, no matter how much you may not like it, really is true after all. That second value of science is not as glamorous, but it's just as important. My little prayer with respect to that possibility is, "If an idea I believe is wrong, please let me find out before anybody else does."

Everyone can do science. Not just "scientists"; all of us. But you need to do science "right," or it's not science. Do it right, and you accumulate a little bit of truth. Do it wrong, and and you've wasted your time, or worse, you've doomed yourself to waste more of your time in the future, too.

The difference between "right" and "wrong" in science is not some snooty, bureaucratic concept. You don't need a license or a blessing to do science right. You just need to ensure that the cause-effect relationships you choose to believe are actually correct. One of the rules for doing science right is that you measure instead of just asserting your opinion.

Different people have different thresholds of skepticism. Some people believe new ideas, whether they're true or false, with very little persuasion. The people who are persuaded easily to believe false things cannot contribute much useful new knowledge to their communities (irrespective of how much they might publish).

People at the opposite end of the spectrum have very strict standards for what they accept as truth. They're careful about what rows they insert and commit into their minds. There aren't as many of those people, but they're more interesting, with respect to science, because they're the people who can contribute new knowledge to the community.

A lot of what we do in the Oracle world comes down to a person demonstrating, say, in sqlplus, that some certain cause produces some certain effect in some certain version of the database on some certain operating system, and so on. Then the next step is when you have to look at that result and decide for yourself (or perhaps with someone's help), how relevant that result is for you.

Innumerable Oracle debates funnel into the argument that, "Yes, I can see plainly that this situation can happen, but that situation will probably never happen for me." This is what happens, for example, when the doubter believes the prover is using an example that is too contrived to be realistic, or when the doubter's context is different from the prover's. ...Like when the doubter is running a data warehouse, but the prover talks only about transaction processing.

That's where another level of value begins, and it's another place where science can help. It's where the issue becomes proving how relevant a given proposal really is for a given circumstance. One of the nice things about software (and Oracle Database software in particular) is that it's usually easy to write code that will tell you how often something happens, or how long it takes when it does. With software, I don't have to guess. I can measure and know. Software is unusual in the world in that it can be used to measure itself.

At this week's ODTUG conference in Monterey, California, a debate has formed that I'm interested in following. The established idea of the experts in this debate is a passionate belief that you should declare, enable, and index foreign key (FK) relationships. The counter-argument is that you should not.

I've heard probably most of the arguments on the pro side of the debate. (a) If you don't declare/enable/index your FKs, then you have to ensure the correct relationships in your application code, which is error-prone in both obvious and highly subtle ways. (b) Not implementing full FK integrity in the database is developmentally inefficient, because it violates the important principle that you should never duplicate code in an application. (c) Furthermore, the absence of declared/enabled/indexed FK integrity is operationally inefficient at run time, because it blocks the Oracle query optimizer from using code paths that can be hundreds of times faster and more scalable than when it can't rely upon database-enforced FK integrity.

I haven't heard a single compelling argument for the other side of the debate. But, you see, there could be a compelling argument that I haven't heard.

This is what makes this new debate interesting for me. One side or the other is on the brink of learning something important, if the debate is conducted properly.

The first thing the two sides will need to do is agree on whether both sides really are in disagreement and, if they are, what exactly the disagreement is. In lots of debates I've seen, we've figured out after defining the terms and deciding the context of the argument (for example, what kind of application we're talking about), that there's really no debate of principle at all. When that happens, it's debate closed. Each side agrees with the other, and maybe the two sides learn a little more about life on the other side of the fence.

If after the suitable definition-of-terms process, the two sides really do still have a debate left, then there will be some kind of attestation of facts as both sides see them. One side, for example, will show sqlplus session output to demonstrate the subtle and not-so-subtle ways in which not doing the declare/enable/index thing causes corrupted data and horrific performance penalties. The other side of the argument will then show some contrary evidence that counts in favor of not doing the declare/enable/index thing.

If the sides can't agree on the truth of the "facts" presented, then the debate will collapse, and at least one side (possibly both) will have learned nothing. If each side succeeds in impressing the other side that the "facts" thus presented are actually true, then the debate will move to the discussions of the relevance of the facts just demonstrated. This is where one guy might say something like, "I know that queries with FKs in the database are faster, but I can't afford the performance penalty at data load time." To which the other guy might say, "Yes, loads are a little slower with referential integrity checking, but that's time spent executing necessary code path to ensure that only correct data can get into your database. And besides, it's not a good trade-off to endure a thousand slow queries a day so that one load can go faster." Rebuttal, counter-rebuttal, and so on. You get the picture.

I am biased in my estimation of how this will turn out. But I sincerely respect when someone thoughtfully and sincerely challenges an important idea in my professional domain, no matter how well entrenched that idea may be. In fact, the more entrenched, the better. The debate will remain interesting as long as the counter-argument is thoughtful and sincere. As soon as the evidence for the challenging new idea reveals itself to be nonsense, or if the counterargument context seems irrelevant to me (and the people I'm trying to represent), then I'll lose interest.

The best debate ends in a handshake (real or virtual) in which the people representing both sides learn something they didn't know before, one side perhaps more than the other. The best debaters value learning more than being right. The best debaters respect each other more after the debate because each has helped the other (and the community around them) to advance.

The worst debaters confuse the principles of factual correctness and personal correctness. When the debate shifts from factual to personal, it may become interesting to some people because of the enhanced drama, but the actual usefulness of the debate evaporates. ([Grin] I guess if the debate were solid to begin with, then the usefulness actually sublimates.)

No matter which idea wins this FK debate (right, I said idea, not people), I expect to be happy for the debate to have occurred. That's because I expect the debate to end with a resolution, and either I'm going to learn something completely new, or I'm going to fortify an existing belief. Something new is obviously exciting, and fortification will make me a more effective teacher of my existing belief. Either result benefits me and, I believe, the community.

With science, you get suspense, drama, plot twists, surprises, fortifications, .... Science is fun. I wish more people knew.

Thursday, June 18, 2009

Profiling with my Boy

We have an article online called "Can you explain Method R so even my boss could understand it?" Today I'm going to raise the stakes, because yesterday I think I explained Method R so that an eleven year-old could understand it.

Yesterday I took my 11 year-old son Alex to lunch. I talked him into eating at one of my favorite restaurants, called Mercado Juarez, over in Irving, so it was a half hour in the car together, just getting over there. It was a big day for the two of us because we were very excited about the new June 17 iPhone OS 3.0 release. I told him about some of the things I've learned about it on the Internet over the past couple of weeks. One subject in particular that we were both interested in was performance. He likes not having to wait for click results just as much as I do.

According to Apple, the new iPhone OS 3.0 software has some important code paths in it that are 3× faster. Then, upgrading to the new iPhone 3G S hardware is supposed to yield yet another 3× performance improvement for some code paths. It's what Philip Schiller talks about at 1:42:00 in the WWDC 2009 keynote video. Very exciting.

Alex of course, like many of us, wants to interpret "3× faster" as "everything I do is going to be 3× faster." As in everything that took 10 seconds yesterday will take 3 seconds tomorrow. It's a nice dream. But it's not what seeing a benchmark run 3× faster means. So we talked about it.

I asked him to imagine that it takes me an hour to do my grocery shopping when I walk to the store. Should I buy a car? He said yes, probably, because a car is a lot faster than walking. So I asked him, what if the total time I spent walking to and from the grocery store was only one minute? Then, he said, having a car wouldn't make that much of a difference. He said you might want a car for other reasons, but he wouldn't recommend it just to make grocery shopping faster.

Good.

I said, what if grocery shopping were taking me five hours, and four of it was time spent walking? "Then you probably should get the car," he told me. "Or a bicycle."

Commit.

On the back of his menu (photo above: click to zoom), I drew him a sequence diagram (A) showing how he, running Safari on an iPhone 3G might look to a performance analyst. I showed him how to read the sequence diagram (time runs top-down, control passes from one tier to another), and I showed him two extreme ways that his sequence diagram might turn out for a given experience. Maybe the majority of the time would be spent on the 3G network tier (B), or maybe the majority of the time would be spent on the Safari software tier (C). We talked about how if B were what was happening, then a 3× faster Safari tier wouldn't really make any difference. Apple wouldn't be lying if they said their software was 3× faster, but he really wouldn't notice a performance improvement. But if C were what was happening, then a 3× faster Safari tier would be a smoking hot upgrade that we'd be thrilled with.

Sequence diagrams, check. Commit.

Now, to profiles. So I drew up a simple profile for him, with 101 seconds of response time consumed by 100 seconds of software and 1 second of 3G (D):

Software  100
3G          1
-------------
Total     101

I asked him, if we made the software 2× faster, what would happen to the total response time? He wrote down "50" in a new column to the right of the "100." Yep. Then I asked him what would happen to total response time. He said to wait a minute, he needed to use the calculator on his iPod Touch. Huh? A few keystrokes later, he came up with a response time of 50.5.

Oops. Rollback.

He made the same mistake that just about every forty year-old I've ever met makes. He figured if one component of response time were 2× faster, then the total response time must be 2× faster, too. Nope. In this case, the wrong answer was close to the right answer, but only because of the particular numbers I had chosen.

So, to illustrate, I drew up another profile (E):

Software    4
3G         10
-------------
Total      14

Now, if we were to make the software 2× faster, what happens to the total? We worked through it together:

Software    4    2
3G         10   10
------------------
Total      14   12

Click. So then we spent the next several minutes doing little quizzes. If this is your profile, and we make this component X times faster, then what's the new response time going to be? Over and over, we did several more of these, some on paper (F), and others just talking.

Commit.

Next step. "What if I told you it takes me an hour to get into my email at home? Do I need to upgrade my network connection?" A couple of minutes of conversation later, he figured out that he couldn't answer that question until he got some more information from me. Specifically, he had to ask me how much of that hour is presently being spent by the network. So we did this over and over a few times. I'd say things like, "It takes me an hour to run my report. Should I spend $4,800 tuning my SQL?" Or, "Clicking this button takes 20 seconds. Should I upgrade my storage area network?"

And, with just a little bit of practice, he learned that he had to say, "How much of the however-long-you-said is being spent by the whatever-it-was-you-said?" I was happy with how he answered, because it illustrated that he had caught onto the pattern. He realized that the specific blah-blah-blah proposed remedies I was asking him about didn't really matter. He had to ask the same question regardless. (He was answering me with a sentence using bind variables.)

Commit.

Alex hears me talk about our Method R Profiler software tool a lot, and he knows conceptually that it helps people make their systems faster, but he's never known in any real detail very much about what it does. So I told him that the profile tables are what our Profiler makes for people. To demonstrate how it does that, I drew him up a list of calls (F), which I told him was a list of visits between a disk and a CPU. ...Just a list that says the same thing that a sequence diagram (annotated with timings) would say:

D 2
C 1
D 6
D 4
D 8
C 3

I told him to make a profile for these calls, and he did (H):

Disk   20
CPU     4
---------
Total  24

Excellent. So I explained that instead of adding up lists in our head all day, we wrote the Profiler to aggregate the call-by-call durations (from an Oracle extended SQL trace file) for you into a profile table that lets you answer the performance questions we had been practicing over lunch. ...Even if there are millions of lines to add up.

The finish-up conversation in the car ride back was about how to use everything we had talked about when you fix people's performance problems. I told him the most vital thing about helping someone solve a performance problem is to make sure that the operation (the business task) that you're analyzing right now is actually the most important business task to fix first. If you're looking at anything other than the most important task first, then you're asking for trouble.

I asked him to imagine that there are five important tasks that are too slow. Maybe every one of those things has its response time dominated by a different component than all the others. Maybe they're all the same. But if they're all different, then no single remedy you can perform is going to fix all five. A given remedy will have a different performance impact upon each of the five tasks, depending on how much of the fixed thing that task was using to begin with.

So the important thing is to know which of the five profiles it is that you ought to be paying attention to first. Maybe one remedy will fix all five tasks, maybe not. You just can't know until you look at the profiles. (Or until you try your proposed remedy. But trial-and-error is an awfully expensive way to find out.)

Commit.

It was a really good lunch. I'll look forward to taking my 9-year-old (Alex's little brother) out the week after next when I get back from ODTUG.

Friday, April 24, 2009

The Most Common Performance Problem I See

At the Percona Performance Conference in Santa Clara this week, the first question an audience member asked our panel was, "What is the most common performance problem you see in the field?"

I figured, being an Oracle guy at a MySQL conference, this might be my only chance to answer something, so I went for the mic. Here is my answer.

The most common performance problem I see is people who think there's a most-common performance problem that they should be looking for, instead of measuring to find out what their actual performance problem actually is.

It's a meta answer, but it's a meta problem. The biggest performance problems I see, and the ones I see most often, are not problems with machines or software. They're problems with people who don't have a reliable process of identifying the right thing to work on in the first place.

That's why the definition of Method R doesn't mention Oracle, or databases, or even computers. It's why Optimizing Oracle Performance spends the first 69 pages talking about red rocks and informed consent and Eli Goldratt instead of Oracle, or databases, or even computers.

The most common performance problem I see is that people guess instead of knowing. The worst cases are when people think they know because they're looking at data, but they really don't know, because they're looking at the wrong data. Unfortunately, every case of guessing that I ever see is this worst case, because nobody in our business goes very far without consulting some kind of data to justify his opinions. Tim Cook from Sun Microsystems pointed me yesterday to a blog post that gives a great example of that illusion of knowing when you really don't.

Monday, April 13, 2009

Maxine Johnson

I want to introduce you to Maxine Johnson, assistant manager of men's sportswear at Nordstrom Galleria Dallas. The reason I think Maxine is important is because she taught my son and me about customer service. I met her several months ago. I still have her card, and I'm still grateful to her. Here's what happened.

A few months ago, my wife and I were in north Dallas with some time to spare, and I convinced her to go with me to pick out one or two pairs of dress slacks. I felt like I was wearing the same pants over and over again when I traveled, and I could use an extra pair or two. We usually go to Nordstrom for that, and so we did again. After some time, I had two pairs of trousers that we both liked, and so we had them measured for hemming and picked them up a few days later.

A week or two passed, and then I packed a pair of my new pants for a trip to Zürich. I put them on in the hotel the first morning I was supposed to speak at an event. On my few-block walk from the hotel to the train station, I caught my reflection in a store window, and—hmmp—my pants were just not... really... quite... long enough. Every step, the whole cuff would come way up above the tops of my shoes. I stopped and tugged them down, and then they seemed alright, but then as soon as I started walking again, they'd ride back up and look too short.

They weren't bad enough that anyone said anything, but I was a little self-consious about it. I kept tugging at them all day.

When I hung them back up in my closet at home, I noticed that when I folded them over the hanger, they didn't reach as far as the other pants that I really liked. Sure enough, when I lined up the waists, these new pants were about an inch shorter than my favorite ones that I had bought at Nordstrom probably four years ago.

Now, pants at Nordstrom cost a little more than maybe at a lot of other places, but they're worth to me what I pay for them because they're nice, and they last a long time. But these new ones made me feel bad, because they were just a little bit off. I could already foresee a future of two new pairs of slacks hanging in my closet for years, never really making the starting rotation because they're just a little bit off, but never making the garage sale pile, either, because they had cost too much.

My wife agreed. They were shorter than the others. They were shorter than they should be. I needed to get them fixed.

Now, this is the part I always hate. Having made the decision, the next step is that step where you take the thing back and try to get the problem fixed. I hate that part. My wife doesn't mind it so much, but these were my pants, and so I was the one that had to go back and put them on so someone could fix them. I really dreaded it though, because I knew that the only way they could fix those pants was to take off the cuff.

It's late in the evening by the time my wife helps me build up a little head of steam, and we both decide (well, she decides, but she's right) that tonight is the perfect night for me to go on a 20-mile drive across town to Nordstrom to get my pants fixed. As a matter of fact, it'd be good if my older boy went with me. That makes it a little more fun, because he's good company for me.

It's late enough by now that before I could leave, I had to phone ahead, just to make sure the store was still open. A nice lady answered the phone. I said my name and told the nice lady that I was having some trouble with some slacks I had bought a few weeks ago, and how late did they stay open? She told me to come right on over.

So my boy and I got into the car, and I drove right on over.

A half hour later, I walked into the store, thankful that the doors were still open, carrying two pairs of slacks on a hanger, with my son walking beside me. A smiling nice lady approached me as I entered the men's department. "Mr. Millsap?" Yes, I am. It surprises me anytime someone remembers my name from that one phase of the conversation where I say real fast, "My name is Cary Millsap, and blah blah blah blah blah," and tell my whole story. The person on the phone hadn't asked me again what my name was. She had caught it in the blur at the beginning of my story.

She proceeded to explain to me what was going to happen. I was going to try on the slacks in the dressing room. The tailor would be there waiting for me. She and the tailor would look them over. If there was enough fabric to make them longer, then they'd do that tonight. If there weren't, then she was going to find two new pairs of slacks for me, and the tailor would have them ready for me tomorrow. If for any reason, those didn't work, then she'd keep preparing new trousers for me until I was satisfied.

Mmm, ok. I was probably grinning a little bit by now, because this was pretty fantastic news. I wasn't going to have to get my pants de-cuffed. I was still a little nervous, though, that when I came out of the dressing room, everyone was going to look at me like, "So what's the problem? I don't see any problem. Those are long enough."

When I came out, Maxine Johnson crossed her arms, put her hand to her chin, shook her head a little, and immediately said something to the effect of, "Oh my, no. That won't do at all." So she brought me two new pairs, which I tried on, and which the tailor measured for me. She gave me a reclaim ticket for the next day. As usual, I had missed her name when she introduced herself as I first entered the men's department. (As you probably already figured out, I have a bad habit of not paying enough attention to that part of the conversation that I think of as "the blur.") I did have the good sense to ask for her business card, which is why I know her name is Maxine Johnson.

My boy and I talked the whole ride home that what we had seen that night had been some real, first-class retail customer care right there, and that we all knew where we'd be buying my next pairs of pants. When I had gotten into the car an hour or so before, I had been very apprehensive about what might happen. I had been especially nervous about how I'd perform during the proving-what's-wrong part of the project. But Maxine Johnson put me completely at ease during my experience. She didn't just do the right thing, she did it in such a manner that I felt glad the whole problem had happened. Here's the thing:

Maxine Johnson made me feel like it was not just ok that I brought the pants back for repair, she made me feel like she was delighted by the opportunity to show me what Nordstrom could do for me under pressure.

I hope that the way Maxine Johnson made me feel is the way that my employees and I make our customers feel. I hope it's the way my children make their customers feel someday when they go to work.

Thank you, Maxine Johnson. Thank you.

Wednesday, April 8, 2009

What would you do with 8 disks?

Yesterday, David Best posted this question at Oracle-L:

If you had 8 disks in a server what would you do? From watching this list I can see alot of people using RAID 5 but i'm wary of the performance implicatons. (http://www.miracleas.com/BAARF/)

I was thinking maybe RAID 5 (3 disks) for the OS, software and
backups. RAID 10 (4 disks + 1 hot spare) for the database files.

Any thoughts?

I do have some thoughts about it.

There are four dimensions in which I have to make considerations as I answer this question:

Volume
Flow
Availability
Change

Just about everybody understands at least a little bit about #1: the reason you bought 8 disks instead of 4 or 16 has something to do with how many bytes of data you're going to store. Most people are clever enough to figure out that if you need to store N bytes of data, then you need to buy N + M bytes of capacity, for some M > 0 (grin).

#2 is where a lot of people fall off the trail. You can't know how many disks you really need to buy unless you know how many I/O calls per second (IOPS) your application is going to generate. You need to ensure that your sustained IOPS rate on each disk will not exceed 50% (see Table 9.3 in Optimizing Oracle Performance for why .5 is special). So, if a disk drive is capable of serving N 8KB IOPS (your disk's capacity for serving I/O calls at your Oracle block size), then you better make sure that the data you put on that disk is so interesting that it motivates your application to execute no more than .5N IOPS to that disk. Otherwise, you're guaranteeing yourself a performance problem.

Your IOPS requirement gets a little trickier, depending on which arrangement you choose for configuring your disks. For example, if you're going to mirror (RAID level 1), then you need to account for the fact that each write call your application makes will motivate two physical writes to disk (one for each copy). Of course, those write calls are going to separate disks, and you better make sure they're going through separate controllers, too. If you're going to do striping with distributed parity (RAID level 5), then you need to realize that each "small" write call is going to generate four physical I/O calls (two reads, and two writes to two different disks).

Of course, RAID level 5 caching complicates the analysis at low loads, but for high enough loads, you can assume away the benefits of cache, and then you're left with an analysis that tell you that for write-intensive data, RAID level 5 is fine as long as you're willing to buy 4× more drives than you thought you needed. ...Which is ironic, because the whole reason you considered RAID level 5 to begin with is that it costs less than buying 2× more drives than you thought you needed, which is why you didn't buy RAID level 1 to begin with.

If you're interested in RAID levels, you should peek at a paper I wrote a long while back, called Configuring Oracle Server for VLDB. It's an old paper, but a lot of what's in there still holds up, and it points you to deeper information if you want it.

You have to think about dimension #3 (availability) so that you can meet your business's requirements for your application to be ready when its users need it. ...Which is why RAID levels 1 and 5 came into the conversation to begin with: because you want a system that keeps running when you lose a disk. Well, different RAID levels have different MTBF and MTTR characteristics, with the bottom line being that RAID level 5 doesn't perform quite as well (or as simply) as RAID level 1 (or, say 1+0 or 0+1), but RAID level 5 has the up-front gratification advantage of being more economical (unless you get a whole bunch of cache, which you pretty much have to, because you want decent performance).

The whole analysis—once you actually go through it—generally funnels you into becoming a BAARF Party member.

Finally, dimension #4 is change. No matter how good your analysis is, it's going to start degrading the moment you put your system together, because from the moment you turn it on, it begins changing. All of your volumes and flows will change. So you need to factor into your analysis how sensitive to change your configuration will be. For example, what % increase in IOPS will require you to add another disk (or pair, or group, etc.)? You need to know in advance, unless you just like surprises. (And you're sure your boss does, too.)

Now, after all this, what would I do with 8 disks? I'd probably stripe and mirror everything, like Juan Loaiza said. Unless I was really, really (I mean really, really) sure I had a low write-rate requirement (think "web page that gets 100 lightweight hits a day"), in which I would consider RAID level 5. I would make sure that my sustained utilization for each drive is less than 50%. In cases where it's not, I would have a performance problem on my hands. In that case, I'd try to balance my workload better across drives, and I would work persistently to find any applications out there that are wasting I/O capacity (naughty users, naughty SQL, etc.). If neither of those actions reduced the load by enough, then I'd put together a justification/requisition for more capacity, and I would brace myself to explain why I thought 8 disks was the right number to begin with.

Friday, April 3, 2009

Cary on Joel on SSD

Joel Spolsky's article on Solid State Disks is a great example of a type of problem my career is dedicated to helping people avoid. Here's what Joel did:

He identified a task needing performance improvement: "compiling is too slow."
He hypothesized that converting from spinning rust disk drives (thanks mwf) to solid state, flash hard drives would improve performance of compiling. (Note here that Joel stated that his "goal was to try spending money, which is plentiful, before [he] spent developer time, which is scarce.")
So he spent some money (which is, um, plentiful) and some of his own time (which is apparently less scarce than that of his developers) replacing a couple of hard drives with SSD. If you follow his Twitter stream, you can see that he started on it 3/25 12:15p and wrote about having finished at 3/27 2:52p.
He was pleased with how much faster the machines were in general, but he was disappointed that his compile times underwent no material performance improvement.

Here's where Method R could have helped. Had he profiled his compile times to see where the time was being spent, he would have known before the upgrade that SSD was not going to improve response time. Given his results, his profile for compiling must have looked like this:

100%  Not disk I/O
  0%  Disk I/O
----  ------------
100%  Total

I'm not judging whether he wasted his time by doing the upgrade. By his own account, he is pleased at how fast his SSD-enabled machines are now. But if, say, the compiling performance problem had been survival-threateningly severe, then he wouldn't have wanted to expend two business days' worth of effort upgrading a component that was destined to make zero difference to the performance of the task he was trying to improve.

So, why would someone embark upon a performance improvement project without first knowing exactly what result he should be able to expect? I can think of some good reasons:

You don't know how to profile the thing that's slow. Hey, if it's going to take you a week to figure out how to profile a given task, then why not spend half that time doing something that your instincts all say is surely going to work?
Um, ...

Ok, after trying to write them all down, I think it really boils down to just one good reason: if profiling is too expensive (that is, you don't know how, or it's too hard, or the tools to do it cost too much), then you're not going to do it. I don't know how I'd profile a compile process on a Microsoft Windows computer. It's probably possible, but I can't think of a good way to do it. It's all about knowing; if you knew how to do it, and it were easy, you'd do it before you spent two days and a few hundred bucks on an upgrade that might not give you what you wanted.

I do know that in the Oracle world, it's not hard anymore, and the tools don't cost nearly as much as they used to. There's no need anymore to upgrade something before you know specifically what's going to happen to your response times. Why guess... when you can know.

Tuesday, March 31, 2009

Last call for C. J. Date course

Note added 3 April 2009: When I wrote this post, we were counting down toward 2 April as the date for our preliminary go/no-go decision. That date is now behind us, and we have made the preliminary decision to Go. We are accepting further enrollments. —Cary Millsap

Thursday 2 April 2009 is our last call for enrollment in C. J. Date's course, "How to write correct SQL, and know it: a relational approach to SQL." I'm looking forward to this course more eagerly than anything I've attended in the past ten years, ...maybe twenty.

SQL and I never really got along too well. When I first joined Oracle Corporation in 1989, I was new to relational databases. I had done one hierarchical database project in college. I enjoyed the project ok, but it wasn't something I ever wanted to do again. When I joined Oracle, I didn't know much about relational technology or SQL. In my formative first couple of years at Oracle, though, I just never learned to like the SQL language. Prior to my Oracle career, I designed languages and wrote compilers for a living. From a language design standpoint, it just seemed that SQL (at least "Oracle SQL") could have become something really cool, but it didn't. For Oracle to treat an empty string as NULL, for example, is a decision which I still can't believe made it into the light of day...

I had a lot of respect over the years for the people I met who knew how to make SQL do what they wanted it to do. Dominic Delmolino was one of the first people I ever met who could make SQL do things I had no idea it could do. I'm still amazed when I see the things that Tom Kyte can do with SQL. I was never one of the SQL people.

Lex de Haan is the first person I ever met who really revealed to me what my problem was. A few years ago, Lex delivered a Miracle presentation in Rødby, Denmark, that dropped my jaw. He explained a better way to write an application with SQL. He showed how to write a completely unambiguous specification using a language I understood, predicate calculus ("this set equals that set," that kind of thing). He then showed how to implement that specification in SQL.

Here's the problem, though. SQL doesn't implement many of the set-theory/predicate-calculus operations that I expect. I'm not looking at Lex's notes as I write this, so I'll show you an example outlined recently by Toon Koppelaars, Lex's coauthor on the brilliant book called Applied Mathematics for Database Professionals (Expert's Voice).

In SQL, there's no "set equality" operator. That's right, although SQL is a set processing language, it has no operator for testing whether one set A equals another set B. But set equality "A = B" can be rewritten as "(A is-a-subset-of B) and (B is-a-subset-of A)".

Unfortunately, SQL doesn't have an is-a-subset-of operator either. But "A is-a-subset-of B" can be rewritten into "A minus B = the-empty-set".

But SQL also lacks the concept of an empty set. The way to express that is to test whether the cardinality of a set is zero, as in "count(*)=0".

Over the course of an hour-long presentation, Lex showed me a dozen or so operators that are missing from SQL, which we really need for expressing our intentions clearly in SQL. He put structure around the negative feelings I had toward the language. And then he showed an equivalent translation for each missing operator that could be implemented in SQL, which invested back into the language a new power. That's the trick that caused my jaw to fall. In Lex's presentation, the game of writing applications in SQL went from this:

Implement complex thoughts in crappy language that requires me to record my thoughts in a format that doesn't much resemble my thinking.
Worry whether the implementation was really right.

...to this:

Record complex thoughts using a language designed well to record exactly such thoughts.
Translate the specification of the program into SQL, using translation patterns.

Since our predicate calculus expressions were explicit enough to be provable, and since we could prove the correctness of the translations we were using to move from our specification to our SQL, we could actually then prove the correctness of our SQL. It was a beam of hope that developers could actually write correct applications ...and know it!

That's the first day I ever got excited thinking about SQL.

So, on April 27–29 in Dallas, I'll get a chance to enter the next phase of that thinking. On top of that, the message will be delivered by Chris Date, who I really enjoyed at the Hotsos Symposium earlier this month, and who is one of the pioneers who invented the whole world our careers live in. I'm looking to forward to it. It should be an interesting classroom, with Chris Date in the front and Karen Morton, Jeff Holt, and some others with me in the back. I hope you won't miss the opportunity.

Like I said, the final day to sign up is this Thursday 2 April 2009. I know that economic times are tough these days, but this is a one-of-a-kind education event that I believe will deliver lasting value to everyone who goes.