Cary Millsap

Monday, February 22, 2010

Thinking Clearly About Performance, revised to include Skew

I’ve just updated the “Thinking Clearly” paper to include an absolutely vital section that was, regrettably, missing from the first revision. It’s a section on the subject of skew.

I hope you enjoy.

Wednesday, February 10, 2010

Thinking Clearly About Performance

I’ve posted a new paper at method-r.com called “Thinking Clearly About Performance.” It’s a topic I’ll be presenting this year at:

RMOUG (Denver) this month
Miracle OpenWorld (Billund DK) in April
UTOUG (Salt Lake City) in May
Kaleidoscope (Washington DC) in June
MOTS (Ann Arbor) in September

The paper is only 13 pages long, and I think you’ll be pleased with its information density. Here is the table of contents:

An Axiomatic Approach
What is Performance?
Response Time vs Throughput
Percentile Specifications
Problem Diagnosis
The Sequence Diagram
The Profile
Amdahl’s Law
Minimizing Risk
Load
Queueing Delay
The Knee
Relevance of the Knee
Capacity Planning
Random Arrivals
Coherency Delay
Performance Testing
Measuring
Performance is a Feature

As usual, I learned a lot writing it. I hope you’ll find it to be a useful distillation of how performance works.

Tuesday, December 22, 2009

The “Do What You Love” Mirage

I am inspired by having read an article called “Do what you love mirage” by Denis Basaric. It begins...

“Do what you love” is advice I hear exclusively from financially secure people. And it rings hollow to me. When you need money to survive, you do any work that is available, love does not play into that choice. Desperation does.

Please read it before you go on.

Welcome back.

This article puts a very important cycle within my life into words. I believe, as Denis says, that a lot of times, we get the cause-effect relationship mixed up when we think about loving what we do.

I love what I do. Well, a lot of it. But Denis is right: I didn’t choose what I do out of love. I chose what I love out of doing. Some examples:

I love mathematics. But I most assuredly did not always love it. I learned to love it through working hard at it.
It’s the same thing with writing. I love it, but I didn’t always. At first, writing was unrewarding drudgery, which is how most people I meet seem to feel about it.
I love public speaking, but I sure didn’t love it when my speech class made me sick to my stomach three mornings a week for a whole semester my freshman year.
I love being an Oracle performance specialist, but I sure didn’t love being airlifted into crisis after crisis throughout the early 1990s.

I could go on. The point is, my life would be unrecognizably different if not for several really painful situations that I decided to endure with the resolve to get really good at what I hated. Until I loved it.

In retrospect, I seem to have been very lucky in many important situations. Of course, I have. But you make your own luck. Although I believe deeply in the idea of, “The harder I work, the luckier I get,” that is not what I’m talking about here. I’m talking about the power that you have to define for yourself whether something that happened was lucky for you or not. Your situations do not define your life. You create your life based on how you regard your situations.

I could have rebelled against Jimmy Harkey and hated math for the rest of my life. Lots of kids did. I could have rebelled against Lewis Parkhill and never become a writer. I could have refused Craig Newberger’s advice to take his second speech course and never become comfortable in front of an audience. I could have left Oracle in 1991 and found a job where they had more mature products....

One of the most important questions that I ever asked my wife before our engagement was this:

If you were forced to wash cars for 12 hours a day, just to make a living, could you enjoy it?

This is a “soulmate” kind of question for me. My wife’s attitude about it is, for our children and me, possibly the most valuable gift in our lives.

Loving what you do can be difficult. I think Denis hits the nail on the head by suggesting that,

By doing good work, you just might find out that what you are doing, is what you are supposed to do. And if you don’t, quality work will get you to where you want to be.

I hope you will find love in what you do today. Do it well, and it’ll definitely improve your odds.

Monday, December 21, 2009

My Whole System Is Slow. Now What?

At CMG'09 a couple of weeks ago, I presented "Measuring Response Times of Code on Oracle Systems." The paper for this presentation was a subset of "For Developers: Making Friends with the Oracle Database." In the presentation, I spent a few minutes talking about why to measure response times in Oracle, and then I spent a lot of minutes talking about how. As usual, I focused heavily on the importance of measuring response times of individual business tasks executed by individual end users.

At the end of the talk, a group of people came to the podium to ask questions (always a good sign). The first question was the question that a lot of people ask. It was:

My whole system is slow. That's all my users will tell me. So then, how do I begin to do what you're describing?

Here's the answer:

Ask your users to show you what they're doing. Just go look at it.

The results of this simple advice are routinely spectacular. Just go look at it: I'm surprised whenever someone doesn't think of doing that, but I shouldn't be. That's because I didn't do it either, for the longest time. I had to learn to. And that's the story I want to tell you here.

In the early 1990s, I was a consultant with Oracle Corporation visiting clients with performance problems at a pace of more than 30 per year. Back then, I did Oracle performance work the old fashioned way: I checked everything I knew how to check, and then I fixed everything I knew how to fix. All billable by the hour. (Note: When I was doing it this way, I had not yet been taught by Dave Ensor, who changed me forever.)

On weeks when I was lucky, I'd be finished checking and fixing by sometime Wednesday, leaving a couple of days to find out what people thought of my work. If I were lucky again (that's two "lucky"s now), everyone would be thrilled with the results. I'd get my hug (so to speak), and I'd catch my flight.

But I wasn't always lucky. Some weeks, I wouldn't find anything suspicious in my checking and fixing. Some weeks, I'd find plenty, but still not everyone would be thrilled with the work. Having people be less than thrilled with my work caused pain for me, which motivated me to figure out how to take more control of my consulting engagements, to drive luck out of the equation.

The most important thing I figured out was...

People knew before I came on-site how they were going to measure on Thursday whether they liked the results of my work.

And...

They were willing to tell me on Monday.

All I had to do was be honest, like this:

On the day I'm done working here, I'd imagine you're going to want to run something that will demonstrate whether I accomplished what you were hoping for while I was here. Would you mind telling me about that now? Maybe even showing me?

I could ask that on Monday, and people were glad to tell me. I'd watch the things run and record how long they ran, and then I'd know how to prioritize my time on site. I'd record how long they ran so at the end of my engagement, I'd be able to show very clearly what improvements I had made.

Sometimes, there would be thirty different things that people would expect to measure on Thursday. If I might not have time to fix them all, then I needed to make sure that I knew the priority of the things I was being asked to fix.

That one step alone—knowing on Monday that prioritized list of what tasks needed to be fast by Thursday—drastically reduced my reliance on luck as a success factor in my job at these sites. Knowing that list on Monday is just like when your teacher in school tells you exactly what's going to be on your next test. It allows you to focus your attention on exactly what you need to do to optimize your reward for the week. (Note to fellow education enthusiasts: Please don't interpret this paragraph as my advocating the idea that GPA should be a student's sole—or even dominant—optimization constraint.)

So, what I learned is that the very first step of any good performance optimization method is necessarily this:

1. Identify the task that's the most important to you.

When I say "task," think "program" or "click" or "batch job" if you want to. What I mean is "a useful unit of work that makes sense to the business." ...Something that a business user would show you if you just went and watched her work for a few minutes.

Then comes step two:

2. Measure its response time (R). In detail.

Why is response time so important? Because that's what's important to the person who'll be watching it run on Thursday, assessing whether she thinks you've done a good job or not. That person's going to click and then wait. Happiness will be inversely proportional to how long the wait is. That's it. That's what "performance" means at 99% of sites I've ever visited.

(If you're interested in the other 1% of sites I've visited, they're interested in throughput, which I've written about in another blog post.)

Measuring response time is vital. You must be able to measure response time if you're going to nail that test on Thursday.

The key is to understand that the term response time doesn't even have a definition except in the context of a task. You can't measure response time if you don't first decide what task you're going to measure. In other words, you cannot do step 2 before you do step 1. With Oracle, for example, you can collect ASH data (if you're licensed to use it) or even trace data for a whole bunch of Oracle processes, but you won't have a single response time until you define which tasks buried within that data are the ones you want to extract and pay attention to.

You get that by visiting a user and watching what she does.

There are lots of excuses for not watching your users. Like these...

"I don't know my users." I know. But you should. You'd do your job better if you did. And your users would, too.
"My users aren't here." I know. They're on the web. They're in Chicago and Singapore and Istanbul, buying plane tickets or baseball caps or stock shares. But if you can't watch at least a simulation of the things those users actually do with the system you help manage, then I can't imagine how you would possibly succeed at providing good performance to them.
"I'm supposed to be able to manage performance with my dashboard." I know. I was supposed to have a hover car by the year 2000.

The longer you stay mired in excuses like these, the longer it's going to be before you can get the benefit of my point here. Your users are running something, and whatever that is that they're running is your version of my Thursday test. You can check and fix all you want, but unless you get lucky and fix the exact tooth that's hurting, your efforts aren't going to be perceived as "helpful." Checking and fixing everything you can think of is far less efficient and effective than targeting exactly what your user needs you to target.

Lots of performance analysts (DBAs, developers, architects, sysadmins, and so on) assume that when someone says, "The whole system is slow," it means there must be a single parameter somewhere in the bowels of the system that needs adjustment, and if you can just make that adjustment, everything is going to be ok. It might mean that, but in my experience, the overwhelming majority of cases are not that way. (Pages 25–29 of Optimizing Oracle Performance has more information about this.)

The great thing about measuring response time is that no matter what the problem is, you'll see it. If the program you're watching is poorly written, you'll see it. If some other program is hogging too much of a resource that your program needs, you'll see it. If you have a bad disk controller, you'll see it. If some parameter needs adjusting, you'll see it.

Realize that when a business user says "system," she doesn't mean what you would mean if you said "system." She means that the thing she runs is slow. Look at that thing. Maybe there are seventeen of them. And sure, maybe all seventeen suffer from the same root cause. If that's the case, then fantastic, because fixing the first problem will magically fix the other sixteen, too. If it's not, then fantastic anyway, because now all of them are on your prioritized list of tasks to optimize, and you'll probably surprise yourself how quickly you'll be able to pick them all off when you focus on one task at a time.

Monday, December 7, 2009

C. J. Date: All Systems Go

Enrollment is open for the course taught by Christopher J. Date that we'll host 26–28 January in the Dallas/Fort Worth area. For many of us, this will be a once-in-a-lifetime opportunity to sit in a classroom for three days with one of the pioneers who created the field we live in each day.

I'm looking forward to this course myself. It is so easy to use Oracle in non-relational ways. But not understanding how to use SQL relationally leads to countless troubles and unnecessary complexities. Chris's focus in this course will be the discipline to use Oracle in a truly relational way, the mastery of which will make your applications faster, easier to prove, and more fun to write, maintain, and ehnance.

I'm particularly looking forward to his section about missing values—the ages-old debate about NULL—which is not covered in Chris's book upon which the course is based.

Thursday, December 3, 2009

The Thing (I Don’t Like) About Video

Video is awesome. I like high-bandwidth communication. Even on the cheapest, most un-produced videos, I can see facial expressions and body language that I’d never be able to pick up from text. I can see candidness that’s not going to come through in a document, even a blog that’s written pretty much off the cuff. And videos with high production value, ...well of course it’s awesome to watch a great short movie right at the tips of your fingers.

But...

But when you send me a 7:48 video, I have to budget 7:48 to watch it. (Well, more actually, because of the latency required to buffer it up.) When you send me a 13-page document, I can “read” it in 10 seconds if I want to. I can skim the first and final paragraphs really quickly and look for pictures or sidebars or quotes, and it takes practically no time for me to do it.

With a video, it’s just more difficult to do that. I can watch the first 10 seconds and usually know whether I want to watch the remainder. But skimming through the whole video—like skipping to the end—is more difficult, because I have to sit there un-utilized while the whole video buffers up. Then I have to sit there while words come at me aurally, which is annoyingly sequential compared to reading buckets of text in one eyeful.

So, the bottom line is that the first 10 seconds of your video need to convince me to watch the remainder.

Or I won’t.

Is it just me?

What’s out there to make video browsing a better, more time-efficient, and more fulfilling experience?

Friday, November 20, 2009

Performance Optimization with Global Entry. Or Not?

As I entered the 30-minute "U.S. Citizens" queue for immigration back into the U.S. last week, the helpful "queue manager" handed me a brochure. This is a great place to hand me something to read, because I'm captive for the next 30 minutes as I await my turn with the immigration officer at the Passport Control desk. The brochure said "Roll through Customs faster."

Ok. I'm listening.

Inside the brochure, the first page lays out the main benefits:

bypass the passport lines
no paper Customs declaration
in most major U.S. airports

Well, that's pretty cool. Especially as I'm standing only 5% deep in a queue with a couple hundred people in it. And look, there's a Global Entry kiosk right there with its own special queue, with nobody—nobody!—in it.

If I had this Global Entry thing, I'd have a superpower that would enable me to zap past the couple hundred people in front of me, and get out of the Passport Control queue right now. Fantastic.

So what does this thing cost? It's right there in the brochure:

Apply online at www.globalentry.gov. There is a non-refundable $100 application fee. Membership is valid for five years. That's $20 a year for the queue-bypassing superpower. Not bad. Still listening.
Schedule an in-person interview. Next, I have to book an appointment to meet someone at the airport for a brief interview.
Complete the interview and enrollment. I give my interview, get my photo taken, have my docs verified, and that's it, I'm done.

So, all in all, it doesn't cost too much: a hundred bucks and probably a couple hours one day next month sometime.

What's the benefit of the queue-bypassing superpower? Well, it's clearly going to knock a half-hour off my journey through Passport Control. I immigrate three or four times per year on average, and today's queue is one of the shorter ones I've seen, so that's at least a couple hours per year that I'd save... Wow, that would be spectacular: a couple more hours each year in my family's arms instead of waiting like a lamb at the abattoir to have my passport controlled.

But getting me into my family's arms 30 minutes earlier is not really what happens. The problem is a kind of logic that people I meet get hung up in all the time. When you think about subsystem (or resource) optimization, it looks like your latency savings for the subsystem should go straight to your system's bottom line, but that's often not what happens. That's why I really don't care about subsystem optimization; I care about response time. I could say that a thousand times, but my statement is too abstract to really convey what I mean unless you already know what I mean.

What really happens in the airport story is this: if I had used Global Entry on my recent arrival, it would have saved me only a minute or two. Not half an hour, not even close.

It sounds crazy, doesn't it? How can a service that cuts half an hour off my Passport Control time not get me home at least a half hour earlier?

You'll understand once I show you a sequence diagram of my arrival. Here it is (at right). You can click the image to embiggen it, if you need.

To read this sequence diagram, start at the top. Time flows downward. This sequence diagram shows two competing scenarios. The multicolored bar on the left-hand side represents the timeline of my actual recent arrival at DFW Airport, without using the Global Entry service. The right-hand timeline is what my arrival would have looked like had I been endowed with the Global Entry superpower.

You can see at the very bottom of the timeline on the right that the time I would have saved with Global Entry is minuscule: only a minute or two.

The real problem is easy to see in the diagram: Queue for Baggage Claim is the great equalizer in this system. No matter whether I'm a Global Entrant or not, I'm going to get my baggage when the good people outside with the Day-Glo Orange vests send it up to me. My status in the Global Entry system has absolutely no influence over what time that will occur.

Once I've gotten my baggage, the Global Entry superpower would have again swung into effect, allowing me to pass through the zero-length queue at the Global Entry kiosk instead of waiting behind two families at the Customs queue. And that's the only net benefit I would have received.

Wait: there were only two families in the Customs queue? What about the hundreds of people I was standing behind in the Passport Control queue? Well, many of them were gone already (either they had hand-carry bags only, or their bags had come off earlier than mine). Many others were still awaiting their bags on the Baggage Claim carousel. Because bags trickle out of the baggage claim process, there isn't the huge all-at-once surge of demand at Customs that there is at Passport Control when a plane unloads. So the queues are shorter.

At any rate, there were four queues at Customs, and none of them was longer than three or four families. So the benefit of Global Entry—in exchange for the $100 and the time spent doing the interview—for me, this day, would have been only the savings of a couple of minutes.

Now, if—if, mind you—I had been able to travel with only carry-on luggage, then Global Entry would have provided me significantly more value. But when I'm returning to the U. S. from abroad, I'm almost never allowed to carry on any bag other than my briefcase. Furthermore, I don't remember ever clearing Passport Control to find my bag waiting for me at Baggage Claim. So the typical benefit to me of enrolling in Global Entry, unfortunately, appears to be only a fraction of the duration required to clear Customs, which in my case is almost always approximately zero.

The problem causing the low value (to me) of the Global Entry program is that the Passport Control resource hides the latency of the Baggage Claim resource. No amount of tuning upon the Passport Control resource will affect the timing of the Baggage In Hand milestone; the time at which that milestone occurs is entirely independent of the Passport Control resource. And that milestone—as long as it occurs after I queue for Baggage Claim—is a direct determinant of when I can exit the airport. (Gantt or PERT chart optimizers would say that Queue for Baggage Claim is on the critical path.)

How could a designer make the airport experience better for the customer? Here are a few ideas:

Let me carry on more baggage. This idea would allow me to trot right through Baggage Claim without waiting for my bag. In this environment, the value of Global Entry would be tremendous. Well, nice theory; but allowing more carry-on baggage wouldn't work too well in the aggregate. The overhead bins on my flight were already stuffed to maximum capacity, and we don't need more flight delays induced by passengers who bring more stuff onboard than the cabin can physically accommodate.
Improve the latency of the baggage claim process. The sequence diagram shows clearly that this is where the big win is. It's easy to complain about baggage claim, because it's nearly always noticeably slower than we want it to be, and we can't see what's going on down there. Our imaginations inform us that there's all sorts of horrible waste going on.
Use latency hiding to mask the pain of the baggage claim process. Put TV sets in the Baggage Claim area, and tune them to something interesting instead of infinite loops of advertising. At CPH, they have a Danish hot dog stand in the baggage claim area. They also have a currency exchange office in there. Excellent latency hiding ideas if you need a snack or some DKK walkin'-around-money.

Latency hiding is a weak substitute for improving the speed of the baggage claim process. The killer app would certainly be to make Baggage Claim faster. Note, however, that just making Baggage Claim a little bit faster wouldn't make the Global Entry program any more valuable. To make Global Entry any more valuable, you'd have to make Baggage Claim fast enough that your bag would be waiting for anyone who cleared the full Passport Control queue.

So, my message today: When you optimize, you must first know your goal. So many people optimize subsystems (resources) that they think are important, but optimizing subsystems is often not a path to optimizing what you really want. At the airport, I really don't give a rip about getting out of the Passport Control queue if it just means I'm going to be dumped earlier into a room where I'll have to wait until an affixed time for my baggage.

Once you know what your real optimization goal is (that's Method R step 1), then the sequence diagram is often all you need to get your breakthrough insight that either helps you either (a) solve your problem or (b) understand when there's nothing further that you can really do about it.