Friday, April 24, 2009

The Most Common Performance Problem I See

At the Percona Performance Conference in Santa Clara this week, the first question an audience member asked our panel was, "What is the most common performance problem you see in the field?"

I figured, being an Oracle guy at a MySQL conference, this might be my only chance to answer something, so I went for the mic. Here is my answer.
The most common performance problem I see is people who think there's a most-common performance problem that they should be looking for, instead of measuring to find out what their actual performance problem actually is.
It's a meta answer, but it's a meta problem. The biggest performance problems I see, and the ones I see most often, are not problems with machines or software. They're problems with people who don't have a reliable process of identifying the right thing to work on in the first place.

That's why the definition of Method R doesn't mention Oracle, or databases, or even computers. It's why Optimizing Oracle Performance spends the first 69 pages talking about red rocks and informed consent and Eli Goldratt instead of Oracle, or databases, or even computers.

The most common performance problem I see is that people guess instead of knowing. The worst cases are when people think they know because they're looking at data, but they really don't know, because they're looking at the wrong data. Unfortunately, every case of guessing that I ever see is this worst case, because nobody in our business goes very far without consulting some kind of data to justify his opinions. Tim Cook from Sun Microsystems pointed me yesterday to a blog post that gives a great example of that illusion of knowing when you really don't.

Monday, April 13, 2009

Maxine Johnson

I want to introduce you to Maxine Johnson, assistant manager of men's sportswear at Nordstrom Galleria Dallas. The reason I think Maxine is important is because she taught my son and me about customer service. I met her several months ago. I still have her card, and I'm still grateful to her. Here's what happened.

A few months ago, my wife and I were in north Dallas with some time to spare, and I convinced her to go with me to pick out one or two pairs of dress slacks. I felt like I was wearing the same pants over and over again when I traveled, and I could use an extra pair or two. We usually go to Nordstrom for that, and so we did again. After some time, I had two pairs of trousers that we both liked, and so we had them measured for hemming and picked them up a few days later.

A week or two passed, and then I packed a pair of my new pants for a trip to Zürich. I put them on in the hotel the first morning I was supposed to speak at an event. On my few-block walk from the hotel to the train station, I caught my reflection in a store window, and—hmmp—my pants were just not... really... quite... long enough. Every step, the whole cuff would come way up above the tops of my shoes. I stopped and tugged them down, and then they seemed alright, but then as soon as I started walking again, they'd ride back up and look too short.

They weren't bad enough that anyone said anything, but I was a little self-consious about it. I kept tugging at them all day.

When I hung them back up in my closet at home, I noticed that when I folded them over the hanger, they didn't reach as far as the other pants that I really liked. Sure enough, when I lined up the waists, these new pants were about an inch shorter than my favorite ones that I had bought at Nordstrom probably four years ago.

Now, pants at Nordstrom cost a little more than maybe at a lot of other places, but they're worth to me what I pay for them because they're nice, and they last a long time. But these new ones made me feel bad, because they were just a little bit off. I could already foresee a future of two new pairs of slacks hanging in my closet for years, never really making the starting rotation because they're just a little bit off, but never making the garage sale pile, either, because they had cost too much.

My wife agreed. They were shorter than the others. They were shorter than they should be. I needed to get them fixed.

Now, this is the part I always hate. Having made the decision, the next step is that step where you take the thing back and try to get the problem fixed. I hate that part. My wife doesn't mind it so much, but these were my pants, and so I was the one that had to go back and put them on so someone could fix them. I really dreaded it though, because I knew that the only way they could fix those pants was to take off the cuff.

It's late in the evening by the time my wife helps me build up a little head of steam, and we both decide (well, she decides, but she's right) that tonight is the perfect night for me to go on a 20-mile drive across town to Nordstrom to get my pants fixed. As a matter of fact, it'd be good if my older boy went with me. That makes it a little more fun, because he's good company for me.

It's late enough by now that before I could leave, I had to phone ahead, just to make sure the store was still open. A nice lady answered the phone. I said my name and told the nice lady that I was having some trouble with some slacks I had bought a few weeks ago, and how late did they stay open? She told me to come right on over.

So my boy and I got into the car, and I drove right on over.

A half hour later, I walked into the store, thankful that the doors were still open, carrying two pairs of slacks on a hanger, with my son walking beside me. A smiling nice lady approached me as I entered the men's department. "Mr. Millsap?" Yes, I am. It surprises me anytime someone remembers my name from that one phase of the conversation where I say real fast, "My name is Cary Millsap, and blah blah blah blah blah," and tell my whole story. The person on the phone hadn't asked me again what my name was. She had caught it in the blur at the beginning of my story.

She proceeded to explain to me what was going to happen. I was going to try on the slacks in the dressing room. The tailor would be there waiting for me. She and the tailor would look them over. If there was enough fabric to make them longer, then they'd do that tonight. If there weren't, then she was going to find two new pairs of slacks for me, and the tailor would have them ready for me tomorrow. If for any reason, those didn't work, then she'd keep preparing new trousers for me until I was satisfied.

Mmm, ok. I was probably grinning a little bit by now, because this was pretty fantastic news. I wasn't going to have to get my pants de-cuffed. I was still a little nervous, though, that when I came out of the dressing room, everyone was going to look at me like, "So what's the problem? I don't see any problem. Those are long enough."

When I came out, Maxine Johnson crossed her arms, put her hand to her chin, shook her head a little, and immediately said something to the effect of, "Oh my, no. That won't do at all." So she brought me two new pairs, which I tried on, and which the tailor measured for me. She gave me a reclaim ticket for the next day. As usual, I had missed her name when she introduced herself as I first entered the men's department. (As you probably already figured out, I have a bad habit of not paying enough attention to that part of the conversation that I think of as "the blur.") I did have the good sense to ask for her business card, which is why I know her name is Maxine Johnson.

My boy and I talked the whole ride home that what we had seen that night had been some real, first-class retail customer care right there, and that we all knew where we'd be buying my next pairs of pants. When I had gotten into the car an hour or so before, I had been very apprehensive about what might happen. I had been especially nervous about how I'd perform during the proving-what's-wrong part of the project. But Maxine Johnson put me completely at ease during my experience. She didn't just do the right thing, she did it in such a manner that I felt glad the whole problem had happened. Here's the thing:
Maxine Johnson made me feel like it was not just ok that I brought the pants back for repair, she made me feel like she was delighted by the opportunity to show me what Nordstrom could do for me under pressure.
I hope that the way Maxine Johnson made me feel is the way that my employees and I make our customers feel. I hope it's the way my children make their customers feel someday when they go to work.

Thank you, Maxine Johnson. Thank you.

Wednesday, April 8, 2009

What would you do with 8 disks?

Yesterday, David Best posted this question at Oracle-L:
If you had 8 disks in a server what would you do? From watching this list I can see alot of people using RAID 5 but i'm wary of the performance implicatons. (http://www.miracleas.com/BAARF/)

I was thinking maybe RAID 5 (3 disks) for the OS, software and
backups. RAID 10 (4 disks + 1 hot spare) for the database files.

Any thoughts?
I do have some thoughts about it.

There are four dimensions in which I have to make considerations as I answer this question:
  1. Volume
  2. Flow
  3. Availability
  4. Change
Just about everybody understands at least a little bit about #1: the reason you bought 8 disks instead of 4 or 16 has something to do with how many bytes of data you're going to store. Most people are clever enough to figure out that if you need to store N bytes of data, then you need to buy N + M bytes of capacity, for some M > 0 (grin).

#2 is where a lot of people fall off the trail. You can't know how many disks you really need to buy unless you know how many I/O calls per second (IOPS) your application is going to generate. You need to ensure that your sustained IOPS rate on each disk will not exceed 50% (see Table 9.3 in Optimizing Oracle Performance for why .5 is special). So, if a disk drive is capable of serving N 8KB IOPS (your disk's capacity for serving I/O calls at your Oracle block size), then you better make sure that the data you put on that disk is so interesting that it motivates your application to execute no more than .5N IOPS to that disk. Otherwise, you're guaranteeing yourself a performance problem.

Your IOPS requirement gets a little trickier, depending on which arrangement you choose for configuring your disks. For example, if you're going to mirror (RAID level 1), then you need to account for the fact that each write call your application makes will motivate two physical writes to disk (one for each copy). Of course, those write calls are going to separate disks, and you better make sure they're going through separate controllers, too. If you're going to do striping with distributed parity (RAID level 5), then you need to realize that each "small" write call is going to generate four physical I/O calls (two reads, and two writes to two different disks).

Of course, RAID level 5 caching complicates the analysis at low loads, but for high enough loads, you can assume away the benefits of cache, and then you're left with an analysis that tell you that for write-intensive data, RAID level 5 is fine as long as you're willing to buy 4× more drives than you thought you needed. ...Which is ironic, because the whole reason you considered RAID level 5 to begin with is that it costs less than buying 2× more drives than you thought you needed, which is why you didn't buy RAID level 1 to begin with.

If you're interested in RAID levels, you should peek at a paper I wrote a long while back, called Configuring Oracle Server for VLDB. It's an old paper, but a lot of what's in there still holds up, and it points you to deeper information if you want it.

You have to think about dimension #3 (availability) so that you can meet your business's requirements for your application to be ready when its users need it. ...Which is why RAID levels 1 and 5 came into the conversation to begin with: because you want a system that keeps running when you lose a disk. Well, different RAID levels have different MTBF and MTTR characteristics, with the bottom line being that RAID level 5 doesn't perform quite as well (or as simply) as RAID level 1 (or, say 1+0 or 0+1), but RAID level 5 has the up-front gratification advantage of being more economical (unless you get a whole bunch of cache, which you pretty much have to, because you want decent performance).

The whole analysis—once you actually go through it—generally funnels you into becoming a BAARF Party member.

Finally, dimension #4 is change. No matter how good your analysis is, it's going to start degrading the moment you put your system together, because from the moment you turn it on, it begins changing. All of your volumes and flows will change. So you need to factor into your analysis how sensitive to change your configuration will be. For example, what % increase in IOPS will require you to add another disk (or pair, or group, etc.)? You need to know in advance, unless you just like surprises. (And you're sure your boss does, too.)

Now, after all this, what would I do with 8 disks? I'd probably stripe and mirror everything, like Juan Loaiza said. Unless I was really, really (I mean really, really) sure I had a low write-rate requirement (think "web page that gets 100 lightweight hits a day"), in which I would consider RAID level 5. I would make sure that my sustained utilization for each drive is less than 50%. In cases where it's not, I would have a performance problem on my hands. In that case, I'd try to balance my workload better across drives, and I would work persistently to find any applications out there that are wasting I/O capacity (naughty users, naughty SQL, etc.). If neither of those actions reduced the load by enough, then I'd put together a justification/requisition for more capacity, and I would brace myself to explain why I thought 8 disks was the right number to begin with.

Friday, April 3, 2009

Cary on Joel on SSD

Joel Spolsky's article on Solid State Disks is a great example of a type of problem my career is dedicated to helping people avoid. Here's what Joel did:
  1. He identified a task needing performance improvement: "compiling is too slow."
  2. He hypothesized that converting from spinning rust disk drives (thanks mwf) to solid state, flash hard drives would improve performance of compiling. (Note here that Joel stated that his "goal was to try spending money, which is plentiful, before [he] spent developer time, which is scarce.")
  3. So he spent some money (which is, um, plentiful) and some of his own time (which is apparently less scarce than that of his developers) replacing a couple of hard drives with SSD. If you follow his Twitter stream, you can see that he started on it 3/25 12:15p and wrote about having finished at 3/27 2:52p.
  4. He was pleased with how much faster the machines were in general, but he was disappointed that his compile times underwent no material performance improvement.
Here's where Method R could have helped. Had he profiled his compile times to see where the time was being spent, he would have known before the upgrade that SSD was not going to improve response time. Given his results, his profile for compiling must have looked like this:
100%  Not disk I/O
  0%  Disk I/O
----  ------------
100%  Total
I'm not judging whether he wasted his time by doing the upgrade. By his own account, he is pleased at how fast his SSD-enabled machines are now. But if, say, the compiling performance problem had been survival-threateningly severe, then he wouldn't have wanted to expend two business days' worth of effort upgrading a component that was destined to make zero difference to the performance of the task he was trying to improve.

So, why would someone embark upon a performance improvement project without first knowing exactly what result he should be able to expect? I can think of some good reasons:
  • You don't know how to profile the thing that's slow. Hey, if it's going to take you a week to figure out how to profile a given task, then why not spend half that time doing something that your instincts all say is surely going to work?
  • Um, ...
Ok, after trying to write them all down, I think it really boils down to just one good reason: if profiling is too expensive (that is, you don't know how, or it's too hard, or the tools to do it cost too much), then you're not going to do it. I don't know how I'd profile a compile process on a Microsoft Windows computer. It's probably possible, but I can't think of a good way to do it. It's all about knowing; if you knew how to do it, and it were easy, you'd do it before you spent two days and a few hundred bucks on an upgrade that might not give you what you wanted.

I do know that in the Oracle world, it's not hard anymore, and the tools don't cost nearly as much as they used to. There's no need anymore to upgrade something before you know specifically what's going to happen to your response times. Why guess... when you can know.