Cary Millsap: waste

Wednesday, April 8, 2009

What would you do with 8 disks?

Yesterday, David Best posted this question at Oracle-L:

If you had 8 disks in a server what would you do? From watching this list I can see alot of people using RAID 5 but i'm wary of the performance implicatons. (http://www.miracleas.com/BAARF/)

I was thinking maybe RAID 5 (3 disks) for the OS, software and
backups. RAID 10 (4 disks + 1 hot spare) for the database files.

Any thoughts?

I do have some thoughts about it.

There are four dimensions in which I have to make considerations as I answer this question:

Volume
Flow
Availability
Change

Just about everybody understands at least a little bit about #1: the reason you bought 8 disks instead of 4 or 16 has something to do with how many bytes of data you're going to store. Most people are clever enough to figure out that if you need to store N bytes of data, then you need to buy N + M bytes of capacity, for some M > 0 (grin).

#2 is where a lot of people fall off the trail. You can't know how many disks you really need to buy unless you know how many I/O calls per second (IOPS) your application is going to generate. You need to ensure that your sustained IOPS rate on each disk will not exceed 50% (see Table 9.3 in Optimizing Oracle Performance for why .5 is special). So, if a disk drive is capable of serving N 8KB IOPS (your disk's capacity for serving I/O calls at your Oracle block size), then you better make sure that the data you put on that disk is so interesting that it motivates your application to execute no more than .5N IOPS to that disk. Otherwise, you're guaranteeing yourself a performance problem.

Your IOPS requirement gets a little trickier, depending on which arrangement you choose for configuring your disks. For example, if you're going to mirror (RAID level 1), then you need to account for the fact that each write call your application makes will motivate two physical writes to disk (one for each copy). Of course, those write calls are going to separate disks, and you better make sure they're going through separate controllers, too. If you're going to do striping with distributed parity (RAID level 5), then you need to realize that each "small" write call is going to generate four physical I/O calls (two reads, and two writes to two different disks).

Of course, RAID level 5 caching complicates the analysis at low loads, but for high enough loads, you can assume away the benefits of cache, and then you're left with an analysis that tell you that for write-intensive data, RAID level 5 is fine as long as you're willing to buy 4× more drives than you thought you needed. ...Which is ironic, because the whole reason you considered RAID level 5 to begin with is that it costs less than buying 2× more drives than you thought you needed, which is why you didn't buy RAID level 1 to begin with.

If you're interested in RAID levels, you should peek at a paper I wrote a long while back, called Configuring Oracle Server for VLDB. It's an old paper, but a lot of what's in there still holds up, and it points you to deeper information if you want it.

You have to think about dimension #3 (availability) so that you can meet your business's requirements for your application to be ready when its users need it. ...Which is why RAID levels 1 and 5 came into the conversation to begin with: because you want a system that keeps running when you lose a disk. Well, different RAID levels have different MTBF and MTTR characteristics, with the bottom line being that RAID level 5 doesn't perform quite as well (or as simply) as RAID level 1 (or, say 1+0 or 0+1), but RAID level 5 has the up-front gratification advantage of being more economical (unless you get a whole bunch of cache, which you pretty much have to, because you want decent performance).

The whole analysis—once you actually go through it—generally funnels you into becoming a BAARF Party member.

Finally, dimension #4 is change. No matter how good your analysis is, it's going to start degrading the moment you put your system together, because from the moment you turn it on, it begins changing. All of your volumes and flows will change. So you need to factor into your analysis how sensitive to change your configuration will be. For example, what % increase in IOPS will require you to add another disk (or pair, or group, etc.)? You need to know in advance, unless you just like surprises. (And you're sure your boss does, too.)

Now, after all this, what would I do with 8 disks? I'd probably stripe and mirror everything, like Juan Loaiza said. Unless I was really, really (I mean really, really) sure I had a low write-rate requirement (think "web page that gets 100 lightweight hits a day"), in which I would consider RAID level 5. I would make sure that my sustained utilization for each drive is less than 50%. In cases where it's not, I would have a performance problem on my hands. In that case, I'd try to balance my workload better across drives, and I would work persistently to find any applications out there that are wasting I/O capacity (naughty users, naughty SQL, etc.). If neither of those actions reduced the load by enough, then I'd put together a justification/requisition for more capacity, and I would brace myself to explain why I thought 8 disks was the right number to begin with.

Wednesday, February 18, 2009

Throughput versus Response Time

I like Doug Burns's recent blog post called Time Matters: Throughput vs. Response Time. If you haven't read it, please do. The post and its comment thread are excellent.

The principle Doug has recognized is why the knee in the performance curve is defined as the traffic intensity (think utilization, or load) value at which, essentially, the ratio of response time divided by throughput is minimized. It's not just the place where response time is minimized (which, as Doug observed, is when there's no load at all except for you, ...which is awesome for you, but not so good for business).

I'd like to emphasize a couple of points. First, batch and interactive workloads have wholly different performance requirements, which several people have already noted in their comments to Doug's post. With batch work, people are normally concerned with maximizing throughput. With online work, individual people care more about their own response times than group throughput, although those people's managers probably care more about group throughput. The individual people probably care about group throughput too, but not so much that they're happy about staying late after work to provide it when their individual tasks run so slowly they can't finish them during the normal working day.

In addition to having different performance requirements, batch workload can often be scheduled differently, too. If you're lucky, you can schedule your batch workload deterministically. For example, maybe you can employ a batch workload manager that feeds workload to your system like a carefully timed IV drip, to keep your system's CPU utilization pegged at 100% without causing your CPU run-queue depth to exceed 1.0. But online workload is almost always nondeterministic, which is to say that it can't be scheduled at all. That's why you have to keep some spare un-utilized system capacity handy; otherwise, your system load goes out past the nasty knee in your performance curve, and your users' response times behave exponentially in response to microscopic changes in load, which results in much Pain and Suffering.

My second point is one that I find that a lot of people don't understand very well: Focusing on individual response time—as in profiling—for an individual business task is an essential element in a process to maximize throughput, too. There are good ways to make a task faster, and there are bad ways. Good ways eliminate unnecessary work from the task without causing negative side-effects for tasks you're not analyzing today. Bad ways accidentally degrade the performance of tasks other than the one(s) you're analyzing.

If you stick to the good ways, you don't end up with the see-saw effect that most people seem to think of when they hear "optimize one business task at a time." You know, the idea that tuning A breaks B; then tuning B breaks A again. If this is happening to you, then you're doing it wrong. Trying to respond to performance problems by making global parameter changes commonly causes the see-saw problem. But eliminating wasteful work creates collateral benefits that allow competing tasks on your system to run faster because the task you've optimized now uses fewer resources, giving everything else freer and clearer access to the resources they need, without having to queue so much for them.

Figuring out how to eliminate wasteful work is where the real fun begins. A lot of the tasks we see are fixable by just changing just a little bit of source code. I mean the 2,142,103-latch query that consumes only 9,098 latches after fixing; things like that. A lot more are fixable by simply collecting statistics correctly. Others require adjustments to an application's indexing strategy, which can seem tricky when you need to optimize across a collection of SQL statements (here comes the see-saw), but even that is pretty much a solved problem if you understand Tapio Lahdenmäki's work (except for the inevitable politics of change control).

Back to the idea of Doug's original post, I wholeheartedly agree that you want to optimize both throughput and response time. The business has to decide what mixture is right. And I believe it's crucial to focus on eliminating waste from each individual competing task if you're going to have any hope of optimizing anything, whether you care more about response time, or throughput.

Think about it this way... A task cannot run at its optimal speed unless it is efficient. You cannot know whether a task is efficient without measuring it. And I mean specifically and exactly it, not just part of "it" or "it" plus a bunch of other stuff surrounding it. That's what profiling is: the measurement of exactly one interesting task that allows you to determine exactly where that task spends its time, and thus whether that task is spending your system's time and resources efficiently.

You can improve a system without profiling, and maybe you can even optimize one without profiling. But you can't know whether a system is optimal without knowing whether its tasks are efficient, and you can't know whether a given task is efficient without profiling it.

When you don't know, you waste time and money. This is why I contend that the ability to profile a single task is absolutely vital to anyone wanting to optimize performance.