Cary Millsap

Tuesday, March 31, 2009

Last call for C. J. Date course

Note added 3 April 2009: When I wrote this post, we were counting down toward 2 April as the date for our preliminary go/no-go decision. That date is now behind us, and we have made the preliminary decision to Go. We are accepting further enrollments. —Cary Millsap

Thursday 2 April 2009 is our last call for enrollment in C. J. Date's course, "How to write correct SQL, and know it: a relational approach to SQL." I'm looking forward to this course more eagerly than anything I've attended in the past ten years, ...maybe twenty.

SQL and I never really got along too well. When I first joined Oracle Corporation in 1989, I was new to relational databases. I had done one hierarchical database project in college. I enjoyed the project ok, but it wasn't something I ever wanted to do again. When I joined Oracle, I didn't know much about relational technology or SQL. In my formative first couple of years at Oracle, though, I just never learned to like the SQL language. Prior to my Oracle career, I designed languages and wrote compilers for a living. From a language design standpoint, it just seemed that SQL (at least "Oracle SQL") could have become something really cool, but it didn't. For Oracle to treat an empty string as NULL, for example, is a decision which I still can't believe made it into the light of day...

I had a lot of respect over the years for the people I met who knew how to make SQL do what they wanted it to do. Dominic Delmolino was one of the first people I ever met who could make SQL do things I had no idea it could do. I'm still amazed when I see the things that Tom Kyte can do with SQL. I was never one of the SQL people.

Lex de Haan is the first person I ever met who really revealed to me what my problem was. A few years ago, Lex delivered a Miracle presentation in Rødby, Denmark, that dropped my jaw. He explained a better way to write an application with SQL. He showed how to write a completely unambiguous specification using a language I understood, predicate calculus ("this set equals that set," that kind of thing). He then showed how to implement that specification in SQL.

Here's the problem, though. SQL doesn't implement many of the set-theory/predicate-calculus operations that I expect. I'm not looking at Lex's notes as I write this, so I'll show you an example outlined recently by Toon Koppelaars, Lex's coauthor on the brilliant book called Applied Mathematics for Database Professionals (Expert's Voice).

In SQL, there's no "set equality" operator. That's right, although SQL is a set processing language, it has no operator for testing whether one set A equals another set B. But set equality "A = B" can be rewritten as "(A is-a-subset-of B) and (B is-a-subset-of A)".

Unfortunately, SQL doesn't have an is-a-subset-of operator either. But "A is-a-subset-of B" can be rewritten into "A minus B = the-empty-set".

But SQL also lacks the concept of an empty set. The way to express that is to test whether the cardinality of a set is zero, as in "count(*)=0".

Over the course of an hour-long presentation, Lex showed me a dozen or so operators that are missing from SQL, which we really need for expressing our intentions clearly in SQL. He put structure around the negative feelings I had toward the language. And then he showed an equivalent translation for each missing operator that could be implemented in SQL, which invested back into the language a new power. That's the trick that caused my jaw to fall. In Lex's presentation, the game of writing applications in SQL went from this:

Implement complex thoughts in crappy language that requires me to record my thoughts in a format that doesn't much resemble my thinking.
Worry whether the implementation was really right.

...to this:

Record complex thoughts using a language designed well to record exactly such thoughts.
Translate the specification of the program into SQL, using translation patterns.

Since our predicate calculus expressions were explicit enough to be provable, and since we could prove the correctness of the translations we were using to move from our specification to our SQL, we could actually then prove the correctness of our SQL. It was a beam of hope that developers could actually write correct applications ...and know it!

That's the first day I ever got excited thinking about SQL.

So, on April 27–29 in Dallas, I'll get a chance to enter the next phase of that thinking. On top of that, the message will be delivered by Chris Date, who I really enjoyed at the Hotsos Symposium earlier this month, and who is one of the pioneers who invented the whole world our careers live in. I'm looking to forward to it. It should be an interesting classroom, with Chris Date in the front and Karen Morton, Jeff Holt, and some others with me in the back. I hope you won't miss the opportunity.

Like I said, the final day to sign up is this Thursday 2 April 2009. I know that economic times are tough these days, but this is a one-of-a-kind education event that I believe will deliver lasting value to everyone who goes.

Friday, February 27, 2009

Dad, do I really need math?

My kids are pretty good about their math homework. They seem to enjoy it for the most part. It wasn't always that way. When the going gets tough, the natural human response, it seems, is to quit. So at times in our kids' school careers, their Mom and I have had to hang tough with them to try to make them do their homework. (The credit here belongs to their Mom.)

I remember when I was in school, the prevailing attitude in the classroom was, "When are we ever going to need to know this?" The much sadder one was, "My Mom and Dad said that I'm never going to need to know this stuff."

I couldn't have told you, when I was 10 years old, that I'd need to understand queueing theory one day in order to finish an Oracle project I had to do for Fidelity Investments. Or that I'd be able to win a Jim Sundberg autographed World Series baseball by using the distributive law of multiplication in my head while he was showing 400 people how Gaylord Perry liked his signs on the mound. It didn't matter to me, because I just had faith that there was a good reason I was supposed to learn everything I could in school. Having that particular faith was no accident.

I don't remember my Mom and Dad ever forcing me into doing math. I knew, of course, that it was My Job to do as well as I could in school ('A's are loafing unless they're '100's). But I don't remember ever feeling forced.

One of the things I fondly remember my Dad doing with me was glide slope calculation. Dad flew for many years for United Airlines. He retired as a 767 captain a long time ago. One of his priorities as a professional was to conserve fuel for his employer. It used to bug him when a pilot would constantly monkey around with the throttle during the approach to a landing. My Dad told me his goal on approach was to dial back the power one time at cruise altitude, at the very beginning of the descent, and then never touch it again until he turned on the thrust reversers after touchdown.

So he played this game with me, especially on car rides, because it was a 30-minute drive each day to where I went to grade school. He'd give me the altitude we were at and the altitude we needed to descend to, and either a time limit or the number of miles outbound we were. Then he'd ask me to calculate the sink rate in my head. He put triangles into my brain that I could see every time he asked me a question like that, and I'd hatch on it with him until we came up with the right sink rate. Or he would ask me things like, if the nose is pointing to heading 026, then what heading is our tail pointed at. So he put circles into my brain, too.

Every once in a while—oh, and I loved this—he would give me a paper flight plan form, with dozens of tiny cells to fill in, and I would fill them all in. I was 6 or 7 when we was doing that. I of course didn't know how to do it correctly, but I filled it all in anyway. Whenever I was really worried about doing it "right," I'd ask my Dad, and he'd tell me the kinds of things I should write down and which cells I should write them in.

You know the biggest value of that flight planning experience? It was that I couldn't wait to find out in school someday what point meant. You know, as in "three point five." I remember the day in class when a teacher finally taught us about decimal points. I felt sooo cool because now I knew what "three point five" actually meant.

My Dad did things with me that got me interested and excited about doing math, all on my own, without making me feel like I was being punished by it. Thus the abundance of wonderful opportunities that I have today are largely a continuing gift from him. I hope that another gift he gave me is the ability to be a good enough dad myself for my own kiddos, but of course I worry that I'm not doing it enough, or well enough. Telling stories about it helps remind me how important it is.

What reminded me of all this is a little document called "A Short Course in Human Relations," autographed by Bobby Bragan. It sits here in the foyer of our Method R office. I see it every single time I walk through our door. You've probably heard the following statement:

Say you were standing with one foot in the oven and one foot in an ice bucket. According to the percentage people, you should be perfectly comfortable.

Bobby Bragan said that; I think it was in 1963. It is a classic illustration of skew, which is vitally important to my career. Bobby Bragan, though, is an American hero for lots of good reasons. You should read about him.

Well, one night a few years ago, I got to watch Bobby Bragan speak to a small group. His talk was fascinating. He brought a huge box of stuff up to the podium with him, and he warmed up with a game. He opened by pulling something out of the box and saying whoever can answer this riddle gets the prize. The first one was something like, "What has eighteen legs and two breasts?" Shocker, right? The answer was The Supreme Court. Whoever said that, Bobby Bragan tossed him the first prize of the night.

Pretty deep into his speech, he must have given out twenty prizes to people. Not me. I either didn't know the answer, or I didn't say it loud enough or fast enough. I watched prize after prize go out, until he brought out this autographed document called "A Short Course in Human Relations." He read it aloud. It was an important part of his speech. And then he asked the question that went with it: "Nine ballplayers come out of the dugout before each game, and each ballplayer shakes the hand of every teammate. How many handshakes is that?" The voice that said "thirty-six" was mine. I was doggone lucky that Bobby Bragan had asked a bunch of baseball players a math question, and right on the prize that I really wanted, too.

Math. You really never know when you're going to need it.

Friday, February 20, 2009

Dang it, people, they're syscalls, not "waits"...

So many times, I see people get really confused about how to attack an Oracle performance problem, resulting in thoughts that look like this:

I don’t understand why my program is so slow. The Oracle wait interface says it’s just not waiting on anything. ?

The confusion begins with the name "wait event." I wish Oracle hadn't called them that. I wish instead of WAIT in the extended SQL trace output, they had used the token SYSCALL. Ok, that's seven bytes of trace data instead of just four, so maybe OS instead of WAIT. I wish that they had called v$session_wait either v$session_syscall or v$session_os .

Here's why. First, realize that an Oracle "wait event" is basically the instrumentation for one operating system subroutine call ("syscall"). For example, the Oracle event called db file sequential read: that's instrumentation for a pread call on our Linux box. On the same system, a db file scattered read covers a sequence of two syscalls: _llseek and readv (that's one reason why I said basically at the beginning of this paragraph). The event called enqueue: that's a semtimedop call.

Second, the word wait is easy to misinterpret. To the Oracle kernel developer who wrote the word WAIT into the Oracle source code, the word connoted the duration that the code path he was writing would have to "wait" for some syscall to return. But to an end-user or performance analyst, the word wait has lots of other meanings, too, like (to name just two):

How long the user has to wait for a task to complete (this is R in the R = S + W equation from queueing theory).
How long the user's task queues for service on a specific resource (this is W in the R = S + W equation from queueing theory).

The problem is that, as obvious and useful as these two definitions seem, neither one of them means what the word wait means in an Oracle context, which is:

wait n. In an Oracle context, the approximate response time of what is usually a single operating system call (syscall) executed by an Oracle kernel process.

That's a problem. It's a big problem when people try to stick Oracle wait times into the W slot of mathematical queueing models. Because they're not W values; they're R values. (But they're not the same R values as in #1 above.)

But that's a digression from a much more important point: I think the word wait simply confuses people into thinking that response time is something different than what it really is. Response time is simply how long it takes to execute a given code path.

To understand response time, you have to understand code path.

This is actually the core tenet that divides people who "tune" into two categories: people who look at code path, and people who look at system resources.

Here's an example of what code path really looks like, for an Oracle process:

begin prepare (dbcall)
  execute Oracle kernel code path (mostly CPU)
  maybe make a syscall or two (e.g., "latch: library cache")
  maybe even make recursive prepare, execute, or fetch calls (e.g., view resolution)
end prepare
maybe make a syscall or two (e.g., "SQL*Net message...")
begin execute (another dbcall)
  execute Oracle kernel code path
  maybe make some syscalls (e.g., "db file sequential read" for updates)
end execute
maybe make a syscall or two
begin fetch (another dbcall)
  execute Oracle kernel code path (acquire latches, visit the buffer cache, ...)
  maybe make some syscalls (e.g., "db file...read")
end fetch
make a syscall or two

The trick is, you can't see this whole picture when you look at v$whatever within Oracle. You have to look at a lot of v$whatevers and do a lot of work reconciling what you find, to come up with anything close to a coherent picture of your code path.

But when you look at the Oracle code path, do you see how the syscalls just kind of blend in with the dbcalls? It's because they're all calls, and they all take time. It's non-orthogonal thinking to call syscalls something other than what they really are: just subroutine calls to another layer in the software stack. Calling all syscalls waits diminishes the one distinction that I think really actually is important; that's the distinction between syscalls that occur within dbcalls and the syscalls that occur between dbcalls.

It's the reason I like extended SQL trace data so much: it lets me look at my code path without having to spend a bunch of extra time trying to compose several different perspectives of performance into a coherent view. The coherent view I want is right there in one place, laid out sequentially for me to look at, and that coherent view fits what the business needs to be looking at, as in...

Scene 1:

Business person: Our TPS Report is slow.
Oracle person: Yes, our system has a lot of waits. We're working on it.
(Later...) Oracle person: Great news! The problem with the waits has been solved.
Business person: Er, but the TSP Report is still slow.

Scene 2:

Business person: Our TPS Report is slow.
Oracle person: I'll look at it.
(Later...) Oracle person: I figured out your problem. The TPS Report was doing something stupid that it didn't need to do. It doesn't anymore.
Business person: Thanks; I noticed. It runs in, like, only a couple seconds now.

Wednesday, February 18, 2009

Throughput versus Response Time

I like Doug Burns's recent blog post called Time Matters: Throughput vs. Response Time. If you haven't read it, please do. The post and its comment thread are excellent.

The principle Doug has recognized is why the knee in the performance curve is defined as the traffic intensity (think utilization, or load) value at which, essentially, the ratio of response time divided by throughput is minimized. It's not just the place where response time is minimized (which, as Doug observed, is when there's no load at all except for you, ...which is awesome for you, but not so good for business).

I'd like to emphasize a couple of points. First, batch and interactive workloads have wholly different performance requirements, which several people have already noted in their comments to Doug's post. With batch work, people are normally concerned with maximizing throughput. With online work, individual people care more about their own response times than group throughput, although those people's managers probably care more about group throughput. The individual people probably care about group throughput too, but not so much that they're happy about staying late after work to provide it when their individual tasks run so slowly they can't finish them during the normal working day.

In addition to having different performance requirements, batch workload can often be scheduled differently, too. If you're lucky, you can schedule your batch workload deterministically. For example, maybe you can employ a batch workload manager that feeds workload to your system like a carefully timed IV drip, to keep your system's CPU utilization pegged at 100% without causing your CPU run-queue depth to exceed 1.0. But online workload is almost always nondeterministic, which is to say that it can't be scheduled at all. That's why you have to keep some spare un-utilized system capacity handy; otherwise, your system load goes out past the nasty knee in your performance curve, and your users' response times behave exponentially in response to microscopic changes in load, which results in much Pain and Suffering.

My second point is one that I find that a lot of people don't understand very well: Focusing on individual response time—as in profiling—for an individual business task is an essential element in a process to maximize throughput, too. There are good ways to make a task faster, and there are bad ways. Good ways eliminate unnecessary work from the task without causing negative side-effects for tasks you're not analyzing today. Bad ways accidentally degrade the performance of tasks other than the one(s) you're analyzing.

If you stick to the good ways, you don't end up with the see-saw effect that most people seem to think of when they hear "optimize one business task at a time." You know, the idea that tuning A breaks B; then tuning B breaks A again. If this is happening to you, then you're doing it wrong. Trying to respond to performance problems by making global parameter changes commonly causes the see-saw problem. But eliminating wasteful work creates collateral benefits that allow competing tasks on your system to run faster because the task you've optimized now uses fewer resources, giving everything else freer and clearer access to the resources they need, without having to queue so much for them.

Figuring out how to eliminate wasteful work is where the real fun begins. A lot of the tasks we see are fixable by just changing just a little bit of source code. I mean the 2,142,103-latch query that consumes only 9,098 latches after fixing; things like that. A lot more are fixable by simply collecting statistics correctly. Others require adjustments to an application's indexing strategy, which can seem tricky when you need to optimize across a collection of SQL statements (here comes the see-saw), but even that is pretty much a solved problem if you understand Tapio Lahdenmäki's work (except for the inevitable politics of change control).

Back to the idea of Doug's original post, I wholeheartedly agree that you want to optimize both throughput and response time. The business has to decide what mixture is right. And I believe it's crucial to focus on eliminating waste from each individual competing task if you're going to have any hope of optimizing anything, whether you care more about response time, or throughput.

Think about it this way... A task cannot run at its optimal speed unless it is efficient. You cannot know whether a task is efficient without measuring it. And I mean specifically and exactly it, not just part of "it" or "it" plus a bunch of other stuff surrounding it. That's what profiling is: the measurement of exactly one interesting task that allows you to determine exactly where that task spends its time, and thus whether that task is spending your system's time and resources efficiently.

You can improve a system without profiling, and maybe you can even optimize one without profiling. But you can't know whether a system is optimal without knowing whether its tasks are efficient, and you can't know whether a given task is efficient without profiling it.

When you don't know, you waste time and money. This is why I contend that the ability to profile a single task is absolutely vital to anyone wanting to optimize performance.

Monday, February 16, 2009

Thank you, Tim Sanders

I grew up with the idea that you never, ever write in a book. Unless you're correcting a mistake. (The book never "gets it" except when the book deserves it.)

But I was wrong. Tim Sanders fixed me. His book, Love Is the Killer App: How to Win Business and Influence Friends, in the section about Knowledge (page 82) just saved me a few hours. Tim recommended to me that as I read, I should tag and cliff within the book itself. His advice made me uncomfortable, because it made good sense, but I do not write in books.

So, I was just wrong. I didn't summon the courage to tag or cliff Tim's book, but I resolved to tag and cliff the next few. And wow, it's a Good Thing. It's basically the act of creating an index for stuff you want to find later. I just saved myself a couple of hours looking for a passage in a book that I found in about 30 seconds because I had tagged it inside the front cover. My tag told me the page number and the concept I was looking for, and—poof!—I snapped right to it, there it was. That's index access by rowid for the brain.

Now I write in all the books I read. It works, and I highly recommend it.

Thank you, Karen Morton, for giving me the copy of Tim's book.

Friday, February 13, 2009

Thank you, RMOUG

I want to thank everyone who came to watch me present my session called, "For Developers: Making Friends with the Oracle Database" session yesterday at RMOUG in Denver. I don't know how many of you were there, but it felt like at least a couple hundred, and I'm grateful to you for sharing your time with me. I hope you had as much fun as I did.

RMOUG was another excellent show. Lots of happy people learning and sharing what they know: you can't beat that. I think Dan Norris captured the spirit of the event very well in his blog (which is now on my blogroll).

Oracle Optimizer Statistics

Accidentally messing up table statistics is probably the single most common cause for poor SQL query performance that I've ever seen. At Karen Morton's blog, you'll find a link to the best paper on the subject that I've ever seen.

Karen is going to present this topic at the Hotsos Symposium here in Dallas in just a few weeks. I'll hope to see you there. This one's big: Chris Date.