Cary Millsap: Oracle wait interface

The first product I ever created after leaving Oracle Corporation in 1999 was a 3-day course about optimizing Oracle performance. The experiences of teaching this course from 2000 through 2003 (heavily revising the material each time I taught it) added up to the knowledge that Jeff Holt and I needed to write Optimizing Oracle Performance (2003).

Between 2000 and 2006, I spent many weeks on the road teaching this 3-day course. I stopped teaching it in 2006. An opportunity to take or teach a course ought to be a joyous experience, and this one had become too much of a grind. I didn’t figure out how to fix it until this year. How I fixed it is the story I’d like to tell you.

The Problem

The problem was simply inefficiency. The inefficiency began with the structure of the course, the 3-day lecture marathon. Realize, 6 × 3 = 18 hours of sitting in a chair, listening attentively to a single voice (my voice) is the equivalent of a 6-week university term of a 3-credit-hour course, taught straight through in three days. No hour-plus homework assignment after each hour of lecture to reinforce the lessons; but a full semester’s worth of listening to one voice, straight through, for three days. What retention rate would you expect from a university course compressed into just 3 days?

So, I optimized. I have created a new course that lasts one day (not even an exhausting full day at that). But how can a student possibly learn as much in 1 day as we used to teach in 3 days? Isn’t a 1-day event bound to be a significantly reduced-value experience?

On the contrary, I believe our students benefit even more now than they used to. Here are the big differences, so you can see why.

The Time Savings

In the 3-day course, I would spend half a day explaining why people should abandon their old system-wide-ratio-based ways of managing system performance. In the new 1-day course, I spend less than an hour explaining the Method R approach to thinking about performance. The point of the new course is not to convince people to abandon anything they’re already doing; it’s to show students the tremendous additional opportunities that are available to them if they’ll just look at what Oracle trace files have to offer. Time savings: 2 hours.

In the 3-day course, I would spend a full day explaining how to interpret trace data. By hand. These were a few little lab exercises, about an hour’s worth. Students would enter dozens of numbers from trace files into laptops or pocket calculators and write results on worksheets. In the new 1-day course, the software tools that a student needs to interpret files of any size—or even directories full of files—are included in the price of the course. Time savings: 5 hours.

In the 3-day course, I would spend half a day explaining how to collect trace data. In the new 1-day course, the software tools that a student needs to get started collecting trace files are included in the price of the course. For software architectures that require more work than our software can do for you, there’s detailed instruction in the course book. Time savings: 3 hours.

In the 3-day course, I would spend half a day working through about five example cases using a software tool to which students would have access for 30 days after they had gone home. In the new 1-day course, I spend one hour working through about eight example cases using software tools that every student will take home and keep forever. I can spend less time per case yet teach more because the cases are thoroughly documented in the course book. So, in class, we focus on the high-level decision making instead of the gnarly technical details you’ll want to look up later anyway. Time savings: 3 hours.

...That’s 13 classroom hours we’ve eliminated from the old 3-day experience. I believe that in these 13 hours, I was teaching material that students weren’t retaining to begin with.

The Book

The next big difference: the book.

In the old 3-day course, I distributed two books: (1) the “Course Notebook,” which was a black and white listing of the course PowerPoint slides, and (2) a copy of Optimizing Oracle Performance (O’Reilly 2003). The O’Reilly book was great, because it contained a lot of detail that you would want to look up after the course. But of course it doesn’t contain any new knowledge we’ve learned since 2003. The Course Notebook, in my opinion, was never worth much to begin with. (In my opinion, no PowerPoint slide printout is worth much to begin with.)

The Mastering Oracle Trace Data (MOTD) book we give each student in my new 1-day course is a full-color, perfect-bound book that explains the course material and far more in deep detail. It is full-color for an important reason. It’s not gratuitous or decorative; it’s because I’ve been studying Edward Tufte. I use color throughout the book to communicate detailed, high-resolution information faster to your brain.

Color in the book helps to reduce student workload and deliver value long after a student has left the classroom. In this class, there is no collection of slide printouts like you’ve archived after every Oracle class you’ve been to since the 1980s. The MOTD book is way better than any other material I’ve ever distributed in my career. I’ve heard students tell their friends that you have to see it to believe it.

“A paper record tells your audience that you are serious, responsible, exact, credible. For deep analysis of evidence and reasoning about complex matters, permanent high-resolution displays [that is, paper] are an excellent start.” —Edward Tufte

The Software

So, where does a student recoup all the time we used to spend going through trace files, and studying how to collect trace data on half a dozen different software architectures? In the thousands of man-hours we’ve invested into the software that we give you when you come to the course. Instead of explaining every little detail about quirks in Oracle trace data that change between Oracle versions 10.1 and 10.2 and 11.2 or 11.2.0.2 and 11.2.0.4, the software does the work for you. Instead of having to explain all the detail work, we have time to explain how to use the results of our software to make decisions about your data.

What’s the catch? Of course, we hope you’ll love our software and want to buy it. The software we give you is completely full-featured and yours to keep forever, but the license limits you to using it only with one login id, and it doesn’t include patches and upgrades, which we release a few times each year. We hope you’ll love our software so much that you’ll want to buy a license that lets you use it on any of your systems and that includes the right to upgrade as we fix bugs and add features. We hope you’ll love it so much that you encourage your colleagues to buy it.

But there’s really no catch. You get software and a course (and a book and a shirt) for less than the daily rate that we used to charge for just a course.

A Shirt?

MOTD London 2011-09-08: “I can help you trace it.”

Yes, a shirt. Each student receives a Method R T-shirt that says, “I can help you trace it.” We don’t give these things away to anyone except for students in my MOTD course. So if you see one, the person wearing it can, in actual fact, Help You Trace It.

The Net Result

The net result of all this optimization is benefits on several fronts:

The course costs a lot less than it used to. The fee is presently only about 25% of the 3-day course’s price, and the whole experience requires less than ⅓ of time away from work that the original course did.
In the new course, our students don’t have to work so hard to make productive use of the course material. The book and the software take so much of the pressure off. We do talk about what the fields in raw trace data mean—I think it’s necessary to know that in order to use the data properly, and have productive debates with your sys/SAN/net/etc. administration colleagues. But we don’t spend your time doing exercises to untangle nested (recursive) calls by hand. The software you take home does that for you. That’s why it is so much easier for a student to put this course to work right away.
Since the course duration is only one day, I can visit far more cities and meet far more students each year. That’s good for students who want to participate, and it’s great for me, because I get to meet more people.

Plans

The only thing missing from our Mastering Oracle Trace Data course right now is you. I have taught the event now in Southlake, Texas (our home town), in Copenhagen, and in London. It’s field-tested and ready to roll. We have several cities on my schedule right now. I’ll be teaching the course in Birmingham UK on the day after UKOUG wraps up, December 8. I’ll be doing Orlando and Tampa in mid-December. I’ll teach two courses this coming January in Manhattan and Long Island. There’s Billund (Legoland) DK in April. We have more plans in the works for Seattle, Portland, Dallas, and Cleveland, and we’re looking for more opportunities.

Share the word by linking the official
MOTD sticker to http://method-r.com/.

My wish is for you to help me book more cities in North America and Europe (I hope to expand beyond that soon). If you are part of a company or a user group with colleagues who would be interested in attending the course, I would love to hear from you. Registering en masse saves you money. The magic number for discounting is 10 students on a single registration from one company or user group.

I can help you trace it.

So many times, I see people get really confused about how to attack an Oracle performance problem, resulting in thoughts that look like this:

I don’t understand why my program is so slow. The Oracle wait interface says it’s just not waiting on anything. ?

The confusion begins with the name "wait event." I wish Oracle hadn't called them that. I wish instead of WAIT in the extended SQL trace output, they had used the token SYSCALL. Ok, that's seven bytes of trace data instead of just four, so maybe OS instead of WAIT. I wish that they had called v$session_wait either v$session_syscall or v$session_os .

Here's why. First, realize that an Oracle "wait event" is basically the instrumentation for one operating system subroutine call ("syscall"). For example, the Oracle event called db file sequential read: that's instrumentation for a pread call on our Linux box. On the same system, a db file scattered read covers a sequence of two syscalls: _llseek and readv (that's one reason why I said basically at the beginning of this paragraph). The event called enqueue: that's a semtimedop call.

Second, the word wait is easy to misinterpret. To the Oracle kernel developer who wrote the word WAIT into the Oracle source code, the word connoted the duration that the code path he was writing would have to "wait" for some syscall to return. But to an end-user or performance analyst, the word wait has lots of other meanings, too, like (to name just two):

How long the user has to wait for a task to complete (this is R in the R = S + W equation from queueing theory).
How long the user's task queues for service on a specific resource (this is W in the R = S + W equation from queueing theory).

The problem is that, as obvious and useful as these two definitions seem, neither one of them means what the word wait means in an Oracle context, which is:

wait n. In an Oracle context, the approximate response time of what is usually a single operating system call (syscall) executed by an Oracle kernel process.

That's a problem. It's a big problem when people try to stick Oracle wait times into the W slot of mathematical queueing models. Because they're not W values; they're R values. (But they're not the same R values as in #1 above.)

But that's a digression from a much more important point: I think the word wait simply confuses people into thinking that response time is something different than what it really is. Response time is simply how long it takes to execute a given code path.

To understand response time, you have to understand code path.

This is actually the core tenet that divides people who "tune" into two categories: people who look at code path, and people who look at system resources.

Here's an example of what code path really looks like, for an Oracle process:

begin prepare (dbcall)
  execute Oracle kernel code path (mostly CPU)
  maybe make a syscall or two (e.g., "latch: library cache")
  maybe even make recursive prepare, execute, or fetch calls (e.g., view resolution)
end prepare
maybe make a syscall or two (e.g., "SQL*Net message...")
begin execute (another dbcall)
  execute Oracle kernel code path
  maybe make some syscalls (e.g., "db file sequential read" for updates)
end execute
maybe make a syscall or two
begin fetch (another dbcall)
  execute Oracle kernel code path (acquire latches, visit the buffer cache, ...)
  maybe make some syscalls (e.g., "db file...read")
end fetch
make a syscall or two

The trick is, you can't see this whole picture when you look at v$whatever within Oracle. You have to look at a lot of v$whatevers and do a lot of work reconciling what you find, to come up with anything close to a coherent picture of your code path.

But when you look at the Oracle code path, do you see how the syscalls just kind of blend in with the dbcalls? It's because they're all calls, and they all take time. It's non-orthogonal thinking to call syscalls something other than what they really are: just subroutine calls to another layer in the software stack. Calling all syscalls waits diminishes the one distinction that I think really actually is important; that's the distinction between syscalls that occur within dbcalls and the syscalls that occur between dbcalls.

It's the reason I like extended SQL trace data so much: it lets me look at my code path without having to spend a bunch of extra time trying to compose several different perspectives of performance into a coherent view. The coherent view I want is right there in one place, laid out sequentially for me to look at, and that coherent view fits what the business needs to be looking at, as in...

Scene 1:

Business person: Our TPS Report is slow.
Oracle person: Yes, our system has a lot of waits. We're working on it.
(Later...) Oracle person: Great news! The problem with the waits has been solved.
Business person: Er, but the TSP Report is still slow.

Scene 2:

Business person: Our TPS Report is slow.
Oracle person: I'll look at it.
(Later...) Oracle person: I figured out your problem. The TPS Report was doing something stupid that it didn't need to do. It doesn't anymore.
Business person: Thanks; I noticed. It runs in, like, only a couple seconds now.

Cary Millsap

Friday, November 18, 2011

I Can Help You Trace It