Friday, February 20, 2009

Dang it, people, they're syscalls, not "waits"...

So many times, I see people get really confused about how to attack an Oracle performance problem, resulting in thoughts that look like this:
I don’t understand why my program is so slow. The Oracle wait interface says it’s just not waiting on anything. ?
The confusion begins with the name "wait event." I wish Oracle hadn't called them that. I wish instead of WAIT in the extended SQL trace output, they had used the token SYSCALL. Ok, that's seven bytes of trace data instead of just four, so maybe OS instead of WAIT. I wish that they had called v$session_wait either v$session_syscall or v$session_os .

Here's why. First, realize that an Oracle "wait event" is basically the instrumentation for one operating system subroutine call ("syscall"). For example, the Oracle event called db file sequential read: that's instrumentation for a pread call on our Linux box. On the same system, a db file scattered read covers a sequence of two syscalls: _llseek and readv (that's one reason why I said basically at the beginning of this paragraph). The event called enqueue: that's a semtimedop call.

Second, the word wait is easy to misinterpret. To the Oracle kernel developer who wrote the word WAIT into the Oracle source code, the word connoted the duration that the code path he was writing would have to "wait" for some syscall to return. But to an end-user or performance analyst, the word wait has lots of other meanings, too, like (to name just two):
  1. How long the user has to wait for a task to complete (this is R in the R = S + W equation from queueing theory).
  2. How long the user's task queues for service on a specific resource (this is W in the R = S + W equation from queueing theory).
The problem is that, as obvious and useful as these two definitions seem, neither one of them means what the word wait means in an Oracle context, which is:
wait n. In an Oracle context, the approximate response time of what is usually a single operating system call (syscall) executed by an Oracle kernel process.
That's a problem. It's a big problem when people try to stick Oracle wait times into the W slot of mathematical queueing models. Because they're not W values; they're R values. (But they're not the same R values as in #1 above.)

But that's a digression from a much more important point: I think the word wait simply confuses people into thinking that response time is something different than what it really is. Response time is simply how long it takes to execute a given code path.
To understand response time, you have to understand code path.
This is actually the core tenet that divides people who "tune" into two categories: people who look at code path, and people who look at system resources.

Here's an example of what code path really looks like, for an Oracle process:
begin prepare (dbcall)
  execute Oracle kernel code path (mostly CPU)
  maybe make a syscall or two (e.g., "latch: library cache")
  maybe even make recursive prepare, execute, or fetch calls (e.g., view resolution)
end prepare
maybe make a syscall or two (e.g., "SQL*Net message...")
begin execute (another dbcall)
  execute Oracle kernel code path
  maybe make some syscalls (e.g., "db file sequential read" for updates)
end execute
maybe make a syscall or two
begin fetch (another dbcall)
  execute Oracle kernel code path (acquire latches, visit the buffer cache, ...)
  maybe make some syscalls (e.g., "db file...read")
end fetch
make a syscall or two
The trick is, you can't see this whole picture when you look at v$whatever within Oracle. You have to look at a lot of v$whatevers and do a lot of work reconciling what you find, to come up with anything close to a coherent picture of your code path.

But when you look at the Oracle code path, do you see how the syscalls just kind of blend in with the dbcalls? It's because they're all calls, and they all take time. It's non-orthogonal thinking to call syscalls something other than what they really are: just subroutine calls to another layer in the software stack. Calling all syscalls waits diminishes the one distinction that I think really actually is important; that's the distinction between syscalls that occur within dbcalls and the syscalls that occur between dbcalls.

It's the reason I like extended SQL trace data so much: it lets me look at my code path without having to spend a bunch of extra time trying to compose several different perspectives of performance into a coherent view. The coherent view I want is right there in one place, laid out sequentially for me to look at, and that coherent view fits what the business needs to be looking at, as in...

Scene 1:
  • Business person: Our TPS Report is slow.
  • Oracle person: Yes, our system has a lot of waits. We're working on it.
  • (Later...) Oracle person: Great news! The problem with the waits has been solved.
  • Business person: Er, but the TSP Report is still slow.
Scene 2:
  • Business person: Our TPS Report is slow.
  • Oracle person: I'll look at it.
  • (Later...) Oracle person: I figured out your problem. The TPS Report was doing something stupid that it didn't need to do. It doesn't anymore.
  • Business person: Thanks; I noticed. It runs in, like, only a couple seconds now.

16 comments:

Chen Shapira said...

Good distinction. It seems like you have very little visibility into the real waits in the system.

Some of these syscalls are pure waiting time (latch related events for instance), but certainly not all of them.

If you see 5 seconds "wait" to "db sequential read", it could be 5 seconds spent reading LOTS of data, or 1 second spent reading and 4 seconds spent really waiting on the IO queue.

Cary Millsap said...

Chen: Exactly. When ela=R for a db file sequential read, you can't know by looking at Oracle data how much of that R is S and how much is W. Your observation is spot on.

Dan Norris said...

Too bad the TPS Report is 40 pages long and has over 50,000 lines on it. But then again, that's probably another blog post :).

I think the presentation at RMOUG really helped people that didn't understand the difference between the different stages of "wait" in the environment. I know it helped me too.

Good stuff!

skymaster said...

Cary-

Excellent analysis on what these performance metrics break down into at the core level. Even 1 second can make a crucial difference in performance for mission critical systems.

Marcin Przepiorowski said...

Cary,
Naming convention is depending where are you sitting as observer.
Of course from OS point of view Oracle wait are not wait's but there is a work which have to be done (call syscall). From Oracle point of view waiting for DB block is just a WAIT (even if during that time server CPU or IO is not in idle state).
I really appreciate your work and your book about Oracle waits.

regards,
Marcin Przepiorowski

Cary Millsap said...

Marcin, I agree that what you're talking about is a "wait," but only in the very loose definition #1 (from the original blog post) sense of the word. The problem with that perspective begins when someone thinks of W from the R = S + W formula when you say the word "wait."

From the database kernel's perspective the duration of a syscall is a response time. That is, it's an R, not a W. That response time consists of service (sometimes that service includes sleeping, which is also NOT a W) and, potentially, queueing (which is in fact waiting, by the stricter definition #2). The "wait" events that Oracle reports on are just not W values at all.

Chen Shapira said...

Cary,

Can you explain about sleeping being part of S and not W?

At least the linux kernel will force processes to sleep while waiting for IO, which will count as W. When would sleep be counted as S?

Cary Millsap said...

Chen, when I wrote that, I was imagining a syscall that has a sleep naturally in its code path, like the read() call you mentioned.

Imagine your read() call executing on a system with no competing workload and thus no queueing at any point in the call chain. The read() politely executes a sleep to free up your proc's CPU for someone else, because the author of read() knew that the hardware device service time is huge compared to the capacity of the CPU whence the call was made.

I think that the duration of the sleep executed in that context would be properly regarded as S time from the caller's (e.g., Oracle kernel's) R = S + W perspective. However, on a system in which there was queueing at the hardware device being read, the duration of that queueing is the duration that I would assign to the W slot.

The problem you've highlighted is that I accidentally mixed layers (in the sequence diagram sense) in my statement that you questioned. Sometimes, the entire duration of the sleep will be S, and sometimes it will include some non-zero W. But the place to measure the S and W values for that part of your read call would be on the device to which control in the code path has been passed, which is the hardware device servicing the call. ...Not the device from which the call was made.

Every time control passes from one tier to a deeper one (Oracle to OS, OS to device), there's a new R being instantiated.

Marcin Przepiorowski said...

Cary,

I once again have to agree with you.
If we are talking about M/M/1 (or in general about M/M/n) model, Oracle "waits" - syscalls - are response time including service time (disk read time) and queue time of underlying OS service. That's of course true.
Using that model of system we can try to model our database using math or modeling tools.
BTW
Did you ever model Oracle DB using that approach ?

But if we want to find out what happen inside our application, and if we assume that our database can be in execute mode (using CPU) and executing code and in waiting mode (waiting for io/memory/other resources) a syscall/Oracle_internal_call called as wait are not a very bad idea.

ps.
I hope I didn't mix up my answer to you. English is not my first language.

regards,
Marcin Przepiorowski

Cary Millsap said...

Marcin,

In only one project have I used queueing theory to produce the principal deliverable for the project. It delivered results that were excellent approximations of reality. We modeled only at a very high level compared to this conversation, though. We modeled a cluster of Oracle databases as the service provider, and a Tuxedo-based application as the requester.

I do like to think about lower-level details in the context of queueing theory, because it helps me to understand (and explain, I hope) general tendencies more clearly. ...Like why it gives you better leverage to eliminate unnecessary work than to upgrade the capacity of your system.

However, trying to use queueing theory at too low a level of abstraction in our field is dangerous. Queueing systems in real life are very sensitive to small input changes. Queueing models accurately reflect that. So if you choose to model a low, very granular level of abstraction, then even tiny mistakes in how you compose the model or how measure the model's input values can produce wildly inaccurate predictions.

One use that intrigues me is what Connie Smith and Lloyd Williams do in their book Performance Solutions: A Practical Guide to Creating Responsive, Scalable Software (Addison-Wesley Object Technology Series).

I appreciate your comments, and I can assure you that your communications with me in English are clearer than would be any attempt I might make to communicate with you in your first language. :-)

Chen Shapira said...

"Every time control passes from one tier to a deeper one (Oracle to OS, OS to device), there's a new R being instantiated."

Exactly!
So Oracle's W is Linux's R, Linux W is the hardware device's R, and so on.
Aggregating everything and arriving at the exact times spent in W vs S over all the layers seems nearly impossible, and as you wrote, it is usually not necessary.

Marcin Przepiorowski said...

Cary,

So until know we have very similar point of view on Oracle DB. I have to agree with you in QT modeling too.
I have delivered two project based on queuing theory - one like you on high level and this one was most accurate. A second one was calculated on low level model (using oracle syscall(waits), cpu measuring, etc) - the results was quite OK, but every even small independent change had a big bad impact on that results.
If you are interested in modeling I can recommend you that book -
R. Jain, "The Art of Computer Systems Performance Analysis: Techniques for Experimental Design, Measurement, Simulation, and Modeling," April 1991, ISBN:0471503361 - is a old one but describes many modeling theory and I have used it many times (even in my MA thesis about modeling ATM networks).

regards,
Marcin

Unknown said...

Excellent article.Do you mean to say one should not confuse classic queuing theory of R=S+Q with that of R=S+w used for oracle response time analysis? Also in which perspective(Oracle,OS and application) the classic queuing theory goes well? I mean where we can apply this equation directly?

Cary Millsap said...

Karthik,

Thank you. Yes, that's exactly what I'm saying. In Oracle, response time is the sum of the time spent executing database calls, plus the time spent executing system calls (which Oracle people, unfortunately, call "waits"), with some double-counting correction for recursive calls. In this context, it’s fine to say that “R=S+W”, but it’s weird that we don’t say ”R=D+S”, where D stands for “database calls” and S stands for ”system calls.” I guess if we did that, people would incorrectly imagine that the “S” for syscalls stood for “service time.” :-)

The response time of a system call executed by an Oracle database kernel process is, in the queueing model R=S+Q, an R (a response time), not a Q (a queueing delay). A syscall is simply not a type of queueing delay for an Oracle kernel process.

Imagine, for example, a “db file sequential read” that maps to a pread OS call. Within that single pread call, there's code path executed in OS kernel mode, which itself can be subject to queueing delay (if the CPU it needs is already busy), followed possibly by a queueing delay at the physical device before the actual seek and rotational latencies kick in (part of your S, the service time for the call), and this kind of behavior happens all the way through the data transfer until the call returns, at which point the Oracle kernel itself may have to queue for CPU in the OS “ready to run” state. The more detail you investigate, the more opportunities for correct application of R=S+Q that you’ll find.

Queueing theory—specifically the M/M/m model—helps you predict Q (queueing delay) based on a system’s configuration and its load characteristics. However, if you want your Q to represent the duration of a syscall (e.g., of a “db file sequential read”), the formulas will not work for you. That's my main point. It's because Oracle ”waits” (unfortunate choice of terminology; actually they’re syscalls) are not queueing delays.

However, you might use M/M/m to predict the expected duration of a “db file sequential read” call if you know the architecture of the device servicing your reads, and if you know the distribution of read calls that are hitting that device.

KunwarSingh said...

Cary , one question.
Is "enq: TX - row lock contention" also not a wait but a syscall ? I thought row lock contention is pure database wait event unlike "db file sequential read"

Rgds,
Kunwar

Cary Millsap said...

Kunwar,

If you’ll strace an Oracle kernel process when it is waiting on a row lock, you’ll see a sequence of enqueue events in the Oracle trace file, and a sequence of associated operating system calls in the strace output. The OS call I’ve seen related to enqueue events is semtimedop. If you look at the manual page for semtimedop, I think it will make sense to you what is going on.