Cary Millsap: exended SQL trace

Showing posts with label exended SQL trace. Show all posts

Thursday, January 13, 2011

New paper "Mastering Performance with Extended SQL Trace"

Happy New Year.

It’s been a busy few weeks. I finally have something tangible to show for it: “Mastering Performance with Extended SQL Trace” is the new paper I’ve written for this year’s RMOUG conference. Think of it a 15-page update to chapter 5 of Optimizing Oracle Performance.

There’s lots of new detail in there. Some highlights:

How to enable and disable traces, even in un-cooperative applications.
How to instrument your application so that tracing the right code path during production operation of your application becomes dead simple.
How to make that instrumentation highly scalable (think 100,000+ tps).
How timestamps since 10.2 allow you to know your recursive call relationships without guessing.
How to create response time profiles for calls and groups of calls, with examples.
Why you don’t want to be on Oracle 11g prior to 11.2.0.2.0.

I hope you’ll be able to make productive use of it.

Monday, August 9, 2010

Mister Trace

For the past several weeks, my team at Method R have been working hard on a new software tool that we released today. It is an extension for Oracle SQL Developer called Method R Trace. We call it MR Trace for short.

MR Trace is for SQL and PL/SQL developers who care about performance. Every time you execute code from a SQL Developer worksheet, MR Trace automatically copies a carefully scoped trace file to your SQL Developer workstation. There, you can open it with any application you want, just by clicking. You can tag it for easy lookup later. There’s a 3-minute video if you’re interested in seeing what it looks like.

I’m particularly excited about MR Trace because it’s the smallest software tool we’ve ever designed. That may sound funny to a lot of people, but it won’t sound funny to you if you’ve read Rework by Jason Fried and David Heinemeier Hansson of 37signals. MR Trace does a seemingly very small thing—it gets your trace file for you—but if you’ve ever done that job yourself, you might get a kick out of seeing it happen so automatically, so simply, and so quickly.

The thing is, the normal process of getting trace files is raw misery for many of our friends. It’s a common story: “If I trace some SQL, then to get my trace files, I have to call up my SA or DBA. I apologize for the interruption and hope he’s in a good mood. I tell him what system I need him to look at. He has to figure out which trace files are the ones I need, and then he FTPs them over to where I can get to them. I try not to bother him, but there’s no other way.”

Most places don’t have any security reasons to prohibit developers from getting their trace files, but they just don’t have the time or the interest to create procedures that developers can use to fetch only the files they’re supposed to see. The resulting bother is so labor-intensive and so demotivating that developers stop fighting and just move on without trace files to guide them.

That’s a big problem: if you can’t see how the code you write consumes response time, then how can you optimize it? How can you even know if you should try to? If you have to guess where your code spends time, then you can’t possibly think clearly about performance.

We have tried to design MR Trace to be a beautiful little application that does exactly what you need by staying out of your way. If we did it right, then you won’t be thinking about MR Trace whenever you use it; you’ll just have the trace files you need, right where and when you need them. And you’ll have them with no extra typing, no extra clicks, and—for goodness’ sake—certainly no more phone calls or trips down the hall. ...Unless it’s to to show off a performance problem you found and fixed before anyone else ever noticed it.

Key information:

	Name:		Method R Trace
	Type:		Extension for Oracle SQL Developer
	Function:		Zero-click trace file collector
	Price:		$49.95 USD
	Risk:		30-day free trial
	URL:		http://method-r.com/software/mrtrace
	Designer:		Method R Corporation

Friday, February 20, 2009

Dang it, people, they're syscalls, not "waits"...

So many times, I see people get really confused about how to attack an Oracle performance problem, resulting in thoughts that look like this:

I don’t understand why my program is so slow. The Oracle wait interface says it’s just not waiting on anything. ?

The confusion begins with the name "wait event." I wish Oracle hadn't called them that. I wish instead of WAIT in the extended SQL trace output, they had used the token SYSCALL. Ok, that's seven bytes of trace data instead of just four, so maybe OS instead of WAIT. I wish that they had called v$session_wait either v$session_syscall or v$session_os .

Here's why. First, realize that an Oracle "wait event" is basically the instrumentation for one operating system subroutine call ("syscall"). For example, the Oracle event called db file sequential read: that's instrumentation for a pread call on our Linux box. On the same system, a db file scattered read covers a sequence of two syscalls: _llseek and readv (that's one reason why I said basically at the beginning of this paragraph). The event called enqueue: that's a semtimedop call.

Second, the word wait is easy to misinterpret. To the Oracle kernel developer who wrote the word WAIT into the Oracle source code, the word connoted the duration that the code path he was writing would have to "wait" for some syscall to return. But to an end-user or performance analyst, the word wait has lots of other meanings, too, like (to name just two):

How long the user has to wait for a task to complete (this is R in the R = S + W equation from queueing theory).
How long the user's task queues for service on a specific resource (this is W in the R = S + W equation from queueing theory).

The problem is that, as obvious and useful as these two definitions seem, neither one of them means what the word wait means in an Oracle context, which is:

wait n. In an Oracle context, the approximate response time of what is usually a single operating system call (syscall) executed by an Oracle kernel process.

That's a problem. It's a big problem when people try to stick Oracle wait times into the W slot of mathematical queueing models. Because they're not W values; they're R values. (But they're not the same R values as in #1 above.)

But that's a digression from a much more important point: I think the word wait simply confuses people into thinking that response time is something different than what it really is. Response time is simply how long it takes to execute a given code path.

To understand response time, you have to understand code path.

This is actually the core tenet that divides people who "tune" into two categories: people who look at code path, and people who look at system resources.

Here's an example of what code path really looks like, for an Oracle process:

begin prepare (dbcall)
  execute Oracle kernel code path (mostly CPU)
  maybe make a syscall or two (e.g., "latch: library cache")
  maybe even make recursive prepare, execute, or fetch calls (e.g., view resolution)
end prepare
maybe make a syscall or two (e.g., "SQL*Net message...")
begin execute (another dbcall)
  execute Oracle kernel code path
  maybe make some syscalls (e.g., "db file sequential read" for updates)
end execute
maybe make a syscall or two
begin fetch (another dbcall)
  execute Oracle kernel code path (acquire latches, visit the buffer cache, ...)
  maybe make some syscalls (e.g., "db file...read")
end fetch
make a syscall or two

The trick is, you can't see this whole picture when you look at v$whatever within Oracle. You have to look at a lot of v$whatevers and do a lot of work reconciling what you find, to come up with anything close to a coherent picture of your code path.

But when you look at the Oracle code path, do you see how the syscalls just kind of blend in with the dbcalls? It's because they're all calls, and they all take time. It's non-orthogonal thinking to call syscalls something other than what they really are: just subroutine calls to another layer in the software stack. Calling all syscalls waits diminishes the one distinction that I think really actually is important; that's the distinction between syscalls that occur within dbcalls and the syscalls that occur between dbcalls.

It's the reason I like extended SQL trace data so much: it lets me look at my code path without having to spend a bunch of extra time trying to compose several different perspectives of performance into a coherent view. The coherent view I want is right there in one place, laid out sequentially for me to look at, and that coherent view fits what the business needs to be looking at, as in...

Scene 1:

Business person: Our TPS Report is slow.
Oracle person: Yes, our system has a lot of waits. We're working on it.
(Later...) Oracle person: Great news! The problem with the waits has been solved.
Business person: Er, but the TSP Report is still slow.

Scene 2:

Business person: Our TPS Report is slow.
Oracle person: I'll look at it.
(Later...) Oracle person: I figured out your problem. The TPS Report was doing something stupid that it didn't need to do. It doesn't anymore.
Business person: Thanks; I noticed. It runs in, like, only a couple seconds now.