Cary Millsap: oracle

Showing posts with label oracle. Show all posts

Friday, October 1, 2021

How I Spent My COVID Vacation

I haven’t blogged in, what, ...forever. A couple of years.

I have been writing, though. Just not here. A lot, actually. Here’s some of the stuff I’ve done during my lapse here:

2020-01-28 – “Solving the unsolvable performance problem” (2 pages)
2020-02-14 – “Method R Workbench: the pesky, intermittent performance problem” (video 4:57)
2020-03-11 – “Preventing the post-production performance problem” (2 pages)
2020-04-23 – “Better testing, better risk reduction” (2 pages)
2020-05-07 – “Some things you probably didn’t know about tracing (Dallas edition)” (video 1:03:27)
2020-06-10 – “Death to the health check… Long live the health check” (2 pages)
2020-07-01 – Method R Workbench 9.0, a huge new release of my company’s flagship software system. Since 2001, this software has grown from a tkprof replacement to a system for mining and managing 10,000s of trace files at once.
2020-07-01 – “Method R Workbench 9: a whole new way to see Oracle performance” (video 5:36)
2020-08-26 – “Some things you probably didn’t know about tracing (Chicago edition)” (video 1:07:34)
2021-01-19 – CMG IMPACT 2021 “Three tricky performance problems solved with Oracle trace data” (ad video 6:12)
2021-08-09 – Method R Trace 21.2, a re-imagining of our trace file collector extension for Oracle SQL Developer.
2021-08-11 – “Method R Workbench video tips” (8 videos, each 2:21 or less)

But the biggest reason of all: I’ve spent every spare moment outside of my work obligations since March 2020…

…writing a new book.

The working title is How to Make Things Faster. It’s a book about making things (including software) go faster, for readers who aren’t necessarily technical. It will be about the size of a Dilbert book and contain pretty much everything I’ve ever learned in 30 years of work, both technical and political. The audience I’m aiming for is every IT professional, business leader, and consultant on Earth. I hope you’ll be able to find it in airport book stores next to Liz Wiseman and Dr. Phil.

The book is story-driven. You’ll get a fun story or two and then half a dozen bite-sized (~2½-page) chapters teaching lessons inspired by the stories, then a new story and more bite-sized lessons. I think that people will be able to enjoy reading the book both serially from front to back, and randomly two or three short chapters at a time.

To ensure that the material would be accessible to non-technical readers, I hired a very special first reviewer for this project: my mom. My instructions to her were to identify anything that was either unclear or boring. Thanks to her help, other reviewers so far have characterized the book as “fun,” “lively,” “quick,” and “important.”

It’s been a lot of work, and there’s still a lot to do. I’m excited about it, and I hope you’ll be, too. I’ll get it to you as soon as I can.

If you want a peek, you should join me next Thursday, October 7 at the Dallas Oracle Users Group’s “DOUG Training Day 2021,” which will convene live (!) in Grapevine, Texas, not too far from where I’m sitting right now. This will be an important event for me: it’ll be the first time I’ll have presented face-to-face with a live audience in nearly two years, and it’ll be the first time I’ll have presented material from the book. Plus, it’s a keynote for the event, so …the pressure’s on. :-)

If you’re interested in the sound of the material, please consider booking me for a workshop.

Tuesday, August 15, 2017

Words I Don’t Use, Part 5: “Wait”

The fifth “word I do not use” is the Oracle technical term wait.

The Oracle Wait Interface

In 1991, Oracle Corporation released some of the most important software instrumentation of all time: the “wait” statistics that were implemented in Oracle 7.0. Here’s part of the story, in Juan Loaiza’s words, as told in Nørgaard et. al (2004), Oracle Insights: Tales of the Oak Table.

This stuff was developed because we were running a benchmark that we could not get to perform. We had spent several weeks trying to figure out what was happening with no success. The symptoms were clear—the system was mostly idle—we just couldn’t figure out why.

We looked at the statistics and ratios and kept coming up with theories, the trouble was that none of them were right. So we wasted weeks tuning and fixing things that were not the problem. Finally we ran out of ideas and were forced to go back and instrument the code to figure out what the problem was.

Once the waits were instrumented the problem was diagnosed in minutes. We were having “free buffer” waits because the DBWR was not writing blocks fast enough. It’s amazing how hard that was to figure out with statistics, and how easy it was to figure out once the waits were instrumented.

...In retrospect a lot of the names could be greatly improved. The wait interface was added after the freeze date as a “stealth” project so it did not get as well thought through as it should have. Like I said, we were just trying to solve a problem in the course of a benchmark. The trouble is that so many people use this stuff now that if you change the names it will break all sorts of thing tools, so we have to leave them alone.

Before Juan’s team added this code, the Oracle kernel would show you only how much time its user calls (like parse, exec, and fetch) were taking. The new instrumentation, which included a set of new fixed views like v$session_wait and new WAIT lines in our trace files, showed how much time Oracle’s system calls (like reads, writes, and semops) were taking.

The Working-Waiting Model

The wait interface begat a whole new mental model about Oracle performance, based on the principle of working versus waiting:

Response Time = Service Time + Wait Time

In this formula, Oracle defines service time as the duration of the CPU used by your Oracle session (the duration Oracle spent working), and wait time as the sum of the durations of your Oracle wait events (the duration that Oracle spent waiting). Of course, response time in this formula means the duration spent inside the Oracle Database kernel.

Why I Don’t Say Wait, Part 1

There are two reasons I don’t use the word wait. The first is simply that it’s ambiguous.

The Oracle formula is okay for talking about database time, but the scope of my attention is almost never just Oracle’s response time—I’m interested in the business’s response time. And when you think about the whole stack (which, of course you do; see holistic), there are events we could call wait events all the way up and down:

The customer waits for an answer from a user.
The user waits for a screen from the browser.
The browser waits for an HTML page from the application server.
The application server waits for a database call from the Oracle kernel.
The Oracle kernel waits for a system call from the operating system.
The operating system’s I/O request waits to clear the device’s queue before receiving service.
...

If I say waits, the users in the room will think I’m talking about application response time, the Oracle people will think I’m talking about Oracle system calls, and the hardware people will think I’m talking about device queueing delays. Even when I’m not.

Why I Don’t Say Wait, Part 2

There is a deeper problem with wait than just ambiguity, though. The word wait invites a mental model that actually obscures your thinking about performance.

Here’s the problem: waiting sounds like something you’d want to avoid, and working sounds like something you’d want more of. Your program is waiting?! Unacceptable. You want it to be working. The connotations of the words working and waiting are unavoidable. It sounds like, if a program is waiting a lot, then you need to fix it; but if it’s working a lot, then it is probably okay. Right?

Actually, no.

The connotations “work is virtuous” and “waits are abhorrent” are false connotations in Oracle. One is not inherently better or worse than the other. Working and waiting are not accurate value judgments about Oracle software. On the contrary, they’re not even meaningful; they’re just arbitrary labels. We could just as well have been taught to say that an Oracle program is “working on disk I/O” and “waiting to finish its CPU instructions.”

The terms working and waiting really just refer to different subroutine call types:

“Oracle is working”	means	“your Oracle kernel process is executing a user call”
“Oracle is waiting”	means	“your Oracle kernel process is executing a system call”

The working-waiting model implies a distinction that does not exist, because these two call types have equal footing. One is no worse than the other, except by virtue of how much time it consumes. It doesn’t matter whether a program is working or waiting; it only matters how long it takes.

Working-Waiting Is a Flawed Analogy

The working-waiting paradigm is a flawed analogy. I’ll illustrate. Imagine two programs that consume 100 seconds apiece when you run them:

Program A		Program B
Duration	Call type	Duration	Call type
98	system calls (waiting)	98	user calls (working)
2	user calls (working)	2	system calls (waiting)
100	Total	100	Total

To improve program A, you should seek to eliminate unnecessary system calls, because that’s where most of A’s time has gone. To improve B, you should seek to eliminate unnecessary user calls, because that’s where most of B’s time has gone. That’s it. Your diagnostic priority shouldn’t be based on your calls’ names; it should be based solely on your calls’ contributions to total duration. Specifically, conclusions like, “Program B is okay because it doesn’t spend much time waiting,” are false.

A Better Model

I find that discarding the working-waiting model helps people optimize better. Here’s how you can do it. First, understand the substitute phrasing: working means executing a user call; and waiting means executing a system call. Second, understand that the excellent ideas people use to optimize other software are excellent ideas for optimizing Oracle, too:

Oracle’s wait interface is vital because it helps us measure an Oracle program’s complete execution duration—not just Oracle’s user calls, but its system calls as well. But I avoid saying wait to help people steer clear of the incorrect bias introduced by the working-waiting analogy.

Monday, March 7, 2016

Loss Aversion and the Setting of DB_BLOCK_CHECKSUM

Within Accenture Enkitec Group, we have recently been discussing the Oracle db_block_checksum parameter and how difficult it is to get clients to set it to a safer setting.

Clients are always concerned about the performance impact of features like this. Several years ago, I met a lot of people who had—in response to some expensive advice with which I strongly disagreed—turned off redo logging with an underscore parameter. The performance they would get from doing this would set the expectation level in their mind, which would cause them to resist (strenuously!) any notion of switching this [now horribly expensive] logging back on. Of course, it makes you wish that it had never even been a parameter.

I believe that the right analysis is to think clearly about risk. Risk is a non-technical word in most people’s minds, but in finance courses they teach that risk is quantifiable as a probability distribution. For example, you can calculate the probability that a disk will go bad in your system today. For disks, it’s not too difficult, because vendors do those calculations (MTTF) for us. But the probability that you’ll wish you had set db_block_checksum=full yesterday is probably more difficult to compute.

From a psychology perspective, customers would be happier if their systems had db_block_checksum set to full or typical to begin with. Then in response to the question,

“Would you like to remove your safety net in exchange for going between 1% and 10% faster? Here’s the horror you might face if you do it...”

...I’d wager that most people would say no, thank you. They will react emotionally to the idea of their safety net being taken away.

But with the baseline of its being turned off to begin with, the question is,

“Would you like to install a safety net in exchange for slowing your system down between 1% and 10%? Here’s the horror you might face if you don’t...”

...I’d wager that most people would answer no, thank you, even though this verdict is opposite to the one I predicted above. They will react emotionally to the idea of their performance being taken away.

Most people have a strong propensity toward loss aversion. They tend to prefer avoiding losses over acquiring gains. If they already have a safety net, they won’t want to lose it. If they don’t have the safety net they need, they’ll feel averse to losing performance to get one. It ends up being a problem more about psychology than technology.

The only tools I know to help people make the right decision are:

Talk to good salespeople about how they overcome the psychology issue. They have to deal with it every day.
Give concrete evidence. Compute the probabilities. Tell the stories of how bad it is to have insufficient protection. Explain that any software feature that provides a benefit is going to cost some system capacity (just like a new report, for example), and that this safety feature is worth the cost. Make sure that when you size systems, you include the incremental capacity cost of switching to db_block_checksum=full.

My teammates get it, of course, because they’ve lived the stories, over and over again, in their roles on the corruption team at Oracle Support. You can get it, too, without leaving your keyboard. If you want to see a fantastic and absolutely horrifying short story about what happens if you do not use Oracle’s db_block_checksum feature properly, read David Loinaz’s article now.

When you read David’s article, you are going to see heavy quoting of my post here in his intro. He did that with my full support. (He wrote his article when my article here wasn’t an article yet.) If you feel like you’ve read it before, just keep reading. You really, really need to see what David has written, beginning with the question:

If I’ve never faced a corruption, and I have good backup strategy, my disks are mirrored, and I have a great database backup strategy, then why do I need to set these kinds of parameters that will impact my performance?

Enjoy.

Friday, February 28, 2014

“How did you learn so much stuff about Oracle?”

In LinkedIn, a new connection asked me a very nice question. He asked, “I know this might sound stupid, but how did you learn so much stuff about Oracle. :)”

Good one. I like the presumption that I know a lot of stuff about Oracle. I suppose that I do, at least about some some aspects of it, although I often feel like I don’t know enough. It occurred to me that answering publicly might also be helpful to anyone trying to figure out how to prepare for a career. Here’s my answer.

I took a job with the young consulting division of Oracle Corporation in September 1989, about two weeks after the very first time I had heard the word “Oracle” used as the name of a company. My background had been mathematics and computer science in school. I had two post-graduate degrees: a Master of Science Computer Science with a focus on language design and compilers, and a Master of Business Administration with a focus in finance.

My first “career job” was as a software engineer, which I started before the MBA. I designed languages and wrote compilers to implement those languages. Yes, people actually pay good money for that, and it’s possibly still the most fun I’ve ever had at work. I wrote software in C, lex, and yacc, and I taught my colleagues how to do it, too. In particular, I spent a lot of time teaching my colleagues how to make their C code faster and more portable (so it would run on more computers than just the one on which you wrote it).

Even though I loved my job, I didn’t see a lot of future in it. At least not in Colorado Springs in the late 1980s. So I took a year off to get the MBA at SMU in Dallas. I went for the MBA because I thought I needed to learn more about money and business. It was the most difficult academic year of my life, because I was not particularly connected to or even interested in most of the subject matter. I hated a lot of my classes, which made it difficult to do as well as I had been accustomed. But I kept grinding away, and finished my degree in the year it was supposed to take. Of course I learned many, many things that year that have been vital to my career.

A couple of weeks after I got my MBA, I went to work for Oracle in Dallas, with a salary that was 168% of what it had been as a compiler designer. My job was to visit Oracle customers and help them with their problems.

It took a while for me to get into a good rhythm at Oracle. My boss was sending me to these local customers that were having problems with the Oracle Financial Applications (the “Finapps,” as we usually called them, which would many years later become the E-Business Suite) on version 6.0.26 of the ORACLE database (it was all caps back then). At first, I couldn’t help them near as much as I had wanted to. It was frustrating.

That actually became my rhythm: week after week, I visited these people who were having horrific problems with ORACLE and the Finapps. The database in 1990, although it had some pretty big bugs, was still pretty good. It was the applications that caused most of the problems I saw. There were a lot of problems, both with the software and with how it was sold. My job was to fix the problems. Some of those problems were technical. Many were not.

A lot of the problems were performance; problems of the software running “too slowly.” I found those problems particularly interesting. For those, I had some experience and tools at my disposal. I knew a good bit about operating systems and compilers and profilers and linkers and debuggers and all that, and so learning about Oracle indexes and rollback segments (two good examples, continual sources of customer frustration) wasn’t that scary of a step for me.

I hadn’t learned anything about Oracle or relational databases in school, I learned about how the database worked at Oracle by reading the documentation, beginning with the excellent Oracle^® Database Concepts. Oracle sped me along a bit with a couple of the standard DBA courses.

My real learning came from being in the field. The problems my customers had were immediately interesting by virtue of being important. The resources available to me for solving such problems back in the early 1990s were really just books, email, and the telephone. The Internet didn’t exist yet. (Can you imagine?) The Oracle books available back then, for the most part, were absolutely horrible. Just garbage. Just about the only thing they were good for was creating problems that you could bill lots of consulting hours to fix. The only thing that was left was email and the telephone.

The problem with email and telephones, however, is that there has to be someone on the other end. Fortunately, I had that. The people on the other end of my email and phone calls were my saviors and heroes. In my early Oracle years, those saviors and heroes included people like Darryl Presley, Laurel Jamtgaard, Tom Kemp, Charlene Feldkamp, David Ensor, Willis Ranney, Lyn Pratt, Lawrence To, Roderick Mañalac, Greg Doherty, Juan Loaiza, Bill Bridge, Brom Mahbod, Alex Ho, Jonathan Klein, Graham Wood, Mark Farnham (who didn’t even work for Oracle, but who could cheerfully introduce me to anyone I needed), Anjo Kolk, and Mogens Nørgaard. I could never repay these people, and many more, for what they did for me. ...In some cases, at all hours of the night.

So, how did I learn so much stuff about Oracle? It started by immersing myself into a universe where every working day I had to solve somebody’s real Oracle problems. Uncomfortable, but effective. I survived because I was persistent and because I had a great company behind me, filled with spectacularly intelligent people who loved helping each other. Could I have done that on my own, today, with the advent of the Internet and lots and lots of great and reliable books out there to draw upon? I doubt it. I sincerely do. But maybe if I were young again...

I tell my children, there’s only one place where money comes from: other people. Money comes only from other people. So many things in life are that way.

I’m a natural introvert. I naturally withdraw from group interactions whenever I don’t feel like I’m helping other people. Thankfully, my work and my family draw me out into the world. If you put me into a situation where I need to solve a technical problem that I can’t solve by myself, then I’ll seek help from the wonderful friends I’ve made.

I can never pay it back, but I can try to pay it forward.

(Oddly, as I’m writing this, I realize that I don’t take the same healthy approach to solving business problems. Perhaps it’s because I naturally assume that my friends would have fun helping solve a technical problem, but that solving a business problem would not be fun and therefore I would be imposing upon them if I were to ask for help solving one. I need to work on that.)

So, to my new LinkedIn friend, here’s my advice. Here’s what worked for me:

Educate yourself. Read, study, experiment. Educate yourself especially well in the fundamentals. So many people don’t. Being fantastic at the fundamentals is a competitive advantage, no matter what you do. If it’s Oracle you’re interested in learning about, that’s software, so learn about software: about operating systems, and C, and linkers, and profilers, and debuggers, .... Read the Oracle Database Concepts guide and all the other free Oracle documentation. Read every book there is by Tom Kyte and Christian Antognini and Jonathan Lewis and Lex de Haan and Toon Koppelaars and Tanel Põder and Kerry Osborne and Karen Morton and James Morle all the other great authors out there today. And read their blogs.
Find a way to hook yourself into a network of people that are willing and able to help you. You can do that online these days. You can earn your way into a community by doing things like asking thoughtful questions, treating people respectfully (even the ones who don’t treat you respectfully), and finding ways to teach others what you’ve learned. Write. Write what you know, for other people to use and improve. And for God’s sake, if you don’t know something, don’t act like you do. That just makes everyone think you’re an asshole, which isn’t helpful.
Immerse yourself into some real problems. Read Scuttle Your Ships Before Advancing if you don’t understand why. You can solve real problems online these days, too (e.g., StackExchange and even Oracle.com), although I think that it’s better to work on real live problems at real live customer sites. Stick with it. Fix things. Help people.

Help people.

That’s my advice.

Friday, April 5, 2013

NoSQL and Oracle, Sex and Marriage

At last week’s Dallas Oracle Users Group meeting, an Oracle DBA asked me, “With all the new database alternatives out there today, like all these open source NoSQL databases, would you recommend for us to learn some of those?”

I told him I had a theory about how these got so popular and that I wanted to share that before I answered his question.

My theory is this. Developers perceive Oracle as being too costly, time-consuming, and complex:

An Oracle Database costs a lot. If you don’t already have an Oracle license, it’s going to take time and money to get one. On the other hand, you can just install Mongo DB today.
Even if you have an Oracle site-wide license, the Oracle software is probably centrally controlled. To get an installation done, you’re probably going to have to negotiate, justify, write a proposal, fill out forms, ...you know, supplicate yourself to—er, I mean negotiate with—your internal IT guys to get an Oracle Database installed. It’s a lot easier to just install MySQL yourself.
Oracle is too complicated. Even if you have a site license and someone who’s happy to install it for you, it’s so big and complicated and proprietary... The only way to run an Oracle Database is with SQL (a declarative language that is alien to many developers) executed through a thick, proprietary, possibly even difficult-to-install layer of software like Oracle Enterprise Manager, Oracle SQL Developer, or sqlplus. Isn’t there an open source database out there that you could just manage from your operating system command line?

When a developer is thinking about installing a database today because he needs one to write his next feature, he wants something cheap, quick, and lightweight. None of those constraints really sounds like Oracle, does it?

So your Java developers install this NoSQL thing, because it’s easy, and then they write a bunch of application code on top of it. Maybe so much code that there’s no affordable way to turn back. Eventually, though, someone will accidentally crash a machine in the middle of something, and there’ll be a whole bunch of partway finished jobs that die. Out of all the rows that are supposed to be in the database, some will be there and some won’t, and so now your company will have to figure out how to delete the parts of those jobs that aren’t supposed to be there.

Because now everyone understands that this kind of thing will probably happen again, too, the exercise may well turn into a feature specification for various “eraser” functions for the application, which (I hope, anyway) will eventually lead to the team discovering the technical term transaction. A transaction is a unit of work that must be atomic, consistent, isolated, and durable (that’where this acronym ACID comes from). If your database doesn’t guarantee that every arbitrarily complex unit of work (every transaction) makes it either 100% into the database or not at all, then your developers have to write that feature themselves. That’s a big, tremendously complex feature. On an Oracle Database, the transaction is a fundamental right given automatically to every user on the system.

Let’s look at just that ‘I’ in ACID for a moment: isolation. How big a deal is transaction isolation? Imagine that your system has a query that runs from 1 pm to 2 pm. Imagine that it prints results to paper as it runs. Now suppose that at 1:30 pm, some user on the system updates two rows in your query’s base table: the table’s first row and its last row. At 1:30, the pre-update version of that first row has already been printed to paper (that happened shortly after 1 pm). The question is, what’s supposed to happen at 2 pm when it comes time to print the information for the final row? You should hope for the old value of that final row—the value as of 1 pm—to print out; otherwise, your report details won’t add up to your report totals. However, if your database doesn’t handle that transaction isolation feature for you automatically, then either you’ll have to lock the table when you run the report (creating an 30-minute-long performance problem for the person wanting to update the table at 1:30), or your query will have to make a snapshot of the table at 1 pm, which is going to require both a lot of extra code and that same lock I just described. On an Oracle Database, high-performance, non-locking read consistency is a standard feature.

And what about backups? Backups are intimately related to the read consistency problem, because backups are just really long queries that get persisted to some secondary storage device. Are you going to quiesce your whole database—freeze the whole system—for whatever duration is required to take a cold backup? That’s the simplest sounding approach, but if you’re going to try to run an actual business with this system, then shutting it down every day—taking down time—to back it up is a real operational problem. Anything fancier (for example, rolling downtime, quiescing parts of your database but not the whole thing) will add cost, time, and complexity. On an Oracle Database, high-performance online “hot” backups are a standard feature.

The thing is, your developers could write code to do transactions (read consistency and all) and incremental (“hot”) backups. Of course they could. Oh, and constraints, and triggers (don’t forget to remind them to handle the mutating table problem), and automatic query optimization, and more, ...but to write those features Really Really Well™, it would take them 30 years and a hundred of their smartest friends to help write it, test it, and fund it. Maybe that’s an exaggeration. Maybe it would take them only a couple years. But Oracle has already done all that for you, and they offer it at a cost that doesn’t seem as high once you understand what all is in there. (And of course, if you buy it on May 31, they’ll cut you a break.)

So I looked at the guy who asked me the question, and I told him, it’s kind of like getting married. When you think about getting married, you’re probably focused mostly on the sex. You’re probably not spending too much time thinking, “Oh, baby, this is the woman I want to be doing family budgets with in fifteen years.” But you need to be. You need to be thinking about the boring stuff like transactions and read consistency and backups and constraints and triggers and automatic query optimization when you select the database you’re going to marry.

Of course, my 15-year-old son was in the room when I said this. I think he probably took it the right way.

So my answer to the original question—“Should I learn some of these other technologies?”—is “Yes, absolutely,” for at least three reasons:

Maybe some development group down the hall is thinking of installing Mongo DB this week so they can get their next set of features implemented. If you know something about both Mongo DB and Oracle, you can help that development group and your managers make better informed decisions about that choice. Maybe Mongo DB is all they need. Maybe it’s not. You can help.
You’re going to learn a lot more than you expect when you learn another database technology, just like learning another natural language (like English, Spanish, etc.) teaches you things you didn’t expect to learn about your native language.
Finally, I encourage you to diversify your knowledge, if for no other reason than your own self-confidence. What if market factors conspire in such a manner that you find yourself competing for an Oracle-unrelated job? A track record of having learned at least two database technologies is proof to yourself that you’re not going to have that much of a problem learning your third.

Thursday, June 7, 2012

An Organizational Constraint that Diminishes Software Quality

One of the biggest problems in software performance today occurs when the people who write software are different from the people who are required to solve the performance problems that their software causes. It works like this:

Architects design a system and pass the specification off to the developers.
The developers implement the specs the architects gave them, while the architects move on to design another system.
When the developers are “done” with their phase, they pass the code off to the production operations team. The operators run the system the developers gave them, while the developers move on to write another system.

The process is an assembly line for software: architects specialize in architecture, developers specialize in development, and operators specialize in operating. It sounds like the principle of industrial efficiency taken to its logical conclusion in the software world.

In this waterfall project plan,
architects design systems they never see written,
and developers write systems they never see run.

Sound good? It sounds like how Henry Ford made a lot of money building cars... Isn’t that how they build roads and bridges? So why not?

With software, there’s a horrible problem with this approach. If you’ve ever had to manage a system that was built like this, you know exactly what it is.

The problem is the absence of a feedback loop between actually using the software and building it. It’s a feedback loop that people who design and build software need for their own professional development. Developers who never see their software run don’t learn enough about how to make their software run better. Likewise, architects who never see their systems run have the same problem, only it’s worse, because (1) their involvement is even more abstract, and (2) their feedback loops are even longer.

Who are the performance experts in most Oracle shops these days? Unfortunately, it’s most often the database administrators, not the database developers. It’s the people who operate a system who learn the most about the system’s design and implementation mistakes. That’s unfortunate, because the people who design and write a system have so much more influence over how a system performs than do the people who just operate it.

If you’re an architect or a developer who has never had to support your own software in production, then you’re probably making some of the same mistakes now that you were making five years ago, without even realizing they’re mistakes. On the other hand, if you’re a developer who has to maintain your own software while it’s being operated in production, you’re probably thinking about new ways to make your next software system easier to support.

So, why is software any different than automotive assembly, or roads and bridges? It’s because software design is a process of invention. Almost every time. When is the last time you ever built exactly the same software you built before? No matter how many libraries you’re able to reuse from previous projects, every system you design is different from any system you’ve ever built before. You don’t just stamp out the same stuff over and over.

Software is funny that way, because the cost of copying and distributing it is vanishingly small. When you make great software that everyone in the world needs, you write it once and ship it at practically zero cost to everyone who needs it. Cars and bridges don’t work that way. Mass production and distribution of cars and bridges requires significantly more resources. The thousands of people involved in copying and distributing cars and bridges don’t have to know how to invent or refine cars or bridges to do great work. But with software, since copying and distributing it is so cheap, almost all that’s left is the invention process. And that requires feedback, just like inventing cars and bridges did.

Don’t organize your software project teams so that they’re denied access to this vital feedback loop.

Friday, November 18, 2011

I Can Help You Trace It

The first product I ever created after leaving Oracle Corporation in 1999 was a 3-day course about optimizing Oracle performance. The experiences of teaching this course from 2000 through 2003 (heavily revising the material each time I taught it) added up to the knowledge that Jeff Holt and I needed to write Optimizing Oracle Performance (2003).

Between 2000 and 2006, I spent many weeks on the road teaching this 3-day course. I stopped teaching it in 2006. An opportunity to take or teach a course ought to be a joyous experience, and this one had become too much of a grind. I didn’t figure out how to fix it until this year. How I fixed it is the story I’d like to tell you.

The Problem

The problem was simply inefficiency. The inefficiency began with the structure of the course, the 3-day lecture marathon. Realize, 6 × 3 = 18 hours of sitting in a chair, listening attentively to a single voice (my voice) is the equivalent of a 6-week university term of a 3-credit-hour course, taught straight through in three days. No hour-plus homework assignment after each hour of lecture to reinforce the lessons; but a full semester’s worth of listening to one voice, straight through, for three days. What retention rate would you expect from a university course compressed into just 3 days?

So, I optimized. I have created a new course that lasts one day (not even an exhausting full day at that). But how can a student possibly learn as much in 1 day as we used to teach in 3 days? Isn’t a 1-day event bound to be a significantly reduced-value experience?

On the contrary, I believe our students benefit even more now than they used to. Here are the big differences, so you can see why.

The Time Savings

In the 3-day course, I would spend half a day explaining why people should abandon their old system-wide-ratio-based ways of managing system performance. In the new 1-day course, I spend less than an hour explaining the Method R approach to thinking about performance. The point of the new course is not to convince people to abandon anything they’re already doing; it’s to show students the tremendous additional opportunities that are available to them if they’ll just look at what Oracle trace files have to offer. Time savings: 2 hours.

In the 3-day course, I would spend a full day explaining how to interpret trace data. By hand. These were a few little lab exercises, about an hour’s worth. Students would enter dozens of numbers from trace files into laptops or pocket calculators and write results on worksheets. In the new 1-day course, the software tools that a student needs to interpret files of any size—or even directories full of files—are included in the price of the course. Time savings: 5 hours.

In the 3-day course, I would spend half a day explaining how to collect trace data. In the new 1-day course, the software tools that a student needs to get started collecting trace files are included in the price of the course. For software architectures that require more work than our software can do for you, there’s detailed instruction in the course book. Time savings: 3 hours.

In the 3-day course, I would spend half a day working through about five example cases using a software tool to which students would have access for 30 days after they had gone home. In the new 1-day course, I spend one hour working through about eight example cases using software tools that every student will take home and keep forever. I can spend less time per case yet teach more because the cases are thoroughly documented in the course book. So, in class, we focus on the high-level decision making instead of the gnarly technical details you’ll want to look up later anyway. Time savings: 3 hours.

...That’s 13 classroom hours we’ve eliminated from the old 3-day experience. I believe that in these 13 hours, I was teaching material that students weren’t retaining to begin with.

The Book

The next big difference: the book.

In the old 3-day course, I distributed two books: (1) the “Course Notebook,” which was a black and white listing of the course PowerPoint slides, and (2) a copy of Optimizing Oracle Performance (O’Reilly 2003). The O’Reilly book was great, because it contained a lot of detail that you would want to look up after the course. But of course it doesn’t contain any new knowledge we’ve learned since 2003. The Course Notebook, in my opinion, was never worth much to begin with. (In my opinion, no PowerPoint slide printout is worth much to begin with.)

The Mastering Oracle Trace Data (MOTD) book we give each student in my new 1-day course is a full-color, perfect-bound book that explains the course material and far more in deep detail. It is full-color for an important reason. It’s not gratuitous or decorative; it’s because I’ve been studying Edward Tufte. I use color throughout the book to communicate detailed, high-resolution information faster to your brain.

Color in the book helps to reduce student workload and deliver value long after a student has left the classroom. In this class, there is no collection of slide printouts like you’ve archived after every Oracle class you’ve been to since the 1980s. The MOTD book is way better than any other material I’ve ever distributed in my career. I’ve heard students tell their friends that you have to see it to believe it.

“A paper record tells your audience that you are serious, responsible, exact, credible. For deep analysis of evidence and reasoning about complex matters, permanent high-resolution displays [that is, paper] are an excellent start.” —Edward Tufte

The Software

So, where does a student recoup all the time we used to spend going through trace files, and studying how to collect trace data on half a dozen different software architectures? In the thousands of man-hours we’ve invested into the software that we give you when you come to the course. Instead of explaining every little detail about quirks in Oracle trace data that change between Oracle versions 10.1 and 10.2 and 11.2 or 11.2.0.2 and 11.2.0.4, the software does the work for you. Instead of having to explain all the detail work, we have time to explain how to use the results of our software to make decisions about your data.

What’s the catch? Of course, we hope you’ll love our software and want to buy it. The software we give you is completely full-featured and yours to keep forever, but the license limits you to using it only with one login id, and it doesn’t include patches and upgrades, which we release a few times each year. We hope you’ll love our software so much that you’ll want to buy a license that lets you use it on any of your systems and that includes the right to upgrade as we fix bugs and add features. We hope you’ll love it so much that you encourage your colleagues to buy it.

But there’s really no catch. You get software and a course (and a book and a shirt) for less than the daily rate that we used to charge for just a course.

A Shirt?

MOTD London 2011-09-08: “I can help you trace it.”

Yes, a shirt. Each student receives a Method R T-shirt that says, “I can help you trace it.” We don’t give these things away to anyone except for students in my MOTD course. So if you see one, the person wearing it can, in actual fact, Help You Trace It.

The Net Result

The net result of all this optimization is benefits on several fronts:

The course costs a lot less than it used to. The fee is presently only about 25% of the 3-day course’s price, and the whole experience requires less than ⅓ of time away from work that the original course did.
In the new course, our students don’t have to work so hard to make productive use of the course material. The book and the software take so much of the pressure off. We do talk about what the fields in raw trace data mean—I think it’s necessary to know that in order to use the data properly, and have productive debates with your sys/SAN/net/etc. administration colleagues. But we don’t spend your time doing exercises to untangle nested (recursive) calls by hand. The software you take home does that for you. That’s why it is so much easier for a student to put this course to work right away.
Since the course duration is only one day, I can visit far more cities and meet far more students each year. That’s good for students who want to participate, and it’s great for me, because I get to meet more people.

Plans

The only thing missing from our Mastering Oracle Trace Data course right now is you. I have taught the event now in Southlake, Texas (our home town), in Copenhagen, and in London. It’s field-tested and ready to roll. We have several cities on my schedule right now. I’ll be teaching the course in Birmingham UK on the day after UKOUG wraps up, December 8. I’ll be doing Orlando and Tampa in mid-December. I’ll teach two courses this coming January in Manhattan and Long Island. There’s Billund (Legoland) DK in April. We have more plans in the works for Seattle, Portland, Dallas, and Cleveland, and we’re looking for more opportunities.

Share the word by linking the official
MOTD sticker to http://method-r.com/.

My wish is for you to help me book more cities in North America and Europe (I hope to expand beyond that soon). If you are part of a company or a user group with colleagues who would be interested in attending the course, I would love to hear from you. Registering en masse saves you money. The magic number for discounting is 10 students on a single registration from one company or user group.

I can help you trace it.

Thursday, January 13, 2011

New paper "Mastering Performance with Extended SQL Trace"

Happy New Year.

It’s been a busy few weeks. I finally have something tangible to show for it: “Mastering Performance with Extended SQL Trace” is the new paper I’ve written for this year’s RMOUG conference. Think of it a 15-page update to chapter 5 of Optimizing Oracle Performance.

There’s lots of new detail in there. Some highlights:

How to enable and disable traces, even in un-cooperative applications.
How to instrument your application so that tracing the right code path during production operation of your application becomes dead simple.
How to make that instrumentation highly scalable (think 100,000+ tps).
How timestamps since 10.2 allow you to know your recursive call relationships without guessing.
How to create response time profiles for calls and groups of calls, with examples.
Why you don’t want to be on Oracle 11g prior to 11.2.0.2.0.

I hope you’ll be able to make productive use of it.

Wednesday, October 20, 2010

Virtual Seminar: "Systematic Oracle SQL Optimization in Real Life"

On November 18 and 19, I’ll be presenting along with Tanel Põder, Jonathan Lewis, and Kerry Osborne in a virtual (GoToWebinar) seminar called Systematic Oracle SQL Optimization in Real Life. Here are the essentials:

What:	Systematic Oracle SQL Optimization in Real Life. Learn how to think clearly about Oracle performance, find your performance problems, and then fix them, whether you’re using your own code (which you can modify) or someone else’s (which you can not modify).
Who:	Cary Millsap, Tanel Põder, Jonathan Lewis, Kerry Osborne
When:	8am–12n US Pacific Time Thursday and Friday 18–19 November 2010
How much:	475 USD (375 USD if you register before 1 November 2010)

The format will be two hours per speaker: an hour and a half for presentation time, and a half hour for questions and answers. Here’s our agenda (all times are listed in USA Pacific Time):

Thursday	8:00a–10:00a	Cary Millsap: Thinking Clearly about Performance
	10:00a–12:00n	Tanel Põder: Understanding and Profiling Execution Plans
Friday	8:00a–10:00a	Jonathan Lewis: Writing Your SQL to Help the Optimizer
	10:00a–12:00n	Kerry Osborne: Controlling Execution Plans (without touching the code)

This is going to be a special event. My staff and I can’t wait to see it ourselves. I hope you will join us.

Thursday, October 7, 2010

Agile is Not a Dirty Word

While I was writing Brown Noise in Written Language, Part 2, twice I came across the word “agile.” First, the word “agility” was in the original sentence that I was criticizing. Joel Garry picked up on it and described it as “a code word for ‘sloppy programming.’” Second, if you read my final paragraph, you might have noticed that I used the term “waterfall” to describe one method for producing bad writing. Waterfall is a reliable method for producing bad computer software too, in my experience, and for exactly the same reason. Whenever I disparage “waterfall,” I’m usually thinking fondly of “agile,” which I consider to be “waterfall’s” opposite. I was thinking fondly of “agile,” then, when I wrote that paragraph, which put me at odds with Joel’s disparaging description of the word. Such conflict usually motivates me to write something.

In my career, I’ve almost always had one foot in each of two separate worlds. These days, one foot is in the Oracle world. There, I have all my old buddies from having worked at Oracle Corporation for over a decade, from companies like Miracle and Pythian, the Oracle ACEs and ACE Directors, Oracle OpenWorld, ODTUG, and a couple dozen or so user groups that I visit every year. The other foot is in the business of software. There, I have colleagues and friends from 37signals and Fog Creek and Red Gate and Pragmatic Marketing, the Business of Software conference, and the dozens of blogs and tweets that I study every day in order to fuel a company that makes not just software that meets a list of requirements, but software that makes you feel like something magical has been accomplished when you run it.

In my Oracle world, agile is a dirty word. I have to actually be careful when I use it. To my Oracle practitioner colleagues, the A-word means, as Joel wrote, “sloppy programming.” In my business of software world, though, “agile” means wholesome golden goodness, an elegant solution to the absolutely most difficult problems in our field. I’m not being facetious one little bit here, either. The two most important influences in my professional life in the past decade have been, far and away:

Eli Goldratt’s The Goal: A Process of Ongoing Improvement
Kent Beck’s Extreme Programming Explained: Embrace Change (2nd Edition)

Far and away.

I don’t mention this among most of my Oracle friends. I don’t blurt out the A-word to them, any more than I’d blurt out the F-word at my parents’ dinner table. To talk with my Oracle friends about the goodness of “A-word development” would go over like an enthusiastic hour-long lecture on urophagia.

A lot of really smart people are very anti-“agile.” I’m pretty sure that it’s mostly because they’ve seen project leaders in the Oracle market segment using the A-word to sell—or justify—some really bad decisions (see Table 1). So the word “agile” itself has been co-opted in our Oracle culture now to mean sloppy, stupid, unprofessional, irresponsible, immature, or naive. That’s ok. I’ve had words taken away from me before. (Like “scalability,” which today is little more than some vague synonym for “fast” or “good”; or “methodology,” which apparently people think sounds cooler than “method.” ...Ok, I am actually a little angry at the agile guys for that one.) That doesn’t mean I can’t still use the concepts.

Table 1.

What people think agile means	What agile means
No written requirements specification; therefore, no disciplined way to match software to requirements.	You write your requirements as computer programs that test your software instead of writing your requirements in natural language documents that a human has to read and interpret to re-test your software every time a developer commits new source code.
No testing phase; therefore, no testing.	You test your software before every commit to your source code repository, by running your automated test suite.
No written design specification; therefore, developers just “design” as they go.	You iterate your design along with your code, but design changes are always accompanied by changes to the automated test programs (which, remember, are the specification).
Rapid prototyping always results in the production code being a fragile—well—rapid fragile, prototype.	When you can’t know how (or whether) something will work, you build it and find out—but only the parts you know you’ll really need. You use the knowledge learned from those experiences to build the one you’ll keep.

Agile is not a synonym for sloppy. On the contrary, you're not really doing agile if you’re not extraordinarily disciplined. I think that is why a lot of people who try agile hit so hard when they fail. I hope you will check out Balancing Agility and Discipline: A Guide for the Perplexed, coauthored by Barry Boehm (yes, that Barry Boehm) if you feel perplexed and in need of guidance.

As with any label, I hope you’ll realize that when you use a word that stands for a complex collection of thought, not everyone who hears or reads the word sees the same mental picture. When this happens, the word ceases being a tool and becomes part of a new problem.

Monday, August 9, 2010

Mister Trace

For the past several weeks, my team at Method R have been working hard on a new software tool that we released today. It is an extension for Oracle SQL Developer called Method R Trace. We call it MR Trace for short.

MR Trace is for SQL and PL/SQL developers who care about performance. Every time you execute code from a SQL Developer worksheet, MR Trace automatically copies a carefully scoped trace file to your SQL Developer workstation. There, you can open it with any application you want, just by clicking. You can tag it for easy lookup later. There’s a 3-minute video if you’re interested in seeing what it looks like.

I’m particularly excited about MR Trace because it’s the smallest software tool we’ve ever designed. That may sound funny to a lot of people, but it won’t sound funny to you if you’ve read Rework by Jason Fried and David Heinemeier Hansson of 37signals. MR Trace does a seemingly very small thing—it gets your trace file for you—but if you’ve ever done that job yourself, you might get a kick out of seeing it happen so automatically, so simply, and so quickly.

The thing is, the normal process of getting trace files is raw misery for many of our friends. It’s a common story: “If I trace some SQL, then to get my trace files, I have to call up my SA or DBA. I apologize for the interruption and hope he’s in a good mood. I tell him what system I need him to look at. He has to figure out which trace files are the ones I need, and then he FTPs them over to where I can get to them. I try not to bother him, but there’s no other way.”

Most places don’t have any security reasons to prohibit developers from getting their trace files, but they just don’t have the time or the interest to create procedures that developers can use to fetch only the files they’re supposed to see. The resulting bother is so labor-intensive and so demotivating that developers stop fighting and just move on without trace files to guide them.

That’s a big problem: if you can’t see how the code you write consumes response time, then how can you optimize it? How can you even know if you should try to? If you have to guess where your code spends time, then you can’t possibly think clearly about performance.

We have tried to design MR Trace to be a beautiful little application that does exactly what you need by staying out of your way. If we did it right, then you won’t be thinking about MR Trace whenever you use it; you’ll just have the trace files you need, right where and when you need them. And you’ll have them with no extra typing, no extra clicks, and—for goodness’ sake—certainly no more phone calls or trips down the hall. ...Unless it’s to to show off a performance problem you found and fixed before anyone else ever noticed it.

Key information:

	Name:		Method R Trace
	Type:		Extension for Oracle SQL Developer
	Function:		Zero-click trace file collector
	Price:		$49.95 USD
	Risk:		30-day free trial
	URL:		http://method-r.com/software/mrtrace
	Designer:		Method R Corporation

Friday, May 14, 2010

Filter Early

Yesterday, my 12 year-old son Alex was excited to tell me that he had learned a new trick that made it easier to multiply fractions. Here’s the trick:

The neat thing for me is that this week I’m working on my slides for ODTUG Kaleidoscope 2010 (well, actually, for the Performance Symposium that’ll occur on Sunday 27 June), and I need more examples to help encourage application developers to write code that will Filter Early. This “trick” (it’s actually an application of the Multiplicative Inverse Law) is a good example of the Filter Early principle.

Filter Early is all about throwing away data that you don’t need, as soon as you can know that you don’t need it. That’s what this trick of arithmetic is all about. Without the trick, you would do more work to multiply 4/7 × 3/4 = (4 × 3)/(7 × 4) = 12/28, and then you would do even more work again to figure out that 12 and 28 both share a factor of 4, which is what you need to know before you then divide 12/4 = 3 and then 28/4 = 7 to reduce 12/28 to 3/7. It’s smarter, faster, and more fun to use the trick. Multiplying fractions without the trick is a Filter Late operation, which is just dumb and slow.

Here are some other examples of the Filter Early pattern’s funnier (unless you’re the victim of it), sinister antipattern, Filter Late. You shouldn’t do these things:

Drop a dozen brass needles into a haystack, shuffle the haystack, and then try to retrieve the needles. (Why did I specifically choose brass? Two reasons. Can you guess?)
Pack everything you own into boxes, hire a moving company to move them to a new home, and then, after moving into your new home, determine that 80% of your belongings are junk that should be thrown away.
Return thousands of rows to the browser, even though the user only wants one or two.
To add further insult to returning thousands of rows to the browser, return the rows in some useless order. Make the user click on an icon that takes time to sort those rows into an order that will allow him to figure out which one or two he actually wanted.
Execute a database join operation in a middle-tier application instead of the database. I’m talking about the Java application that fetches 100,000 rows from table A and 350,000 rows from table B, and then joins the two result sets in a for loop, in an operation that makes 100,000 comparisons to figure out that the result set of the join contains two rows, which the database could have told you much more efficiently.
Slog row-by-row through a multimillion-row table looking for the four rows you need, instead of using an index scan to compute the addresses of those four rows and then access them directly.

Converting a Filter Late application into a Filter Early application can make performance unbelievably better.

One of my favorite features of the Oracle Exadata machine is that it applies the Filter Early principle where a lot of people would have never thought to try it. It filters query results in the storage server instead of the database server. Before Exadata, the Oracle Database passed disk blocks (which contain some rows you do need, but also some rows you don’t) from the storage server to the database. Exadata passes only the rows you need back to the database server (Chris Antognini explains).

How many Filter Early and Filter Late examples do you know?

Tuesday, December 22, 2009

The “Do What You Love” Mirage

I am inspired by having read an article called “Do what you love mirage” by Denis Basaric. It begins...

“Do what you love” is advice I hear exclusively from financially secure people. And it rings hollow to me. When you need money to survive, you do any work that is available, love does not play into that choice. Desperation does.

Please read it before you go on.

Welcome back.

This article puts a very important cycle within my life into words. I believe, as Denis says, that a lot of times, we get the cause-effect relationship mixed up when we think about loving what we do.

I love what I do. Well, a lot of it. But Denis is right: I didn’t choose what I do out of love. I chose what I love out of doing. Some examples:

I love mathematics. But I most assuredly did not always love it. I learned to love it through working hard at it.
It’s the same thing with writing. I love it, but I didn’t always. At first, writing was unrewarding drudgery, which is how most people I meet seem to feel about it.
I love public speaking, but I sure didn’t love it when my speech class made me sick to my stomach three mornings a week for a whole semester my freshman year.
I love being an Oracle performance specialist, but I sure didn’t love being airlifted into crisis after crisis throughout the early 1990s.

I could go on. The point is, my life would be unrecognizably different if not for several really painful situations that I decided to endure with the resolve to get really good at what I hated. Until I loved it.

In retrospect, I seem to have been very lucky in many important situations. Of course, I have. But you make your own luck. Although I believe deeply in the idea of, “The harder I work, the luckier I get,” that is not what I’m talking about here. I’m talking about the power that you have to define for yourself whether something that happened was lucky for you or not. Your situations do not define your life. You create your life based on how you regard your situations.

I could have rebelled against Jimmy Harkey and hated math for the rest of my life. Lots of kids did. I could have rebelled against Lewis Parkhill and never become a writer. I could have refused Craig Newberger’s advice to take his second speech course and never become comfortable in front of an audience. I could have left Oracle in 1991 and found a job where they had more mature products....

One of the most important questions that I ever asked my wife before our engagement was this:

If you were forced to wash cars for 12 hours a day, just to make a living, could you enjoy it?

This is a “soulmate” kind of question for me. My wife’s attitude about it is, for our children and me, possibly the most valuable gift in our lives.

Loving what you do can be difficult. I think Denis hits the nail on the head by suggesting that,

By doing good work, you just might find out that what you are doing, is what you are supposed to do. And if you don’t, quality work will get you to where you want to be.

I hope you will find love in what you do today. Do it well, and it’ll definitely improve your odds.

Monday, December 21, 2009

My Whole System Is Slow. Now What?

At CMG'09 a couple of weeks ago, I presented "Measuring Response Times of Code on Oracle Systems." The paper for this presentation was a subset of "For Developers: Making Friends with the Oracle Database." In the presentation, I spent a few minutes talking about why to measure response times in Oracle, and then I spent a lot of minutes talking about how. As usual, I focused heavily on the importance of measuring response times of individual business tasks executed by individual end users.

At the end of the talk, a group of people came to the podium to ask questions (always a good sign). The first question was the question that a lot of people ask. It was:

My whole system is slow. That's all my users will tell me. So then, how do I begin to do what you're describing?

Here's the answer:

Ask your users to show you what they're doing. Just go look at it.

The results of this simple advice are routinely spectacular. Just go look at it: I'm surprised whenever someone doesn't think of doing that, but I shouldn't be. That's because I didn't do it either, for the longest time. I had to learn to. And that's the story I want to tell you here.

In the early 1990s, I was a consultant with Oracle Corporation visiting clients with performance problems at a pace of more than 30 per year. Back then, I did Oracle performance work the old fashioned way: I checked everything I knew how to check, and then I fixed everything I knew how to fix. All billable by the hour. (Note: When I was doing it this way, I had not yet been taught by Dave Ensor, who changed me forever.)

On weeks when I was lucky, I'd be finished checking and fixing by sometime Wednesday, leaving a couple of days to find out what people thought of my work. If I were lucky again (that's two "lucky"s now), everyone would be thrilled with the results. I'd get my hug (so to speak), and I'd catch my flight.

But I wasn't always lucky. Some weeks, I wouldn't find anything suspicious in my checking and fixing. Some weeks, I'd find plenty, but still not everyone would be thrilled with the work. Having people be less than thrilled with my work caused pain for me, which motivated me to figure out how to take more control of my consulting engagements, to drive luck out of the equation.

The most important thing I figured out was...

People knew before I came on-site how they were going to measure on Thursday whether they liked the results of my work.

And...

They were willing to tell me on Monday.

All I had to do was be honest, like this:

On the day I'm done working here, I'd imagine you're going to want to run something that will demonstrate whether I accomplished what you were hoping for while I was here. Would you mind telling me about that now? Maybe even showing me?

I could ask that on Monday, and people were glad to tell me. I'd watch the things run and record how long they ran, and then I'd know how to prioritize my time on site. I'd record how long they ran so at the end of my engagement, I'd be able to show very clearly what improvements I had made.

Sometimes, there would be thirty different things that people would expect to measure on Thursday. If I might not have time to fix them all, then I needed to make sure that I knew the priority of the things I was being asked to fix.

That one step alone—knowing on Monday that prioritized list of what tasks needed to be fast by Thursday—drastically reduced my reliance on luck as a success factor in my job at these sites. Knowing that list on Monday is just like when your teacher in school tells you exactly what's going to be on your next test. It allows you to focus your attention on exactly what you need to do to optimize your reward for the week. (Note to fellow education enthusiasts: Please don't interpret this paragraph as my advocating the idea that GPA should be a student's sole—or even dominant—optimization constraint.)

So, what I learned is that the very first step of any good performance optimization method is necessarily this:

1. Identify the task that's the most important to you.

When I say "task," think "program" or "click" or "batch job" if you want to. What I mean is "a useful unit of work that makes sense to the business." ...Something that a business user would show you if you just went and watched her work for a few minutes.

Then comes step two:

2. Measure its response time (R). In detail.

Why is response time so important? Because that's what's important to the person who'll be watching it run on Thursday, assessing whether she thinks you've done a good job or not. That person's going to click and then wait. Happiness will be inversely proportional to how long the wait is. That's it. That's what "performance" means at 99% of sites I've ever visited.

(If you're interested in the other 1% of sites I've visited, they're interested in throughput, which I've written about in another blog post.)

Measuring response time is vital. You must be able to measure response time if you're going to nail that test on Thursday.

The key is to understand that the term response time doesn't even have a definition except in the context of a task. You can't measure response time if you don't first decide what task you're going to measure. In other words, you cannot do step 2 before you do step 1. With Oracle, for example, you can collect ASH data (if you're licensed to use it) or even trace data for a whole bunch of Oracle processes, but you won't have a single response time until you define which tasks buried within that data are the ones you want to extract and pay attention to.

You get that by visiting a user and watching what she does.

There are lots of excuses for not watching your users. Like these...

"I don't know my users." I know. But you should. You'd do your job better if you did. And your users would, too.
"My users aren't here." I know. They're on the web. They're in Chicago and Singapore and Istanbul, buying plane tickets or baseball caps or stock shares. But if you can't watch at least a simulation of the things those users actually do with the system you help manage, then I can't imagine how you would possibly succeed at providing good performance to them.
"I'm supposed to be able to manage performance with my dashboard." I know. I was supposed to have a hover car by the year 2000.

The longer you stay mired in excuses like these, the longer it's going to be before you can get the benefit of my point here. Your users are running something, and whatever that is that they're running is your version of my Thursday test. You can check and fix all you want, but unless you get lucky and fix the exact tooth that's hurting, your efforts aren't going to be perceived as "helpful." Checking and fixing everything you can think of is far less efficient and effective than targeting exactly what your user needs you to target.

Lots of performance analysts (DBAs, developers, architects, sysadmins, and so on) assume that when someone says, "The whole system is slow," it means there must be a single parameter somewhere in the bowels of the system that needs adjustment, and if you can just make that adjustment, everything is going to be ok. It might mean that, but in my experience, the overwhelming majority of cases are not that way. (Pages 25–29 of Optimizing Oracle Performance has more information about this.)

The great thing about measuring response time is that no matter what the problem is, you'll see it. If the program you're watching is poorly written, you'll see it. If some other program is hogging too much of a resource that your program needs, you'll see it. If you have a bad disk controller, you'll see it. If some parameter needs adjusting, you'll see it.

Realize that when a business user says "system," she doesn't mean what you would mean if you said "system." She means that the thing she runs is slow. Look at that thing. Maybe there are seventeen of them. And sure, maybe all seventeen suffer from the same root cause. If that's the case, then fantastic, because fixing the first problem will magically fix the other sixteen, too. If it's not, then fantastic anyway, because now all of them are on your prioritized list of tasks to optimize, and you'll probably surprise yourself how quickly you'll be able to pick them all off when you focus on one task at a time.

Thursday, November 12, 2009

Why We Made Method R

Twenty years ago (well, a month or so more than that), I entered the Oracle ecosystem. I went to work as a consultant for Oracle Corporation in September 1989. Before Oracle, I had been a language designer and compiler developer. I wrote code in lex, yacc, and C for a living. My responsibilities had also included improving other people's C code: making it more reliable, more portable, easier to read, easier to prove, and easier to maintain; and it was my job to teach other people in my department how to do these things themselves. I loved all of these duties.

In 1987, I decided to leave what I loved for a little while, to earn an MBA. Fortunately, at that time, it was possible to earn an MBA in a year. After a year of very difficult work, I had my degree and a new perspective on business. I interviewed with Oracle, and about a week later I had a job with a company that a month prior I had never heard of.

By the mid-1990s, circumstances and my natural gravity had matched to create a career in which I was again a software developer, optimizer, and teacher. By 1998, I was the manager of a group of 85 performance specialists called the System Performance Group (SPG). And I was the leader of the system architecture and system management consulting service line within Oracle Consulting's Global Steering Committee.

My job in the SPG role was to respond to all the system performance-related issues in the USA for Oracle's largest accounts. My job in the Global Steering Committee was to package the success of SPG so that other practices around the world could repeat it. The theory was that if a country manager in, say, Venezuela, wanted his own SPG, then he could use the financial models, budgets, hiring plans, training plans, etc. created by my steering committee group. Just add water.

But there was a problem. My own group of 85 people consisted of two very different types of people. About ten of these 85 people were spectacularly successful optimizers whom I could send anywhere with confidence that they'd thrive at either improving

performance or proving that performance improvements weren't possible. The other 75 were very smart, very hard-working people who would grow into the tip of my pyramid over the course of more years, but they weren't there yet.

The problem was, how to you convert good, smart, hard-working people in the base of the SPG pyramid into people in the tip? The practice manager in Venezuela would need to know that. The answer, of course, is supposed to be the Training Plan. Optimally, the Training Plan consists of a curriculum of a few courses, a little on-the-job training, and then, presto: tip of the pyramid. Just add water.

But unfortunately that wasn't the way things worked. What I had been getting instead, within my own elite group, was a

process that took many years to convert a smart, hard-working person into a reasonably reliable performance optimizer whom you could send anywhere. Worse yet, the peculiar stresses of the job—like being away from home 80% of the time, and continually visiting angry people each week, having to work for me—caused an outflow of talent that approximately equaled the inflow of people who made it to the tip of the pyramid. The tip of my pyramid never grew beyond roughly 10 people.

The problem, by definition, was the Training Plan. It just wasn't good enough. It wasn't that the instructors of Oracle's internal "tuning" courses were doing a poor job of teaching courses. And it wasn't that the course developers had done a poor job of creating courses. On the contrary, the instructors and course developers were doing excellent work. The problem was that the courses were focusing on the wrong thing. The reason that the courses weren't getting the job done was that the very subject matter that needed teaching hadn't been invented yet.

I expect that the people who write, say, the course called "Braking System Repair for Boeing 777" to have themselves invented the braking system they write about. So, the question was, who should be responsible for inventing the subject matter on how to optimize Oracle? I decided that I wanted that person to be me. I deliberated carefully and decided that my best chance of doing that the way I wanted to do it would be outside of Oracle. So in October 1999, ten years and one week after I joined the company, I left Oracle with the vision of creating a repeatable, teachable method for optimizing Oracle systems.

Ten years later, this is still the vision for my company, Method R Corporation. We exist not to make your system faster. We exist to make you faster at making all your systems faster. Our work is far from done, but here is what we have done:

Written white papers and other articles that explain Method R to you at no cost.
Written a book called Optimizing Oracle Performance, where you can learn Method R at a low cost.
Created a Method R course (on which the book is based), to teach you how to diagnose and repair response time problems in Oracle-based systems.
Spoken at hundreds of public and private events where we help people understand performance and how to manage it.
Provided consulting services to make people awesome at making their systems faster and more efficient.
Created the first response time profiling software ever for Oracle software applications, to let you analyze hundreds of megabytes of data without drudgery.
Created a free instrumentation library so that you can instrument the response times of Oracle-based software that you write.
Created software tools to help you be awesome at extracting every drop of information that your Oracle system is willing to give you about your response times.
Created a software tool that enables you to record the response time of every business task that runs on your system so you can effortlessly manage end-user performance.

As I said, our work is far from done. It's work that really, really matters to us, and it's work we love doing. I expect it to be a journey that will last long into the future. I hope that our journey will intersect with yours from time to time, and that you will enjoy it when it does.