Wednesday, October 20, 2010

Virtual Seminar: "Systematic Oracle SQL Optimization in Real Life"

On November 18 and 19, I’ll be presenting along with Tanel Põder, Jonathan Lewis, and Kerry Osborne in a virtual (GoToWebinar) seminar called Systematic Oracle SQL Optimization in Real Life. Here are the essentials:

What: Systematic Oracle SQL Optimization in Real Life.
Learn how to think clearly about Oracle performance, find your performance problems, and then fix them, whether you’re using your own code (which you can modify) or someone else’s (which you can not modify).
Who: Cary Millsap, Tanel Põder, Jonathan Lewis, Kerry Osborne
When: 8am–12n US Pacific Time Thursday and Friday 18–19 November 2010
How much: 475 USD (375 USD if you register before 1 November 2010)

The format will be two hours per speaker: an hour and a half for presentation time, and a half hour for questions and answers. Here’s our agenda (all times are listed in USA Pacific Time):

Thursday8:00a–10:00aCary Millsap: Thinking Clearly about Performance
10:00a–12:00nTanel Põder: Understanding and Profiling Execution Plans
Friday8:00a–10:00aJonathan Lewis: Writing Your SQL to Help the Optimizer
10:00a–12:00nKerry Osborne: Controlling Execution Plans (without touching the code)

This is going to be a special event. My staff and I can’t wait to see it ourselves. I hope you will join us.

Thursday, October 7, 2010

Agile is Not a Dirty Word

While I was writing Brown Noise in Written Language, Part 2, twice I came across the word “agile.” First, the word “agility” was in the original sentence that I was criticizing. Joel Garry picked up on it and described it as “a code word for ‘sloppy programming.’” Second, if you read my final paragraph, you might have noticed that I used the term “waterfall” to describe one method for producing bad writing. Waterfall is a reliable method for producing bad computer software too, in my experience, and for exactly the same reason. Whenever I disparage “waterfall,” I’m usually thinking fondly of “agile,” which I consider to be “waterfall’s” opposite. I was thinking fondly of “agile,” then, when I wrote that paragraph, which put me at odds with Joel’s disparaging description of the word. Such conflict usually motivates me to write something.

In my career, I’ve almost always had one foot in each of two separate worlds. These days, one foot is in the Oracle world. There, I have all my old buddies from having worked at Oracle Corporation for over a decade, from companies like Miracle and Pythian, the Oracle ACEs and ACE Directors, Oracle OpenWorld, ODTUG, and a couple dozen or so user groups that I visit every year. The other foot is in the business of software. There, I have colleagues and friends from 37signals and Fog Creek and Red Gate and Pragmatic Marketing, the Business of Software conference, and the dozens of blogs and tweets that I study every day in order to fuel a company that makes not just software that meets a list of requirements, but software that makes you feel like something magical has been accomplished when you run it.

In my Oracle world, agile is a dirty word. I have to actually be careful when I use it. To my Oracle practitioner colleagues, the A-word means, as Joel wrote, “sloppy programming.” In my business of software world, though, “agile” means wholesome golden goodness, an elegant solution to the absolutely most difficult problems in our field. I’m not being facetious one little bit here, either. The two most important influences in my professional life in the past decade have been, far and away:
  1. Eli Goldratt’s The Goal: A Process of Ongoing Improvement
  2. Kent Beck’s Extreme Programming Explained: Embrace Change (2nd Edition)
Far and away.

I don’t mention this among most of my Oracle friends. I don’t blurt out the A-word to them, any more than I’d blurt out the F-word at my parents’ dinner table. To talk with my Oracle friends about the goodness of “A-word development” would go over like an enthusiastic hour-long lecture on urophagia.

A lot of really smart people are very anti-“agile.” I’m pretty sure that it’s mostly because they’ve seen project leaders in the Oracle market segment using the A-word to sell—or justify—some really bad decisions (see Table 1). So the word “agile” itself has been co-opted in our Oracle culture now to mean sloppy, stupid, unprofessional, irresponsible, immature, or naive. That’s ok. I’ve had words taken away from me before. (Like “scalability,” which today is little more than some vague synonym for “fast” or “good”; or “methodology,” which apparently people think sounds cooler than “method.” ...Ok, I am actually a little angry at the agile guys for that one.) That doesn’t mean I can’t still use the concepts.

Table 1.
What people think agile meansWhat agile means
No written requirements specification; therefore, no disciplined way to match software to requirements.You write your requirements as computer programs that test your software instead of writing your requirements in natural language documents that a human has to read and interpret to re-test your software every time a developer commits new source code.
No testing phase; therefore, no testing.You test your software before every commit to your source code repository, by running your automated test suite.
No written design specification; therefore, developers just “design” as they go.You iterate your design along with your code, but design changes are always accompanied by changes to the automated test programs (which, remember, are the specification).
Rapid prototyping always results in the production code being a fragile—well—rapid fragile, prototype.When you can’t know how (or whether) something will work, you build it and find out—but only the parts you know you’ll really need. You use the knowledge learned from those experiences to build the one you’ll keep.

Agile is not a synonym for sloppy. On the contrary, you're not really doing agile if you’re not extraordinarily disciplined. I think that is why a lot of people who try agile hit so hard when they fail. I hope you will check out Balancing Agility and Discipline: A Guide for the Perplexed, coauthored by Barry Boehm (yes, that Barry Boehm) if you feel perplexed and in need of guidance.

As with any label, I hope you’ll realize that when you use a word that stands for a complex collection of thought, not everyone who hears or reads the word sees the same mental picture. When this happens, the word ceases being a tool and becomes part of a new problem.

Thursday, September 30, 2010

Brown Noise in Written Language, Part 2

Here is some more thinking on the subject of brown noise in written language, stimulated by Joel Garry’s comment to my prior post.

My point is not an appeal for more creative writing in the let’s-use-lots-of-adverbs sense. It’s an appeal for clarity of expression. More fundamentally, it is an appeal for having an idea to express in the first place. If you have an actual idea and express it in a useful way, then maybe you've created something that is not spam (even if it happens to be a mass mailing), because it yields some value to your audience.

My point is about being creative only to the extent that if you haven’t created an interesting thought to convey by the time you’ve written something, then you don’t deserve—and you’re not going to get—my attention. (Except you might get me to criticize your writing in my blog.)

What Lanham calls “the Official Style” is a tool for solving two specific problems: There’s (1) “I have no clear thought to express, yet I'm required to write something today.” And (2) “I have a thought I'd like to express, but I'm afraid that if I just come out and say it, I'll get in trouble.” Problem #1 happens, for example, to school children who are required to write when they really don’t have anything in mind to be passionate about. Problem #2 happens to millions who live out the Emperor’s New Clothes every day of their lives. They don’t “get” what their mission is or why it’s important, so when they’re required to write, they encrypt their material to hide from their audience that they don’t get it. The result includes spam, mission statements, and 98% of the PowerPoint presentations you’ll ever see in your life.

I’m always more successful when I orient my thoughts in the direction of gratitude, so a better Part 1 post from me would have been structured as:
  1. Wow, look at this horrible, horrible sentence. I am so lucky I don't have to live and work in an environment where this kind of expression (and by implication, this kind of thinking) is deemed acceptable.
  2. I highly recommend Lanham's Revising Prose. It is brilliant. It helps you fix this kind of writing, and—more importantly—the kind of thinking that leads to it.
  3. I’m grateful for the work of people like Lanham, Fried, Heinemeier-Hansson, and many others, who help us understand and appreciate clear thinking and courageous writing.
Writing is not just output. Writing is an iterative process—along with thinking, experimenting, testing—that creates new thought. If you try to use the waterfall approach when you write—“Step 1: Do all your thinking; Step 2: Do all your writing”—then you’ll miss the whole point of how writing clarifies and creates new thought. That is why learning how to revise prose is so important. It’s not just about how to make writing better. As Lanham illustrates in dozens of examples throughout his book, revising prose forces improvement in the writer’s thinking, which enriches the writer’s life even more than the writing, however tremendous, will enrich the reader.

Brown Noise in Written Language

Today’s email brought a loaf of spam with this in it:
[Name withheld] is a world-class developer and provider of leading-edge solutions that help customers optimize the physical infrastructure through simplification, agility, and operational efficiency.
This passage is the informational equivalent of this audio file. If you can read it without feeling sad, sarcastic, vaguely scummy, or bitter about humanity’s perverse unwillingness to combine thought and language in a useful way, then I beg you to read Revising Prose and Rework.


That is all.

Sunday, September 26, 2010

My Actual OTN Interview

And now, the actual OTN interview (9:11) is online. Thank you, Justin; it was a lot of fun. And thank you to Oracle Corporation for another great show. It's an ever-growing world we work in, and I'm thrilled to be a part of it.

Friday, September 10, 2010

New Method R Blogs

Today we installed two new blogs over at our web page. We created them to give us—all of us at Method R—a place to talk about our products and professional experiences. I hope you'll come have a look.

I'll still be posting here, too.

Thursday, September 2, 2010

My OTN Interview at OOW2010 (which hasn’t happened yet)

Yesterday, Justin Kestelyn from Oracle Technology Network sent me a note specifying some logistics for our OTN interview together called “Coding for Performance, with Cary Millsap.” It will take place at Oracle OpenWorld on Monday, September 20 at 11:45am Pacific Time.

One of Justin’s requests in his note was, “Topics: Please suggest 5 topics for discussion?” So, I thought for a couple of minutes about questions I’d like him to ask me, and I wrote down the first one. And then I thought to myself that I might as well write down the answer I would hope to say to this; maybe it’ll help me remember everything I want to say. Then I wrote another question, and the answer just flowed, and then another, and another. Fifteen minutes later, I had the whole thing written out.

I told Justin all this, and we agreed that it would be fun to post the whole interview here on my blog, before it ever happened. And then during the actual interview, we’ll see what actually happens. It’ll all be in Justin’s hands by then.

So, here we go. Justin Kestelyn’s interview, “Coding for Performance, with Cary Millsap.” Which hasn’t happened yet.

◆ ◆ ◆

Justin: Hi, Cary. Welcome to the show, etc., etc.

Cary: Hi, Justin. It’s great to be here. Thank you for having me, etc., etc.

Justin: So tell me, ... What is the most important thing to know about performance?

Cary: Performance is all about code path. There are only three ways that a program can consume your time. There’s (1) the actual execution of your program’s instructions. There’s (2) queueing delay, which is what you get when you visit a resource that’s busy serving someone else (CPU, disk, network, etc.). And there’s (3) coherency delay, which is when you await some other process’s permission to execute your next step. The code you’re running controls all three of those ways you can spend time. So understanding performance is all about understanding code, whether it’s Java or PHP or C# that you wrote, or the C code that the Oracle Database kernel developers have written for you.

Justin: Is tuning SQL or PL/SQL any different from tuning Java or PHP or C#?

Cary: The tools are a little different, but the fundamentals are exactly the same. You find out which code path in your application is consuming your time, and then you go after it. The best thing to do is figure out a way to execute that code path less often (because the fastest way to do anything is to not do it at all). The next best thing to do is try to figure out a way to make any instructions that can’t be eliminated, faster. That’s the whole trick.

Justin: You make it sound easy.

Cary: It usually is easy once you can collect the data you need to guide you. ...Once you know how to get the system to tell you where it’s spending your time. People make it hard on themselves anytime they try to use performance data that includes information about anything other than the specific user experience they’re trying to fix. Like when they try to fix the performance of some click on a web form by looking at CPU utilization data on their application server or their database server.

Another thing that makes it really hard is the design of the application. More tiers means more complexity when it comes time to diagnose performance problems. And some User Interface designs are just guaranteed to create performance problems. My presentation here called “Messed-Up Apps” is a showcase of a few of those kinds of designs. The message there is that performance is something that has to be designed into an application from the start, like any other feature. Performance is not something you can paint on at the end.

Justin: What can developers do to maximize the performance of the applications they write?

Cary: The most important thing is to remember a couple of key ideas. First, Barry Boehm showed that the cost of repairing defects increases hyperbolically, the later you find them in your development and deployment life cycle. That’s true for performance defects just like it is for functional defects. Second, what Donald Knuth wrote 40 years ago is still true today: when developers try to guess where their code is slow, they do an awful job. Even great developers, when they profile the response time of their code, they’re often surprised at where that code is spending their (or their users’) time. So, profiling early in the software development life cycle is vital.

Next, it’s important to test. Not just functional requirements, but test performance requirements, too. Finally, it’s important to realize that there’s no way that your testing can catch every performance problem that can go wrong, so it’s important to make your application code easy to diagnose and repair in production. You do that with good instrumentation so your production system managers can profile in production when they need to, just like the developers do on the development and test systems.

Justin: How do you—a developer—profile your code?

Cary: Every development language has profiling tools that go with it. They’re tools that you can point at your application when it runs to show exactly where every smidgen of response time is being consumed within that code. The first profiler I was ever aware of is the -pg flag on C compilers. You gcc -pg to compile your code, and then after you run your code, you can use gprof to profile where your time went. Java has profilers, PHP has them, Perl, C#, C++, all of them.

Even the Oracle Database has a profiling capability, but they don’t call it “profiling” (that name means something else in the Oracle documentation). The extended SQL trace data that Oracle emits when you do the right DBMS_MONITOR.SESSION_TRACE_ENABLE call is a written record of where every bit of your response time went. That’s “profiling,” in the computer science sense of the word. Those files have been the basis of my career as a performance analyst and software tools author for the past 20 years or so.

Justin: Tell us a little bit about your company, Method R. You founded it a couple of years ago?

Cary: Yes, I started a new company called Method R Corporation in April 2008. We’ve had a great time writing tools for people and performing services (teaching and consulting) to help people solve their performance problems. Our core business is building tools that help people do for themselves what we know how to do with performance. The trace data that Oracle emits is very complicated, and we have software tools that make it easy to get what you need from those trace files.

We also have an extension for SQL Developer that makes it easy to get the trace data itself, while you’re developing a new SQL- or PL/SQL-based application. We’re also working on a number of very large development projects for customers in which we’re writing complex application code that has to scale to outrageous workloads. We’re always looking for ways we can help people.

Justin: Well, that’s all the time we have today. I really enjoyed talking with you, etc., etc.

Cary: Oh, it was my pleasure. Thank you for having me, and good luck with the rest of the Show.

◆ ◆ ◆

...Which hasn’t happened yet. ☺

I hope you enjoyed, and I’ll look forward to seeing you at Oracle OpenWorld 2010.

I’ll be presenting “Messed-Up Apps: a study of performance antipatterns” at the ODTUG User Group Forum on Sunday at 3:00pm, and “Thinking Clearly about Performance” at the main event on Tuesday at 12:30pm. See you there!

Monday, August 9, 2010

Mister Trace

For the past several weeks, my team at Method R have been working hard on a new software tool that we released today. It is an extension for Oracle SQL Developer called Method R Trace. We call it MR Trace for short.

MR Trace is for SQL and PL/SQL developers who care about performance. Every time you execute code from a SQL Developer worksheet, MR Trace automatically copies a carefully scoped trace file to your SQL Developer workstation. There, you can open it with any application you want, just by clicking. You can tag it for easy lookup later. There’s a 3-minute video if you’re interested in seeing what it looks like.

I’m particularly excited about MR Trace because it’s the smallest software tool we’ve ever designed. That may sound funny to a lot of people, but it won’t sound funny to you if you’ve read Rework by Jason Fried and David Heinemeier Hansson of 37signals. MR Trace does a seemingly very small thing—it gets your trace file for you—but if you’ve ever done that job yourself, you might get a kick out of seeing it happen so automatically, so simply, and so quickly.

The thing is, the normal process of getting trace files is raw misery for many of our friends. It’s a common story: “If I trace some SQL, then to get my trace files, I have to call up my SA or DBA. I apologize for the interruption and hope he’s in a good mood. I tell him what system I need him to look at. He has to figure out which trace files are the ones I need, and then he FTPs them over to where I can get to them. I try not to bother him, but there’s no other way.”

Most places don’t have any security reasons to prohibit developers from getting their trace files, but they just don’t have the time or the interest to create procedures that developers can use to fetch only the files they’re supposed to see. The resulting bother is so labor-intensive and so demotivating that developers stop fighting and just move on without trace files to guide them.

That’s a big problem: if you can’t see how the code you write consumes response time, then how can you optimize it? How can you even know if you should try to? If you have to guess where your code spends time, then you can’t possibly think clearly about performance.

We have tried to design MR Trace to be a beautiful little application that does exactly what you need by staying out of your way. If we did it right, then you won’t be thinking about MR Trace whenever you use it; you’ll just have the trace files you need, right where and when you need them. And you’ll have them with no extra typing, no extra clicks, and—for goodness’ sake—certainly no more phone calls or trips down the hall. ...Unless it’s to to show off a performance problem you found and fixed before anyone else ever noticed it.

Key information:

Method R Trace
Extension for Oracle SQL Developer
Zero-click trace file collector
$49.95 USD
30-day free trial
Method R Corporation

Monday, July 26, 2010

Thinking Clearly is more important than the Right Answer

Have you ever met anyone who attracted your attention because he had the right idea, but the more you got to know how he arrived at that idea, the less attracted you felt?

All our lives, we learn how important it is to be correct, to have the right answer. You gotta have the right answer to make good grades in school, to nail that interview, to be accepted by your peers and your families and your supervisors, .... But too many people think that an education is merely a sequence of milestones at which you demonstrate that you know the right answer. That view of education is unfortunate.

Here’s a little trick that will help me demonstrate. I’m sure you already know how to “cancel” factors in fractions, like I showed in my Filter Early post, to make division simpler. Like this:But, did you know that you can do this, too?
You never knew you could do that, did you?

Well, that’s because you can’t. Canceling the nines produces the right answer in this case: 95/19 is in fact 5/1. But the trick works only in a few special cases. It doesn’t, for example, work here:
Canceling digits like this is not a reliable technique for reducing fractions. (Here’s a puzzle for you. For how many two-digit number pairs will this digit-canceling trick work? What are they? How did you figure it out?)

The trick’s problem is precisely its lack of reliability. A process is reliable only if it works every time you use it. Incomplete reliability is the most insidious of vices. If you have a tool that never works, you learn quickly never to depend upon it, so it doesn’t hurt you too badly. But if you have a tool that works sometimes, then you can grow to trust it—which increases the stakes—and then it really hurts you when it fails.

Of course, you can make a partially reliable tool useful with some extra work. You can determine under what limited circumstances the tool is reliable, and under what circumstances it isn’t. Engineers do it all the time. Aluminum is structurally unreliable in certain temperature ranges, so when a part needs to operate in those ranges, they don’t build it out of aluminum. In some cases, a tool is so unreliable—like our cancel-the-digits trick—that you’re better off abandoning it entirely.

So, if your student (your child) were to compute 95/19 = 5/1 by using the unreliable cancel-the-digits method, should you mark the problem correctly solved? It’s the right answer; but in this case, the correctness of the answer is actually an unfortunate coincidence.

I say unfortunate, because any feedback that implies, “you can reduce fractions by canceling digits,” helps to create a defect in the student’s mind. It creates a bug—in the software sense—that he’ll need to fix later if he wants to function properly. That’s why showing your work is so important for students. How can someone evaluate your thinking if all you show is your final answer?

Being a good teacher requires many of the same skills as being a good software tester. It’s not just about whether the student can puke out the right answers, it’s whether the process in the student’s mind is reliable. For example, if a student is prone to believing in an unreliable trick like cancel-the-digits, then a test where all the problems submit nicely to that trick is a really bad test.

Likewise, being a good student requires many of the same skills as being a good software developer. It’s not just fitting your mind to the problems in the book; it’s exploring how the things you’re learning (both code path and data) can help you solve other problems, too. Being a good student means finding out “Why?” a lot. Why does this work? Does it always work? When does it not work?

Clear thinking is more important than the right answer. Certainly you want the right answer, but knowing how to find the right answer is far more important. It’s the difference between having a fish and knowing how to catch more.

Friday, May 14, 2010

Filter Early

Yesterday, my 12 year-old son Alex was excited to tell me that he had learned a new trick that made it easier to multiply fractions. Here’s the trick:

The neat thing for me is that this week I’m working on my slides for ODTUG Kaleidoscope 2010 (well, actually, for the Performance Symposium that’ll occur on Sunday 27 June), and I need more examples to help encourage application developers to write code that will Filter Early. This “trick” (it’s actually an application of the Multiplicative Inverse Law) is a good example of the Filter Early principle.

Filter Early is all about throwing away data that you don’t need, as soon as you can know that you don’t need it. That’s what this trick of arithmetic is all about. Without the trick, you would do more work to multiply 4/7 × 3/4 = (4 × 3)/(7 × 4) = 12/28, and then you would do even more work again to figure out that 12 and 28 both share a factor of 4, which is what you need to know before you then divide 12/4 = 3 and then 28/4 = 7 to reduce 12/28 to 3/7. It’s smarter, faster, and more fun to use the trick. Multiplying fractions without the trick is a Filter Late operation, which is just dumb and slow.

Here are some other examples of the Filter Early pattern’s funnier (unless you’re the victim of it), sinister antipattern, Filter Late. You shouldn’t do these things:
  • Drop a dozen brass needles into a haystack, shuffle the haystack, and then try to retrieve the needles. (Why did I specifically choose brass? Two reasons. Can you guess?)
  • Pack everything you own into boxes, hire a moving company to move them to a new home, and then, after moving into your new home, determine that 80% of your belongings are junk that should be thrown away.
  • Return thousands of rows to the browser, even though the user only wants one or two.
  • To add further insult to returning thousands of rows to the browser, return the rows in some useless order. Make the user click on an icon that takes time to sort those rows into an order that will allow him to figure out which one or two he actually wanted.
  • Execute a database join operation in a middle-tier application instead of the database. I’m talking about the Java application that fetches 100,000 rows from table A and 350,000 rows from table B, and then joins the two result sets in a for loop, in an operation that makes 100,000 comparisons to figure out that the result set of the join contains two rows, which the database could have told you much more efficiently.
  • Slog row-by-row through a multimillion-row table looking for the four rows you need, instead of using an index scan to compute the addresses of those four rows and then access them directly.
Converting a Filter Late application into a Filter Early application can make performance unbelievably better.

One of my favorite features of the Oracle Exadata machine is that it applies the Filter Early principle where a lot of people would have never thought to try it. It filters query results in the storage server instead of the database server. Before Exadata, the Oracle Database passed disk blocks (which contain some rows you do need, but also some rows you don’t) from the storage server to the database. Exadata passes only the rows you need back to the database server (Chris Antognini explains).

How many Filter Early and Filter Late examples do you know?

Thursday, April 29, 2010

The Ramp

I love stories about performance problems. Recently, my friend Debra Lilley sent me this one:
I went to see a very large publishing company about 6 months after they went live. I asked them what their biggest issue was, and they told me querying in GL was very slow, and I was able to fix quite easily. (There was a very simple concatenated index trick for the Chart of Accounts segments that people just never used.) Then I asked if there was anything else. The manager said no but the clerk who sat behind him said, “I have a problem.” His manager seemed embarrassed, but when I pressed him, the clerk continued, “Every day I throw away reams of paper from our invoice listing.”

I asked to look at the request, which ran a simple listing of all invoices entered at a scheduled time each day. I opened up the schedule screen and there was a tick box to “Increment date on each run.” This was not ticked, and they were running the report from day 1, every day. When they accepted the system at go live there was no issue. I think all system implementations should include a 3- or 6-month review. Regardless of how good the implementers are, their setup is based on the information known at the time. In production, that information (volumes, etc.) often changes, and when it does, it can affect your decisions.
My friends Connie Smith and Lloyd Williams call this performance antipattern The Ramp. With the ramp, processing duration increases as the system is used. This invoicing system exhibited ramp behavior, because every invoicing process execution would take just a little bit longer and print just a few more pages than the prior execution did.

The problem of the ramp reminds me of a joke I heard when I was young. A boy, one who is athletically very talented but not too bright, takes on a job as a stripe painter for the highway department. The department gives him bucket of paint and a brush and drives him out to the highway he’s supposed to paint. His first day on the job, he paints a stripe almost seven miles long. This is an utterly stunning feat, for no one previously had ever painted more than five miles in a day. The department was ecstatic. Apparently, this boy’s true calling was to paint roadways.

The excitement abated a little bit on the second day, when the boy painted only five miles of highway. But still, five miles is the best that anyone had ever done before him. But on the third day, the distance dropped to two miles, and on the fourth day, it fell to less than one mile.

The department managers were gravely concerned, especially after having been so excited on the first couple of days. So they had a driver go out to fetch the boy, to bring him back to the office to explain why his productivity had been so outstanding at first but had then declined so horribly.

The reason was easy to understand, the boy explained. Every day he painted, he kept getting farther and farther away from where he had set his paint bucket on the first day.

I’ve known people who’ve written linked list insertion algorithms this way. Joel Spolsky has written about string library functions in C that work this way. I’ve seen people write joins in SQL that work this way. And Debra’s publishing company ran their invoices this way.

When you have the ramp problem, individual response times increase linearly. ...Which is bad. But overall response time—through the history of using such an application—varies in proportion to the square of the number of items being processed. ...Which is super-duper bad.

Imagine, in the invoicing problem that Debra solved, that the system had been processing just one invoice per day and that each invoice is only one page long. Given that she was at a “very large publishing company,” it’s certain that the volume was greater than this, but for the sake of simplifying my argument, let’s assume that there was just one new invoice each day. Then, with the “Increment date on each run” box left unchecked, there would be one invoice to print on day 1, two on day 2, etc. On any day n, there would be n invoices to print.

Obviously, the response time on any given day n would thus be n times longer than it needed to be. At the end of the first year of operation with the new application, an invoice would take 365 times longer to print than on the first day of the year.

But the pain each day of invoice generation is not all there is to the problem. The original concern was expressed in terms of all the paper that was wasted. That paper waste is important, not just because of the environmental impact of unnecessary paper consumption, but also because of all the computing power expended over the operational history of the application required to generate those pages. That includes the resources (the electrical power, the CPU cycles, the memory, the disk and network I/Os, etc.) that could have been put to better use doing something else.

In the grossly over-simplified invoicing system I’ve asked you to imagine (which creates only one invoice per day), the total number of pages printed as of the end of day n is 1 + 2 + ... + n, which is n(n + 1)/2. All but n of those pages are unnecessary. Thus the total number of wasted pages that will have been printed by the end of day n is n(n + 1)/2 – n, which is n(n – 1)/2, or (n2n)/2. The number of invoices that should never been printed is proportional to the square of the number of days using the application.

To get a sense for what that means, think about this (remember, all these points refer to a grossly over-simplified system that creates only one invoice per day):
  • By the end of the first month, you'll have printed 465 pages when you only needed 30. That’s 435 unnecessary pages.
  • But by the end of the first year, you’ll have printed 66,795 pages instead of 365. That’s 66,430 unnecessary pages. It’s 27 unnecessary 2,500-page boxes of paper.
  • And by the end of the fifth year, you’ll have used 668 boxes of paper to print 1,668,508 pages instead of using just one box to print 1,826 pages. The picture below shows how tremendously wasteful this is.

When total effort varies as the square of something (like the number of items to process, or the number of days you’ve been using an application), it’s bad, bad news for efficiency. It means that every time your something doubles, your performance (time, materials consumption, etc.) will degrade by a factor of four. Every time your something increases by a factor of ten, your performance will degrade by a factor of a hundred. When your something increases a hundred fold, performance will degrade by a factor of 10,000.

Algorithm analysts characterize algorithms that behave this way as O(n2), pronounced “big-oh of n-squared.” O(n2) performance is no way to live. The good news is that you can usually break yourself out of a O(n2) regime. Sometimes, as Debra’s story illustrates, the solution isn’t even technical: she solved her client’s problem by using an option designed into the end-user interface.

No matter where the problem is—whether it’s problem with use, setup, implementation, design, or concept—it’s worth significant time and effort to find the O(n2) problems in your system and eliminate them. Whenever you need reassurance of that idea, just glance again at the image of the paper boxes shown here.

And by the way, do you remember my post about “Just go look at it?” Tally one for Debra, for the win.

Friday, March 5, 2010

On "Is a computer science degree a good goal?"

Dan Fink's "Is a computer science degree a good goal?" has gotten my wheels going. I think it's important to note this:
Computer Science ≠ Information Technology
Not only are these two disciplines not equal, neither is a subset of the other.

One of my most memorable culture shocks coming out of school into the Oracle domain was how many people didn't understand the difference between computer science, which is a specialized branch of mathematics, and information technology, which is a specialized branch of business administration. They both deal with computers (the IT major more than the CS one, actually), so of course there's risk that people will miss the distinction.

Over dinner Friday night with some of my friends from Percona, we touched on one of the problems. It's difficult for a technical major in school to explain even to his family and friends back home what he's studying. I remember saying once during my senior year as a math major, "I haven't seen a number bigger than 1 since I was a sophomore." I heard a new one tonight: "I got to the level where the only numbers in my math books were the page numbers."

It's difficult for people who don't study computer science to understand who you are or how the min/max kd-trees and deterministic finite automata and predicate calculus and closures that you're studying are different from the COBOL and SQL and MTBFs and ITIL that the IT majors are studying. It's easy to see why laypeople don't understand how these sets of topics arrange into distinctly different categories. What continually surprises me is how often even IT specialists don't understand the distinction. I guess even the computer science graduates soften that distinction when they take jobs doing tasks (to make a living, of course) that will be automated within ten years by other computer scientist graduates.

I agree with Dan and the comments from Tim, Robyn, Noons, Gary, and David about where the IT career path is ultimately headed in the general case. What I don't believe is that the only career path for computer scientists and mathematicians is IT. It's certainly not the only career path for the ones who can actually create things.

I believe that college (by which I mean "University" in the European sense) is a place where the most valuable skill you learn is how to learn, and that, no matter what your major, as long as you work hard and apply yourself to overcoming the difficult challenges, there will be things in this world for you to do to earn your way.

I really hope that the net effect of a depressed, broken, and downward-trending IT industry is not that it further discourages kids from engaging in math and computer science studies in school. But I don't want for so many of our kids today who'll be our adults of tomorrow to become just compartmentalized, highly specialized robots with devastatingly good skills at things that nobody's really willing to pay good money for. I think that the successful human of the future will need to be able to invent, design, create, empathize, teach, see (really see), listen (not just hear), learn, adapt, and solve.

...Just exactly like the successful human of the past.

Wednesday, March 3, 2010

RobB's Question about M/M/m

Today, user RobB left a comment on my recent blog post, asking this:
I have some doubts on how valid an M/M/m model is in a typical database performance scenario. Taking the example from the wait chapter of your book where you have a 0.49 second (service_time) query that you want to perform in less than a second, 95% percent of the time. The most important point here is the assumption of an exponential distribution for service_time immediately states that about 13% of the queries will take more than 2X(Average Service Time), and going the other way most likely to take 0 seconds. From just this assumption only, it is immediately clear that it is impossible to meet the design criteria without looking at anything else. From your article and link to the Kendall notation, wouldn’t an M/D/m model be more appropriate when looking at something like SQL query response time?? Something like M/M/m seems more suited to queueing at the supermarket, for example, and probably many other ‘human interactive’ scenarios compared to a single sub-component of an IT system.
Here’s my answer, part 1.


First, I believe your observation about the book example is correct. It is correct that if service times are exponentially distributed, then about 13% (13.5335%, more precisely) of those times will be 2S, where S is the mean service time. So in the problem I stated, it would be impossible to achieve subsecond response time in more than about 86% of executions, even if there were no competing workload at all. You’re right: you don't need a complicated model to figure that out. You could get that straight from the CDF of the exponential distribution.

However, I think the end of the example provides significant value, where it demonstrates how to use the M/M/m model to prove that you're not going to be able to meet your design criteria unless you can work the value of S down to .103 seconds or less (Optimizing Oracle Performance Fig 9-26, p277). I’ve seen lots of people argue, “You need to tune that task,” but until M/M/m, I had never seen anyone be able to say what the necessary service time goal was, which of course varies as a function of the anticipated arrival rate. A numerical goal is what you need when you have developers who want to know when they’re finished writing code.

With regard to whether real-life service times are really exponentially distributed, you’ve got me wondering now, myself. If service times are exponentially distributed, then for any mean service time S, there’s a 9.5% probability that a randomly selected service time will be less than .1S (in Mathematica, CDF[ExponentialDistribution[1/s],.1s] is 0.0951626 if 0.1s > 0). I’ve got to admit that at the moment, I’m baffled as to how this kind of distribution would model any real-life service process, human, IT, or otherwise.

On its face, it seems like a distribution that prohibits service times smaller than a certain minimum value would be a better model (or perhaps, as you suggest, even fixed service times, as in M/D/m). I think I’m missing something right now that I used to know, because I remember thinking about this previously.

I have two anecdotal pieces of evidence to consider.

One, nowhere in my library of books dedicated to the application of queueing theory to modeling computer software performance (that’s more than 6,000 pages, over 14 inches of material) does Kleinrock, Allen, Jain, Gunther, Menascé, et al mention an M/D/m queueing system. That’s no proof that M/D/m is not the right answer, but it’s information that implies that an awful lot of thinking has gone into the application of queueing theory to software applications without anyone deciding that M/D/m is important enough to write about.

Two, I’ve used M/M/m before in modeling a trading system for a huge investment management company. The response time predictions that M/M/m produced were spectacularly accurate. We did macro-level testing only, comparing response times predicted by M/M/m to actual response times measured by Tuxedo. We didn’t check to see whether service times were exponentially distributed, because the model results were consistently within 5% of perfect accuracy.

Neither of these is proof, of course, that M/M/m is superior in routine applicability to M/D/m. One question I want to answer is whether an M/D/m system would provide better or worse performance than a similar M/M/m system. My intuition is leaning in favor of believing that the M/M/m system would give better performance. If that’s true, then M/M/m is an optimistic model compared to M/D/m, which means that if a real-life system is M/D/m and an M/M/m model says it’s not going to meet requirements, then it assuredly won’t.

I did find a paper online by G. J. Franx about M/D/m queueing. Maybe that paper contains an R=f(λ,μ) function that I can use to model an M/D/m system, which would enable me to do the comparison. I’ll look into it.

Then there’s the issue of whether M/M/m or M/D/m is a more appropriate model for a given real circumstance. The answer to that is simple: test your service times to see if they’re exponentially distributed. The Perl code in Optimizing Oracle Performance, pages 248–254 will do that for you.

Monday, February 22, 2010

Thinking Clearly About Performance, revised to include Skew

I’ve just updated the “Thinking Clearly” paper to include an absolutely vital section that was, regrettably, missing from the first revision. It’s a section on the subject of skew.

I hope you enjoy.

Wednesday, February 10, 2010

Thinking Clearly About Performance

I’ve posted a new paper at called “Thinking Clearly About Performance.” It’s a topic I’ll be presenting this year at:
The paper is only 13 pages long, and I think you’ll be pleased with its information density. Here is the table of contents:
  1. An Axiomatic Approach
  2. What is Performance?
  3. Response Time vs Throughput
  4. Percentile Specifications
  5. Problem Diagnosis
  6. The Sequence Diagram
  7. The Profile
  8. Amdahl’s Law
  9. Minimizing Risk
  10. Load
  11. Queueing Delay
  12. The Knee
  13. Relevance of the Knee
  14. Capacity Planning
  15. Random Arrivals
  16. Coherency Delay
  17. Performance Testing
  18. Measuring
  19. Performance is a Feature
As usual, I learned a lot writing it. I hope you’ll find it to be a useful distillation of how performance works.