Cary Millsap: performance

Showing posts with label performance. Show all posts

Thursday, September 17, 2015

The Fundamental Challenge of Computer System Performance

The fundamental challenge of computer system performance is for your system to have enough power to handle the work you ask it to do. It sounds really simple, but helping people meet this challenge has been the point of my whole career. It has kept me busy for 26 years, and there’s no end in sight.

Capacity and Workload

Our challenge is the relationship between a computer’s capacity and its workload. I think of capacity as an empty box representing a machine’s ability to do work over time. Workload is the work your computer does, in the form of programs that it runs for you, executed over time. Workload is the content that can fill the capacity box.

Capacity Is the One You Can Control, Right?

When the workload gets too close to filling the box, what do you do? Most people’s instinctive reaction is that, well, we need a bigger box. Slow system? Just add power. It sounds so simple, especially since—as “everyone knows”—computers get faster and cheaper every year. We call that the KIWI response: kill it with iron.

KIWI... Why Not?

As welcome as KIWI may feel, KIWI is expensive, and it doesn’t always work. Maybe you don’t have the budget right now to upgrade to a new machine. Upgrades cost more than just the hardware itself: there’s the time and money it takes to set it up, test it, and migrate your applications to it. Your software may cost more to run on faster hardware. What if your system is already the biggest and fastest one they make?

And as weird as it may sound, upgrading to a more powerful computer doesn’t always make your programs run faster. There are classes of performance problems that adding capacity never solves. (Yes, it is possible to predict when that will happen.) KIWI is not always a viable answer.

So, What Can You Do?

Performance is not just about capacity. Though many people overlook them, there are solutions on the workload side of the ledger, too. What if you could make workload smaller without compromising the value of your system?

It is usually possible to make a computer produce all of the useful results that you need without having to do as much work.

You might be able to make a system run faster by making its capacity box bigger. But you might also make it run faster by trimming down that big red workload inside your existing box. If you only trim off the wasteful stuff, then nobody gets hurt, and you’ll have winning all around.

So, how might one go about doing that?

Workload

“Workload” is a conjunction of two words. It is useful to think about those two words separately.

The amount of work your system does for a given program execution is determined mostly by how that program is written. A lot of programs make their systems do more work than they should. Your load, on the other hand—the number of program executions people request—is determined mostly by your users. Users can waste system capacity, too; for example, by running reports that nobody ever reads.

Both work and load are variables that, with skill, you can manipulate to your benefit. You do it by improving the code in your programs (reducing work), or by improving your business processes (reducing load). I like workload optimizations because they usually save money and work better than capacity increases. Workload optimization can seem like magic.

The Anatomy of Performance

This simple equation explains why a program consumes the time it does:

r = cl or response time = call count × call latency

Think of a call as a computer instruction. Call count, then, is the number of instructions that your system executes when you run a program, and call latency is how long each instruction takes. How long you wait for your answer, then—your response time—is the product of your call count and your call latency.

Some fine print: It’s really a little more complicated than this, but actually not that much. Most response times are composed of many different types of calls, all of which have different latencies (we see these in program execution profiles), so the real equation looks like r = c₁l₁ + c₂l₂ + ... + c_nl_n. But we’ll be fine with r = cl for this article.

Call count depends on two things: how the code is written, and how often people run that code.

How the code is written (work) — If you were programming a robot to shop for you at the grocery store, you could program it to make one trip from home for each item you purchase. Go get bacon. Come home. Go get milk... It would probably be dumb if you did it that way, because the duration of your shopping experience would be dominated by the execution of clearly unnecessary travel instructions, but you’d be surprised at how often people write programs that act like this.
How often people run that code (load) — If you wanted your grocery store robot to buy 42 things for you, it would have to execute more instructions than if you wanted to buy only 7. If you found yourself repeatedly discarding spoiled, unused food, you might be able to reduce the number of things you shop for without compromising anything you really need.

Call latency is influenced by two types of delays: queueing delays and coherency delays.

Queueing delays — Whenever you request a resource that is already busy servicing other requests, you wait in line. That’s a queueing delay. It’s what happens when your robot tries to drive to the grocery store, but all the roads are clogged with robots that are going to the store to buy one item at a time. Driving to the store takes only 7 minutes, but waiting in traffic costs you another 13 minutes. The more work your robot does, the greater its chances of being delayed by queueing, and the more such delays your robot will inflict upon others as well.
Coherency delays — You endure a coherency delay whenever a resource you are using needs to communicate or coordinate with another resource. For example, if your robot’s cashier at the store has to talk with a specific manager or other cashier (who might already be busy with a customer), the checkout process will take longer. The more times your robot goes to the store, the worse your wait will be, and everyone else’s, too.

The Secret

This r = cl thing sure looks like the equation for a line, but because of queueing and coherency delays, the value of l increases when c increases. This causes response time to act not like a line, but instead like a hyperbola.

Because our brains tend to conceive of our world as linear, nobody expects for everyone’s response times to get seven times worse when you’ve only added some new little bit of workload, but that’s the kind of thing that routinely happens with performance. ...And not just computer performance. Banks, highways, restaurants, amusement parks, and grocery-shopping robots all work the same way.

Response times are trememdously sensitive to your call counts, so the secret to great performance is to keep your call counts small. This principle is the basis for perhaps the best and most famous performance optimization advice ever rendered:

The First Rule of Program Optimization: Don’t do it.

The Second Rule of Program Optimization (for experts only!): Don’t do it yet.

— Michael A. Jackson

The Problem

Keeping call counts small is really, really important. This makes being a vendor of information services difficult, because it is so easy for application users to make call counts grow. They can do it by running more programs, by adding more users, by adding new features or reports, or by even by just the routine process of adding more data every day.

Running your application with other applications on the same computer complicates the problem. What happens when all these application’ peak workloads overlap? It is a problem that Application Service Providers (ASPs), Software as a Service (SaaS) providers, and cloud computing providers must solve.

The Solution

The solution is a process:

Call counts are sacred. They can be difficult to forecast, so you have to measure them continually. Understand that. Hire people who understand it. Hire people who know how to measure and improve the efficiency of your application programs and the systems they reside on.
Give your people time to fix inefficiencies in your code. An inexpensive code fix might return many times the benefit of an expensive hardware upgrade. If you have bought your software from a software vendor, work with them to make sure they are streamlining the code they ship you.
Learn when to say no. Don’t add new features (especially new long-running programs like reports) that are inefficient, that make more calls than necessary. If your users are already creating as much workload as the system can handle, then start prioritizing which workload you will and won’t allow on your system during peak hours.
If you are an information service provider, charge your customers for the amount of work your systems do for them. The economic incentive to build and buy more efficient programs works wonders.

Thursday, June 7, 2012

An Organizational Constraint that Diminishes Software Quality

One of the biggest problems in software performance today occurs when the people who write software are different from the people who are required to solve the performance problems that their software causes. It works like this:

Architects design a system and pass the specification off to the developers.
The developers implement the specs the architects gave them, while the architects move on to design another system.
When the developers are “done” with their phase, they pass the code off to the production operations team. The operators run the system the developers gave them, while the developers move on to write another system.

The process is an assembly line for software: architects specialize in architecture, developers specialize in development, and operators specialize in operating. It sounds like the principle of industrial efficiency taken to its logical conclusion in the software world.

In this waterfall project plan,
architects design systems they never see written,
and developers write systems they never see run.

Sound good? It sounds like how Henry Ford made a lot of money building cars... Isn’t that how they build roads and bridges? So why not?

With software, there’s a horrible problem with this approach. If you’ve ever had to manage a system that was built like this, you know exactly what it is.

The problem is the absence of a feedback loop between actually using the software and building it. It’s a feedback loop that people who design and build software need for their own professional development. Developers who never see their software run don’t learn enough about how to make their software run better. Likewise, architects who never see their systems run have the same problem, only it’s worse, because (1) their involvement is even more abstract, and (2) their feedback loops are even longer.

Who are the performance experts in most Oracle shops these days? Unfortunately, it’s most often the database administrators, not the database developers. It’s the people who operate a system who learn the most about the system’s design and implementation mistakes. That’s unfortunate, because the people who design and write a system have so much more influence over how a system performs than do the people who just operate it.

If you’re an architect or a developer who has never had to support your own software in production, then you’re probably making some of the same mistakes now that you were making five years ago, without even realizing they’re mistakes. On the other hand, if you’re a developer who has to maintain your own software while it’s being operated in production, you’re probably thinking about new ways to make your next software system easier to support.

So, why is software any different than automotive assembly, or roads and bridges? It’s because software design is a process of invention. Almost every time. When is the last time you ever built exactly the same software you built before? No matter how many libraries you’re able to reuse from previous projects, every system you design is different from any system you’ve ever built before. You don’t just stamp out the same stuff over and over.

Software is funny that way, because the cost of copying and distributing it is vanishingly small. When you make great software that everyone in the world needs, you write it once and ship it at practically zero cost to everyone who needs it. Cars and bridges don’t work that way. Mass production and distribution of cars and bridges requires significantly more resources. The thousands of people involved in copying and distributing cars and bridges don’t have to know how to invent or refine cars or bridges to do great work. But with software, since copying and distributing it is so cheap, almost all that’s left is the invention process. And that requires feedback, just like inventing cars and bridges did.

Don’t organize your software project teams so that they’re denied access to this vital feedback loop.

Friday, January 21, 2011

Describing Performance Improvements (Beware of Ratios)

Recently, I received into my Spam folder an ad claiming that a product could “...improve performance 1000%.” Claims in that format have bugged me for a long time, at least as far back as the 1990s, when some of the most popular Oracle “tips & techniques” books of the era used that format a lot to state claims.

Beware of claims worded like that.

Whenever I see “...improve performance 1000%,” I have to do extra work to decode what the author has encoded in his tidy numerical package with a percent-sign bow. The two performance improvement formulas that make sense to me are these:

Improvement = (b – a)/b, where b is the response time of the task before repair, and a is the response time of the task after repair. This formula expresses the proportion (or percentage, if you multiply by 100%) of the original response time that you have eliminated. It can’t be bigger than 1 (or 100%) without invoking reverse time travel.
Improvement = b/a, where b and a are defined exactly as above. This formula expresses how many times faster the after response time is than the before one.

Since 1000% is bigger than 100%, it can’t have been calculated using formula #1. I assume, then, that when someone says “...improve performance 1000%,” he means that b/a = 10, which, expressed as a percentage, is 1000%. What I really want to know, though, is what were b and a? Were they 1000 and 1? 1 and .001? 6 and .4? (...In which case, I would have to search for a new formula #3.) Why won’t you tell me?

Any time you see a ‘%’ character, beware: you’re looking at a ratio. The principal benefit of ratios is also their biggest flaw. A ratio conceals its denominator. That, of course, is exactly what ratios are meant to do—it’s called normalization—but it’s not always good to normalize. Here’s an example. Imagine two SQL queries A and B that return the exact same result set. What’s better: query A, with a 90% hit ratio on the database buffer cache? or query B, with a 99% hit ratio?

Query	Cache hit ratio
A	90%
B	99%

As tempting as it might be to choose the query with the higher cache hit ratio, the correct answer is...

There’s not enough information given in the problem to answer. It could be either A or B, depending on information that has not yet been revealed.

Here’s why. Consider the two distinct situations listed below. Each situation matches the problem statement. For situation 1, the answer is: query B is better. But for situation 2, the answer is: query A is better, because it does far less overall work. Without knowing more about the situation than just the ratio, you can’t answer the question.

Situation 1
Query	Cache lookups	Cache hits	Cache hit ratio
A	100	90	90%
B	100	99	99%

Situation 2
Query	Cache lookups	Cache hits	Cache hit ratio
A	10	9	90%
B	100	99	99%

Because a ratio hides its denominator, it’s insufficient for explaining your performance results to people (unless your aim is intentionally to hide information, which I’ll suggest is not a sustainable success strategy). It is still useful to show a normalized measure of your result, and a ratio is good for that. I didn’t say you shouldn’t use them. I just said they’re insufficient. You need something more.

The best way to think clearly about performance improvements is with the ratio as a parenthetical additional interesting bit of information, as in:

I improved response time of T from 10s to .1s (99% reduction).
I improved throughput of T from 42t/s to 420t/s (10-fold increase).

There are three critical pieces of information you need to include here: the before measurement (b), the after measurement (a), and the name of the task (here, T) that you made faster. I’ve talked about b and a before, but this I’ve slipped this T thing in on you all of a sudden, haven’t I!

Even authors who give you b and a have a nasty habit of leaving off the T, which is far worse even than leaving off the before and after numbers, because it implies that using their magic has improved the performance of every task on the system by exactly the same proportion (either p% or n-fold), which is almost never true. That is because it’s rare for any two tasks on a given system to have “similar” response time profiles (defining similar in the proportional sense). For example, imagine the following quite dissimilar two profiles:

Task A
Response time	Resource
100%	Total
90%	CPU
10%	Disk I/O

Task B
Response time	Resource
100%	Total
90%	Disk I/O
10%	CPU

No single component upgrade can have equal performance improvement effects upon both these tasks. Making CPU processing 2× faster will speed up task A by 45% and task B by 5%. Likewise, making Disk I/O processing 10× faster will speed up task A by 9% and task B by 80%.

For a vendor to claim any noticeable, homogeneous improvement across the board on any computer system containing tasks A and B would be an outright lie.

Thursday, January 13, 2011

New paper "Mastering Performance with Extended SQL Trace"

Happy New Year.

It’s been a busy few weeks. I finally have something tangible to show for it: “Mastering Performance with Extended SQL Trace” is the new paper I’ve written for this year’s RMOUG conference. Think of it a 15-page update to chapter 5 of Optimizing Oracle Performance.

There’s lots of new detail in there. Some highlights:

How to enable and disable traces, even in un-cooperative applications.
How to instrument your application so that tracing the right code path during production operation of your application becomes dead simple.
How to make that instrumentation highly scalable (think 100,000+ tps).
How timestamps since 10.2 allow you to know your recursive call relationships without guessing.
How to create response time profiles for calls and groups of calls, with examples.
Why you don’t want to be on Oracle 11g prior to 11.2.0.2.0.

I hope you’ll be able to make productive use of it.

Sunday, September 26, 2010

My Actual OTN Interview

And now, the actual OTN interview (9:11) is online. Thank you, Justin; it was a lot of fun. And thank you to Oracle Corporation for another great show. It's an ever-growing world we work in, and I'm thrilled to be a part of it.

Thursday, September 2, 2010

My OTN Interview at OOW2010 (which hasn’t happened yet)

Yesterday, Justin Kestelyn from Oracle Technology Network sent me a note specifying some logistics for our OTN interview together called “Coding for Performance, with Cary Millsap.” It will take place at Oracle OpenWorld on Monday, September 20 at 11:45am Pacific Time.

One of Justin’s requests in his note was, “Topics: Please suggest 5 topics for discussion?” So, I thought for a couple of minutes about questions I’d like him to ask me, and I wrote down the first one. And then I thought to myself that I might as well write down the answer I would hope to say to this; maybe it’ll help me remember everything I want to say. Then I wrote another question, and the answer just flowed, and then another, and another. Fifteen minutes later, I had the whole thing written out.

I told Justin all this, and we agreed that it would be fun to post the whole interview here on my blog, before it ever happened. And then during the actual interview, we’ll see what actually happens. It’ll all be in Justin’s hands by then.

So, here we go. Justin Kestelyn’s interview, “Coding for Performance, with Cary Millsap.” Which hasn’t happened yet.

◆ ◆ ◆

Justin: Hi, Cary. Welcome to the show, etc., etc.

Cary: Hi, Justin. It’s great to be here. Thank you for having me, etc., etc.

Justin: So tell me, ... What is the most important thing to know about performance?

Cary: Performance is all about code path. There are only three ways that a program can consume your time. There’s (1) the actual execution of your program’s instructions. There’s (2) queueing delay, which is what you get when you visit a resource that’s busy serving someone else (CPU, disk, network, etc.). And there’s (3) coherency delay, which is when you await some other process’s permission to execute your next step. The code you’re running controls all three of those ways you can spend time. So understanding performance is all about understanding code, whether it’s Java or PHP or C# that you wrote, or the C code that the Oracle Database kernel developers have written for you.

Justin: Is tuning SQL or PL/SQL any different from tuning Java or PHP or C#?

Cary: The tools are a little different, but the fundamentals are exactly the same. You find out which code path in your application is consuming your time, and then you go after it. The best thing to do is figure out a way to execute that code path less often (because the fastest way to do anything is to not do it at all). The next best thing to do is try to figure out a way to make any instructions that can’t be eliminated, faster. That’s the whole trick.

Justin: You make it sound easy.

Cary: It usually is easy once you can collect the data you need to guide you. ...Once you know how to get the system to tell you where it’s spending your time. People make it hard on themselves anytime they try to use performance data that includes information about anything other than the specific user experience they’re trying to fix. Like when they try to fix the performance of some click on a web form by looking at CPU utilization data on their application server or their database server.

Another thing that makes it really hard is the design of the application. More tiers means more complexity when it comes time to diagnose performance problems. And some User Interface designs are just guaranteed to create performance problems. My presentation here called “Messed-Up Apps” is a showcase of a few of those kinds of designs. The message there is that performance is something that has to be designed into an application from the start, like any other feature. Performance is not something you can paint on at the end.

Justin: What can developers do to maximize the performance of the applications they write?

Cary: The most important thing is to remember a couple of key ideas. First, Barry Boehm showed that the cost of repairing defects increases hyperbolically, the later you find them in your development and deployment life cycle. That’s true for performance defects just like it is for functional defects. Second, what Donald Knuth wrote 40 years ago is still true today: when developers try to guess where their code is slow, they do an awful job. Even great developers, when they profile the response time of their code, they’re often surprised at where that code is spending their (or their users’) time. So, profiling early in the software development life cycle is vital.

Next, it’s important to test. Not just functional requirements, but test performance requirements, too. Finally, it’s important to realize that there’s no way that your testing can catch every performance problem that can go wrong, so it’s important to make your application code easy to diagnose and repair in production. You do that with good instrumentation so your production system managers can profile in production when they need to, just like the developers do on the development and test systems.

Justin: How do you—a developer—profile your code?

Cary: Every development language has profiling tools that go with it. They’re tools that you can point at your application when it runs to show exactly where every smidgen of response time is being consumed within that code. The first profiler I was ever aware of is the -pg flag on C compilers. You gcc -pg to compile your code, and then after you run your code, you can use gprof to profile where your time went. Java has profilers, PHP has them, Perl, C#, C++, all of them.

Even the Oracle Database has a profiling capability, but they don’t call it “profiling” (that name means something else in the Oracle documentation). The extended SQL trace data that Oracle emits when you do the right DBMS_MONITOR.SESSION_TRACE_ENABLE call is a written record of where every bit of your response time went. That’s “profiling,” in the computer science sense of the word. Those files have been the basis of my career as a performance analyst and software tools author for the past 20 years or so.

Justin: Tell us a little bit about your company, Method R. You founded it a couple of years ago?

Cary: Yes, I started a new company called Method R Corporation in April 2008. We’ve had a great time writing tools for people and performing services (teaching and consulting) to help people solve their performance problems. Our core business is building tools that help people do for themselves what we know how to do with performance. The trace data that Oracle emits is very complicated, and we have software tools that make it easy to get what you need from those trace files.

We also have an extension for SQL Developer that makes it easy to get the trace data itself, while you’re developing a new SQL- or PL/SQL-based application. We’re also working on a number of very large development projects for customers in which we’re writing complex application code that has to scale to outrageous workloads. We’re always looking for ways we can help people.

Justin: Well, that’s all the time we have today. I really enjoyed talking with you, etc., etc.

Cary: Oh, it was my pleasure. Thank you for having me, and good luck with the rest of the Show.

◆ ◆ ◆

...Which hasn’t happened yet. ☺

I hope you enjoyed, and I’ll look forward to seeing you at Oracle OpenWorld 2010.

I’ll be presenting “Messed-Up Apps: a study of performance antipatterns” at the ODTUG User Group Forum on Sunday at 3:00pm, and “Thinking Clearly about Performance” at the main event on Tuesday at 12:30pm. See you there!

Friday, May 14, 2010

Filter Early

Yesterday, my 12 year-old son Alex was excited to tell me that he had learned a new trick that made it easier to multiply fractions. Here’s the trick:

The neat thing for me is that this week I’m working on my slides for ODTUG Kaleidoscope 2010 (well, actually, for the Performance Symposium that’ll occur on Sunday 27 June), and I need more examples to help encourage application developers to write code that will Filter Early. This “trick” (it’s actually an application of the Multiplicative Inverse Law) is a good example of the Filter Early principle.

Filter Early is all about throwing away data that you don’t need, as soon as you can know that you don’t need it. That’s what this trick of arithmetic is all about. Without the trick, you would do more work to multiply 4/7 × 3/4 = (4 × 3)/(7 × 4) = 12/28, and then you would do even more work again to figure out that 12 and 28 both share a factor of 4, which is what you need to know before you then divide 12/4 = 3 and then 28/4 = 7 to reduce 12/28 to 3/7. It’s smarter, faster, and more fun to use the trick. Multiplying fractions without the trick is a Filter Late operation, which is just dumb and slow.

Here are some other examples of the Filter Early pattern’s funnier (unless you’re the victim of it), sinister antipattern, Filter Late. You shouldn’t do these things:

Drop a dozen brass needles into a haystack, shuffle the haystack, and then try to retrieve the needles. (Why did I specifically choose brass? Two reasons. Can you guess?)
Pack everything you own into boxes, hire a moving company to move them to a new home, and then, after moving into your new home, determine that 80% of your belongings are junk that should be thrown away.
Return thousands of rows to the browser, even though the user only wants one or two.
To add further insult to returning thousands of rows to the browser, return the rows in some useless order. Make the user click on an icon that takes time to sort those rows into an order that will allow him to figure out which one or two he actually wanted.
Execute a database join operation in a middle-tier application instead of the database. I’m talking about the Java application that fetches 100,000 rows from table A and 350,000 rows from table B, and then joins the two result sets in a for loop, in an operation that makes 100,000 comparisons to figure out that the result set of the join contains two rows, which the database could have told you much more efficiently.
Slog row-by-row through a multimillion-row table looking for the four rows you need, instead of using an index scan to compute the addresses of those four rows and then access them directly.

Converting a Filter Late application into a Filter Early application can make performance unbelievably better.

One of my favorite features of the Oracle Exadata machine is that it applies the Filter Early principle where a lot of people would have never thought to try it. It filters query results in the storage server instead of the database server. Before Exadata, the Oracle Database passed disk blocks (which contain some rows you do need, but also some rows you don’t) from the storage server to the database. Exadata passes only the rows you need back to the database server (Chris Antognini explains).

How many Filter Early and Filter Late examples do you know?

Friday, April 30, 2010

The Ramp

I love stories about performance problems. Recently, my friend Debra Lilley sent me this one:

I went to see a very large publishing company about 6 months after they went live. I asked them what their biggest issue was, and they told me querying in GL was very slow, and I was able to fix quite easily. (There was a very simple concatenated index trick for the Chart of Accounts segments that people just never used.) Then I asked if there was anything else. The manager said no but the clerk who sat behind him said, “I have a problem.” His manager seemed embarrassed, but when I pressed him, the clerk continued, “Every day I throw away reams of paper from our invoice listing.”

I asked to look at the request, which ran a simple listing of all invoices entered at a scheduled time each day. I opened up the schedule screen and there was a tick box to “Increment date on each run.” This was not ticked, and they were running the report from day 1, every day. When they accepted the system at go live there was no issue. I think all system implementations should include a 3- or 6-month review. Regardless of how good the implementers are, their setup is based on the information known at the time. In production, that information (volumes, etc.) often changes, and when it does, it can affect your decisions.

My friends Connie Smith and Lloyd Williams call this performance antipattern The Ramp. With the ramp, processing duration increases as the system is used. This invoicing system exhibited ramp behavior, because every invoicing process execution would take just a little bit longer and print just a few more pages than the prior execution did.

The problem of the ramp reminds me of a joke I heard when I was young. A boy, one who is athletically very talented but not too bright, takes on a job as a stripe painter for the highway department. The department gives him bucket of paint and a brush and drives him out to the highway he’s supposed to paint. His first day on the job, he paints a stripe almost seven miles long. This is an utterly stunning feat, for no one previously had ever painted more than five miles in a day. The department was ecstatic. Apparently, this boy’s true calling was to paint roadways.

The excitement abated a little bit on the second day, when the boy painted only five miles of highway. But still, five miles is the best that anyone had ever done before him. But on the third day, the distance dropped to two miles, and on the fourth day, it fell to less than one mile.

The department managers were gravely concerned, especially after having been so excited on the first couple of days. So they had a driver go out to fetch the boy, to bring him back to the office to explain why his productivity had been so outstanding at first but had then declined so horribly.

The reason was easy to understand, the boy explained. Every day he painted, he kept getting farther and farther away from where he had set his paint bucket on the first day.

I’ve known people who’ve written linked list insertion algorithms this way. Joel Spolsky has written about string library functions in C that work this way. I’ve seen people write joins in SQL that work this way. And Debra’s publishing company ran their invoices this way.

When you have the ramp problem, individual response times increase linearly. ...Which is bad. But overall response time—through the history of using such an application—varies in proportion to the square of the number of items being processed. ...Which is super-duper bad.

Imagine, in the invoicing problem that Debra solved, that the system had been processing just one invoice per day and that each invoice is only one page long. Given that she was at a “very large publishing company,” it’s certain that the volume was greater than this, but for the sake of simplifying my argument, let’s assume that there was just one new invoice each day. Then, with the “Increment date on each run” box left unchecked, there would be one invoice to print on day 1, two on day 2, etc. On any day n, there would be n invoices to print.

Obviously, the response time on any given day n would thus be n times longer than it needed to be. At the end of the first year of operation with the new application, an invoice would take 365 times longer to print than on the first day of the year.

But the pain each day of invoice generation is not all there is to the problem. The original concern was expressed in terms of all the paper that was wasted. That paper waste is important, not just because of the environmental impact of unnecessary paper consumption, but also because of all the computing power expended over the operational history of the application required to generate those pages. That includes the resources (the electrical power, the CPU cycles, the memory, the disk and network I/Os, etc.) that could have been put to better use doing something else.

In the grossly over-simplified invoicing system I’ve asked you to imagine (which creates only one invoice per day), the total number of pages printed as of the end of day n is 1 + 2 + ... + n, which is n(n + 1)/2. All but n of those pages are unnecessary. Thus the total number of wasted pages that will have been printed by the end of day n is n(n + 1)/2 – n, which is n(n – 1)/2, or (n² – n)/2. The number of invoices that should never been printed is proportional to the square of the number of days using the application.

To get a sense for what that means, think about this (remember, all these points refer to a grossly over-simplified system that creates only one invoice per day):

By the end of the first month, you'll have printed 465 pages when you only needed 30. That’s 435 unnecessary pages.
But by the end of the first year, you’ll have printed 66,795 pages instead of 365. That’s 66,430 unnecessary pages. It’s 27 unnecessary 2,500-page boxes of paper.
And by the end of the fifth year, you’ll have used 668 boxes of paper to print 1,668,508 pages instead of using just one box to print 1,826 pages. The picture below shows how tremendously wasteful this is.

When total effort varies as the square of something (like the number of items to process, or the number of days you’ve been using an application), it’s bad, bad news for efficiency. It means that every time your something doubles, your performance (time, materials consumption, etc.) will degrade by a factor of four. Every time your something increases by a factor of ten, your performance will degrade by a factor of a hundred. When your something increases a hundred fold, performance will degrade by a factor of 10,000.

Algorithm analysts characterize algorithms that behave this way as O(n²), pronounced “big-oh of n-squared.” O(n²) performance is no way to live. The good news is that you can usually break yourself out of a O(n²) regime. Sometimes, as Debra’s story illustrates, the solution isn’t even technical: she solved her client’s problem by using an option designed into the end-user interface.

No matter where the problem is—whether it’s problem with use, setup, implementation, design, or concept—it’s worth significant time and effort to find the O(n²) problems in your system and eliminate them. Whenever you need reassurance of that idea, just glance again at the image of the paper boxes shown here.

And by the way, do you remember my post about “Just go look at it?” Tally one for Debra, for the win.

Wednesday, March 3, 2010

RobB's Question about M/M/m

Today, user RobB left a comment on my recent blog post, asking this:

I have some doubts on how valid an M/M/m model is in a typical database performance scenario. Taking the example from the wait chapter of your book where you have a 0.49 second (service_time) query that you want to perform in less than a second, 95% percent of the time. The most important point here is the assumption of an exponential distribution for service_time immediately states that about 13% of the queries will take more than 2X(Average Service Time), and going the other way most likely to take 0 seconds. From just this assumption only, it is immediately clear that it is impossible to meet the design criteria without looking at anything else. From your article and link to the Kendall notation, wouldn’t an M/D/m model be more appropriate when looking at something like SQL query response time?? Something like M/M/m seems more suited to queueing at the supermarket, for example, and probably many other ‘human interactive’ scenarios compared to a single sub-component of an IT system.

Here’s my answer, part 1.

RobB,

First, I believe your observation about the book example is correct. It is correct that if service times are exponentially distributed, then about 13% (13.5335%, more precisely) of those times will be 2S, where S is the mean service time. So in the problem I stated, it would be impossible to achieve subsecond response time in more than about 86% of executions, even if there were no competing workload at all. You’re right: you don't need a complicated model to figure that out. You could get that straight from the CDF of the exponential distribution.

However, I think the end of the example provides significant value, where it demonstrates how to use the M/M/m model to prove that you're not going to be able to meet your design criteria unless you can work the value of S down to .103 seconds or less (Optimizing Oracle Performance Fig 9-26, p277). I’ve seen lots of people argue, “You need to tune that task,” but until M/M/m, I had never seen anyone be able to say what the necessary service time goal was, which of course varies as a function of the anticipated arrival rate. A numerical goal is what you need when you have developers who want to know when they’re finished writing code.

With regard to whether real-life service times are really exponentially distributed, you’ve got me wondering now, myself. If service times are exponentially distributed, then for any mean service time S, there’s a 9.5% probability that a randomly selected service time will be less than .1S (in Mathematica, CDF[ExponentialDistribution[1/s],.1s] is 0.0951626 if 0.1s > 0). I’ve got to admit that at the moment, I’m baffled as to how this kind of distribution would model any real-life service process, human, IT, or otherwise.

On its face, it seems like a distribution that prohibits service times smaller than a certain minimum value would be a better model (or perhaps, as you suggest, even fixed service times, as in M/D/m). I think I’m missing something right now that I used to know, because I remember thinking about this previously.

I have two anecdotal pieces of evidence to consider.

One, nowhere in my library of books dedicated to the application of queueing theory to modeling computer software performance (that’s more than 6,000 pages, over 14 inches of material) does Kleinrock, Allen, Jain, Gunther, Menascé, et al mention an M/D/m queueing system. That’s no proof that M/D/m is not the right answer, but it’s information that implies that an awful lot of thinking has gone into the application of queueing theory to software applications without anyone deciding that M/D/m is important enough to write about.

Two, I’ve used M/M/m before in modeling a trading system for a huge investment management company. The response time predictions that M/M/m produced were spectacularly accurate. We did macro-level testing only, comparing response times predicted by M/M/m to actual response times measured by Tuxedo. We didn’t check to see whether service times were exponentially distributed, because the model results were consistently within 5% of perfect accuracy.

Neither of these is proof, of course, that M/M/m is superior in routine applicability to M/D/m. One question I want to answer is whether an M/D/m system would provide better or worse performance than a similar M/M/m system. My intuition is leaning in favor of believing that the M/M/m system would give better performance. If that’s true, then M/M/m is an optimistic model compared to M/D/m, which means that if a real-life system is M/D/m and an M/M/m model says it’s not going to meet requirements, then it assuredly won’t.

I did find a paper online by G. J. Franx about M/D/m queueing. Maybe that paper contains an R=f(λ,μ) function that I can use to model an M/D/m system, which would enable me to do the comparison. I’ll look into it.

Then there’s the issue of whether M/M/m or M/D/m is a more appropriate model for a given real circumstance. The answer to that is simple: test your service times to see if they’re exponentially distributed. The Perl code in Optimizing Oracle Performance, pages 248–254 will do that for you.

Monday, February 22, 2010

Thinking Clearly About Performance, revised to include Skew

I’ve just updated the “Thinking Clearly” paper to include an absolutely vital section that was, regrettably, missing from the first revision. It’s a section on the subject of skew.

I hope you enjoy.

Wednesday, February 10, 2010

Thinking Clearly About Performance

I’ve posted a new paper at method-r.com called “Thinking Clearly About Performance.” It’s a topic I’ll be presenting this year at:

RMOUG (Denver) this month
Miracle OpenWorld (Billund DK) in April
UTOUG (Salt Lake City) in May
Kaleidoscope (Washington DC) in June
MOTS (Ann Arbor) in September

The paper is only 13 pages long, and I think you’ll be pleased with its information density. Here is the table of contents:

An Axiomatic Approach
What is Performance?
Response Time vs Throughput
Percentile Specifications
Problem Diagnosis
The Sequence Diagram
The Profile
Amdahl’s Law
Minimizing Risk
Load
Queueing Delay
The Knee
Relevance of the Knee
Capacity Planning
Random Arrivals
Coherency Delay
Performance Testing
Measuring
Performance is a Feature

As usual, I learned a lot writing it. I hope you’ll find it to be a useful distillation of how performance works.

Monday, December 21, 2009

My Whole System Is Slow. Now What?

At CMG'09 a couple of weeks ago, I presented "Measuring Response Times of Code on Oracle Systems." The paper for this presentation was a subset of "For Developers: Making Friends with the Oracle Database." In the presentation, I spent a few minutes talking about why to measure response times in Oracle, and then I spent a lot of minutes talking about how. As usual, I focused heavily on the importance of measuring response times of individual business tasks executed by individual end users.

At the end of the talk, a group of people came to the podium to ask questions (always a good sign). The first question was the question that a lot of people ask. It was:

My whole system is slow. That's all my users will tell me. So then, how do I begin to do what you're describing?

Here's the answer:

Ask your users to show you what they're doing. Just go look at it.

The results of this simple advice are routinely spectacular. Just go look at it: I'm surprised whenever someone doesn't think of doing that, but I shouldn't be. That's because I didn't do it either, for the longest time. I had to learn to. And that's the story I want to tell you here.

In the early 1990s, I was a consultant with Oracle Corporation visiting clients with performance problems at a pace of more than 30 per year. Back then, I did Oracle performance work the old fashioned way: I checked everything I knew how to check, and then I fixed everything I knew how to fix. All billable by the hour. (Note: When I was doing it this way, I had not yet been taught by Dave Ensor, who changed me forever.)

On weeks when I was lucky, I'd be finished checking and fixing by sometime Wednesday, leaving a couple of days to find out what people thought of my work. If I were lucky again (that's two "lucky"s now), everyone would be thrilled with the results. I'd get my hug (so to speak), and I'd catch my flight.

But I wasn't always lucky. Some weeks, I wouldn't find anything suspicious in my checking and fixing. Some weeks, I'd find plenty, but still not everyone would be thrilled with the work. Having people be less than thrilled with my work caused pain for me, which motivated me to figure out how to take more control of my consulting engagements, to drive luck out of the equation.

The most important thing I figured out was...

People knew before I came on-site how they were going to measure on Thursday whether they liked the results of my work.

And...

They were willing to tell me on Monday.

All I had to do was be honest, like this:

On the day I'm done working here, I'd imagine you're going to want to run something that will demonstrate whether I accomplished what you were hoping for while I was here. Would you mind telling me about that now? Maybe even showing me?

I could ask that on Monday, and people were glad to tell me. I'd watch the things run and record how long they ran, and then I'd know how to prioritize my time on site. I'd record how long they ran so at the end of my engagement, I'd be able to show very clearly what improvements I had made.

Sometimes, there would be thirty different things that people would expect to measure on Thursday. If I might not have time to fix them all, then I needed to make sure that I knew the priority of the things I was being asked to fix.

That one step alone—knowing on Monday that prioritized list of what tasks needed to be fast by Thursday—drastically reduced my reliance on luck as a success factor in my job at these sites. Knowing that list on Monday is just like when your teacher in school tells you exactly what's going to be on your next test. It allows you to focus your attention on exactly what you need to do to optimize your reward for the week. (Note to fellow education enthusiasts: Please don't interpret this paragraph as my advocating the idea that GPA should be a student's sole—or even dominant—optimization constraint.)

So, what I learned is that the very first step of any good performance optimization method is necessarily this:

1. Identify the task that's the most important to you.

When I say "task," think "program" or "click" or "batch job" if you want to. What I mean is "a useful unit of work that makes sense to the business." ...Something that a business user would show you if you just went and watched her work for a few minutes.

Then comes step two:

2. Measure its response time (R). In detail.

Why is response time so important? Because that's what's important to the person who'll be watching it run on Thursday, assessing whether she thinks you've done a good job or not. That person's going to click and then wait. Happiness will be inversely proportional to how long the wait is. That's it. That's what "performance" means at 99% of sites I've ever visited.

(If you're interested in the other 1% of sites I've visited, they're interested in throughput, which I've written about in another blog post.)

Measuring response time is vital. You must be able to measure response time if you're going to nail that test on Thursday.

The key is to understand that the term response time doesn't even have a definition except in the context of a task. You can't measure response time if you don't first decide what task you're going to measure. In other words, you cannot do step 2 before you do step 1. With Oracle, for example, you can collect ASH data (if you're licensed to use it) or even trace data for a whole bunch of Oracle processes, but you won't have a single response time until you define which tasks buried within that data are the ones you want to extract and pay attention to.

You get that by visiting a user and watching what she does.

There are lots of excuses for not watching your users. Like these...

"I don't know my users." I know. But you should. You'd do your job better if you did. And your users would, too.
"My users aren't here." I know. They're on the web. They're in Chicago and Singapore and Istanbul, buying plane tickets or baseball caps or stock shares. But if you can't watch at least a simulation of the things those users actually do with the system you help manage, then I can't imagine how you would possibly succeed at providing good performance to them.
"I'm supposed to be able to manage performance with my dashboard." I know. I was supposed to have a hover car by the year 2000.

The longer you stay mired in excuses like these, the longer it's going to be before you can get the benefit of my point here. Your users are running something, and whatever that is that they're running is your version of my Thursday test. You can check and fix all you want, but unless you get lucky and fix the exact tooth that's hurting, your efforts aren't going to be perceived as "helpful." Checking and fixing everything you can think of is far less efficient and effective than targeting exactly what your user needs you to target.

Lots of performance analysts (DBAs, developers, architects, sysadmins, and so on) assume that when someone says, "The whole system is slow," it means there must be a single parameter somewhere in the bowels of the system that needs adjustment, and if you can just make that adjustment, everything is going to be ok. It might mean that, but in my experience, the overwhelming majority of cases are not that way. (Pages 25–29 of Optimizing Oracle Performance has more information about this.)

The great thing about measuring response time is that no matter what the problem is, you'll see it. If the program you're watching is poorly written, you'll see it. If some other program is hogging too much of a resource that your program needs, you'll see it. If you have a bad disk controller, you'll see it. If some parameter needs adjusting, you'll see it.

Realize that when a business user says "system," she doesn't mean what you would mean if you said "system." She means that the thing she runs is slow. Look at that thing. Maybe there are seventeen of them. And sure, maybe all seventeen suffer from the same root cause. If that's the case, then fantastic, because fixing the first problem will magically fix the other sixteen, too. If it's not, then fantastic anyway, because now all of them are on your prioritized list of tasks to optimize, and you'll probably surprise yourself how quickly you'll be able to pick them all off when you focus on one task at a time.

Friday, November 20, 2009

Performance Optimization with Global Entry. Or Not?

As I entered the 30-minute "U.S. Citizens" queue for immigration back into the U.S. last week, the helpful "queue manager" handed me a brochure. This is a great place to hand me something to read, because I'm captive for the next 30 minutes as I await my turn with the immigration officer at the Passport Control desk. The brochure said "Roll through Customs faster."

Ok. I'm listening.

Inside the brochure, the first page lays out the main benefits:

bypass the passport lines
no paper Customs declaration
in most major U.S. airports

Well, that's pretty cool. Especially as I'm standing only 5% deep in a queue with a couple hundred people in it. And look, there's a Global Entry kiosk right there with its own special queue, with nobody—nobody!—in it.

If I had this Global Entry thing, I'd have a superpower that would enable me to zap past the couple hundred people in front of me, and get out of the Passport Control queue right now. Fantastic.

So what does this thing cost? It's right there in the brochure:

Apply online at www.globalentry.gov. There is a non-refundable $100 application fee. Membership is valid for five years. That's $20 a year for the queue-bypassing superpower. Not bad. Still listening.
Schedule an in-person interview. Next, I have to book an appointment to meet someone at the airport for a brief interview.
Complete the interview and enrollment. I give my interview, get my photo taken, have my docs verified, and that's it, I'm done.

So, all in all, it doesn't cost too much: a hundred bucks and probably a couple hours one day next month sometime.

What's the benefit of the queue-bypassing superpower? Well, it's clearly going to knock a half-hour off my journey through Passport Control. I immigrate three or four times per year on average, and today's queue is one of the shorter ones I've seen, so that's at least a couple hours per year that I'd save... Wow, that would be spectacular: a couple more hours each year in my family's arms instead of waiting like a lamb at the abattoir to have my passport controlled.

But getting me into my family's arms 30 minutes earlier is not really what happens. The problem is a kind of logic that people I meet get hung up in all the time. When you think about subsystem (or resource) optimization, it looks like your latency savings for the subsystem should go straight to your system's bottom line, but that's often not what happens. That's why I really don't care about subsystem optimization; I care about response time. I could say that a thousand times, but my statement is too abstract to really convey what I mean unless you already know what I mean.

What really happens in the airport story is this: if I had used Global Entry on my recent arrival, it would have saved me only a minute or two. Not half an hour, not even close.

It sounds crazy, doesn't it? How can a service that cuts half an hour off my Passport Control time not get me home at least a half hour earlier?

You'll understand once I show you a sequence diagram of my arrival. Here it is (at right). You can click the image to embiggen it, if you need.

To read this sequence diagram, start at the top. Time flows downward. This sequence diagram shows two competing scenarios. The multicolored bar on the left-hand side represents the timeline of my actual recent arrival at DFW Airport, without using the Global Entry service. The right-hand timeline is what my arrival would have looked like had I been endowed with the Global Entry superpower.

You can see at the very bottom of the timeline on the right that the time I would have saved with Global Entry is minuscule: only a minute or two.

The real problem is easy to see in the diagram: Queue for Baggage Claim is the great equalizer in this system. No matter whether I'm a Global Entrant or not, I'm going to get my baggage when the good people outside with the Day-Glo Orange vests send it up to me. My status in the Global Entry system has absolutely no influence over what time that will occur.

Once I've gotten my baggage, the Global Entry superpower would have again swung into effect, allowing me to pass through the zero-length queue at the Global Entry kiosk instead of waiting behind two families at the Customs queue. And that's the only net benefit I would have received.

Wait: there were only two families in the Customs queue? What about the hundreds of people I was standing behind in the Passport Control queue? Well, many of them were gone already (either they had hand-carry bags only, or their bags had come off earlier than mine). Many others were still awaiting their bags on the Baggage Claim carousel. Because bags trickle out of the baggage claim process, there isn't the huge all-at-once surge of demand at Customs that there is at Passport Control when a plane unloads. So the queues are shorter.

At any rate, there were four queues at Customs, and none of them was longer than three or four families. So the benefit of Global Entry—in exchange for the $100 and the time spent doing the interview—for me, this day, would have been only the savings of a couple of minutes.

Now, if—if, mind you—I had been able to travel with only carry-on luggage, then Global Entry would have provided me significantly more value. But when I'm returning to the U. S. from abroad, I'm almost never allowed to carry on any bag other than my briefcase. Furthermore, I don't remember ever clearing Passport Control to find my bag waiting for me at Baggage Claim. So the typical benefit to me of enrolling in Global Entry, unfortunately, appears to be only a fraction of the duration required to clear Customs, which in my case is almost always approximately zero.

The problem causing the low value (to me) of the Global Entry program is that the Passport Control resource hides the latency of the Baggage Claim resource. No amount of tuning upon the Passport Control resource will affect the timing of the Baggage In Hand milestone; the time at which that milestone occurs is entirely independent of the Passport Control resource. And that milestone—as long as it occurs after I queue for Baggage Claim—is a direct determinant of when I can exit the airport. (Gantt or PERT chart optimizers would say that Queue for Baggage Claim is on the critical path.)

How could a designer make the airport experience better for the customer? Here are a few ideas:

Let me carry on more baggage. This idea would allow me to trot right through Baggage Claim without waiting for my bag. In this environment, the value of Global Entry would be tremendous. Well, nice theory; but allowing more carry-on baggage wouldn't work too well in the aggregate. The overhead bins on my flight were already stuffed to maximum capacity, and we don't need more flight delays induced by passengers who bring more stuff onboard than the cabin can physically accommodate.
Improve the latency of the baggage claim process. The sequence diagram shows clearly that this is where the big win is. It's easy to complain about baggage claim, because it's nearly always noticeably slower than we want it to be, and we can't see what's going on down there. Our imaginations inform us that there's all sorts of horrible waste going on.
Use latency hiding to mask the pain of the baggage claim process. Put TV sets in the Baggage Claim area, and tune them to something interesting instead of infinite loops of advertising. At CPH, they have a Danish hot dog stand in the baggage claim area. They also have a currency exchange office in there. Excellent latency hiding ideas if you need a snack or some DKK walkin'-around-money.

Latency hiding is a weak substitute for improving the speed of the baggage claim process. The killer app would certainly be to make Baggage Claim faster. Note, however, that just making Baggage Claim a little bit faster wouldn't make the Global Entry program any more valuable. To make Global Entry any more valuable, you'd have to make Baggage Claim fast enough that your bag would be waiting for anyone who cleared the full Passport Control queue.

So, my message today: When you optimize, you must first know your goal. So many people optimize subsystems (resources) that they think are important, but optimizing subsystems is often not a path to optimizing what you really want. At the airport, I really don't give a rip about getting out of the Passport Control queue if it just means I'm going to be dumped earlier into a room where I'll have to wait until an affixed time for my baggage.

Once you know what your real optimization goal is (that's Method R step 1), then the sequence diagram is often all you need to get your breakthrough insight that either helps you either (a) solve your problem or (b) understand when there's nothing further that you can really do about it.

Thursday, November 12, 2009

Why We Made Method R

Twenty years ago (well, a month or so more than that), I entered the Oracle ecosystem. I went to work as a consultant for Oracle Corporation in September 1989. Before Oracle, I had been a language designer and compiler developer. I wrote code in lex, yacc, and C for a living. My responsibilities had also included improving other people's C code: making it more reliable, more portable, easier to read, easier to prove, and easier to maintain; and it was my job to teach other people in my department how to do these things themselves. I loved all of these duties.

In 1987, I decided to leave what I loved for a little while, to earn an MBA. Fortunately, at that time, it was possible to earn an MBA in a year. After a year of very difficult work, I had my degree and a new perspective on business. I interviewed with Oracle, and about a week later I had a job with a company that a month prior I had never heard of.

By the mid-1990s, circumstances and my natural gravity had matched to create a career in which I was again a software developer, optimizer, and teacher. By 1998, I was the manager of a group of 85 performance specialists called the System Performance Group (SPG). And I was the leader of the system architecture and system management consulting service line within Oracle Consulting's Global Steering Committee.

My job in the SPG role was to respond to all the system performance-related issues in the USA for Oracle's largest accounts. My job in the Global Steering Committee was to package the success of SPG so that other practices around the world could repeat it. The theory was that if a country manager in, say, Venezuela, wanted his own SPG, then he could use the financial models, budgets, hiring plans, training plans, etc. created by my steering committee group. Just add water.

But there was a problem. My own group of 85 people consisted of two very different types of people. About ten of these 85 people were spectacularly successful optimizers whom I could send anywhere with confidence that they'd thrive at either improving

performance or proving that performance improvements weren't possible. The other 75 were very smart, very hard-working people who would grow into the tip of my pyramid over the course of more years, but they weren't there yet.

The problem was, how to you convert good, smart, hard-working people in the base of the SPG pyramid into people in the tip? The practice manager in Venezuela would need to know that. The answer, of course, is supposed to be the Training Plan. Optimally, the Training Plan consists of a curriculum of a few courses, a little on-the-job training, and then, presto: tip of the pyramid. Just add water.

But unfortunately that wasn't the way things worked. What I had been getting instead, within my own elite group, was a

process that took many years to convert a smart, hard-working person into a reasonably reliable performance optimizer whom you could send anywhere. Worse yet, the peculiar stresses of the job—like being away from home 80% of the time, and continually visiting angry people each week, having to work for me—caused an outflow of talent that approximately equaled the inflow of people who made it to the tip of the pyramid. The tip of my pyramid never grew beyond roughly 10 people.

The problem, by definition, was the Training Plan. It just wasn't good enough. It wasn't that the instructors of Oracle's internal "tuning" courses were doing a poor job of teaching courses. And it wasn't that the course developers had done a poor job of creating courses. On the contrary, the instructors and course developers were doing excellent work. The problem was that the courses were focusing on the wrong thing. The reason that the courses weren't getting the job done was that the very subject matter that needed teaching hadn't been invented yet.

I expect that the people who write, say, the course called "Braking System Repair for Boeing 777" to have themselves invented the braking system they write about. So, the question was, who should be responsible for inventing the subject matter on how to optimize Oracle? I decided that I wanted that person to be me. I deliberated carefully and decided that my best chance of doing that the way I wanted to do it would be outside of Oracle. So in October 1999, ten years and one week after I joined the company, I left Oracle with the vision of creating a repeatable, teachable method for optimizing Oracle systems.

Ten years later, this is still the vision for my company, Method R Corporation. We exist not to make your system faster. We exist to make you faster at making all your systems faster. Our work is far from done, but here is what we have done:

Written white papers and other articles that explain Method R to you at no cost.
Written a book called Optimizing Oracle Performance, where you can learn Method R at a low cost.
Created a Method R course (on which the book is based), to teach you how to diagnose and repair response time problems in Oracle-based systems.
Spoken at hundreds of public and private events where we help people understand performance and how to manage it.
Provided consulting services to make people awesome at making their systems faster and more efficient.
Created the first response time profiling software ever for Oracle software applications, to let you analyze hundreds of megabytes of data without drudgery.
Created a free instrumentation library so that you can instrument the response times of Oracle-based software that you write.
Created software tools to help you be awesome at extracting every drop of information that your Oracle system is willing to give you about your response times.
Created a software tool that enables you to record the response time of every business task that runs on your system so you can effortlessly manage end-user performance.

As I said, our work is far from done. It's work that really, really matters to us, and it's work we love doing. I expect it to be a journey that will last long into the future. I hope that our journey will intersect with yours from time to time, and that you will enjoy it when it does.

Thursday, July 2, 2009

Fundamentals of Software Performance Quick Reference Card

I just posted "Fundamentals of Software Performance Quick Reference Card" at the Method R company website:

This two-page quick reference card written by Cary Millsap sums up computer software performance the Method R way. The first page lists definitions of the terms you need to know: efficiency, knee, load, response time, and so on. The second page lists ten principles that are vital to your ability to think clearly about software performance. This document contains meaningful insight in a format that's compact enough to hang on your wall.

It's free, and there's no sign-up required. I hope you will enjoy it.

Thursday, June 18, 2009

Profiling with my Boy

We have an article online called "Can you explain Method R so even my boss could understand it?" Today I'm going to raise the stakes, because yesterday I think I explained Method R so that an eleven year-old could understand it.

Yesterday I took my 11 year-old son Alex to lunch. I talked him into eating at one of my favorite restaurants, called Mercado Juarez, over in Irving, so it was a half hour in the car together, just getting over there. It was a big day for the two of us because we were very excited about the new June 17 iPhone OS 3.0 release. I told him about some of the things I've learned about it on the Internet over the past couple of weeks. One subject in particular that we were both interested in was performance. He likes not having to wait for click results just as much as I do.

According to Apple, the new iPhone OS 3.0 software has some important code paths in it that are 3× faster. Then, upgrading to the new iPhone 3G S hardware is supposed to yield yet another 3× performance improvement for some code paths. It's what Philip Schiller talks about at 1:42:00 in the WWDC 2009 keynote video. Very exciting.

Alex of course, like many of us, wants to interpret "3× faster" as "everything I do is going to be 3× faster." As in everything that took 10 seconds yesterday will take 3 seconds tomorrow. It's a nice dream. But it's not what seeing a benchmark run 3× faster means. So we talked about it.

I asked him to imagine that it takes me an hour to do my grocery shopping when I walk to the store. Should I buy a car? He said yes, probably, because a car is a lot faster than walking. So I asked him, what if the total time I spent walking to and from the grocery store was only one minute? Then, he said, having a car wouldn't make that much of a difference. He said you might want a car for other reasons, but he wouldn't recommend it just to make grocery shopping faster.

Good.

I said, what if grocery shopping were taking me five hours, and four of it was time spent walking? "Then you probably should get the car," he told me. "Or a bicycle."

Commit.

On the back of his menu (photo above: click to zoom), I drew him a sequence diagram (A) showing how he, running Safari on an iPhone 3G might look to a performance analyst. I showed him how to read the sequence diagram (time runs top-down, control passes from one tier to another), and I showed him two extreme ways that his sequence diagram might turn out for a given experience. Maybe the majority of the time would be spent on the 3G network tier (B), or maybe the majority of the time would be spent on the Safari software tier (C). We talked about how if B were what was happening, then a 3× faster Safari tier wouldn't really make any difference. Apple wouldn't be lying if they said their software was 3× faster, but he really wouldn't notice a performance improvement. But if C were what was happening, then a 3× faster Safari tier would be a smoking hot upgrade that we'd be thrilled with.

Sequence diagrams, check. Commit.

Now, to profiles. So I drew up a simple profile for him, with 101 seconds of response time consumed by 100 seconds of software and 1 second of 3G (D):

Software  100
3G          1
-------------
Total     101

I asked him, if we made the software 2× faster, what would happen to the total response time? He wrote down "50" in a new column to the right of the "100." Yep. Then I asked him what would happen to total response time. He said to wait a minute, he needed to use the calculator on his iPod Touch. Huh? A few keystrokes later, he came up with a response time of 50.5.

Oops. Rollback.

He made the same mistake that just about every forty year-old I've ever met makes. He figured if one component of response time were 2× faster, then the total response time must be 2× faster, too. Nope. In this case, the wrong answer was close to the right answer, but only because of the particular numbers I had chosen.

So, to illustrate, I drew up another profile (E):

Software    4
3G         10
-------------
Total      14

Now, if we were to make the software 2× faster, what happens to the total? We worked through it together:

Software    4    2
3G         10   10
------------------
Total      14   12

Click. So then we spent the next several minutes doing little quizzes. If this is your profile, and we make this component X times faster, then what's the new response time going to be? Over and over, we did several more of these, some on paper (F), and others just talking.

Commit.

Next step. "What if I told you it takes me an hour to get into my email at home? Do I need to upgrade my network connection?" A couple of minutes of conversation later, he figured out that he couldn't answer that question until he got some more information from me. Specifically, he had to ask me how much of that hour is presently being spent by the network. So we did this over and over a few times. I'd say things like, "It takes me an hour to run my report. Should I spend $4,800 tuning my SQL?" Or, "Clicking this button takes 20 seconds. Should I upgrade my storage area network?"

And, with just a little bit of practice, he learned that he had to say, "How much of the however-long-you-said is being spent by the whatever-it-was-you-said?" I was happy with how he answered, because it illustrated that he had caught onto the pattern. He realized that the specific blah-blah-blah proposed remedies I was asking him about didn't really matter. He had to ask the same question regardless. (He was answering me with a sentence using bind variables.)

Commit.

Alex hears me talk about our Method R Profiler software tool a lot, and he knows conceptually that it helps people make their systems faster, but he's never known in any real detail very much about what it does. So I told him that the profile tables are what our Profiler makes for people. To demonstrate how it does that, I drew him up a list of calls (F), which I told him was a list of visits between a disk and a CPU. ...Just a list that says the same thing that a sequence diagram (annotated with timings) would say:

D 2
C 1
D 6
D 4
D 8
C 3

I told him to make a profile for these calls, and he did (H):

Disk   20
CPU     4
---------
Total  24

Excellent. So I explained that instead of adding up lists in our head all day, we wrote the Profiler to aggregate the call-by-call durations (from an Oracle extended SQL trace file) for you into a profile table that lets you answer the performance questions we had been practicing over lunch. ...Even if there are millions of lines to add up.

The finish-up conversation in the car ride back was about how to use everything we had talked about when you fix people's performance problems. I told him the most vital thing about helping someone solve a performance problem is to make sure that the operation (the business task) that you're analyzing right now is actually the most important business task to fix first. If you're looking at anything other than the most important task first, then you're asking for trouble.

I asked him to imagine that there are five important tasks that are too slow. Maybe every one of those things has its response time dominated by a different component than all the others. Maybe they're all the same. But if they're all different, then no single remedy you can perform is going to fix all five. A given remedy will have a different performance impact upon each of the five tasks, depending on how much of the fixed thing that task was using to begin with.

So the important thing is to know which of the five profiles it is that you ought to be paying attention to first. Maybe one remedy will fix all five tasks, maybe not. You just can't know until you look at the profiles. (Or until you try your proposed remedy. But trial-and-error is an awfully expensive way to find out.)

Commit.

It was a really good lunch. I'll look forward to taking my 9-year-old (Alex's little brother) out the week after next when I get back from ODTUG.

Friday, April 24, 2009

The Most Common Performance Problem I See

At the Percona Performance Conference in Santa Clara this week, the first question an audience member asked our panel was, "What is the most common performance problem you see in the field?"

I figured, being an Oracle guy at a MySQL conference, this might be my only chance to answer something, so I went for the mic. Here is my answer.

The most common performance problem I see is people who think there's a most-common performance problem that they should be looking for, instead of measuring to find out what their actual performance problem actually is.

It's a meta answer, but it's a meta problem. The biggest performance problems I see, and the ones I see most often, are not problems with machines or software. They're problems with people who don't have a reliable process of identifying the right thing to work on in the first place.

That's why the definition of Method R doesn't mention Oracle, or databases, or even computers. It's why Optimizing Oracle Performance spends the first 69 pages talking about red rocks and informed consent and Eli Goldratt instead of Oracle, or databases, or even computers.

The most common performance problem I see is that people guess instead of knowing. The worst cases are when people think they know because they're looking at data, but they really don't know, because they're looking at the wrong data. Unfortunately, every case of guessing that I ever see is this worst case, because nobody in our business goes very far without consulting some kind of data to justify his opinions. Tim Cook from Sun Microsystems pointed me yesterday to a blog post that gives a great example of that illusion of knowing when you really don't.