Thursday, September 17, 2015

The Fundamental Challenge of Computer System Performance

The fundamental challenge of computer system performance is for your system to have enough power to handle the work you ask it to do. It sounds really simple, but helping people meet this challenge has been the point of my whole career. It has kept me busy for 26 years, and there’s no end in sight.

Capacity and Workload

Our challenge is the relationship between a computer’s capacity and its workload. I think of capacity as an empty box representing a machine’s ability to do work over time. Workload is the work your computer does, in the form of programs that it runs for you, executed over time. Workload is the content that can fill the capacity box.


Capacity Is the One You Can Control, Right?

When the workload gets too close to filling the box, what do you do? Most people’s instinctive reaction is that, well, we need a bigger box. Slow system? Just add power. It sounds so simple, especially since—as “everyone knows”—computers get faster and cheaper every year. We call that the KIWI response: kill it with iron.

KIWI... Why Not?

As welcome as KIWI may feel, KIWI is expensive, and it doesn’t always work. Maybe you don’t have the budget right now to upgrade to a new machine. Upgrades cost more than just the hardware itself: there’s the time and money it takes to set it up, test it, and migrate your applications to it. Your software may cost more to run on faster hardware. What if your system is already the biggest and fastest one they make?

And as weird as it may sound, upgrading to a more powerful computer doesn’t always make your programs run faster. There are classes of performance problems that adding capacity never solves. (Yes, it is possible to predict when that will happen.) KIWI is not always a viable answer.

So, What Can You Do?

Performance is not just about capacity. Though many people overlook them, there are solutions on the workload side of the ledger, too. What if you could make workload smaller without compromising the value of your system?
It is usually possible to make a computer produce all of the useful results that you need without having to do as much work.
You might be able to make a system run faster by making its capacity box bigger. But you might also make it run faster by trimming down that big red workload inside your existing box. If you only trim off the wasteful stuff, then nobody gets hurt, and you’ll have winning all around.

So, how might one go about doing that?

Workload

“Workload” is a conjunction of two words. It is useful to think about those two words separately.


The amount of work your system does for a given program execution is determined mostly by how that program is written. A lot of programs make their systems do more work than they should. Your load, on the other hand—the number of program executions people request—is determined mostly by your users. Users can waste system capacity, too; for example, by running reports that nobody ever reads.

Both work and load are variables that, with skill, you can manipulate to your benefit. You do it by improving the code in your programs (reducing work), or by improving your business processes (reducing load). I like workload optimizations because they usually save money and work better than capacity increases. Workload optimization can seem like magic.

The Anatomy of Performance

This simple equation explains why a program consumes the time it does:
r = cl        or        response time = call count × call latency
Think of a call as a computer instruction. Call count, then, is the number of instructions that your system executes when you run a program, and call latency is how long each instruction takes. How long you wait for your answer, then—your response time—is the product of your call count and your call latency.

Some fine print: It’s really a little more complicated than this, but actually not that much. Most response times are composed of many different types of calls, all of which have different latencies (we see these in program execution profiles), so the real equation looks like r = c1l1 + c2l2 + ... + cnln. But we’ll be fine with r = cl for this article.

Call count depends on two things: how the code is written, and how often people run that code.
  • How the code is written (work) — If you were programming a robot to shop for you at the grocery store, you could program it to make one trip from home for each item you purchase. Go get bacon. Come home. Go get milk... It would probably be dumb if you did it that way, because the duration of your shopping experience would be dominated by the execution of clearly unnecessary travel instructions, but you’d be surprised at how often people write programs that act like this.
  • How often people run that code (load) — If you wanted your grocery store robot to buy 42 things for you, it would have to execute more instructions than if you wanted to buy only 7. If you found yourself repeatedly discarding spoiled, unused food, you might be able to reduce the number of things you shop for without compromising anything you really need.
Call latency is influenced by two types of delays: queueing delays and coherency delays.
  • Queueing delays — Whenever you request a resource that is already busy servicing other requests, you wait in line. That’s a queueing delay. It’s what happens when your robot tries to drive to the grocery store, but all the roads are clogged with robots that are going to the store to buy one item at a time. Driving to the store takes only 7 minutes, but waiting in traffic costs you another 13 minutes. The more work your robot does, the greater its chances of being delayed by queueing, and the more such delays your robot will inflict upon others as well.
  • Coherency delays — You endure a coherency delay whenever a resource you are using needs to communicate or coordinate with another resource. For example, if your robot’s cashier at the store has to talk with a specific manager or other cashier (who might already be busy with a customer), the checkout process will take longer. The more times your robot goes to the store, the worse your wait will be, and everyone else’s, too.

The Secret

This r = cl thing sure looks like the equation for a line, but because of queueing and coherency delays, the value of l increases when c increases. This causes response time to act not like a line, but instead like a hyperbola.


Because our brains tend to conceive of our world as linear, nobody expects for everyone’s response times to get seven times worse when you’ve only added some new little bit of workload, but that’s the kind of thing that routinely happens with performance. ...And not just computer performance. Banks, highways, restaurants, amusement parks, and grocery-shopping robots all work the same way.

Response times are trememdously sensitive to your call counts, so the secret to great performance is to keep your call counts small. This principle is the basis for perhaps the best and most famous performance optimization advice ever rendered:
The First Rule of Program Optimization: Don’t do it.

The Second Rule of Program Optimization (for experts only!): Don’t do it yet.

The Problem

Keeping call counts small is really, really important. This makes being a vendor of information services difficult, because it is so easy for application users to make call counts grow. They can do it by running more programs, by adding more users, by adding new features or reports, or by even by just the routine process of adding more data every day.

Running your application with other applications on the same computer complicates the problem. What happens when all these application’ peak workloads overlap? It is a problem that Application Service Providers (ASPs), Software as a Service (SaaS) providers, and cloud computing providers must solve.

The Solution

The solution is a process:
  1. Call counts are sacred. They can be difficult to forecast, so you have to measure them continually. Understand that. Hire people who understand it. Hire people who know how to measure and improve the efficiency of your application programs and the systems they reside on.
  2. Give your people time to fix inefficiencies in your code. An inexpensive code fix might return many times the benefit of an expensive hardware upgrade. If you have bought your software from a software vendor, work with them to make sure they are streamlining the code they ship you.
  3. Learn when to say no. Don’t add new features (especially new long-running programs like reports) that are inefficient, that make more calls than necessary. If your users are already creating as much workload as the system can handle, then start prioritizing which workload you will and won’t allow on your system during peak hours.
  4. If you are an information service provider, charge your customers for the amount of work your systems do for them. The economic incentive to build and buy more efficient programs works wonders.

6 comments:

rsiz said...

Brilliant as usual without unneeded complexity (also as usual). A side issue that proper configuration of existing box (compared to a lame configuration) might from time to time improve (lower response time or push non-linearity of workload point to the right on your response time versus workload graph) the system for a given load. Sometimes that sort of optimization to workload is useful, but it can only shift the graph a bit (at best), not change the fundamental underlying principal which you highlight. Bravo.

Eric Yen said...

Thank you for sharing.

Cary Millsap said...

Thank you, Mark. And it is my pleasure, Eric.

Thank you both for reading.

Sampath said...

Hello Cary,
Been a while since I visited your blog. Still thriving with your initial initiation into tuning back in 2004. I have a question now. Hope you must be using OEM as an add on to your list of tools. We, in our company use it all the time (Just me!!!). A specific problem I am encountering. IT is showing top CPU consumers as the statements/calls that are invoked several times during an interval on the top activity page. But top 3 are pl/sql procedure calls and are very fast in terms of elapsed time as well as average buffer gets per execution. The numbers for them are solid. But we have issues with heavy cpu usage and our whole system comes to grinding halt. One thing I noticed though is distinct sql ids that are having bind/literal combo. Definitely a problem.
Is there a way to identify the real culprits using the available instrumentation in Oracle?
Greatly appreciate your feedback.
Regards
Kumar Ramalingam

Cary Millsap said...

Kumar,

Seeing that three statements (SQL or PL/SQL) are your "top" consumers isn't a lot of information. I would approach it like this...

If your top three statements are each consuming, say, 10%+ of your CPU (paying careful attention to sync your observed time interval with an interval that you know to be relevant), then you need to ensure that these statements are as efficient as possible. If, as you imply, these statements are already as efficient as you can make them, then either you need to run them less frequently or increase your system capacity.

Otherwise (if your top three statements are not consuming a lot of your CPU), then you will have a large number of distinct statements contributing to your workload. There are different ways this can happen. The one you're implying is probably happening to you is that a large number of distinct statements are being generated by a small number of places in your source code. If this is the case, then it's good news, because you'll be able to make a comparatively big impact with a comparatively small effort focused on just a few places in your code.

One thing you'll want from OEM that I'm not sure it offers is the aggregation statistics for "similar" SQL statements, by which I mean statements that should have been sharable but which aren't. If OEM doesn't aggregate statistics across similar statements, then you'll be left with a long list of similar but distinct statements, each of which may not look too expensive on its own, but together might account for a significant proportion of your workload. Your goal is to find the smallest number of places in your source code that you can manipulate to provide relief to your system. When your tools don't aggregate along the dimension you need (similar statements in this case), it means you have to do a lot more work in your head (in spreadsheets, etc.). Trace data can help you here. You can see how at http://method-r.com/blogs/company-blog/222-quantifying-sql-shareability.

The other thing you could consider is setting cursor_sharing=force. Perhaps with this setting in place, OEM will aggregate properly for you so you can see where in your source code you need to focus first. I don't recommend using cursor_sharing=force as a permanent solution, although the performance of it seems better in 12c than it was in Oracle Database versions 10 and 11.

Hope this helps...


Cary

navakanth Talluri said...

Hi Carry ,

As always , a great article.! Thank you for sharing.



Regards,
Navakanth