Cary Millsap: performance as a service

Monday, December 29, 2008

Performance as a Service, Part 2

Over the holiday weekend, Dallas left a comment on my July 7 post that begins with this:

One of the biggest issues I run into is that most of my customers have no SLAs outside of availability.

It's an idea that resonates with a lot of people that I talk to.

I see the following progressive hierarchy when it comes to measuring performance...

Don't measure response times at all.
Measure response times. Don't alert at all.
Measure response times. Alert against thresholds.
Measure response times. Alert upon variances.

Most people don't measure response times at all (category 1), at least not until there's trouble. Most people don't measure response times even then, but some do. Not many people fit into what I've called category 2, because once you have a way to collect response time data, it's too tempting to do some kind of alerting with it.

Category 3 is a world in which people measure response times, and they compare those response times against some pre-specified list of tolerances for those response times. Here's where the big problem that Dallas is talking about hits you: Where does that list of tolerances come from? It takes work to make that list, and preceding that work is the motivation to make that list. Many companies just don't have that motivation.

I think it's the specter of the difficulty in getting to category 3 that prevents a lot of people from moving into category 2. I think that is Dallas's situation.

A few years ago, I would have listed category 3 at the top of my hierarchy, but at CMG'07, in a paper called "Death to Dashboards...," Peg McMahon and Justin Martin made me aware of another level: this notion of alerting based on variance.

The plan of creating a tolerance for every business task you execute on your system works fine for a few interesting tasks, but the idea doesn't scale to systems with hundreds or thousands of instrumented tasks. The task of negotiating, setting, and maintaining hundreds of different tolerances is just too labor-intensive.

Peg and Justin's paper described the notion that not bothering with individual tolerances works just as well—and with very low setup cost—because what you really ought to look out for are changes in response times. (It's an idea similar to what Robyn Sands described at Hotsos Symposium 2008.) You can look for variances without defining tolerances, but of course you cannot do it without measuring response times.

Dallas ends with:

I think one of the things you might offer as part of the "Performance as a Service" might be assisting customers in developing those performance SLAs, especially since your team is very experienced in knowing what is possible.

I of course love that he made that point, because this is exactly the kind of thing we're in business to do for people. Just contact us through http://method-r.com. We are ready, willing, and able, and now is a great time to schedule something.

There is a lot of value in doing the response time instrumentation exercise, no matter how you do alerting. The value comes in two main ways. First, just the act of measuring often reveals inefficiencies that are easy and inexpensive to fix. We find little mistakes all the time that make systems faster and nicer to use and that allow companies to use their IT budgets more efficiently. Second, response time information is just fascinating. It's valuable for people on both sides of the information supply-and-demand relationship to see how fast things really are, and how often you really run them. Seeing real performance data brings ideas together. It happens even if you don't choose to do alerting at all.

The biggest hurdle is in moving from category 1 to category 2. Once you're at category 2, your hardest work is behind you, and you and your business will have exactly the information you'll need for deciding whether to move on to category 3 or 4.

Friday, July 11, 2008

So how do you FIX the problems that "Performance as a Service" helps you find?

I want to respond carefully to Reubin's comment on my Performance as a Service post from July 7. Reuben asked:

i can see how you actually determine for a customer where the pain points are and you can validate user remarks about poor performance. But i don't see from your post how you are going to attack the problem of fixing the performance issue.

i would be most interested in hearing your thoughts on that. I wonder if you guys are going to touch the actual code behind the "order button" you described.

Under the service model I described in the Performance as a Service post, a client using our performance service would have several choices about how to respond to a problem. They could contract our team as consultants; they could use someone else; they could do it themselves; or of course, they could choose to defer the solution.

Attacking the problem of fixing the performance issue is actually the "easy" part. Well, it's not necessarily always easy, but it's the part that my team have been doing over and over again since the early 1990s. We use Method R. Once we've measured the response time in detail with Oracle's extended SQL trace function, we know exactly where the task is spending the end user's time. From there, I think it's fair to say we (Jeff, Karen, etc.) are pretty skilled at figuring out what to do next.

Sometimes, the root cause of a performance problem requires manipulation of the application source code, and sometimes it doesn't. If you do diagnosis right, you should never have to guess which one it is. A lot of people wonder what happens if it's the code that needs modification, but the application is prepackaged, and therefore the source code is out of your control. In my experience, most vendors are very responsive to improving the performance of their products when they're shown unambiguously how to improve them.

If your application is slow, you should be eager to know exactly why it's slow. You should be equally eager to know, whether you wrote the code yourself or someone else wrote it for you. To avoid collecting the right performance diagnostic data for an application because you're afraid of what you might find out is like taking your hands off the wheel and covering your eyes when a child rides his bike out in front of the car that you're driving. There's significant time-value upon information about performance problems. Even if someone else's code is the reason for your performance problem (or whatever truth you might be afraid of learning), you need to know it as early as possible.

The SLA Manager service I talked about is so important because the most difficult part of using Method R is usually the data collection step. The difficulty is almost always more political than technical. It's overcoming the question, "Why should we change the way we collect our data?" I believe the business value of knowing how long your computer system takes to execute tasks for your business is important enough that it will get people into the habit of measuring response time. ...Which is a vital step in solving the data collection problem that's at the heart of every persistent performance problem I've ever seen. I believe the data collection service that I described will help remove the most important remaining barrier to highly manageable application software performance in our market.