Cary Millsap: robyn sands

Over the holiday weekend, Dallas left a comment on my July 7 post that begins with this:

One of the biggest issues I run into is that most of my customers have no SLAs outside of availability.

It's an idea that resonates with a lot of people that I talk to.

I see the following progressive hierarchy when it comes to measuring performance...

Don't measure response times at all.
Measure response times. Don't alert at all.
Measure response times. Alert against thresholds.
Measure response times. Alert upon variances.

Most people don't measure response times at all (category 1), at least not until there's trouble. Most people don't measure response times even then, but some do. Not many people fit into what I've called category 2, because once you have a way to collect response time data, it's too tempting to do some kind of alerting with it.

Category 3 is a world in which people measure response times, and they compare those response times against some pre-specified list of tolerances for those response times. Here's where the big problem that Dallas is talking about hits you: Where does that list of tolerances come from? It takes work to make that list, and preceding that work is the motivation to make that list. Many companies just don't have that motivation.

I think it's the specter of the difficulty in getting to category 3 that prevents a lot of people from moving into category 2. I think that is Dallas's situation.

A few years ago, I would have listed category 3 at the top of my hierarchy, but at CMG'07, in a paper called "Death to Dashboards...," Peg McMahon and Justin Martin made me aware of another level: this notion of alerting based on variance.

The plan of creating a tolerance for every business task you execute on your system works fine for a few interesting tasks, but the idea doesn't scale to systems with hundreds or thousands of instrumented tasks. The task of negotiating, setting, and maintaining hundreds of different tolerances is just too labor-intensive.

Peg and Justin's paper described the notion that not bothering with individual tolerances works just as well—and with very low setup cost—because what you really ought to look out for are changes in response times. (It's an idea similar to what Robyn Sands described at Hotsos Symposium 2008.) You can look for variances without defining tolerances, but of course you cannot do it without measuring response times.

Dallas ends with:

I think one of the things you might offer as part of the "Performance as a Service" might be assisting customers in developing those performance SLAs, especially since your team is very experienced in knowing what is possible.

I of course love that he made that point, because this is exactly the kind of thing we're in business to do for people. Just contact us through http://method-r.com. We are ready, willing, and able, and now is a great time to schedule something.

There is a lot of value in doing the response time instrumentation exercise, no matter how you do alerting. The value comes in two main ways. First, just the act of measuring often reveals inefficiencies that are easy and inexpensive to fix. We find little mistakes all the time that make systems faster and nicer to use and that allow companies to use their IT budgets more efficiently. Second, response time information is just fascinating. It's valuable for people on both sides of the information supply-and-demand relationship to see how fast things really are, and how often you really run them. Seeing real performance data brings ideas together. It happens even if you don't choose to do alerting at all.

The biggest hurdle is in moving from category 1 to category 2. Once you're at category 2, your hardest work is behind you, and you and your business will have exactly the information you'll need for deciding whether to move on to category 3 or 4.

Cary Millsap

Monday, December 29, 2008

Performance as a Service, Part 2