Monday, December 29, 2008

Performance as a Service, Part 2

Over the holiday weekend, Dallas left a comment on my July 7 post that begins with this:
One of the biggest issues I run into is that most of my customers have no SLAs outside of availability.
It's an idea that resonates with a lot of people that I talk to.

I see the following progressive hierarchy when it comes to measuring performance...
  1. Don't measure response times at all.
  2. Measure response times. Don't alert at all.
  3. Measure response times. Alert against thresholds.
  4. Measure response times. Alert upon variances.
Most people don't measure response times at all (category 1), at least not until there's trouble. Most people don't measure response times even then, but some do. Not many people fit into what I've called category 2, because once you have a way to collect response time data, it's too tempting to do some kind of alerting with it.

Category 3 is a world in which people measure response times, and they compare those response times against some pre-specified list of tolerances for those response times. Here's where the big problem that Dallas is talking about hits you: Where does that list of tolerances come from? It takes work to make that list, and preceding that work is the motivation to make that list. Many companies just don't have that motivation.

I think it's the specter of the difficulty in getting to category 3 that prevents a lot of people from moving into category 2. I think that is Dallas's situation.

A few years ago, I would have listed category 3 at the top of my hierarchy, but at CMG'07, in a paper called "Death to Dashboards...," Peg McMahon and Justin Martin made me aware of another level: this notion of alerting based on variance.

The plan of creating a tolerance for every business task you execute on your system works fine for a few interesting tasks, but the idea doesn't scale to systems with hundreds or thousands of instrumented tasks. The task of negotiating, setting, and maintaining hundreds of different tolerances is just too labor-intensive.

Peg and Justin's paper described the notion that not bothering with individual tolerances works just as well—and with very low setup cost—because what you really ought to look out for are changes in response times. (It's an idea similar to what Robyn Sands described at Hotsos Symposium 2008.) You can look for variances without defining tolerances, but of course you cannot do it without measuring response times.

Dallas ends with:
I think one of the things you might offer as part of the "Performance as a Service" might be assisting customers in developing those performance SLAs, especially since your team is very experienced in knowing what is possible.
I of course love that he made that point, because this is exactly the kind of thing we're in business to do for people. Just contact us through http://method-r.com. We are ready, willing, and able, and now is a great time to schedule something.

There is a lot of value in doing the response time instrumentation exercise, no matter how you do alerting. The value comes in two main ways. First, just the act of measuring often reveals inefficiencies that are easy and inexpensive to fix. We find little mistakes all the time that make systems faster and nicer to use and that allow companies to use their IT budgets more efficiently. Second, response time information is just fascinating. It's valuable for people on both sides of the information supply-and-demand relationship to see how fast things really are, and how often you really run them. Seeing real performance data brings ideas together. It happens even if you don't choose to do alerting at all.

The biggest hurdle is in moving from category 1 to category 2. Once you're at category 2, your hardest work is behind you, and you and your business will have exactly the information you'll need for deciding whether to move on to category 3 or 4.

6 comments:

John Brady said...

When I worked at Sun I struggled with the same issue - customers would set SLAs for availability and measure them, but never for performance.

One of the reasons I think that performance is so often ignored by customers is that they cannot go out and buy a product from someone else that addresses it directly.

Performance is part of the trio of service qualities - Performance, Availability and Manageability. Customers can buy products for Availability - clustered servers and warm standby - and products for Manageability - various System Management suites. But there are very few products that directly address Performance.

So IT departments often just ignore it, because they think it cannot be solved. Because, if it could be solved, then someone would have developed a product that did it, and they could buy it. But there are few if any products out there that address Performance, and so it gets ignored.

You will also see that customers tend to address Availability before Manageability, and so leave Performance until last. Yet poor performance can have an impact equivalent to a decrease in availability - a 5% performance degradation is equivalent to an hours downtime over a 24 hour period i.e. the same number of 'unprocessed transactions'.

John

Cary Millsap said...

John,

Thank you for your comment. My hope is that more people who are interested will discover that we do have a product for performance that makes it easy for people to measure response times like I've talked about.

I really appreciate your final paragraph equating the impact of degraded performance to that of an outage.

—Cary

John Brady said...

Cary,

Yes, I am always amazed at how people do everything they can do avoid an availability issue before it occurs by buying lots of extra hardware and software, but do almost nothing about performance.

Personally I cannot understand why people treat performance so different to availability - both are about your ability to process business transactions in a timely manner. People go to great lengths to avoid availability issues before they occur, yet the same people will simply ignore performance, and react to it after it happens. And that is the wrong time to tackle performance - it is way too late.

I also think about performance management in terms of insurance against the times when something bad happens. Again, you take out house, car, health or life insurance before you have an accident. Not afterwards.

Adrian Cockcroft gave another good analogy of how IT departments deal with performance in a blog post about dealing with fires. It again shows how the general approach to performance - fix it after it occurs - is not the right attitude.

John

Cary Millsap said...

Thank you for the link to Adrian's blog.

—Cary

Dominic Delmolino said...

Gosh I'm so far behind on my reading...

I love the idea of variance measuring -- the old Savant Q tool did this in spades (talk to JB at Oracle about this).

Once you capture the historical data you can also calibrate your measurements against things like "every Monday at 8am"; "every 3rd month end" -- to see if performance trends are predictable. Also, you can start to calibrate variance -- not just in absolute terms ("it's varying by 5 seconds") but also in relative terms ("it's varying by 1/2 of a standard deviation") -- which can be more meaningful...

Cary Millsap said...

Hi Dom,

I love that idea, too. It has been a popular one at CMG for the past few years, too, which is where I first became familiar with it.

Thanks for your comment,

—Cary