Monday, March 7, 2016

Loss Aversion and the Setting of DB_BLOCK_CHECKSUM

Within Accenture Enkitec Group, we have recently been discussing the Oracle db_block_checksum parameter and how difficult it is to get clients to set it to a safer setting.

Clients are always concerned about the performance impact of features like this. Several years ago, I met a lot of people who had—in response to some expensive advice with which I strongly disagreed—turned off redo logging with an underscore parameter. The performance they would get from doing this would set the expectation level in their mind, which would cause them to resist (strenuously!) any notion of switching this [now horribly expensive] logging back on. Of course, it makes you wish that it had never even been a parameter.

I believe that the right analysis is to think clearly about risk. Risk is a non-technical word in most people’s minds, but in finance courses they teach that risk is quantifiable as a probability distribution. For example, you can calculate the probability that a disk will go bad in your system today. For disks, it’s not too difficult, because vendors do those calculations (MTTF) for us. But the probability that you’ll wish you had set db_block_checksum=full yesterday is probably more difficult to compute.

From a psychology perspective, customers would be happier if their systems had db_block_checksum set to full or typical to begin with. Then in response to the question,
“Would you like to remove your safety net in exchange for going between 1% and 10% faster? Here’s the horror you might face if you do it...”
...I’d wager that most people would say no, thank you. They will react emotionally to the idea of their safety net being taken away.

But with the baseline of its being turned off to begin with, the question is,
“Would you like to install a safety net in exchange for slowing your system down between 1% and 10%? Here’s the horror you might face if you don’t...”
...I’d wager that most people would answer no, thank you, even though this verdict is opposite to the one I predicted above. They will react emotionally to the idea of their performance being taken away.

Most people have a strong propensity toward loss aversion. They tend to prefer avoiding losses over acquiring gains. If they already have a safety net, they won’t want to lose it. If they don’t have the safety net they need, they’ll feel averse to losing performance to get one. It ends up being a problem more about psychology than technology.

The only tools I know to help people make the right decision are:
  1. Talk to good salespeople about how they overcome the psychology issue. They have to deal with it every day.
  2. Give concrete evidence. Compute the probabilities. Tell the stories of how bad it is to have insufficient protection. Explain that any software feature that provides a benefit is going to cost some system capacity (just like a new report, for example), and that this safety feature is worth the cost. Make sure that when you size systems, you include the incremental capacity cost of switching to db_block_checksum=full.
My teammates get it, of course, because they’ve lived the stories, over and over again, in their roles on the corruption team at Oracle Support. You can get it, too, without leaving your keyboard. If you want to see a fantastic and absolutely horrifying short story about what happens if you do not use Oracle’s db_block_checksum feature properly, read David Loinaz’s article now.

When you read David’s article, you are going to see heavy quoting of my post here in his intro. He did that with my full support. (He wrote his article when my article here wasn’t an article yet.) If you feel like you’ve read it before, just keep reading. You really, really need to see what David has written, beginning with the question:
If I’ve never faced a corruption, and I have good backup strategy, my disks are mirrored, and I have a great database backup strategy, then why do I need to set these kinds of parameters that will impact my performance?
Enjoy.