Monday, August 21, 2023

A Design Decision

This week, my team at Method R devoted some time to an enhancement request that required an interesting design decision. This post is about the analysis behind that decision.

The enhancement request was for our flagship product called Method R Workbench. It's an application that people use to mine, manage, and manipulate Oracle trace files.

One of its features, called mrskew, is a tool that allows a Workbench user to filter, group, and sort the raw dbcall and syscall data that Oracle Database processes write to their trace files. You can use mrskew from within the Workbench application, or from a *nix command line.

Here's an example of a mrskew command. It's what you would use to find out how long your program spent reading Oracle blocks from secondary storage. It will show you which blocks it read, how long they took, and how many times each block was read:

mrskew --name='db.*read' \
  --group='sprintf("%8d %8d", $p1, $p2)' \
  x_ora_1492.trc

Here's the output:

sprintf("%8d %8d", $p1, $p2)  DURATION       %  CALLS      MEAN  ...
----------------------------  --------  ------  -----  --------  
                  2        2  0.072918    1.0%     26  0.002805  ...
                 33   698186  0.051940    0.7%      1  0.051940  ...
                 50   339841  0.049261    0.7%      1  0.049261  ...
...

The important thing in this report is the meaning of the $p1 and $p2 variables. The combination of these two variables happens to represent the data block address (the file number and block number) of an Oracle block that was read by some kind of an Oracle read call. It would be nice for the report to tell you that instead of just telling you that the first two columns of numbers are the output of an sprintf function call.

We have a command-line option for that. The ‑‑group-label option lets you assign your own title for the group column. So, with some careful character counting, you could use…

‑‑group-label='    FILE    BLOCK'

…to get exactly the heading you want:

    FILE    BLOCK  DURATION       %  CALLS      MEAN  ...
-----------------  --------  ------  -----  --------  
       2        2  0.072918    1.0%     26  0.002805  ...
      33   698186  0.051940    0.7%      1  0.051940  ...
      50   339841  0.049261    0.7%      1  0.049261  ...
...

That makes sense. Now it's easy to see that Oracle has read one block (file #2, block #2) 26 times, consuming a total of 0.072918 seconds reading it.

The group label fits the output, only because of the careful character counting. The enhancement request was to allow the ‑‑group-label option to take an expression, not just a string. Like this:

--group-label='sprintf("%8s %8s", "FILE", "BLOCK")'

That way, he could print out the header he wanted, perfectly aligned, by just syncing his ‑‑group‑label expression to his ‑‑group expression, without having to count space characters that are literally invisible.

It's a smart idea. The group label option should have been designed that way from the beginning. We eagerly approved the enhancement request and began thinking about the design.

When we thought it through, we ended up with two different ideas about how we could implement this new idea:

  1. Redefine ‑‑group‑label to take an expression instead of a string. mrskew will calculate the value of the expression before printing the column label.
  2. Create a new option, say, ‑‑new‑group‑label, that takes an expression as its argument. And leave ‑‑group‑label as it is.

The first idea is how the enhancement request was worded. The second idea entered our minds because the first idea creates a compatibility problem: if we change the spec of the existing ‑‑group‑label option, it will break some existing mrskew scripts. For example, these will work in Workbench 9.2.5:

--group-label=FILE
--group-label="FILE BLOCK"

But if we redefine ‑‑group‑label to take an expression instead of a string, then these won't work anymore. People will need to quote their string expressions like this:

--group-label='"FILE"'
--group-label='"FILE BLOCK"'

In the end, we decided to redefine the existing option and live with the compatibility breach.

The way we make decisions like this is that we create strenuous arguments for each idea. Here are some of the arguments we considered en route to our decision.

First, the customer experience (cognitive expenditure).

Everyone who participated in the debate had the customer experience foremost in mind. But how can we objectively measure "customer experience"? How do you structure a scientific debate about the superiority of one experience over another?

One way to do it is to measure cognitive expenditure—the amount of mental effort that a user has to invest to get the desired outcome from our software. We want to minimize cognitive expenditure, to maximize a customer's return on investment of effort.

We began by realizing that responding to this enhancement request with one of our two ideas would necessarily force the user into one of two new regimes:

  1. The syntax of ‑‑group-label has changed.
  2. There's a new ‑‑new-group-label option.

In regime 1, our users would have to learn the new syntax. That's a cognitive expenditure. But it's a one-time expenditure, which is good. The new syntax would be consistent with the existing ‑‑group syntax, which is actually a cognitive savings for our users over what we have now. However, if a customer had saved any scripts that used the old syntax, then the customer would have to convert those scripts. That's a cognitive expenditure in a loop (one for each script), which is bad.

In regime 2, our users would have to learn about ‑‑new-group‑label, which is a cognitive expenditure. They'd still have to remember (or relearn) about ‑‑group‑label, too, which is a similar cognitive expenditure as the one in regime 1. They wouldn't have to modify any old scripts, but they would have to make the choice of whether to use ‑‑group‑label or ‑‑new-group‑label, every time they wrote a script in the future. That's another cognitive expenditure in a loop (one for each script), which is bad.

Second, the developer experience (technical debt).

We also need to consider the developer's experience. We don't want to create code that increases technical debt that makes the product unnecessarily difficult to support.

If we redefine ‑‑group-label, there's no long-term effect to worry about. But if we add ‑‑new‑group‑label to the story, I would expect for people to wonder, why are there two such similar group label options, when one (the one that takes an expression) is clearly superior? And why does the inferior one have the better name?

At some point in the future, I envision wanting to clean up the cruft and have just the one group label feature. Naturally, the right name for it would be ‑‑group‑label. But of course, changing the spec that way would introduce a compatibility problem. To make things worse, this would occur in the future when—one would hope, if our business is growing—such a decision would impact even more customers than it would today. So then, why create the cruft in the first place? It'll be a worse problem later than it is now.

The question that really seals the deal, is who will the change really affect? It's really a probability question about customer experiences.

Most users who use the Workbench application will never experience our group label option directly. It's there for everybody to use, but our Workbench has so many predefined reports built into it, most users never need to touch the group label option for themselves. When they do need to modify it, they're usually tweaking a report that we've predefined for them, which is a low–cognitive-expenditure experience.

In the end, Method R Corporation bears almost the entire cost of the ‑‑group‑label redefinition. It required us to revise:

Most users will experience the benefit of the ‑‑group‑label change, without ever knowing that, once upon a time, it changed. And that's the way we want it. We want the product to be as smart as possible so that our customers get the most out of their investment, both cognitive and financial.

2 comments:

Jared said...

Thanks for this explanation.

There's always a lot of thought that goes into such changes.

The amount of effort required before any coding begins may be a surprise to some.

Tomasz Ziss said...

Thanks for sharing this post :)!
I really appreciated the explanation and analysis before coding action.