Cary Millsap: software development

Showing posts with label software development. Show all posts

Friday, February 27, 2015

What happened to “when the application is fast enough to meet users’ requirements?”

On January 5, I received an email called “Video” from my friend and former employee Guðmundur Jósepsson from Iceland. His friends call him Gummi (rhymes with “do me”). Gummi is the guy whose name is set in the ridiculous monospace font on page xxiv of Optimizing Oracle Performance, apparently because O’Reilly’s Linotype Birka font didn’t have the letter eth (ð) in it. Gummi once modestly teased me that this is what he is best known for. But I digress...

His email looked like this:

It’s a screen shot of frame 3:12 from my November 2014 video called “Why you need a profiler for Oracle.” At frame 3:12, I am answering the question of how you can know when you’re finished optimizing a given application function. Gummi’s question is, «Oi! What happened to “when the application is fast enough to meet users’ requirements?”»

Gummi noticed (the good ones will do that) that the video says something different than the thing he had heard me say for years. It’s a fair question. Why, in the video, have I said this new thing? It was not an accident.

When are you finished optimizing?

The question in focus is, “When are you finished optimizing?” Since 2003, I have actually used three different answers:

When are you are finished optimizing?

When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.
Source: Optimizing Oracle Performance (2003) pages 302–304.

When the application is fast enough to meet your users’ requirements.
Source: I have taught this in various courses, conferences, and consulting calls since 1999 or so.

When there are no unnecessary calls, and the calls that remain run at hardware speed.
Source: “Why you need a profiler for Oracle” (2014) frames 2:51–3:20.

My motive behind answers A and B was the idea that optimizing beyond what your business needs can be wasteful. I created these answers to deter people from misdirecting time and money toward perfecting something when those resources might be better invested improving something else. This idea was important, and it still is.

So, then, where did C come from? I’ll begin with a picture. The following figure allows you to plot the response time for a single application function, whatever “given function” you’re looking at. You could draw a similar figure for every application function on your system (although I wouldn’t suggest it).

Somewhere on this response time axis for your given function is the function’s actual response time. I haven’t marked that response time’s location specifically, but I know it’s in the blue zone, because at the bottom of the blue zone is the special response time R_T. This value R_T is the function’s top speed on the hardware you own today. Your function can’t go faster than this without upgrading something.

It so happens that this top speed is the speed at which your function will run if and only if (i) it contains no unnecessary calls and (ii) the calls that remain run at hardware speed. ...Which, of course, is the idea behind this new answer C.

Where, exactly, is your “requirement”?

Answer B (“When the application is fast enough to meet your users’ requirements”) requires that you know the users’ response time requirement for your function, so, next, let’s locate that value on our response time axis.

This is where the trouble begins. Most DBAs don’t know what their users’ response time requirements really are. Don’t despair, though; most users don’t either.

At banks, airlines, hospitals, telcos, and nuclear plants, you need strict service level agreements, so those businesses invest into quantifying them. But realize: quantifying all your functions’ response time requirements isn’t about a bunch of users sitting in a room arguing over which subjective speed limits sound the best. It’s about knowing your technological speed limits and understanding how close to those values your business needs to pay to be. It’s an expensive process. At some companies, it’s worth the effort; at most companies, it’s just not.

How about using, “well, nobody complains about it,” as all the evidence you need that a given function is meeting your users’ requirement? It’s how a lot of people do it. You might get away with doing it this way if your systems weren’t growing. But systems do grow. More data, more users, more application functions: these are all forms of growth, and you can probably measure every one of them happening where you’re sitting right now. All these forms of growth put you on a collision course with failing to meet your users’ response time requirements, whether you and your users know exactly what they are, or not.

In any event, if you don’t know exactly what your users’ response time requirements are, then you won’t be able to use “meets your users’ requirement” as your finish line that tells you when to stop optimizing. This very practical problem is the demise of answer B for most people.

Knowing your top speed

Even if you do know exactly what your users’ requirements are, it’s not enough. You need to know something more.

Imagine for a minute that you do know your users’ response time requirement for a given function, and let’s say that it’s this: “95% of executions of this function must complete within 5 seconds.” Now imagine that this morning when you started looking at the function, it would typically run for 10 seconds in your Oracle SQL Developer worksheet, but now after spending an hour or so with it, you have it down to where it runs pretty much every time in just 4 seconds. So, you’ve eliminated 60% of the function’s response time. That’s a pretty good day’s work, right? The question is, are you done? Or do you keep going?

Here is the reason that answer C is so important. You cannot responsibly answer whether you’re done without knowing that function’s top speed. Even if you know how fast people want it to run, you can’t know whether you’re finished without knowing how fast it can run.

Why? Imagine that 85% of those 4 seconds are consumed by Oracle enqueue, or latch, or log file sync calls, or by hundreds of parse calls, or 3,214 network round-trips to return 3,214 rows. If any of these things is the case, then no, you’re absolutely not done yet. If you were to allow some ridiculous code path like that to survive on a production system, you’d be diminishing the whole system’s effectiveness for everybody (even people who are running functions other than the one you’re fixing).

Now, sure, if there’s something else on the system that has a higher priority than finishing the fix on this function, then you should jump to it. But you should at least leave this function on your to-do list. Your analysis of the higher priority function might even reveal that this function’s inefficiencies are causing the higher-priority function’s problems. Such can be the nature of inefficient code under conditions of high load.

On the other hand, if your function is running in 4 seconds and (i) its profile shows no unnecessary calls, and (ii) the calls that remain are running at hardware speeds, then you’ve reached a milestone:

if your code meets your users’ requirement, then you’re done;
otherwise, either you’ll have to reimagine how to implement the function, or you’ll have to upgrade your hardware (or both).

There’s that “users’ requirement” thing again. You see why it has to be there, right?

Well, here’s what most people do. They get their functions’ response times reasonably close to their top speeds (which, with good people, isn’t usually as expensive as it sounds), and then they worry about requirements only if those requirements are so important that it’s worth a project to quantify them. A requirement is usually considered really important if it’s close to your top speed or if it’s really expensive when you violate a service level requirement.

This strategy works reasonably well.

It is interesting to note here that knowing a function’s top speed is actually more important than knowing your users’ requirements for that function. A lot of companies can work just fine not knowing their users’ requirements, but without knowing your top speeds, you really are in the dark. A second observation that I find particularly amusing is this: not only is your top speed more important to know, your top speed is actually easier to compute than your users’ requirement (…if you have a profiler, which was my point in the video).

Better and easier is a good combination.

Tomorrow is important, too

When are you are finished optimizing?

When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.

When the application is fast enough to meet your users’ requirements.

When there are no unnecessary calls, and the calls that remain run at hardware speed.

Answer A is still a pretty strong answer. Notice that it actually maps closely to answer C. Answer C’s prescription for “no unnecessary calls” yields answer A’s goal of call reduction, and answer C’s prescription for “calls that remain run at hardware speed” yields answer A’s goal of latency reduction. So, in a way, C is a more action-oriented version of A, but A goes further to combat the perfectionism trap with its emphasis on the cost of action versus the cost of inaction.

One thing I’ve grown to dislike about answer A, though, is its emphasis on today in “…exceeds the cost of the performance you’re getting today.” After years of experience with the question of when optimization is complete, I think that answer A under-emphasizes the importance of tomorrow. Unplanned tomorrows can quickly become ugly todays, and as important as tomorrow is to businesses and the people who run them, it’s even more important to another community: database application developers.

Subjective goals are treacherous for developers

Many developers have no way to test, today, the true production response time behavior of their code, which they won’t learn until tomorrow. ...And perhaps only until some remote, distant tomorrow.

Imagine you’re a developer using 100-row tables on your desktop to test code that will access 100,000,000,000-row tables on your production server. Or maybe you’re testing your code’s performance only in isolation from other workload. Both of these are problems; they’re procedural mistakes, but they are everyday real-life for many developers. When this is how you develop, telling you that “your users’ response time requirement is n seconds” accidentally implies that you are finished optimizing when your query finishes in less than n seconds on your no-load system of 100-row test tables.

If you are a developer writing high-risk code—and any code that will touch huge database segments in production is high-risk code—then of course you must aim for the “no unnecessary calls” part of the top speed target. And you must aim for the “and the calls that remain run at hardware speed” part, too, but you won’t be able to measure your progress against that goal until you have access to full data volumes and full user workloads.

Notice that to do both of these things, you must have access to full data volumes and full user workloads in your development environment. To build high-performance applications, you must do full data volume testing and full user workload testing in each of your functional development iterations.

This is where agile development methods yield a huge advantage: agile methods provide a project structure that encourages full performance testing for each new product function as it is developed. Contrast this with the terrible project planning approach of putting all your performance testing at the end of your project, when it’s too late to actually fix anything (if there’s even enough budget left over by then to do any testing at all). If you want a high-performance application with great performance diagnostics, then performance instrumentation should be an important part of your feedback for each development iteration of each new function you create.

My answer

So, when are you finished optimizing?

When the cost of call reduction and latency reduction exceeds the cost of the performance you’re getting today.
When the application is fast enough to meet your users’ requirements.
When there are no unnecessary calls and the calls that remain run at hardware speed.

There is some merit in all three answers, but as Dave Ensor taught me inside Oracle many years ago, the correct answer is C. Answer A specifically restricts your scope of concern to today, which is especially dangerous for developers. Answer B permits you to promote horrifically bad code, unhindered, into production, where it can hurt the performance of every function on the system. Answers A and B both presume that you know information that you probably don’t know and that you may not need to know. Answer C is my favorite answer because it is tells you exactly when you’re done, using units you can measure and that you should be measuring.

Answer C is usually a tougher standard than answer A or B, and when it’s not, it is the best possible standard you can meet without upgrading or redesigning something. In light of this “tougher standard” kind of talk, it is still important to understand that what is optimal from a software engineering perspective is not always optimal from a business perspective. The term optimized must ultimately be judged within the constraints of what the business chooses to pay for. In the spirit of answer A, you can still make the decision not to optimize all your code to the last picosecond of its potential. How perfect you make your code should be a business decision. That decision should be informed by facts, and these facts should include knowledge of your code’s top speed.

Thank you, Guðmundur Jósepsson, of Iceland, for your question. Thank you for waiting patiently for several weeks while I struggled putting these thoughts into words.

Thursday, June 7, 2012

An Organizational Constraint that Diminishes Software Quality

One of the biggest problems in software performance today occurs when the people who write software are different from the people who are required to solve the performance problems that their software causes. It works like this:

Architects design a system and pass the specification off to the developers.
The developers implement the specs the architects gave them, while the architects move on to design another system.
When the developers are “done” with their phase, they pass the code off to the production operations team. The operators run the system the developers gave them, while the developers move on to write another system.

The process is an assembly line for software: architects specialize in architecture, developers specialize in development, and operators specialize in operating. It sounds like the principle of industrial efficiency taken to its logical conclusion in the software world.

In this waterfall project plan,
architects design systems they never see written,
and developers write systems they never see run.

Sound good? It sounds like how Henry Ford made a lot of money building cars... Isn’t that how they build roads and bridges? So why not?

With software, there’s a horrible problem with this approach. If you’ve ever had to manage a system that was built like this, you know exactly what it is.

The problem is the absence of a feedback loop between actually using the software and building it. It’s a feedback loop that people who design and build software need for their own professional development. Developers who never see their software run don’t learn enough about how to make their software run better. Likewise, architects who never see their systems run have the same problem, only it’s worse, because (1) their involvement is even more abstract, and (2) their feedback loops are even longer.

Who are the performance experts in most Oracle shops these days? Unfortunately, it’s most often the database administrators, not the database developers. It’s the people who operate a system who learn the most about the system’s design and implementation mistakes. That’s unfortunate, because the people who design and write a system have so much more influence over how a system performs than do the people who just operate it.

If you’re an architect or a developer who has never had to support your own software in production, then you’re probably making some of the same mistakes now that you were making five years ago, without even realizing they’re mistakes. On the other hand, if you’re a developer who has to maintain your own software while it’s being operated in production, you’re probably thinking about new ways to make your next software system easier to support.

So, why is software any different than automotive assembly, or roads and bridges? It’s because software design is a process of invention. Almost every time. When is the last time you ever built exactly the same software you built before? No matter how many libraries you’re able to reuse from previous projects, every system you design is different from any system you’ve ever built before. You don’t just stamp out the same stuff over and over.

Software is funny that way, because the cost of copying and distributing it is vanishingly small. When you make great software that everyone in the world needs, you write it once and ship it at practically zero cost to everyone who needs it. Cars and bridges don’t work that way. Mass production and distribution of cars and bridges requires significantly more resources. The thousands of people involved in copying and distributing cars and bridges don’t have to know how to invent or refine cars or bridges to do great work. But with software, since copying and distributing it is so cheap, almost all that’s left is the invention process. And that requires feedback, just like inventing cars and bridges did.

Don’t organize your software project teams so that they’re denied access to this vital feedback loop.

Friday, June 17, 2011

Using Agile Practices to Create an Agile Presentation

What’s the best way to make a presentation on Agile practices? Practice Agile practices.

You could write a presentation “big bang” style, delivering version 1.0 in front of your big audience of 200+ people at Kscope 2011 before anybody has seen it. Of course, if you do it that way, you build a lot of risk into your product. But what else can you do?

You can execute the Agile practices of releasing early and often, allowing the reception of your product to guide its design. Whenever you find an aspect of your product that doesn’t get the enthusiastic reception you had hoped for, you fix it for the next release.

That’s one of the reasons that my release schedule for “My Case for Agile Methods” includes a little online webinar hosted by Red Gate Software next week. My release schedule is actually a lot more complicated than just one little pre-ODTUG webinar:

2011-04-15	Show key conceptual graphics to son (age 13)
2011-04-29	Review #1 of paper with employee #1
2011-05-18	Review #2 of paper with customer
2011-05-14	Review #3 of paper with employee #1
2011-05-18	Review #4 of paper with employee #2
2011-05-26	Review #5 of paper with employee #3
2011-06-01	Submit paper to ODTUG web site
2011-06-02	Review #6 of paper with employee #1
2011-06-06	Review #7 of paper with employee #3
2011-06-10	Submit revised paper to ODTUG web site
2011-06-13	Present “My Case for Agile Methods” to twelve people in an on-site customer meeting
2011-06-22	Present “My Case for Agile Methods” in an online webinar hosted by Red Gate Software
2011-06-27	Present “My Case for Agile Methods” at ODTUG Kscope 2011 in Long Beach, California

(By the way, the vast majority of the work here is done in Pages, not Keynote. I think using a word processor, not an operating system for slide projectors.)

Two Agile practices are key to everything I’ve ever done well: incremental design and rapid iteration. Release early, release often, and incorporate what you learn from real world use back into the product. The magic comes from learning how to choose wisely in two dimensions:

Which feature do you include next?
To whom do you release next?

The key is to show your work to other people. Yes, there’s tremendous value in practicing a presentation, but practicing without an audience merely reinforces, it doesn’t inform. What you need while you design something is information—specifically, you need the kind of information called feedback. Some of the feedback I receive generates some pretty energetic arguing. I need that to fortify my understanding of my own arguments so that I’ll be more likely to survive a good Q&A session on stage.

To lots of people who have seen teams run projects into the ground using what they call “Agile,” the word “Agile” has become a synonym for sloppy, irresponsible work habits. When you hear me talk about Agile, you’ll hear about practices that are highly disciplined and that actually require a lot of focus, dedication, commitment, practice, and plain old hard work to execute.

Agile, to me, is about injecting discipline into a process that is inevitably rife with unpredictable change.