Cary Millsap: Method R

Showing posts with label Method R. Show all posts

Wednesday, April 3, 2013

And now, ...the video

First, I want to thank everyone who responded to my prior blog post and its accompanying survey, where I asked when video is better than a paper. As I mentioned already in the comment section for that blog post, the results were loud and clear: 53.9% of respondents indicated that they’d prefer reading a paper, and 46.1% indicated that they’d prefer watching a video. Basically a clean 50/50 split.

The comments suggested that people have a lower threshold for “polish” with a video than with a paper, so one of the ideas to which I’ve needed to modify my thinking is to just create decent videos and publish them without expending a lot of effort in editing.

But how?

Well, the idea came to me in the process of agreeing to perform a 1-hour session for my good friends and customers at The Pythian Group. Their education department asked me if I’d mind if they recorded my session, and I told them I would love if they would. They encouraged me to open it up the the public, which of course I thought (cue light bulb over head) was the best idea ever. So, really, it’s not a video as much as it’s a recorded performance.

I haven’t edited the session at all, so it’ll have rough spots and goofs and imperfect audio... But I’ve been eager ever since the day of the event (2013-02-15) to post it for you.

So here you go. For the “other 50%” who prefer watching a video, I present to you: “The Method R Profiling Ecosystem.” If you would like to follow along in the accompanying paper, you can get it here: “A First Look at Using Method R Workbench Software.”

Sunday, December 4, 2011

Gwen Shapira on SSD

If you haven’t seen Gwen Shapira’s article about de-confusing SSD, I recommend that you read it soon.

One statement stood out as an idea on which I wanted to comment:

If you don’t see significant number of physical reads and sequential read wait events in your AWR report, you won’t notice much performance improvements from using SSD.

I wanted to remind you that you can do better. If you do notice a significant number of physical reads and sequential write wait events in your AWR report, then it’s still not certain that SSD will improve the performance of the task whose performance you’re hoping to improve. You don’t have to guess about the effect that SSD will have upon any business task you care about. In 2009, I wrote a blog post that explains.

Friday, November 18, 2011

I Can Help You Trace It

The first product I ever created after leaving Oracle Corporation in 1999 was a 3-day course about optimizing Oracle performance. The experiences of teaching this course from 2000 through 2003 (heavily revising the material each time I taught it) added up to the knowledge that Jeff Holt and I needed to write Optimizing Oracle Performance (2003).

Between 2000 and 2006, I spent many weeks on the road teaching this 3-day course. I stopped teaching it in 2006. An opportunity to take or teach a course ought to be a joyous experience, and this one had become too much of a grind. I didn’t figure out how to fix it until this year. How I fixed it is the story I’d like to tell you.

The Problem

The problem was simply inefficiency. The inefficiency began with the structure of the course, the 3-day lecture marathon. Realize, 6 × 3 = 18 hours of sitting in a chair, listening attentively to a single voice (my voice) is the equivalent of a 6-week university term of a 3-credit-hour course, taught straight through in three days. No hour-plus homework assignment after each hour of lecture to reinforce the lessons; but a full semester’s worth of listening to one voice, straight through, for three days. What retention rate would you expect from a university course compressed into just 3 days?

So, I optimized. I have created a new course that lasts one day (not even an exhausting full day at that). But how can a student possibly learn as much in 1 day as we used to teach in 3 days? Isn’t a 1-day event bound to be a significantly reduced-value experience?

On the contrary, I believe our students benefit even more now than they used to. Here are the big differences, so you can see why.

The Time Savings

In the 3-day course, I would spend half a day explaining why people should abandon their old system-wide-ratio-based ways of managing system performance. In the new 1-day course, I spend less than an hour explaining the Method R approach to thinking about performance. The point of the new course is not to convince people to abandon anything they’re already doing; it’s to show students the tremendous additional opportunities that are available to them if they’ll just look at what Oracle trace files have to offer. Time savings: 2 hours.

In the 3-day course, I would spend a full day explaining how to interpret trace data. By hand. These were a few little lab exercises, about an hour’s worth. Students would enter dozens of numbers from trace files into laptops or pocket calculators and write results on worksheets. In the new 1-day course, the software tools that a student needs to interpret files of any size—or even directories full of files—are included in the price of the course. Time savings: 5 hours.

In the 3-day course, I would spend half a day explaining how to collect trace data. In the new 1-day course, the software tools that a student needs to get started collecting trace files are included in the price of the course. For software architectures that require more work than our software can do for you, there’s detailed instruction in the course book. Time savings: 3 hours.

In the 3-day course, I would spend half a day working through about five example cases using a software tool to which students would have access for 30 days after they had gone home. In the new 1-day course, I spend one hour working through about eight example cases using software tools that every student will take home and keep forever. I can spend less time per case yet teach more because the cases are thoroughly documented in the course book. So, in class, we focus on the high-level decision making instead of the gnarly technical details you’ll want to look up later anyway. Time savings: 3 hours.

...That’s 13 classroom hours we’ve eliminated from the old 3-day experience. I believe that in these 13 hours, I was teaching material that students weren’t retaining to begin with.

The Book

The next big difference: the book.

In the old 3-day course, I distributed two books: (1) the “Course Notebook,” which was a black and white listing of the course PowerPoint slides, and (2) a copy of Optimizing Oracle Performance (O’Reilly 2003). The O’Reilly book was great, because it contained a lot of detail that you would want to look up after the course. But of course it doesn’t contain any new knowledge we’ve learned since 2003. The Course Notebook, in my opinion, was never worth much to begin with. (In my opinion, no PowerPoint slide printout is worth much to begin with.)

The Mastering Oracle Trace Data (MOTD) book we give each student in my new 1-day course is a full-color, perfect-bound book that explains the course material and far more in deep detail. It is full-color for an important reason. It’s not gratuitous or decorative; it’s because I’ve been studying Edward Tufte. I use color throughout the book to communicate detailed, high-resolution information faster to your brain.

Color in the book helps to reduce student workload and deliver value long after a student has left the classroom. In this class, there is no collection of slide printouts like you’ve archived after every Oracle class you’ve been to since the 1980s. The MOTD book is way better than any other material I’ve ever distributed in my career. I’ve heard students tell their friends that you have to see it to believe it.

“A paper record tells your audience that you are serious, responsible, exact, credible. For deep analysis of evidence and reasoning about complex matters, permanent high-resolution displays [that is, paper] are an excellent start.” —Edward Tufte

The Software

So, where does a student recoup all the time we used to spend going through trace files, and studying how to collect trace data on half a dozen different software architectures? In the thousands of man-hours we’ve invested into the software that we give you when you come to the course. Instead of explaining every little detail about quirks in Oracle trace data that change between Oracle versions 10.1 and 10.2 and 11.2 or 11.2.0.2 and 11.2.0.4, the software does the work for you. Instead of having to explain all the detail work, we have time to explain how to use the results of our software to make decisions about your data.

What’s the catch? Of course, we hope you’ll love our software and want to buy it. The software we give you is completely full-featured and yours to keep forever, but the license limits you to using it only with one login id, and it doesn’t include patches and upgrades, which we release a few times each year. We hope you’ll love our software so much that you’ll want to buy a license that lets you use it on any of your systems and that includes the right to upgrade as we fix bugs and add features. We hope you’ll love it so much that you encourage your colleagues to buy it.

But there’s really no catch. You get software and a course (and a book and a shirt) for less than the daily rate that we used to charge for just a course.

A Shirt?

MOTD London 2011-09-08: “I can help you trace it.”

Yes, a shirt. Each student receives a Method R T-shirt that says, “I can help you trace it.” We don’t give these things away to anyone except for students in my MOTD course. So if you see one, the person wearing it can, in actual fact, Help You Trace It.

The Net Result

The net result of all this optimization is benefits on several fronts:

The course costs a lot less than it used to. The fee is presently only about 25% of the 3-day course’s price, and the whole experience requires less than ⅓ of time away from work that the original course did.
In the new course, our students don’t have to work so hard to make productive use of the course material. The book and the software take so much of the pressure off. We do talk about what the fields in raw trace data mean—I think it’s necessary to know that in order to use the data properly, and have productive debates with your sys/SAN/net/etc. administration colleagues. But we don’t spend your time doing exercises to untangle nested (recursive) calls by hand. The software you take home does that for you. That’s why it is so much easier for a student to put this course to work right away.
Since the course duration is only one day, I can visit far more cities and meet far more students each year. That’s good for students who want to participate, and it’s great for me, because I get to meet more people.

Plans

The only thing missing from our Mastering Oracle Trace Data course right now is you. I have taught the event now in Southlake, Texas (our home town), in Copenhagen, and in London. It’s field-tested and ready to roll. We have several cities on my schedule right now. I’ll be teaching the course in Birmingham UK on the day after UKOUG wraps up, December 8. I’ll be doing Orlando and Tampa in mid-December. I’ll teach two courses this coming January in Manhattan and Long Island. There’s Billund (Legoland) DK in April. We have more plans in the works for Seattle, Portland, Dallas, and Cleveland, and we’re looking for more opportunities.

Share the word by linking the official
MOTD sticker to http://method-r.com/.

My wish is for you to help me book more cities in North America and Europe (I hope to expand beyond that soon). If you are part of a company or a user group with colleagues who would be interested in attending the course, I would love to hear from you. Registering en masse saves you money. The magic number for discounting is 10 students on a single registration from one company or user group.

I can help you trace it.

Friday, June 3, 2011

Why KScope?

Early this year, my friend Mike Riley from ODTUG asked me to write a little essay in response to the question, “Why Kscope?” that he could post on the ODTUG blog. He agreed that cross-posting would help the group reach more people, so I’ve reproduced my response to that question here. I’ll hope to see you at Kscope11 in Long Beach June 26–30. If you develop applications for Oracle systems, you need to be there.

MR: Why KScope?

CM: Most people in the Oracle world who know my name probably think of me as a database administrator. In my heart, I am a software designer and developer. Before my career with Oracle, I worked in the semiconductor industry as a language designer. I wrote compilers for a living. Designing and writing software has always been my professional true love. I’ve never strayed too far away from it; I’ve always found a reason to write software, no matter what my job has been. [Ed: Examples include the Oracle*APS suite and a compiler design project he did for Great West Life in the 1990s, the queueing theory models he worked on in the late 1990s, the Method R Profiler software (Cary wrote all the XSLT code), and finally today, he spends about half of his time designing and writing the MR Tools suite.]

My career as an Oracle performance specialist is really a natural extension of my software development background. It is still really weird to me that in the Oracle market, performance is regarded as a job done primarily by operations people instead of by development people. Developers control at least 90% of the leverage over how fast an application will be able to run. I think that performance became a DBA responsibility in the formative years of our Oracle world because so many early Oracle projects had DBA teams but no professional development teams.

Most of those big projects were people implementing big off-the-shelf applications like Oracle Financial and Manufacturing Applications (which grew into the Oracle E-Business Suite). The only developers that most of those implementation teams had were what I would call nonprofessional developers. Now, I don’t mean people who were in any way unprofessional. I mean they were predominantly businesspeople who had never been educated as software developers, but who’d been told that of course anybody could write computer programs in this new “fourth-generation language” called SQL.

Just about any time you implement a vendor’s highly customizable new application with 20,000+ database objects underneath it, you’re going to run into performance problems. Someone had to attend to those problems, and the DBAs and sysadmins were the only technical people anywhere near the project who could do it. Those DBAs and Oracle sysadmins were also the people who organized the early Oracle conferences, and I think this is where the topic of “performance tuning” became embedded into the DBA track.

The resulting problem that I still see today is that the topic became dominated by “tips and techniques”—lists of tricks that operational people could try to maybe make their systems go a little bit faster. The word “tuning” says it all. I almost never use the word except facetiously, because it’s a cheap imitation of what systems really need, which is performance optimization, which is what designers and developers of software are supposed to do. Even the evolution of Oracle tools for the performance analyst mirrors this post-production tips-and-techniques “tuning” mentality. That’s why most performance management tools you see today are predominantly oriented toward viewing performance from a system resource perspective (the DBA’s perspective), rather than the code path perspective (the developer’s perspective).

The whole key to performance is the application design and development team, especially when you realize that the performance of an application is not just its code path speed, but its overall interaction with the person using it. So many of the performance problems that I’ve found are caused by applications that are just stupid in how they’re designed to interact with me. For example, if you’ve seen my “Messed-up apps” presentation before, you might remember the self-service bus ticket kiosk that made me wait for over a minute while the application tallied the more-than-2,000 different bus trips for which I might want to buy a ticket. That’s an app with a broken specification. There’s nothing that a run-time operations team can do to make that application any fun to use (short of sending it back for redesign).

My goal as a software designer is not just to make software that runs quickly. My goal is also to make applications that are delightful to use. It’s the difference between an application that you use because you must and one that feels like it’s a necessary part of who you are. Making software like that is the kind of thing that a designer learns from studying Don Norman, Edward Tufte, Christopher Alexander, and Jonathan Ive. It’s a level of performance that just isn’t on the menu for operational run-time support staff to even think about, because it’s beyond their control.

So: why Kscope? The ODTUG conferences are the best places I can go in the Oracle market where I can be with people who think and talk about these things. …Or for that matter, who understand that these ideas even exist and deserve to be studied. KScope is just the right place for me to be.

Sunday, September 26, 2010

My Actual OTN Interview

And now, the actual OTN interview (9:11) is online. Thank you, Justin; it was a lot of fun. And thank you to Oracle Corporation for another great show. It's an ever-growing world we work in, and I'm thrilled to be a part of it.

Friday, September 10, 2010

New Method R Blogs

Today we installed two new blogs over at our method-r.com web page. We created them to give us—all of us at Method R—a place to talk about our products and professional experiences. I hope you'll come have a look.

I'll still be posting here, too.

Thursday, September 2, 2010

My OTN Interview at OOW2010 (which hasn’t happened yet)

Yesterday, Justin Kestelyn from Oracle Technology Network sent me a note specifying some logistics for our OTN interview together called “Coding for Performance, with Cary Millsap.” It will take place at Oracle OpenWorld on Monday, September 20 at 11:45am Pacific Time.

One of Justin’s requests in his note was, “Topics: Please suggest 5 topics for discussion?” So, I thought for a couple of minutes about questions I’d like him to ask me, and I wrote down the first one. And then I thought to myself that I might as well write down the answer I would hope to say to this; maybe it’ll help me remember everything I want to say. Then I wrote another question, and the answer just flowed, and then another, and another. Fifteen minutes later, I had the whole thing written out.

I told Justin all this, and we agreed that it would be fun to post the whole interview here on my blog, before it ever happened. And then during the actual interview, we’ll see what actually happens. It’ll all be in Justin’s hands by then.

So, here we go. Justin Kestelyn’s interview, “Coding for Performance, with Cary Millsap.” Which hasn’t happened yet.

◆ ◆ ◆

Justin: Hi, Cary. Welcome to the show, etc., etc.

Cary: Hi, Justin. It’s great to be here. Thank you for having me, etc., etc.

Justin: So tell me, ... What is the most important thing to know about performance?

Cary: Performance is all about code path. There are only three ways that a program can consume your time. There’s (1) the actual execution of your program’s instructions. There’s (2) queueing delay, which is what you get when you visit a resource that’s busy serving someone else (CPU, disk, network, etc.). And there’s (3) coherency delay, which is when you await some other process’s permission to execute your next step. The code you’re running controls all three of those ways you can spend time. So understanding performance is all about understanding code, whether it’s Java or PHP or C# that you wrote, or the C code that the Oracle Database kernel developers have written for you.

Justin: Is tuning SQL or PL/SQL any different from tuning Java or PHP or C#?

Cary: The tools are a little different, but the fundamentals are exactly the same. You find out which code path in your application is consuming your time, and then you go after it. The best thing to do is figure out a way to execute that code path less often (because the fastest way to do anything is to not do it at all). The next best thing to do is try to figure out a way to make any instructions that can’t be eliminated, faster. That’s the whole trick.

Justin: You make it sound easy.

Cary: It usually is easy once you can collect the data you need to guide you. ...Once you know how to get the system to tell you where it’s spending your time. People make it hard on themselves anytime they try to use performance data that includes information about anything other than the specific user experience they’re trying to fix. Like when they try to fix the performance of some click on a web form by looking at CPU utilization data on their application server or their database server.

Another thing that makes it really hard is the design of the application. More tiers means more complexity when it comes time to diagnose performance problems. And some User Interface designs are just guaranteed to create performance problems. My presentation here called “Messed-Up Apps” is a showcase of a few of those kinds of designs. The message there is that performance is something that has to be designed into an application from the start, like any other feature. Performance is not something you can paint on at the end.

Justin: What can developers do to maximize the performance of the applications they write?

Cary: The most important thing is to remember a couple of key ideas. First, Barry Boehm showed that the cost of repairing defects increases hyperbolically, the later you find them in your development and deployment life cycle. That’s true for performance defects just like it is for functional defects. Second, what Donald Knuth wrote 40 years ago is still true today: when developers try to guess where their code is slow, they do an awful job. Even great developers, when they profile the response time of their code, they’re often surprised at where that code is spending their (or their users’) time. So, profiling early in the software development life cycle is vital.

Next, it’s important to test. Not just functional requirements, but test performance requirements, too. Finally, it’s important to realize that there’s no way that your testing can catch every performance problem that can go wrong, so it’s important to make your application code easy to diagnose and repair in production. You do that with good instrumentation so your production system managers can profile in production when they need to, just like the developers do on the development and test systems.

Justin: How do you—a developer—profile your code?

Cary: Every development language has profiling tools that go with it. They’re tools that you can point at your application when it runs to show exactly where every smidgen of response time is being consumed within that code. The first profiler I was ever aware of is the -pg flag on C compilers. You gcc -pg to compile your code, and then after you run your code, you can use gprof to profile where your time went. Java has profilers, PHP has them, Perl, C#, C++, all of them.

Even the Oracle Database has a profiling capability, but they don’t call it “profiling” (that name means something else in the Oracle documentation). The extended SQL trace data that Oracle emits when you do the right DBMS_MONITOR.SESSION_TRACE_ENABLE call is a written record of where every bit of your response time went. That’s “profiling,” in the computer science sense of the word. Those files have been the basis of my career as a performance analyst and software tools author for the past 20 years or so.

Justin: Tell us a little bit about your company, Method R. You founded it a couple of years ago?

Cary: Yes, I started a new company called Method R Corporation in April 2008. We’ve had a great time writing tools for people and performing services (teaching and consulting) to help people solve their performance problems. Our core business is building tools that help people do for themselves what we know how to do with performance. The trace data that Oracle emits is very complicated, and we have software tools that make it easy to get what you need from those trace files.

We also have an extension for SQL Developer that makes it easy to get the trace data itself, while you’re developing a new SQL- or PL/SQL-based application. We’re also working on a number of very large development projects for customers in which we’re writing complex application code that has to scale to outrageous workloads. We’re always looking for ways we can help people.

Justin: Well, that’s all the time we have today. I really enjoyed talking with you, etc., etc.

Cary: Oh, it was my pleasure. Thank you for having me, and good luck with the rest of the Show.

◆ ◆ ◆

...Which hasn’t happened yet. ☺

I hope you enjoyed, and I’ll look forward to seeing you at Oracle OpenWorld 2010.

I’ll be presenting “Messed-Up Apps: a study of performance antipatterns” at the ODTUG User Group Forum on Sunday at 3:00pm, and “Thinking Clearly about Performance” at the main event on Tuesday at 12:30pm. See you there!

Monday, August 9, 2010

Mister Trace

For the past several weeks, my team at Method R have been working hard on a new software tool that we released today. It is an extension for Oracle SQL Developer called Method R Trace. We call it MR Trace for short.

MR Trace is for SQL and PL/SQL developers who care about performance. Every time you execute code from a SQL Developer worksheet, MR Trace automatically copies a carefully scoped trace file to your SQL Developer workstation. There, you can open it with any application you want, just by clicking. You can tag it for easy lookup later. There’s a 3-minute video if you’re interested in seeing what it looks like.

I’m particularly excited about MR Trace because it’s the smallest software tool we’ve ever designed. That may sound funny to a lot of people, but it won’t sound funny to you if you’ve read Rework by Jason Fried and David Heinemeier Hansson of 37signals. MR Trace does a seemingly very small thing—it gets your trace file for you—but if you’ve ever done that job yourself, you might get a kick out of seeing it happen so automatically, so simply, and so quickly.

The thing is, the normal process of getting trace files is raw misery for many of our friends. It’s a common story: “If I trace some SQL, then to get my trace files, I have to call up my SA or DBA. I apologize for the interruption and hope he’s in a good mood. I tell him what system I need him to look at. He has to figure out which trace files are the ones I need, and then he FTPs them over to where I can get to them. I try not to bother him, but there’s no other way.”

Most places don’t have any security reasons to prohibit developers from getting their trace files, but they just don’t have the time or the interest to create procedures that developers can use to fetch only the files they’re supposed to see. The resulting bother is so labor-intensive and so demotivating that developers stop fighting and just move on without trace files to guide them.

That’s a big problem: if you can’t see how the code you write consumes response time, then how can you optimize it? How can you even know if you should try to? If you have to guess where your code spends time, then you can’t possibly think clearly about performance.

We have tried to design MR Trace to be a beautiful little application that does exactly what you need by staying out of your way. If we did it right, then you won’t be thinking about MR Trace whenever you use it; you’ll just have the trace files you need, right where and when you need them. And you’ll have them with no extra typing, no extra clicks, and—for goodness’ sake—certainly no more phone calls or trips down the hall. ...Unless it’s to to show off a performance problem you found and fixed before anyone else ever noticed it.

Key information:

	Name:		Method R Trace
	Type:		Extension for Oracle SQL Developer
	Function:		Zero-click trace file collector
	Price:		$49.95 USD
	Risk:		30-day free trial
	URL:		http://method-r.com/software/mrtrace
	Designer:		Method R Corporation

Monday, December 21, 2009

My Whole System Is Slow. Now What?

At CMG'09 a couple of weeks ago, I presented "Measuring Response Times of Code on Oracle Systems." The paper for this presentation was a subset of "For Developers: Making Friends with the Oracle Database." In the presentation, I spent a few minutes talking about why to measure response times in Oracle, and then I spent a lot of minutes talking about how. As usual, I focused heavily on the importance of measuring response times of individual business tasks executed by individual end users.

At the end of the talk, a group of people came to the podium to ask questions (always a good sign). The first question was the question that a lot of people ask. It was:

My whole system is slow. That's all my users will tell me. So then, how do I begin to do what you're describing?

Here's the answer:

Ask your users to show you what they're doing. Just go look at it.

The results of this simple advice are routinely spectacular. Just go look at it: I'm surprised whenever someone doesn't think of doing that, but I shouldn't be. That's because I didn't do it either, for the longest time. I had to learn to. And that's the story I want to tell you here.

In the early 1990s, I was a consultant with Oracle Corporation visiting clients with performance problems at a pace of more than 30 per year. Back then, I did Oracle performance work the old fashioned way: I checked everything I knew how to check, and then I fixed everything I knew how to fix. All billable by the hour. (Note: When I was doing it this way, I had not yet been taught by Dave Ensor, who changed me forever.)

On weeks when I was lucky, I'd be finished checking and fixing by sometime Wednesday, leaving a couple of days to find out what people thought of my work. If I were lucky again (that's two "lucky"s now), everyone would be thrilled with the results. I'd get my hug (so to speak), and I'd catch my flight.

But I wasn't always lucky. Some weeks, I wouldn't find anything suspicious in my checking and fixing. Some weeks, I'd find plenty, but still not everyone would be thrilled with the work. Having people be less than thrilled with my work caused pain for me, which motivated me to figure out how to take more control of my consulting engagements, to drive luck out of the equation.

The most important thing I figured out was...

People knew before I came on-site how they were going to measure on Thursday whether they liked the results of my work.

And...

They were willing to tell me on Monday.

All I had to do was be honest, like this:

On the day I'm done working here, I'd imagine you're going to want to run something that will demonstrate whether I accomplished what you were hoping for while I was here. Would you mind telling me about that now? Maybe even showing me?

I could ask that on Monday, and people were glad to tell me. I'd watch the things run and record how long they ran, and then I'd know how to prioritize my time on site. I'd record how long they ran so at the end of my engagement, I'd be able to show very clearly what improvements I had made.

Sometimes, there would be thirty different things that people would expect to measure on Thursday. If I might not have time to fix them all, then I needed to make sure that I knew the priority of the things I was being asked to fix.

That one step alone—knowing on Monday that prioritized list of what tasks needed to be fast by Thursday—drastically reduced my reliance on luck as a success factor in my job at these sites. Knowing that list on Monday is just like when your teacher in school tells you exactly what's going to be on your next test. It allows you to focus your attention on exactly what you need to do to optimize your reward for the week. (Note to fellow education enthusiasts: Please don't interpret this paragraph as my advocating the idea that GPA should be a student's sole—or even dominant—optimization constraint.)

So, what I learned is that the very first step of any good performance optimization method is necessarily this:

1. Identify the task that's the most important to you.

When I say "task," think "program" or "click" or "batch job" if you want to. What I mean is "a useful unit of work that makes sense to the business." ...Something that a business user would show you if you just went and watched her work for a few minutes.

Then comes step two:

2. Measure its response time (R). In detail.

Why is response time so important? Because that's what's important to the person who'll be watching it run on Thursday, assessing whether she thinks you've done a good job or not. That person's going to click and then wait. Happiness will be inversely proportional to how long the wait is. That's it. That's what "performance" means at 99% of sites I've ever visited.

(If you're interested in the other 1% of sites I've visited, they're interested in throughput, which I've written about in another blog post.)

Measuring response time is vital. You must be able to measure response time if you're going to nail that test on Thursday.

The key is to understand that the term response time doesn't even have a definition except in the context of a task. You can't measure response time if you don't first decide what task you're going to measure. In other words, you cannot do step 2 before you do step 1. With Oracle, for example, you can collect ASH data (if you're licensed to use it) or even trace data for a whole bunch of Oracle processes, but you won't have a single response time until you define which tasks buried within that data are the ones you want to extract and pay attention to.

You get that by visiting a user and watching what she does.

There are lots of excuses for not watching your users. Like these...

"I don't know my users." I know. But you should. You'd do your job better if you did. And your users would, too.
"My users aren't here." I know. They're on the web. They're in Chicago and Singapore and Istanbul, buying plane tickets or baseball caps or stock shares. But if you can't watch at least a simulation of the things those users actually do with the system you help manage, then I can't imagine how you would possibly succeed at providing good performance to them.
"I'm supposed to be able to manage performance with my dashboard." I know. I was supposed to have a hover car by the year 2000.

The longer you stay mired in excuses like these, the longer it's going to be before you can get the benefit of my point here. Your users are running something, and whatever that is that they're running is your version of my Thursday test. You can check and fix all you want, but unless you get lucky and fix the exact tooth that's hurting, your efforts aren't going to be perceived as "helpful." Checking and fixing everything you can think of is far less efficient and effective than targeting exactly what your user needs you to target.

Lots of performance analysts (DBAs, developers, architects, sysadmins, and so on) assume that when someone says, "The whole system is slow," it means there must be a single parameter somewhere in the bowels of the system that needs adjustment, and if you can just make that adjustment, everything is going to be ok. It might mean that, but in my experience, the overwhelming majority of cases are not that way. (Pages 25–29 of Optimizing Oracle Performance has more information about this.)

The great thing about measuring response time is that no matter what the problem is, you'll see it. If the program you're watching is poorly written, you'll see it. If some other program is hogging too much of a resource that your program needs, you'll see it. If you have a bad disk controller, you'll see it. If some parameter needs adjusting, you'll see it.

Realize that when a business user says "system," she doesn't mean what you would mean if you said "system." She means that the thing she runs is slow. Look at that thing. Maybe there are seventeen of them. And sure, maybe all seventeen suffer from the same root cause. If that's the case, then fantastic, because fixing the first problem will magically fix the other sixteen, too. If it's not, then fantastic anyway, because now all of them are on your prioritized list of tasks to optimize, and you'll probably surprise yourself how quickly you'll be able to pick them all off when you focus on one task at a time.

Friday, November 20, 2009

Performance Optimization with Global Entry. Or Not?

As I entered the 30-minute "U.S. Citizens" queue for immigration back into the U.S. last week, the helpful "queue manager" handed me a brochure. This is a great place to hand me something to read, because I'm captive for the next 30 minutes as I await my turn with the immigration officer at the Passport Control desk. The brochure said "Roll through Customs faster."

Ok. I'm listening.

Inside the brochure, the first page lays out the main benefits:

bypass the passport lines
no paper Customs declaration
in most major U.S. airports

Well, that's pretty cool. Especially as I'm standing only 5% deep in a queue with a couple hundred people in it. And look, there's a Global Entry kiosk right there with its own special queue, with nobody—nobody!—in it.

If I had this Global Entry thing, I'd have a superpower that would enable me to zap past the couple hundred people in front of me, and get out of the Passport Control queue right now. Fantastic.

So what does this thing cost? It's right there in the brochure:

Apply online at www.globalentry.gov. There is a non-refundable $100 application fee. Membership is valid for five years. That's $20 a year for the queue-bypassing superpower. Not bad. Still listening.
Schedule an in-person interview. Next, I have to book an appointment to meet someone at the airport for a brief interview.
Complete the interview and enrollment. I give my interview, get my photo taken, have my docs verified, and that's it, I'm done.

So, all in all, it doesn't cost too much: a hundred bucks and probably a couple hours one day next month sometime.

What's the benefit of the queue-bypassing superpower? Well, it's clearly going to knock a half-hour off my journey through Passport Control. I immigrate three or four times per year on average, and today's queue is one of the shorter ones I've seen, so that's at least a couple hours per year that I'd save... Wow, that would be spectacular: a couple more hours each year in my family's arms instead of waiting like a lamb at the abattoir to have my passport controlled.

But getting me into my family's arms 30 minutes earlier is not really what happens. The problem is a kind of logic that people I meet get hung up in all the time. When you think about subsystem (or resource) optimization, it looks like your latency savings for the subsystem should go straight to your system's bottom line, but that's often not what happens. That's why I really don't care about subsystem optimization; I care about response time. I could say that a thousand times, but my statement is too abstract to really convey what I mean unless you already know what I mean.

What really happens in the airport story is this: if I had used Global Entry on my recent arrival, it would have saved me only a minute or two. Not half an hour, not even close.

It sounds crazy, doesn't it? How can a service that cuts half an hour off my Passport Control time not get me home at least a half hour earlier?

You'll understand once I show you a sequence diagram of my arrival. Here it is (at right). You can click the image to embiggen it, if you need.

To read this sequence diagram, start at the top. Time flows downward. This sequence diagram shows two competing scenarios. The multicolored bar on the left-hand side represents the timeline of my actual recent arrival at DFW Airport, without using the Global Entry service. The right-hand timeline is what my arrival would have looked like had I been endowed with the Global Entry superpower.

You can see at the very bottom of the timeline on the right that the time I would have saved with Global Entry is minuscule: only a minute or two.

The real problem is easy to see in the diagram: Queue for Baggage Claim is the great equalizer in this system. No matter whether I'm a Global Entrant or not, I'm going to get my baggage when the good people outside with the Day-Glo Orange vests send it up to me. My status in the Global Entry system has absolutely no influence over what time that will occur.

Once I've gotten my baggage, the Global Entry superpower would have again swung into effect, allowing me to pass through the zero-length queue at the Global Entry kiosk instead of waiting behind two families at the Customs queue. And that's the only net benefit I would have received.

Wait: there were only two families in the Customs queue? What about the hundreds of people I was standing behind in the Passport Control queue? Well, many of them were gone already (either they had hand-carry bags only, or their bags had come off earlier than mine). Many others were still awaiting their bags on the Baggage Claim carousel. Because bags trickle out of the baggage claim process, there isn't the huge all-at-once surge of demand at Customs that there is at Passport Control when a plane unloads. So the queues are shorter.

At any rate, there were four queues at Customs, and none of them was longer than three or four families. So the benefit of Global Entry—in exchange for the $100 and the time spent doing the interview—for me, this day, would have been only the savings of a couple of minutes.

Now, if—if, mind you—I had been able to travel with only carry-on luggage, then Global Entry would have provided me significantly more value. But when I'm returning to the U. S. from abroad, I'm almost never allowed to carry on any bag other than my briefcase. Furthermore, I don't remember ever clearing Passport Control to find my bag waiting for me at Baggage Claim. So the typical benefit to me of enrolling in Global Entry, unfortunately, appears to be only a fraction of the duration required to clear Customs, which in my case is almost always approximately zero.

The problem causing the low value (to me) of the Global Entry program is that the Passport Control resource hides the latency of the Baggage Claim resource. No amount of tuning upon the Passport Control resource will affect the timing of the Baggage In Hand milestone; the time at which that milestone occurs is entirely independent of the Passport Control resource. And that milestone—as long as it occurs after I queue for Baggage Claim—is a direct determinant of when I can exit the airport. (Gantt or PERT chart optimizers would say that Queue for Baggage Claim is on the critical path.)

How could a designer make the airport experience better for the customer? Here are a few ideas:

Let me carry on more baggage. This idea would allow me to trot right through Baggage Claim without waiting for my bag. In this environment, the value of Global Entry would be tremendous. Well, nice theory; but allowing more carry-on baggage wouldn't work too well in the aggregate. The overhead bins on my flight were already stuffed to maximum capacity, and we don't need more flight delays induced by passengers who bring more stuff onboard than the cabin can physically accommodate.
Improve the latency of the baggage claim process. The sequence diagram shows clearly that this is where the big win is. It's easy to complain about baggage claim, because it's nearly always noticeably slower than we want it to be, and we can't see what's going on down there. Our imaginations inform us that there's all sorts of horrible waste going on.
Use latency hiding to mask the pain of the baggage claim process. Put TV sets in the Baggage Claim area, and tune them to something interesting instead of infinite loops of advertising. At CPH, they have a Danish hot dog stand in the baggage claim area. They also have a currency exchange office in there. Excellent latency hiding ideas if you need a snack or some DKK walkin'-around-money.

Latency hiding is a weak substitute for improving the speed of the baggage claim process. The killer app would certainly be to make Baggage Claim faster. Note, however, that just making Baggage Claim a little bit faster wouldn't make the Global Entry program any more valuable. To make Global Entry any more valuable, you'd have to make Baggage Claim fast enough that your bag would be waiting for anyone who cleared the full Passport Control queue.

So, my message today: When you optimize, you must first know your goal. So many people optimize subsystems (resources) that they think are important, but optimizing subsystems is often not a path to optimizing what you really want. At the airport, I really don't give a rip about getting out of the Passport Control queue if it just means I'm going to be dumped earlier into a room where I'll have to wait until an affixed time for my baggage.

Once you know what your real optimization goal is (that's Method R step 1), then the sequence diagram is often all you need to get your breakthrough insight that either helps you either (a) solve your problem or (b) understand when there's nothing further that you can really do about it.

Thursday, November 12, 2009

Why We Made Method R

Twenty years ago (well, a month or so more than that), I entered the Oracle ecosystem. I went to work as a consultant for Oracle Corporation in September 1989. Before Oracle, I had been a language designer and compiler developer. I wrote code in lex, yacc, and C for a living. My responsibilities had also included improving other people's C code: making it more reliable, more portable, easier to read, easier to prove, and easier to maintain; and it was my job to teach other people in my department how to do these things themselves. I loved all of these duties.

In 1987, I decided to leave what I loved for a little while, to earn an MBA. Fortunately, at that time, it was possible to earn an MBA in a year. After a year of very difficult work, I had my degree and a new perspective on business. I interviewed with Oracle, and about a week later I had a job with a company that a month prior I had never heard of.

By the mid-1990s, circumstances and my natural gravity had matched to create a career in which I was again a software developer, optimizer, and teacher. By 1998, I was the manager of a group of 85 performance specialists called the System Performance Group (SPG). And I was the leader of the system architecture and system management consulting service line within Oracle Consulting's Global Steering Committee.

My job in the SPG role was to respond to all the system performance-related issues in the USA for Oracle's largest accounts. My job in the Global Steering Committee was to package the success of SPG so that other practices around the world could repeat it. The theory was that if a country manager in, say, Venezuela, wanted his own SPG, then he could use the financial models, budgets, hiring plans, training plans, etc. created by my steering committee group. Just add water.

But there was a problem. My own group of 85 people consisted of two very different types of people. About ten of these 85 people were spectacularly successful optimizers whom I could send anywhere with confidence that they'd thrive at either improving

performance or proving that performance improvements weren't possible. The other 75 were very smart, very hard-working people who would grow into the tip of my pyramid over the course of more years, but they weren't there yet.

The problem was, how to you convert good, smart, hard-working people in the base of the SPG pyramid into people in the tip? The practice manager in Venezuela would need to know that. The answer, of course, is supposed to be the Training Plan. Optimally, the Training Plan consists of a curriculum of a few courses, a little on-the-job training, and then, presto: tip of the pyramid. Just add water.

But unfortunately that wasn't the way things worked. What I had been getting instead, within my own elite group, was a

process that took many years to convert a smart, hard-working person into a reasonably reliable performance optimizer whom you could send anywhere. Worse yet, the peculiar stresses of the job—like being away from home 80% of the time, and continually visiting angry people each week, having to work for me—caused an outflow of talent that approximately equaled the inflow of people who made it to the tip of the pyramid. The tip of my pyramid never grew beyond roughly 10 people.

The problem, by definition, was the Training Plan. It just wasn't good enough. It wasn't that the instructors of Oracle's internal "tuning" courses were doing a poor job of teaching courses. And it wasn't that the course developers had done a poor job of creating courses. On the contrary, the instructors and course developers were doing excellent work. The problem was that the courses were focusing on the wrong thing. The reason that the courses weren't getting the job done was that the very subject matter that needed teaching hadn't been invented yet.

I expect that the people who write, say, the course called "Braking System Repair for Boeing 777" to have themselves invented the braking system they write about. So, the question was, who should be responsible for inventing the subject matter on how to optimize Oracle? I decided that I wanted that person to be me. I deliberated carefully and decided that my best chance of doing that the way I wanted to do it would be outside of Oracle. So in October 1999, ten years and one week after I joined the company, I left Oracle with the vision of creating a repeatable, teachable method for optimizing Oracle systems.

Ten years later, this is still the vision for my company, Method R Corporation. We exist not to make your system faster. We exist to make you faster at making all your systems faster. Our work is far from done, but here is what we have done:

Written white papers and other articles that explain Method R to you at no cost.
Written a book called Optimizing Oracle Performance, where you can learn Method R at a low cost.
Created a Method R course (on which the book is based), to teach you how to diagnose and repair response time problems in Oracle-based systems.
Spoken at hundreds of public and private events where we help people understand performance and how to manage it.
Provided consulting services to make people awesome at making their systems faster and more efficient.
Created the first response time profiling software ever for Oracle software applications, to let you analyze hundreds of megabytes of data without drudgery.
Created a free instrumentation library so that you can instrument the response times of Oracle-based software that you write.
Created software tools to help you be awesome at extracting every drop of information that your Oracle system is willing to give you about your response times.
Created a software tool that enables you to record the response time of every business task that runs on your system so you can effortlessly manage end-user performance.

As I said, our work is far from done. It's work that really, really matters to us, and it's work we love doing. I expect it to be a journey that will last long into the future. I hope that our journey will intersect with yours from time to time, and that you will enjoy it when it does.

Thursday, September 24, 2009

Happy Birthday, OFA Standard

Yesterday I finally posted to the Method R website some of the papers I wrote while I was at Oracle Corporation (1989–1999). You can now find these papers where I am.

When I was uploading my OFA Standard paper, I noticed that today—24 September 2009—is the fourteenth birthday of its publication date. So, even though the original OFA paper was around for a few years before 1995, please join me in celebrating the birthday of the final version of the official OFA Standard document.

Wednesday, September 16, 2009

On the Importance of Diagnosing Before Resolving

Today a reader posted a question I like at our Method R website. It's about the story I tell in the article called, "Can you explain Method R so even my boss could understand it?" The story is about sending your son on a shopping trip, and it takes him too long to complete the errand. The point is that an excellent way to fix any kind of performance problem is to profile the response time for the properly chosen task, which is the basis for Method R (both the method and the company).

Here is the profile that details where the boy's time went during his errand:

                    --Duration---
Subtask             minutes     %  Executions
------------------  -------  ----  ----------
Talk with friends        37   62%           3
Choose item              10   17%           5
Walk to/from store        8   13%           2
Pay cashier               5    8%           1
------------------  -------  ----  ----------
Total                    60  100%

I went on to describe that the big leverage in this profile is the elimination of the subtask called "Talk with friends," which will reduce response time by 62%.

The interesting question that a reader posted is this:

Not sure this is always the right approach. For example, lets imagine the son has to pick 50 items
Talk 3 times 37 minutes
Choose item 50 times 45 minutes
Walk 2 times 8 minutes
Pay 1 time 5 minutes
Working on "choose item" is maybe not the right thing to do...

Let's explore it. Here's what the profile would look like if this were to happen:

          --Duration---
Subtask   minutes     %  Executions
-------   -------  ----  ----------
Choose         45   47%          50
Talk           37   39%           3
Walk            8    8%           2
Pay             5    5%           1
-------   -------  ----  ----------
Total          95  100%

The point of the inquiry is this:

The right answer in this case, too, is to begin with eliminating Talk from the profile. That's because, even though it's not ranked at the very top of the profile, Talk is completely unnecessary to the real goal (grocery shopping). It's a time-waster that simply shouldn't be in the profile. At all. But with Cary's method of addressing the profile from the top downward, you would instead focus on the "Choose" line, which is the wrong thing.

In chapters 1 through 4 of our book about Method R for Oracle, I explained the method much more thoroughly than I did in the very brief article. In my brevity, I skipped past an important point. Here's a summary of the Method R steps for diagnosing and resolving performance problems using a profile:

(Diagnosis phase) For each subtask (row in the profile), visiting subtasks in order of descending duration...

Can you eliminate any executions without sacrificing required function?
Can you improve (reduce) individual execution latency?

(Resolution phase) Choose the candidate solution with the best net value (that is, the greatest value of benefit minus cost).

Here's a narrative of executing the steps of the diagnostic phase, one at a time, upon the new profile, which—again—is this:

          --Duration---
Subtask   minutes     %  Executions
-------   -------  ----  ----------
Choose         45   47%          50
Talk           37   39%           3
Walk            8    8%           2
Pay             5    5%           1
-------   -------  ----  ----------
Total          95  100%

Execution elimination for the Choose subtask: If you really need all 50 items, then no, you can't eliminate any Choose executions.
Latency optimization for the Choose subtask: Perhaps you could optimize the mean latency (which is .9 minutes per item). My wife does this. For example, she knows better where the items in the store are, so she spends less time searching for them. (I, on the other hand, can get lost in my own shower.) If, for example, you could reduce mean latency to, say, .8 minutes per item by giving your boy a map, then you could save (.9 – .8) × 50 = 5 minutes (5%). (Note that we don't execute the solution yet; we're just diagnosing right now.)
Execution elimination for the Talk subtask: Hmm, seems to me like if your true goal is fast grocery shopping, then you don't need your boy executing any of these 3 Talk events. Proposed time savings: 37 minutes (39%).
Latency optimization for the Talk subtask: Since you can eliminate all Talk calls, no need to bother thinking about latency reduction. ...Unless you're prey to some external constraint (like social advancement, say, in attempt to maximize your probability of having rich and beautiful grandchildren someday), in which case you should think about latency reduction instead of execution elimination.
Execution elimination for the Walk subtask: Well, the boy has to get there, and he has to get back, so this "executions=2" figure looks optimal. (Those Oracle applications we often see that process one row per network I/O call would have 50 Walk executions, one for each Choose call.)
Latency optimization for the Walk subtask: Walking takes 4 minutes each way. Driving might take less time, but then again, it might actually take even more. Will driving introduce new dependent subtasks? Warm Up? Park? De-ice? Even driving doesn't eliminate all the walking... Plus, there's not a lot of leverage in optimizing Walk, because it accounts for only 8% of total response time to begin with, so it's not worth a whole lot of bother trying to shave it down by some marginal proportion, especially since inserting a car into your life (or letting your boy drive yours) is no trivial matter.
Execution elimination for the Pay subtask: The execution count on Pay is already optimized down to the legally required minimum. No real opportunity for improvement here without some kind of radical architecture change.
Latency optimization for the Pay subtask: It takes 5 minutes to Pay? That seems a bit much. So you should look at the payment process. Or should you? Even if you totally eliminate Pay from the profile, it's only going to save 5% of your time. But, if every minute counts, then yes, you look at it. ...Especially if there might be an easy way to improve it. If the benefit comes at practically no cost, then you'll take it, even if the benefit is only small. So, imagine that you find out that the reason Pay was so slow is that it was executed by writing a check, which required waiting for store manager approval. Using cash or a credit/debit card might improve response time by, say, 4 minutes (4%).

Now you're done assessing the effects of (1) execution elimination and (2) latency reduction for each line in the profile. That ends the diagnostic phase of the method. The next step is the resolution phase: to determine which of these candidate solutions is the best. Given the analysis I've walked you through, I'd rank the candidate solutions in this order:

Eliminate all 3 executions of Talk. That'll save 37 minutes (39%), and it's easy to implement; you don't have to buy a car, apply for a credit card, train the boy how to shop faster, or change the architecture of how shopping works. You can simply discard the "requirement" to chat, or you can specify that it be performed only during non-errand time windows.
Optimize Pay latency by using cash or a card, if it's easy enough to give your boy access to cash or a card. That will save 4 minutes, which—by the way—will be a more important proportion of the total errand time after you eliminate all the Talk executions.
Finally, consider optimizing Choose latency. Maybe giving your son a map of the store will help. Maybe you should print your grocery list more neatly so he can read it without having to ask for help. Maybe by simply sending him to the store more often, he'll get faster as his familiarity with the place improves.

That's it.

So the point I want to highlight is this:

I'm not saying you should stick to the top line of your profile until you've absolutely conquered it.

It is important to pass completely through your profile to construct your set of candidate solutions. Then, on a separate pass, you evaluate those candidate solutions to determine which ones you want to implement, and in what order. That first full pass is key. You have to do it for Method R to be reliable for solving any performance problem.

Thursday, June 18, 2009

Profiling with my Boy

We have an article online called "Can you explain Method R so even my boss could understand it?" Today I'm going to raise the stakes, because yesterday I think I explained Method R so that an eleven year-old could understand it.

Yesterday I took my 11 year-old son Alex to lunch. I talked him into eating at one of my favorite restaurants, called Mercado Juarez, over in Irving, so it was a half hour in the car together, just getting over there. It was a big day for the two of us because we were very excited about the new June 17 iPhone OS 3.0 release. I told him about some of the things I've learned about it on the Internet over the past couple of weeks. One subject in particular that we were both interested in was performance. He likes not having to wait for click results just as much as I do.

According to Apple, the new iPhone OS 3.0 software has some important code paths in it that are 3× faster. Then, upgrading to the new iPhone 3G S hardware is supposed to yield yet another 3× performance improvement for some code paths. It's what Philip Schiller talks about at 1:42:00 in the WWDC 2009 keynote video. Very exciting.

Alex of course, like many of us, wants to interpret "3× faster" as "everything I do is going to be 3× faster." As in everything that took 10 seconds yesterday will take 3 seconds tomorrow. It's a nice dream. But it's not what seeing a benchmark run 3× faster means. So we talked about it.

I asked him to imagine that it takes me an hour to do my grocery shopping when I walk to the store. Should I buy a car? He said yes, probably, because a car is a lot faster than walking. So I asked him, what if the total time I spent walking to and from the grocery store was only one minute? Then, he said, having a car wouldn't make that much of a difference. He said you might want a car for other reasons, but he wouldn't recommend it just to make grocery shopping faster.

Good.

I said, what if grocery shopping were taking me five hours, and four of it was time spent walking? "Then you probably should get the car," he told me. "Or a bicycle."

Commit.

On the back of his menu (photo above: click to zoom), I drew him a sequence diagram (A) showing how he, running Safari on an iPhone 3G might look to a performance analyst. I showed him how to read the sequence diagram (time runs top-down, control passes from one tier to another), and I showed him two extreme ways that his sequence diagram might turn out for a given experience. Maybe the majority of the time would be spent on the 3G network tier (B), or maybe the majority of the time would be spent on the Safari software tier (C). We talked about how if B were what was happening, then a 3× faster Safari tier wouldn't really make any difference. Apple wouldn't be lying if they said their software was 3× faster, but he really wouldn't notice a performance improvement. But if C were what was happening, then a 3× faster Safari tier would be a smoking hot upgrade that we'd be thrilled with.

Sequence diagrams, check. Commit.

Now, to profiles. So I drew up a simple profile for him, with 101 seconds of response time consumed by 100 seconds of software and 1 second of 3G (D):

Software  100
3G          1
-------------
Total     101

I asked him, if we made the software 2× faster, what would happen to the total response time? He wrote down "50" in a new column to the right of the "100." Yep. Then I asked him what would happen to total response time. He said to wait a minute, he needed to use the calculator on his iPod Touch. Huh? A few keystrokes later, he came up with a response time of 50.5.

Oops. Rollback.

He made the same mistake that just about every forty year-old I've ever met makes. He figured if one component of response time were 2× faster, then the total response time must be 2× faster, too. Nope. In this case, the wrong answer was close to the right answer, but only because of the particular numbers I had chosen.

So, to illustrate, I drew up another profile (E):

Software    4
3G         10
-------------
Total      14

Now, if we were to make the software 2× faster, what happens to the total? We worked through it together:

Software    4    2
3G         10   10
------------------
Total      14   12

Click. So then we spent the next several minutes doing little quizzes. If this is your profile, and we make this component X times faster, then what's the new response time going to be? Over and over, we did several more of these, some on paper (F), and others just talking.

Commit.

Next step. "What if I told you it takes me an hour to get into my email at home? Do I need to upgrade my network connection?" A couple of minutes of conversation later, he figured out that he couldn't answer that question until he got some more information from me. Specifically, he had to ask me how much of that hour is presently being spent by the network. So we did this over and over a few times. I'd say things like, "It takes me an hour to run my report. Should I spend $4,800 tuning my SQL?" Or, "Clicking this button takes 20 seconds. Should I upgrade my storage area network?"

And, with just a little bit of practice, he learned that he had to say, "How much of the however-long-you-said is being spent by the whatever-it-was-you-said?" I was happy with how he answered, because it illustrated that he had caught onto the pattern. He realized that the specific blah-blah-blah proposed remedies I was asking him about didn't really matter. He had to ask the same question regardless. (He was answering me with a sentence using bind variables.)

Commit.

Alex hears me talk about our Method R Profiler software tool a lot, and he knows conceptually that it helps people make their systems faster, but he's never known in any real detail very much about what it does. So I told him that the profile tables are what our Profiler makes for people. To demonstrate how it does that, I drew him up a list of calls (F), which I told him was a list of visits between a disk and a CPU. ...Just a list that says the same thing that a sequence diagram (annotated with timings) would say:

D 2
C 1
D 6
D 4
D 8
C 3

I told him to make a profile for these calls, and he did (H):

Disk   20
CPU     4
---------
Total  24

Excellent. So I explained that instead of adding up lists in our head all day, we wrote the Profiler to aggregate the call-by-call durations (from an Oracle extended SQL trace file) for you into a profile table that lets you answer the performance questions we had been practicing over lunch. ...Even if there are millions of lines to add up.

The finish-up conversation in the car ride back was about how to use everything we had talked about when you fix people's performance problems. I told him the most vital thing about helping someone solve a performance problem is to make sure that the operation (the business task) that you're analyzing right now is actually the most important business task to fix first. If you're looking at anything other than the most important task first, then you're asking for trouble.

I asked him to imagine that there are five important tasks that are too slow. Maybe every one of those things has its response time dominated by a different component than all the others. Maybe they're all the same. But if they're all different, then no single remedy you can perform is going to fix all five. A given remedy will have a different performance impact upon each of the five tasks, depending on how much of the fixed thing that task was using to begin with.

So the important thing is to know which of the five profiles it is that you ought to be paying attention to first. Maybe one remedy will fix all five tasks, maybe not. You just can't know until you look at the profiles. (Or until you try your proposed remedy. But trial-and-error is an awfully expensive way to find out.)

Commit.

It was a really good lunch. I'll look forward to taking my 9-year-old (Alex's little brother) out the week after next when I get back from ODTUG.

Friday, April 24, 2009

The Most Common Performance Problem I See

At the Percona Performance Conference in Santa Clara this week, the first question an audience member asked our panel was, "What is the most common performance problem you see in the field?"

I figured, being an Oracle guy at a MySQL conference, this might be my only chance to answer something, so I went for the mic. Here is my answer.

The most common performance problem I see is people who think there's a most-common performance problem that they should be looking for, instead of measuring to find out what their actual performance problem actually is.

It's a meta answer, but it's a meta problem. The biggest performance problems I see, and the ones I see most often, are not problems with machines or software. They're problems with people who don't have a reliable process of identifying the right thing to work on in the first place.

That's why the definition of Method R doesn't mention Oracle, or databases, or even computers. It's why Optimizing Oracle Performance spends the first 69 pages talking about red rocks and informed consent and Eli Goldratt instead of Oracle, or databases, or even computers.

The most common performance problem I see is that people guess instead of knowing. The worst cases are when people think they know because they're looking at data, but they really don't know, because they're looking at the wrong data. Unfortunately, every case of guessing that I ever see is this worst case, because nobody in our business goes very far without consulting some kind of data to justify his opinions. Tim Cook from Sun Microsystems pointed me yesterday to a blog post that gives a great example of that illusion of knowing when you really don't.

Friday, April 3, 2009

Cary on Joel on SSD

Joel Spolsky's article on Solid State Disks is a great example of a type of problem my career is dedicated to helping people avoid. Here's what Joel did:

He identified a task needing performance improvement: "compiling is too slow."
He hypothesized that converting from spinning rust disk drives (thanks mwf) to solid state, flash hard drives would improve performance of compiling. (Note here that Joel stated that his "goal was to try spending money, which is plentiful, before [he] spent developer time, which is scarce.")
So he spent some money (which is, um, plentiful) and some of his own time (which is apparently less scarce than that of his developers) replacing a couple of hard drives with SSD. If you follow his Twitter stream, you can see that he started on it 3/25 12:15p and wrote about having finished at 3/27 2:52p.
He was pleased with how much faster the machines were in general, but he was disappointed that his compile times underwent no material performance improvement.

Here's where Method R could have helped. Had he profiled his compile times to see where the time was being spent, he would have known before the upgrade that SSD was not going to improve response time. Given his results, his profile for compiling must have looked like this:

100%  Not disk I/O
  0%  Disk I/O
----  ------------
100%  Total

I'm not judging whether he wasted his time by doing the upgrade. By his own account, he is pleased at how fast his SSD-enabled machines are now. But if, say, the compiling performance problem had been survival-threateningly severe, then he wouldn't have wanted to expend two business days' worth of effort upgrading a component that was destined to make zero difference to the performance of the task he was trying to improve.

So, why would someone embark upon a performance improvement project without first knowing exactly what result he should be able to expect? I can think of some good reasons:

You don't know how to profile the thing that's slow. Hey, if it's going to take you a week to figure out how to profile a given task, then why not spend half that time doing something that your instincts all say is surely going to work?
Um, ...

Ok, after trying to write them all down, I think it really boils down to just one good reason: if profiling is too expensive (that is, you don't know how, or it's too hard, or the tools to do it cost too much), then you're not going to do it. I don't know how I'd profile a compile process on a Microsoft Windows computer. It's probably possible, but I can't think of a good way to do it. It's all about knowing; if you knew how to do it, and it were easy, you'd do it before you spent two days and a few hundred bucks on an upgrade that might not give you what you wanted.

I do know that in the Oracle world, it's not hard anymore, and the tools don't cost nearly as much as they used to. There's no need anymore to upgrade something before you know specifically what's going to happen to your response times. Why guess... when you can know.

Friday, July 11, 2008

So how do you FIX the problems that "Performance as a Service" helps you find?

I want to respond carefully to Reubin's comment on my Performance as a Service post from July 7. Reuben asked:

i can see how you actually determine for a customer where the pain points are and you can validate user remarks about poor performance. But i don't see from your post how you are going to attack the problem of fixing the performance issue.

i would be most interested in hearing your thoughts on that. I wonder if you guys are going to touch the actual code behind the "order button" you described.

Under the service model I described in the Performance as a Service post, a client using our performance service would have several choices about how to respond to a problem. They could contract our team as consultants; they could use someone else; they could do it themselves; or of course, they could choose to defer the solution.

Attacking the problem of fixing the performance issue is actually the "easy" part. Well, it's not necessarily always easy, but it's the part that my team have been doing over and over again since the early 1990s. We use Method R. Once we've measured the response time in detail with Oracle's extended SQL trace function, we know exactly where the task is spending the end user's time. From there, I think it's fair to say we (Jeff, Karen, etc.) are pretty skilled at figuring out what to do next.

Sometimes, the root cause of a performance problem requires manipulation of the application source code, and sometimes it doesn't. If you do diagnosis right, you should never have to guess which one it is. A lot of people wonder what happens if it's the code that needs modification, but the application is prepackaged, and therefore the source code is out of your control. In my experience, most vendors are very responsive to improving the performance of their products when they're shown unambiguously how to improve them.

If your application is slow, you should be eager to know exactly why it's slow. You should be equally eager to know, whether you wrote the code yourself or someone else wrote it for you. To avoid collecting the right performance diagnostic data for an application because you're afraid of what you might find out is like taking your hands off the wheel and covering your eyes when a child rides his bike out in front of the car that you're driving. There's significant time-value upon information about performance problems. Even if someone else's code is the reason for your performance problem (or whatever truth you might be afraid of learning), you need to know it as early as possible.

The SLA Manager service I talked about is so important because the most difficult part of using Method R is usually the data collection step. The difficulty is almost always more political than technical. It's overcoming the question, "Why should we change the way we collect our data?" I believe the business value of knowing how long your computer system takes to execute tasks for your business is important enough that it will get people into the habit of measuring response time. ...Which is a vital step in solving the data collection problem that's at the heart of every persistent performance problem I've ever seen. I believe the data collection service that I described will help remove the most important remaining barrier to highly manageable application software performance in our market.

Wednesday, June 4, 2008

Flash Drives and Databases

I learned today about "Sun to embed flash storage in nearly all its servers." This is supposed to be good news for database professionals all over because, flash storage "...consumes one-fifth the power and is a hundred times faster [than rotating disk drives]."

Hey-hey!

Of course, flash storage is going to cost a little more. Well, I'm not sure, maybe a lot more. But, according to the article:

John Fowler, the head of Sun’s servers and storage division, said at a press conference in Boston Tuesday. “The fact that it’s not the same dollars per gigabyte is perfectly okay.”

Alright, I understand that. Get more, pay more. I'm still on board.

But I predict that a lot of people who buy flash storage are going to be disappointed. Here's why.

We all know now that flash storage is a hundred times faster than rotating disk drives. (Says so right in the article. And consumes one-fifth the power.) We all also "know" that databases are I/O intensive applications. (The article says that, too. But everybody already "knew" that anyway.)

The problem that's going to happen is the people (1) who have a slow database application, (2) who assume that their application is slow because of the I/O it is doing, (3) whose application doesn't really spend much time doing I/O at all (whether it does a "lot" of I/O is irrelevant), and (4) who buy flash storage specifically on the hope that after the installation, their database application will "be 100x faster" (because, of course, the flash storage is 100x faster than the storage it is replacing).

See the problem?

Think about Amdahl's Law: improving the speed of a component will help a user's performance only in proportion to the duration for which that user used that component in the first place. Here's an example. Imagine this response time profile:

Total response time: 100 minutes (100%)
Time spent executing OS read calls: 5 minutes (5%) (e.g., db file sequential read)
Time spent doing other stuff: 95 minutes (95%)

Now, so how much time will you save if you upgrade your disk drives to a technology that's 100x faster. The answer is that the new "Time spent executing OS read calls" will be .05 minutes, right? Well, maybe. Let's go with that for a moment. If that were true, then how much time will you save? You'll save 4.95 minutes, which is 4.95% of your original response time. Your application won't be 100x faster (or, equivalently, 99% faster), it'll be 4.95% faster.

The users in this story aren't going to be happy with this if they're thinking that the result of an expensive upgrade is going to be 100x faster performance. If they're expecting 1-minute performance and get 95.05-minute performance instead, they're going to be, um, disappointed.

Now, reality is probably not even going to be that good. Imagine that those 5 minutes our user spent in the 100-minute original application response time was consumed executing 150,000 distinct Oracle db file sequential read calls (which map to 150,000 OS read calls). That makes your single-call I/O latency 0.002 seconds per call (300 seconds divided by 150,000 calls).

That's pretty good, but it's a normal enough latency these days on today's high-powered SAN devices. If you think about rotating disk drives, then 0.002 seconds per call is mind-blowingly excellent. But I/O latencies of 0.002 seconds or better don't come from disk drives, they come from the cache that's sitting in these SANs. The read calls that result in physical disk access are taking much longer, 0.005 seconds or more. An average latency of 0.002 is possible because so many of those read calls are being fulfilled from cache.

And the flash drive upgrades aren't going to improve the latency of those calls being fulfilled from cache.

So, to recap, the best improvement you'll ever get by upgrading to flash drives is a percentage improvement that's equivalent to the percentage of time you spent before the upgrade actually making I/O calls. If a lot of your I/O calls are satisfied by reads from cache to begin with, then upgrading to flash drives will help you less than that.

The biggest performance problem most people have is that they don't know where their users' time is going. They know where their system's time is going, but that doesn't matter. What people need to see is the response time profiles of the tasks that the business regards as the most important things it needs done. That's the cornerstone of what Method R (both the method and the company) is all about.

Flash drives might help you. Maybe a lot. And maybe they'll help you a little, or maybe not at all. If you can't see individual user response times, then you'll have to actually try them to find out whether they'll be good for you or not (imagine cash register sound here).

We built our Profiler software so that when we manage Oracle systems, we can see the users' response times and not have to guess about stuff like this. When you can see your response times, you don't have to guess whether a proposed upgrade is going to help you. You'll know exactly whom will be helped, and you'll know by how much.