Cary Millsap: profiling

Showing posts with label profiling. Show all posts

Thursday, September 17, 2015

The Fundamental Challenge of Computer System Performance

The fundamental challenge of computer system performance is for your system to have enough power to handle the work you ask it to do. It sounds really simple, but helping people meet this challenge has been the point of my whole career. It has kept me busy for 26 years, and there’s no end in sight.

Capacity and Workload

Our challenge is the relationship between a computer’s capacity and its workload. I think of capacity as an empty box representing a machine’s ability to do work over time. Workload is the work your computer does, in the form of programs that it runs for you, executed over time. Workload is the content that can fill the capacity box.

Capacity Is the One You Can Control, Right?

When the workload gets too close to filling the box, what do you do? Most people’s instinctive reaction is that, well, we need a bigger box. Slow system? Just add power. It sounds so simple, especially since—as “everyone knows”—computers get faster and cheaper every year. We call that the KIWI response: kill it with iron.

KIWI... Why Not?

As welcome as KIWI may feel, KIWI is expensive, and it doesn’t always work. Maybe you don’t have the budget right now to upgrade to a new machine. Upgrades cost more than just the hardware itself: there’s the time and money it takes to set it up, test it, and migrate your applications to it. Your software may cost more to run on faster hardware. What if your system is already the biggest and fastest one they make?

And as weird as it may sound, upgrading to a more powerful computer doesn’t always make your programs run faster. There are classes of performance problems that adding capacity never solves. (Yes, it is possible to predict when that will happen.) KIWI is not always a viable answer.

So, What Can You Do?

Performance is not just about capacity. Though many people overlook them, there are solutions on the workload side of the ledger, too. What if you could make workload smaller without compromising the value of your system?

It is usually possible to make a computer produce all of the useful results that you need without having to do as much work.

You might be able to make a system run faster by making its capacity box bigger. But you might also make it run faster by trimming down that big red workload inside your existing box. If you only trim off the wasteful stuff, then nobody gets hurt, and you’ll have winning all around.

So, how might one go about doing that?

Workload

“Workload” is a conjunction of two words. It is useful to think about those two words separately.

The amount of work your system does for a given program execution is determined mostly by how that program is written. A lot of programs make their systems do more work than they should. Your load, on the other hand—the number of program executions people request—is determined mostly by your users. Users can waste system capacity, too; for example, by running reports that nobody ever reads.

Both work and load are variables that, with skill, you can manipulate to your benefit. You do it by improving the code in your programs (reducing work), or by improving your business processes (reducing load). I like workload optimizations because they usually save money and work better than capacity increases. Workload optimization can seem like magic.

The Anatomy of Performance

This simple equation explains why a program consumes the time it does:

r = cl or response time = call count × call latency

Think of a call as a computer instruction. Call count, then, is the number of instructions that your system executes when you run a program, and call latency is how long each instruction takes. How long you wait for your answer, then—your response time—is the product of your call count and your call latency.

Some fine print: It’s really a little more complicated than this, but actually not that much. Most response times are composed of many different types of calls, all of which have different latencies (we see these in program execution profiles), so the real equation looks like r = c₁l₁ + c₂l₂ + ... + c_nl_n. But we’ll be fine with r = cl for this article.

Call count depends on two things: how the code is written, and how often people run that code.

How the code is written (work) — If you were programming a robot to shop for you at the grocery store, you could program it to make one trip from home for each item you purchase. Go get bacon. Come home. Go get milk... It would probably be dumb if you did it that way, because the duration of your shopping experience would be dominated by the execution of clearly unnecessary travel instructions, but you’d be surprised at how often people write programs that act like this.
How often people run that code (load) — If you wanted your grocery store robot to buy 42 things for you, it would have to execute more instructions than if you wanted to buy only 7. If you found yourself repeatedly discarding spoiled, unused food, you might be able to reduce the number of things you shop for without compromising anything you really need.

Call latency is influenced by two types of delays: queueing delays and coherency delays.

Queueing delays — Whenever you request a resource that is already busy servicing other requests, you wait in line. That’s a queueing delay. It’s what happens when your robot tries to drive to the grocery store, but all the roads are clogged with robots that are going to the store to buy one item at a time. Driving to the store takes only 7 minutes, but waiting in traffic costs you another 13 minutes. The more work your robot does, the greater its chances of being delayed by queueing, and the more such delays your robot will inflict upon others as well.
Coherency delays — You endure a coherency delay whenever a resource you are using needs to communicate or coordinate with another resource. For example, if your robot’s cashier at the store has to talk with a specific manager or other cashier (who might already be busy with a customer), the checkout process will take longer. The more times your robot goes to the store, the worse your wait will be, and everyone else’s, too.

The Secret

This r = cl thing sure looks like the equation for a line, but because of queueing and coherency delays, the value of l increases when c increases. This causes response time to act not like a line, but instead like a hyperbola.

Because our brains tend to conceive of our world as linear, nobody expects for everyone’s response times to get seven times worse when you’ve only added some new little bit of workload, but that’s the kind of thing that routinely happens with performance. ...And not just computer performance. Banks, highways, restaurants, amusement parks, and grocery-shopping robots all work the same way.

Response times are trememdously sensitive to your call counts, so the secret to great performance is to keep your call counts small. This principle is the basis for perhaps the best and most famous performance optimization advice ever rendered:

The First Rule of Program Optimization: Don’t do it.

The Second Rule of Program Optimization (for experts only!): Don’t do it yet.

— Michael A. Jackson

The Problem

Keeping call counts small is really, really important. This makes being a vendor of information services difficult, because it is so easy for application users to make call counts grow. They can do it by running more programs, by adding more users, by adding new features or reports, or by even by just the routine process of adding more data every day.

Running your application with other applications on the same computer complicates the problem. What happens when all these application’ peak workloads overlap? It is a problem that Application Service Providers (ASPs), Software as a Service (SaaS) providers, and cloud computing providers must solve.

The Solution

The solution is a process:

Call counts are sacred. They can be difficult to forecast, so you have to measure them continually. Understand that. Hire people who understand it. Hire people who know how to measure and improve the efficiency of your application programs and the systems they reside on.
Give your people time to fix inefficiencies in your code. An inexpensive code fix might return many times the benefit of an expensive hardware upgrade. If you have bought your software from a software vendor, work with them to make sure they are streamlining the code they ship you.
Learn when to say no. Don’t add new features (especially new long-running programs like reports) that are inefficient, that make more calls than necessary. If your users are already creating as much workload as the system can handle, then start prioritizing which workload you will and won’t allow on your system during peak hours.
If you are an information service provider, charge your customers for the amount of work your systems do for them. The economic incentive to build and buy more efficient programs works wonders.

Friday, November 20, 2009

Performance Optimization with Global Entry. Or Not?

As I entered the 30-minute "U.S. Citizens" queue for immigration back into the U.S. last week, the helpful "queue manager" handed me a brochure. This is a great place to hand me something to read, because I'm captive for the next 30 minutes as I await my turn with the immigration officer at the Passport Control desk. The brochure said "Roll through Customs faster."

Ok. I'm listening.

Inside the brochure, the first page lays out the main benefits:

bypass the passport lines
no paper Customs declaration
in most major U.S. airports

Well, that's pretty cool. Especially as I'm standing only 5% deep in a queue with a couple hundred people in it. And look, there's a Global Entry kiosk right there with its own special queue, with nobody—nobody!—in it.

If I had this Global Entry thing, I'd have a superpower that would enable me to zap past the couple hundred people in front of me, and get out of the Passport Control queue right now. Fantastic.

So what does this thing cost? It's right there in the brochure:

Apply online at www.globalentry.gov. There is a non-refundable $100 application fee. Membership is valid for five years. That's $20 a year for the queue-bypassing superpower. Not bad. Still listening.
Schedule an in-person interview. Next, I have to book an appointment to meet someone at the airport for a brief interview.
Complete the interview and enrollment. I give my interview, get my photo taken, have my docs verified, and that's it, I'm done.

So, all in all, it doesn't cost too much: a hundred bucks and probably a couple hours one day next month sometime.

What's the benefit of the queue-bypassing superpower? Well, it's clearly going to knock a half-hour off my journey through Passport Control. I immigrate three or four times per year on average, and today's queue is one of the shorter ones I've seen, so that's at least a couple hours per year that I'd save... Wow, that would be spectacular: a couple more hours each year in my family's arms instead of waiting like a lamb at the abattoir to have my passport controlled.

But getting me into my family's arms 30 minutes earlier is not really what happens. The problem is a kind of logic that people I meet get hung up in all the time. When you think about subsystem (or resource) optimization, it looks like your latency savings for the subsystem should go straight to your system's bottom line, but that's often not what happens. That's why I really don't care about subsystem optimization; I care about response time. I could say that a thousand times, but my statement is too abstract to really convey what I mean unless you already know what I mean.

What really happens in the airport story is this: if I had used Global Entry on my recent arrival, it would have saved me only a minute or two. Not half an hour, not even close.

It sounds crazy, doesn't it? How can a service that cuts half an hour off my Passport Control time not get me home at least a half hour earlier?

You'll understand once I show you a sequence diagram of my arrival. Here it is (at right). You can click the image to embiggen it, if you need.

To read this sequence diagram, start at the top. Time flows downward. This sequence diagram shows two competing scenarios. The multicolored bar on the left-hand side represents the timeline of my actual recent arrival at DFW Airport, without using the Global Entry service. The right-hand timeline is what my arrival would have looked like had I been endowed with the Global Entry superpower.

You can see at the very bottom of the timeline on the right that the time I would have saved with Global Entry is minuscule: only a minute or two.

The real problem is easy to see in the diagram: Queue for Baggage Claim is the great equalizer in this system. No matter whether I'm a Global Entrant or not, I'm going to get my baggage when the good people outside with the Day-Glo Orange vests send it up to me. My status in the Global Entry system has absolutely no influence over what time that will occur.

Once I've gotten my baggage, the Global Entry superpower would have again swung into effect, allowing me to pass through the zero-length queue at the Global Entry kiosk instead of waiting behind two families at the Customs queue. And that's the only net benefit I would have received.

Wait: there were only two families in the Customs queue? What about the hundreds of people I was standing behind in the Passport Control queue? Well, many of them were gone already (either they had hand-carry bags only, or their bags had come off earlier than mine). Many others were still awaiting their bags on the Baggage Claim carousel. Because bags trickle out of the baggage claim process, there isn't the huge all-at-once surge of demand at Customs that there is at Passport Control when a plane unloads. So the queues are shorter.

At any rate, there were four queues at Customs, and none of them was longer than three or four families. So the benefit of Global Entry—in exchange for the $100 and the time spent doing the interview—for me, this day, would have been only the savings of a couple of minutes.

Now, if—if, mind you—I had been able to travel with only carry-on luggage, then Global Entry would have provided me significantly more value. But when I'm returning to the U. S. from abroad, I'm almost never allowed to carry on any bag other than my briefcase. Furthermore, I don't remember ever clearing Passport Control to find my bag waiting for me at Baggage Claim. So the typical benefit to me of enrolling in Global Entry, unfortunately, appears to be only a fraction of the duration required to clear Customs, which in my case is almost always approximately zero.

The problem causing the low value (to me) of the Global Entry program is that the Passport Control resource hides the latency of the Baggage Claim resource. No amount of tuning upon the Passport Control resource will affect the timing of the Baggage In Hand milestone; the time at which that milestone occurs is entirely independent of the Passport Control resource. And that milestone—as long as it occurs after I queue for Baggage Claim—is a direct determinant of when I can exit the airport. (Gantt or PERT chart optimizers would say that Queue for Baggage Claim is on the critical path.)

How could a designer make the airport experience better for the customer? Here are a few ideas:

Let me carry on more baggage. This idea would allow me to trot right through Baggage Claim without waiting for my bag. In this environment, the value of Global Entry would be tremendous. Well, nice theory; but allowing more carry-on baggage wouldn't work too well in the aggregate. The overhead bins on my flight were already stuffed to maximum capacity, and we don't need more flight delays induced by passengers who bring more stuff onboard than the cabin can physically accommodate.
Improve the latency of the baggage claim process. The sequence diagram shows clearly that this is where the big win is. It's easy to complain about baggage claim, because it's nearly always noticeably slower than we want it to be, and we can't see what's going on down there. Our imaginations inform us that there's all sorts of horrible waste going on.
Use latency hiding to mask the pain of the baggage claim process. Put TV sets in the Baggage Claim area, and tune them to something interesting instead of infinite loops of advertising. At CPH, they have a Danish hot dog stand in the baggage claim area. They also have a currency exchange office in there. Excellent latency hiding ideas if you need a snack or some DKK walkin'-around-money.

Latency hiding is a weak substitute for improving the speed of the baggage claim process. The killer app would certainly be to make Baggage Claim faster. Note, however, that just making Baggage Claim a little bit faster wouldn't make the Global Entry program any more valuable. To make Global Entry any more valuable, you'd have to make Baggage Claim fast enough that your bag would be waiting for anyone who cleared the full Passport Control queue.

So, my message today: When you optimize, you must first know your goal. So many people optimize subsystems (resources) that they think are important, but optimizing subsystems is often not a path to optimizing what you really want. At the airport, I really don't give a rip about getting out of the Passport Control queue if it just means I'm going to be dumped earlier into a room where I'll have to wait until an affixed time for my baggage.

Once you know what your real optimization goal is (that's Method R step 1), then the sequence diagram is often all you need to get your breakthrough insight that either helps you either (a) solve your problem or (b) understand when there's nothing further that you can really do about it.

Friday, April 3, 2009

Cary on Joel on SSD

Joel Spolsky's article on Solid State Disks is a great example of a type of problem my career is dedicated to helping people avoid. Here's what Joel did:

He identified a task needing performance improvement: "compiling is too slow."
He hypothesized that converting from spinning rust disk drives (thanks mwf) to solid state, flash hard drives would improve performance of compiling. (Note here that Joel stated that his "goal was to try spending money, which is plentiful, before [he] spent developer time, which is scarce.")
So he spent some money (which is, um, plentiful) and some of his own time (which is apparently less scarce than that of his developers) replacing a couple of hard drives with SSD. If you follow his Twitter stream, you can see that he started on it 3/25 12:15p and wrote about having finished at 3/27 2:52p.
He was pleased with how much faster the machines were in general, but he was disappointed that his compile times underwent no material performance improvement.

Here's where Method R could have helped. Had he profiled his compile times to see where the time was being spent, he would have known before the upgrade that SSD was not going to improve response time. Given his results, his profile for compiling must have looked like this:

100%  Not disk I/O
  0%  Disk I/O
----  ------------
100%  Total

I'm not judging whether he wasted his time by doing the upgrade. By his own account, he is pleased at how fast his SSD-enabled machines are now. But if, say, the compiling performance problem had been survival-threateningly severe, then he wouldn't have wanted to expend two business days' worth of effort upgrading a component that was destined to make zero difference to the performance of the task he was trying to improve.

So, why would someone embark upon a performance improvement project without first knowing exactly what result he should be able to expect? I can think of some good reasons:

You don't know how to profile the thing that's slow. Hey, if it's going to take you a week to figure out how to profile a given task, then why not spend half that time doing something that your instincts all say is surely going to work?
Um, ...

Ok, after trying to write them all down, I think it really boils down to just one good reason: if profiling is too expensive (that is, you don't know how, or it's too hard, or the tools to do it cost too much), then you're not going to do it. I don't know how I'd profile a compile process on a Microsoft Windows computer. It's probably possible, but I can't think of a good way to do it. It's all about knowing; if you knew how to do it, and it were easy, you'd do it before you spent two days and a few hundred bucks on an upgrade that might not give you what you wanted.

I do know that in the Oracle world, it's not hard anymore, and the tools don't cost nearly as much as they used to. There's no need anymore to upgrade something before you know specifically what's going to happen to your response times. Why guess... when you can know.

Wednesday, February 18, 2009

Throughput versus Response Time

I like Doug Burns's recent blog post called Time Matters: Throughput vs. Response Time. If you haven't read it, please do. The post and its comment thread are excellent.

The principle Doug has recognized is why the knee in the performance curve is defined as the traffic intensity (think utilization, or load) value at which, essentially, the ratio of response time divided by throughput is minimized. It's not just the place where response time is minimized (which, as Doug observed, is when there's no load at all except for you, ...which is awesome for you, but not so good for business).

I'd like to emphasize a couple of points. First, batch and interactive workloads have wholly different performance requirements, which several people have already noted in their comments to Doug's post. With batch work, people are normally concerned with maximizing throughput. With online work, individual people care more about their own response times than group throughput, although those people's managers probably care more about group throughput. The individual people probably care about group throughput too, but not so much that they're happy about staying late after work to provide it when their individual tasks run so slowly they can't finish them during the normal working day.

In addition to having different performance requirements, batch workload can often be scheduled differently, too. If you're lucky, you can schedule your batch workload deterministically. For example, maybe you can employ a batch workload manager that feeds workload to your system like a carefully timed IV drip, to keep your system's CPU utilization pegged at 100% without causing your CPU run-queue depth to exceed 1.0. But online workload is almost always nondeterministic, which is to say that it can't be scheduled at all. That's why you have to keep some spare un-utilized system capacity handy; otherwise, your system load goes out past the nasty knee in your performance curve, and your users' response times behave exponentially in response to microscopic changes in load, which results in much Pain and Suffering.

My second point is one that I find that a lot of people don't understand very well: Focusing on individual response time—as in profiling—for an individual business task is an essential element in a process to maximize throughput, too. There are good ways to make a task faster, and there are bad ways. Good ways eliminate unnecessary work from the task without causing negative side-effects for tasks you're not analyzing today. Bad ways accidentally degrade the performance of tasks other than the one(s) you're analyzing.

If you stick to the good ways, you don't end up with the see-saw effect that most people seem to think of when they hear "optimize one business task at a time." You know, the idea that tuning A breaks B; then tuning B breaks A again. If this is happening to you, then you're doing it wrong. Trying to respond to performance problems by making global parameter changes commonly causes the see-saw problem. But eliminating wasteful work creates collateral benefits that allow competing tasks on your system to run faster because the task you've optimized now uses fewer resources, giving everything else freer and clearer access to the resources they need, without having to queue so much for them.

Figuring out how to eliminate wasteful work is where the real fun begins. A lot of the tasks we see are fixable by just changing just a little bit of source code. I mean the 2,142,103-latch query that consumes only 9,098 latches after fixing; things like that. A lot more are fixable by simply collecting statistics correctly. Others require adjustments to an application's indexing strategy, which can seem tricky when you need to optimize across a collection of SQL statements (here comes the see-saw), but even that is pretty much a solved problem if you understand Tapio Lahdenmäki's work (except for the inevitable politics of change control).

Back to the idea of Doug's original post, I wholeheartedly agree that you want to optimize both throughput and response time. The business has to decide what mixture is right. And I believe it's crucial to focus on eliminating waste from each individual competing task if you're going to have any hope of optimizing anything, whether you care more about response time, or throughput.

Think about it this way... A task cannot run at its optimal speed unless it is efficient. You cannot know whether a task is efficient without measuring it. And I mean specifically and exactly it, not just part of "it" or "it" plus a bunch of other stuff surrounding it. That's what profiling is: the measurement of exactly one interesting task that allows you to determine exactly where that task spends its time, and thus whether that task is spending your system's time and resources efficiently.

You can improve a system without profiling, and maybe you can even optimize one without profiling. But you can't know whether a system is optimal without knowing whether its tasks are efficient, and you can't know whether a given task is efficient without profiling it.

When you don't know, you waste time and money. This is why I contend that the ability to profile a single task is absolutely vital to anyone wanting to optimize performance.

Tuesday, February 3, 2009

Reading Knuth

I've been reading a little Knuth for the past couple of days. I've read about the famous statement, "premature optimization is the root of all evil," many times, and yesterday I decided to read his statement in its whole context. You can find Knuth's whole article here (thank you, Wikipedia):

Knuth, Donald: Structured Programming with Goto Statements. Computing Surveys 6:4 (1974), 261–301.

I want to quote the passage for you that contains the famous line about premature optimization. It's on page 268 (the eighth page of the article):

There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil.

Most people stop here when they quote Knuth. But actually, it just keeps getting better:

Yet we should not pass up our opportunities in that critical 3%. A good programmer will not be lulled into complacency by such reasoning, he will be wise to look carefully at the critical code; but only after that code has been identified. It is often a mistake to make a priori judgments about what parts of a program are really critical, since the universal experience of programmers who have been using measurement tools has been that their intuitive guesses fail. After working with such tools for seven years, I've become convinced that all compilers written from now on should be designed to provide all programmers with feedback indicating what parts of their programs are costing the most; indeed, this feedback should be supplied automatically unless it has been specifically turned off.

What a beautiful expression of the idea that's at the core of what I do for a living.

If you haven't seen "For Developers: Making Friends with the Oracle Database" yet, I hope you'll take a look. My aim is to help Oracle application developers writing Java, PHP, C#—or anything else—understand how to find that critical code that Knuth wrote about in 1974. ...And, just as importantly, how to ignore the other 97% of your code that you shouldn't be worrying about.

Tuesday, December 16, 2008

A Small Adventure in Profiling

Tonight I'm finishing up some code I'm writing. It's a program that reports on directories full of trace files. I can tell you more about that later. Anyway, tonight, I got my code doing pretty much what I wanted it to be doing, and I decided to profile it. This way, I can see where my code is spending its time.

My program is called lstrc. It's written in Perl. Here's how I profiled it:

23:31:09 $ perl -d:DProf /usr/local/bin/lstrc

The output of my program appeared when I ran it. Then I ran dprofpp, and here's what I got:

23:31:23 $ dprofpp
Total Elapsed Time = 0.411082 Seconds
User+System Time = 0.407182 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
64.1    0.261  0.261     18   0.0145 0.0145  TFK::Util::total_cpu
18.1    0.074  0.076    176   0.0004 0.0004  TFK::Util::timdat
5.65    0.023  0.052      9   0.0026 0.0058  main::BEGIN
1.72    0.007  0.007   1348   0.0000 0.0000  File::ReadBackwards::readline
0.98    0.004  0.022      6   0.0007 0.0036  TFK::Util::BEGIN
0.74    0.003  0.011     18   0.0002 0.0006  TFK::Util::tim1
0.74    0.003  0.004      6   0.0005 0.0006  ActiveState::Path::BEGIN
0.74    0.003  0.014      7   0.0004 0.0019  Date::Parse::BEGIN
0.74    0.003  0.359      2   0.0015 0.1797  main::process_files
0.49    0.002  0.002      4   0.0005 0.0006  Config::BEGIN
0.49    0.002  0.002    177   0.0000 0.0000  File::Basename::fileparse
0.49    0.002  0.002    176   0.0000 0.0000  File::Basename::_strip_trailing_sep
0.49    0.002  0.002      3   0.0005 0.0005  Exporter::as_heavy
0.49    0.002  0.002      6   0.0003 0.0004  File::ReadBackwards::BEGIN
0.25    0.001  0.002     24   0.0001 0.0001  Getopt::Long::BEGIN

What this says is that the function called TFK::Util::total_cpu accounts for 64.1% of the program's total execution time. The thing you couldn't have known (except I'm going to tell you) is that this program is not supposed to execute the function TFK::Util::total_cpu. At all. It's because I didn't specify the --cpu command line argument. (I told you that you couldn't have known.)

Given this knowledge that my code was spending 64.1% of my time executing a function that I didn't even want to run, I was able to add the appropriate branch around the call of TFK::Util::total_cpu. Then, when I ran my code again, it produced exactly the same output, but its profile looked like this:

23:33:07 $ dprofpp
Total Elapsed Time = 0.150279 Seconds
User+System Time = 0.147957 Seconds
Exclusive Times
%Time ExclSec CumulS #Calls sec/call Csec/c  Name
50.0    0.074  0.076    176   0.0004 0.0004  TFK::Util::timdat
15.5    0.023  0.053      9   0.0026 0.0058  main::BEGIN
4.73    0.007  0.007   1348   0.0000 0.0000  File::ReadBackwards::readline
2.70    0.004  0.022      6   0.0007 0.0036  TFK::Util::BEGIN
2.70    0.004  0.013     18   0.0002 0.0007  TFK::Util::tim1
2.03    0.003  0.004      6   0.0005 0.0006  ActiveState::Path::BEGIN
2.03    0.003  0.013      7   0.0004 0.0019  Date::Parse::BEGIN
2.03    0.003  0.100      2   0.0015 0.0499  main::process_files
1.35    0.002  0.002      4   0.0005 0.0005  Config::BEGIN
1.35    0.002  0.002    177   0.0000 0.0000  File::Basename::fileparse
1.35    0.002  0.002    176   0.0000 0.0000  File::Basename::_strip_trailing_sep
1.35    0.002  0.002      6   0.0003 0.0004  File::ReadBackwards::BEGIN
1.35    0.002  0.002      3   0.0005 0.0005  Exporter::as_heavy
0.68    0.001  0.002     24   0.0001 0.0001  Getopt::Long::BEGIN
0.68    0.001  0.005    176   0.0000 0.0000  File::Basename::basename

Yay.

Let me summarize:

Total Elapsed Time = 0.411082 Seconds — before profiling
Total Elapsed Time = 0.150279 Seconds — after profiling

That's about a 64% improvement in response time, in return for about 30 extra seconds of development work.

Profiling—seeing how your code has spent your time—rocks.