Cary Millsap: An Organizational Constraint that Diminishes Software Quality

Thursday, June 7, 2012

An Organizational Constraint that Diminishes Software Quality

One of the biggest problems in software performance today occurs when the people who write software are different from the people who are required to solve the performance problems that their software causes. It works like this:

Architects design a system and pass the specification off to the developers.
The developers implement the specs the architects gave them, while the architects move on to design another system.
When the developers are “done” with their phase, they pass the code off to the production operations team. The operators run the system the developers gave them, while the developers move on to write another system.

The process is an assembly line for software: architects specialize in architecture, developers specialize in development, and operators specialize in operating. It sounds like the principle of industrial efficiency taken to its logical conclusion in the software world.

In this waterfall project plan,
architects design systems they never see written,
and developers write systems they never see run.

Sound good? It sounds like how Henry Ford made a lot of money building cars... Isn’t that how they build roads and bridges? So why not?

With software, there’s a horrible problem with this approach. If you’ve ever had to manage a system that was built like this, you know exactly what it is.

The problem is the absence of a feedback loop between actually using the software and building it. It’s a feedback loop that people who design and build software need for their own professional development. Developers who never see their software run don’t learn enough about how to make their software run better. Likewise, architects who never see their systems run have the same problem, only it’s worse, because (1) their involvement is even more abstract, and (2) their feedback loops are even longer.

Who are the performance experts in most Oracle shops these days? Unfortunately, it’s most often the database administrators, not the database developers. It’s the people who operate a system who learn the most about the system’s design and implementation mistakes. That’s unfortunate, because the people who design and write a system have so much more influence over how a system performs than do the people who just operate it.

If you’re an architect or a developer who has never had to support your own software in production, then you’re probably making some of the same mistakes now that you were making five years ago, without even realizing they’re mistakes. On the other hand, if you’re a developer who has to maintain your own software while it’s being operated in production, you’re probably thinking about new ways to make your next software system easier to support.

So, why is software any different than automotive assembly, or roads and bridges? It’s because software design is a process of invention. Almost every time. When is the last time you ever built exactly the same software you built before? No matter how many libraries you’re able to reuse from previous projects, every system you design is different from any system you’ve ever built before. You don’t just stamp out the same stuff over and over.

Software is funny that way, because the cost of copying and distributing it is vanishingly small. When you make great software that everyone in the world needs, you write it once and ship it at practically zero cost to everyone who needs it. Cars and bridges don’t work that way. Mass production and distribution of cars and bridges requires significantly more resources. The thousands of people involved in copying and distributing cars and bridges don’t have to know how to invent or refine cars or bridges to do great work. But with software, since copying and distributing it is so cheap, almost all that’s left is the invention process. And that requires feedback, just like inventing cars and bridges did.

Don’t organize your software project teams so that they’re denied access to this vital feedback loop.

18 comments:

Unknown said...: I find it really comes down to the CapEx vs OpEx accounting split and the fact that developers can walk away from their code, until called in on very specific issues or changes.

Developers should be forced to operations for 12 months after release of their product. "That'ld learn 'em"; June 7, 2012 at 10:33 AM
Anonymous said...: It would be nice developers busy with after care for few months to see how code performs in production... but is seems virtually impossible when same set of developers have to start coding for next release of the product to fulfill new functional requirements and at the same time address bugs from the current release.... After GoLive, I think, regular meetings between operations and development teams can help bridge the gap to some extent.; June 7, 2012 at 12:03 PM
Mladen Gogala said...: Cary, how exactly do you think this should be done? To allow developers access to production, in order to make the feedback available? To merge the functions of a DBA and software architect? DBA is a generalist, not a specialist. In the animal kingdom, its best analogy would be a black bear. Developers would be herbivores: they take input from the people who handle the green stuff and process it into fertilizer (applications). However, in mother nature, it would be very hard to get a bear and a deer working together, on the same shrub, without turning the deer into the main course. The same thing is with the software engineering: you can't have it both ways. The solution is for the DBA to learn as much about the application as humanly possible. However, it's one thing to say that the developers shouldn't be deprived of the feedback and something else entirely to give them access to production. In the tale, lion and lamb will lie together in reality, they will not wake up together. Having app designers act as DBA type personnel is the same type of story.; June 7, 2012 at 12:03 PM
Cary Millsap said...: Mladen, one of my favorite examples of integrating developers into operations is the Oracle Corporation idea (Jeff Walker circa 1991) of The Application Developer Wall of Shame. If your code causes a performance problem in production, your code with your name on it got hat-pinned to a public-area fabric cubicle wall in Belmont for everyone to see.

Nobody wanted to be on that wall. The Wall stimulated no end of brown-bag lunches and collaboration between developers and operations people (they even reached out to consultants like me) who might be able to give developers a lead on what to fix today so their name wouldn’t show up on The Wall tomorrow.

For products that don’t have such big ongoing operational responsibilities attached to them (like iPhone apps, for example), the key becomes for the people who write the apps to have a vested interest as users of those apps. You feel the pain/gap yourself, and you refine the product in response to that heightened understanding. I’m convinced that this is what makes Apple so good, and some car manufacturers, too.

I talked about this feedback loop problem, too, in “My case for agile methods.” Another good place I recommend looking: Dominic Delmolino has written down some of what I think are the best ideas available on this subject.; June 7, 2012 at 12:20 PM
Mladen Gogala said...: BTW, I read "The Goal" following the recommendation from your "Optimizing Oracle" book.; June 8, 2012 at 10:06 AM
Mladen Gogala said...: Hi Cary, I read Eli Goldrat's book "The Goal" and I read your paper about the agile methods. I must still say that I disagree. Here is why: agile methodology is a software development methodology, not an administration methodology. DBA personnel usually doesn't do software development. My assignments are typically to fix a performance problem, refresh development instance, set up a new instance for some purpose, coordinate performance testing, monitor the database or to resolve a software problem with Oracle Support. Projects include things like upgrade to 11G, transition to RAC, switching to new SAN or upgrading hardware for production DB. Agile methodology is created to help with ever changing requirements. The tasks and projects that I have just described do not have ever changing requirements, as is the case with the most of the infrastructure projects.
Daily scrum meetings, short sprints and retros impose a lot of overhead on the programming staff, which typically spends at least 1 hour daily in meetings. That is all peachy and well for the application development which has to follow business practices in the ever changing business world. For the infrastructure projects, which have a fixed and immutable set of requirements, agile methodology imposes unnecessary burden and is therefore unsuitable for the DBA projects as described above.
All the agile shops I have ever worked with, had serious problems with managing change and doing software upgrades. The databases were always lagging behind, frequently using unsupported releases. Put plainly, DBA personnel doesn't do software development and a methodology which is centred around software development is simply unsuitable for the DBA work.; June 8, 2012 at 10:08 AM
John Hurley said...: When you make great software that everyone in the world needs, you write it once and ship it at practically zero cost to everyone who needs it.

??? Can you give a couple of example of what you are referring to specifically here ... me sorry no idea.

Struggling with several concepts here in this sentence including the first premise "people need software" then sorry it goes down hill from there; June 8, 2012 at 3:35 PM
Unknown said...: Hi Cary,

I talked the similar topic in Amazon COMMIT 2011, you're a little late and missed my session.

Control and Responsibility:
http://www.facebook.com/note.php?note_id=409881258919

I still owe you the slides about the best way to do something is not to do it at all. ^_^

Thanks,
Charlie; June 8, 2012 at 4:12 PM
Joel Garry said...: @Charlie:

Maybe you should consider most places ought to try their software to be sure it works before they distribute it. Most places aren't facebook, and maybe facebook is doing it wrong. I know I'm offended by slipstreaming. I know I'm offended by the lousy UI of linkedin and facebook, and even more so to see the problems propagate to other sites like monkees in the zoo.

The wall of shame is everywhere nowadays, and no one is afraid to be on it.

++ to Mladen's comments.

I'm not a robot, and I am having difficulty proving so.; June 8, 2012 at 4:44 PM
Robyn said...: Hi Cary,

Agreed. Feedback is 100% necessary for anyone to do their job well, and organizational structures everywhere are insulating their development teams from the info that they need to build good applications. This 'syndrome' used to be most apparent with consultants who focused on implementations: they took on a project that lasted 6 months or two years, depending on the new (usually ERP related) application. Once it was done, they moved on and they never had to deal with the systems they built, and their knowledge of the application was limited. As a result, they deployed the same mistakes on systems at multiple locations and organizations, creating growing market for troubleshooting and performance consultants :)

The same thing is now apparent the development world. Development tools aim to insulate the developer from the database, and as a result, he has little insight into how his code uses/impacts database, system and network resources. DBA's seem to be focused on their version of the latest cool technology as well. They're more interested in filling up their resume with words like 'RAC', 'Exadata', 'High Availability', or 'Big Data'. Their resumes may mention 'performance tuning' but their skills are more akin to 'performance hacking' by playing with parameters or finding shortcuts to mitigate application performance weaknesses - shortcuts that only delay a reoccurrence of the problem rather than preventing it.

This leaves a huge skill gap in a development team: no one seems to know how to make the application more efficient, yet software is the best place to make huge leaps in performance capability. Ironically, knowledge about how the application works and the business data is also the critical knowledge that makes an IT worker most valuable, and therefore, not subject to outsourcing. Cary, I think you are correct in stating that organizational structures create the problem, and rewards systems are feeding it. Smart managers should be looking for ways to bridge the gap and make sure the development team is completely aware of how their product behaves in the real world. Smart IT workers won't wait for their managers. It's absolutely possible for a motivated developer to get feedback they need and increase the value of the tools they build in spite of the organization. Best of all, they make themselves more valuable in the process. And if DBA's don't find a way to get into the act, they will soon be seen as swappable spares.

cheers ... Robyn; June 8, 2012 at 5:23 PM
Cary Millsap said...: Joel, Mladen, John: ...But is there any doubt that, for example, the experience of using Facebook is immensely superior to using, say, Yahoo Groups (where pictures and comments are still stored in widely separated parts of the application)?

I feel that the average quality of application user experiences has improved stunningly over just the past 4 years. Every day when I use Eventbrite or TripIt or iTunes or Evernote, I feel it, especially on days when I have to compare my experiences using those kinds of apps with the experiences of using something like the Oracle self-service time reporting application or an SAP government vendor registration form.

There’s a quantum productivity difference between using first-class, modern, well-designed applications and using old-style applications that were developers implemented in response to fixed and immutable requirements dictated in a one-way flow of information by architects and designers. This “quantum productivity difference” statement I’m making here is not just a statement about performance of the application, it’s a statement about the performance of the person using the application, which is the whole point of software in the first place.

Mladen, to your point, I think that the statement “DBA is a generalist, not a specialist” is ripe for negotiation. There’s a tremendous rainbow of responsibilities that the term “DBA” can describe, and those responsibilities range from virtually zero-creativity-required tasks all the way up to designing applications and business processes. I think that the people who execute only low-creativity tasks (i.e., tasks requiring repeated execution of a thoroughly debugged process) are, exactly as Robyn noted, destined to become swappable spares, just like they have in the manufacturing industry. I believe that the tasks requiring invention are where success and feedback loop lengths are, without a doubt, inversely proportional.; June 8, 2012 at 6:31 PM
Mladen Gogala said...: Cary, being a generalist doesn't necessarily mean being a robot. A generalist these days has to know his platform exceptionally well, has to know Oracle, as many aspects of it as possible, scripting language or two, probably an additional database or two, have project management skills and enough experience to be able to solve all kinds of problems. As a DBA, I once designed and implemented software that parsed the spreadsheet sent from McGraw-Hill (S&P index) and inserted it into database by using procmail and Perl. That still doesn't make me a developer. Also, DBA should not be a programmer, he or she is an administrative resource, also in charge of security. Very large part of all major break ins was committed by an insider, frequently a programmer. DBA and development have to be separated. I am afraid that there isn't much room for negotiation about that.

As for the feedback, I agree that it is necessary, but I have very bad experiences with agile methodology applied to DBA work, as a way of implementing that feedback. There are various other methods of organizing workflow, so that DBA doesn't have to attend those inane scrum meetings, sprint retro events and other rituals that make sense in case of development but are utterly ridiculous in case of DBA projects.
Finally, as for being a swappable spare, I am that even now. My employer can always thank me for my services and hire another DBA, more to his liking. Larry Ellison has been a prophet of doom for the DBA for a long, long time. One of the major Microsoft sales pitches against Oracle is the need for an "expensive DBA". However, neither you nor Jonathan Lewis, Tanel Poder or Tom Kyte are out of work yet. Neither am I. At the present state of technology, the DBA's will likely be needed for some time in the future, until there is an optimizer that can take care of everything. I do not worry too much about the artificial intelligence software when the natural intelligence sometimes blurts things like "don't misunderestimate me".; June 8, 2012 at 11:47 PM
Cary Millsap said...: I wish I had written this myself: “What is DevOps?” from Mike Loukides.; June 10, 2012 at 11:06 AM
Joel Garry said...: Cherry picking some phrases:

Operations doesn't go away, it becomes part of the development

Well, I don't accept this from the get-go. There's a fundamental conflict between reliability engineering and releasing early and often.

The companies that didn't suffer, including Netflix, knew how to design for reliability; they understood resilience, spreading data across zones, and a whole lot of reliability engineering. Furthermore, they understood that resilience was a property of the application, and they worked with the development teams to ensure that the applications could survive when parts of the network went down.

My problem with this is, most companies will suffer. It's expensive to do it right, and tends to be done only on the upside of the implementation curve, else on a disaster. Most companies are not Amazon or netflix, and it is a severe mistake to scale smaller companies as if they are. Just like an entrepreneurial sized
company needs a different management style than Boeing, a company with midsized operational requirements has different requirements than a startup or a netflix.

Which leads to another issue: this is ignoring any kind of lifecycle for cloud systems. Even if dev/no/shmegeggy/ops works, at some point the systems need to mature, and all this breaks down. Who's going to be the maintenance ops guy?

do developers wear the beepers, or IT staff?

Beepers? What is this, the guy stuck in the '90s on 30 Rock?

We're still learning how to monitor systems, how to analyze the data generated by modern monitoring tools, and how to build dashboards that let us see and use the results effectively... The amount of information we can capture is tremendous, and far beyond what humans can analyze without techniques like machine learning.

Translated: "We don't really know what we are doing." But we will proselytise it anyways.; June 12, 2012 at 5:24 PM
Joel Garry said...: (continued)
Operations groups have been leaders in taking best practices from older disciplines (control systems theory, manufacturing, medicine) and integrating them into software development.

Well, I can tell you that manufacturing and medicine suffer from best practices silver bullet syndrome, deteriorating as the concepts spread. I've seen it time and again (and overheard people making the same complaints in Arbys), the new MBA comes in and trys to impose what he learned in school on an incompatible set of processes and software.
Don't even get me started on medicine. Control systems theory ought to help, but again, big conflict with whatever-ops. Only very large, new, well-managed and well-funded projects can afford specialized reliability engineers, and that just ain't most projects. Much less a CORE team.

Developers deploying their own code also brings accountability, responsibility, and the requisite authority to influence production.
Of course, I work on stuff that needs to follow legal requirements and have traceability. Must be nice not to have that. I expect someday the software industry will mature to the accountability point of not being able to say "Hey, software has bugs, too effing bad for you." Severable responsibility is a good thing, both in practical terms and scalability of process development. Check out that constitutional government thing.

put one "canary" instance into traffic
I even don't know what to say about this as quality control without being sarcastic.

it's trivial for a developer to create their own cluster of any size without assistance (NoOps again).
I guess it must be nice to not have to answer to a budget.

But is there any doubt that, for example, the experience of using Facebook is immensely superior to using, say, Yahoo Groups

Yeah, but what we have here is the experience of using facebook for all your base in the cloud.; June 12, 2012 at 5:25 PM
Dominic Delmolino said...: Cary,

Sorry I'm late to the party. I'll try to describe what I've seen this space that appears to improve things somewhat. As always, your mileage may vary and I'm sure there's more than one way to skin this cat.

I agree that the heart of the problem is the disconnect between the folks that make the decisions regarding how an application's database usage is codified (the developers) and the folks who are designated as responsible for how well it performs (the operations personnel). I think it only makes sense to hold accountable the folks who decided how to use the database for how well it ends up performing.

Yes, I know that there are many ways a poorly installed or configured infrastructure can affect database performance, but in my experience those incidents pale beside poorly designed and codified application/database logic.

There are many ways to try and head off the issue of having poorly designed and codified application/database logic, including making sure the people who do the designs and write the code are held responsible for the ultimate performance of their code during production usage. I've seen ideas that include: 1) Giving developers read-only access to tools and systems that allow them to observe in real-time the performance of their application code and requiring them to spend a portion of their time doing so. 2) Having operations folks provide regular feedback through reports, identification of poor code, "Wall of Shame", filing internal bugs -- and reviewing that feedback with developers. 3) Embedding capable database professionals in the development and design processes so that their experience is applied to dealing with nascent performance issues before they mature. 4) Providing developers with a read-only copy of the production database / data so that they can test their queries and code at scale.

I'm sure there are more ideas and not every idea will work in every environment -- but I think almost any of these ideas are better than having developers and DBAs simply play the victim of each other and point fingers when performance problems occur in production.

Personally, I think most of the best DBAs (like those of you who come here, or to many of the other Oak Table and some Oracle Ace blogs) have talents that are wasted when they are limited to operations instead of being applied to assisting the design and development of database/application logic. Who better to decide whether or not a table should be partitioned or built as an IOT?; June 19, 2012 at 9:06 AM
Dominic Delmolino said...: Changing gears a bit, I'm always puzzled by so-called DBA-only projects. At my last job every system change was only done if it was related to a business need or added business value, and those changes followed the system lifecycle like every other change. We didn't do undisciplined agile -- we constantly developed changes, checked them into source control, and released them into QA and production following a regular schedule (micro-releases every week, mini-releases every month, and major-releases every 3 months). Upgrades and security patches were done as part of major releases -- done first in development, tested just like any other change in QA, and deployed into production just like any other change. We upgraded when it made sense -- often skipping the X.0 releases, and waiting until the X.2.0.3 or X.2.0.4 release. We deployed database capabilites (encryption, compression, data-guard, etc), just like any other system change -- involving developers and operators during the process to make sure all issues were addressed.

This does mean that there's effort to make sure environments are similar -- but I've found that such effort pays off in the long run with less code and configuration issues. Check out Southwest Airlines rationale for only using one kind of plane to see how this reduces cost and complexity.

In the end, as Voltaire said: "With great power comes great responsibility" -- the power to decide how to implement database logic via developers should come with the responsibility to make sure it performs well in production. How you choose to impose that responsibility requires creativity and an appreciation of the fact that it may require a cultural change.; June 19, 2012 at 9:07 AM
asliwx said...: Great discussion... the only comphrensive codification of ...how to quanitfy and qualify the statement "An organization constraint that diminishes software quality" .... http://www.amazon.com/Beyond-Goal-Theory-Constraints/dp/B000ELJ9NO

By definition ... a constraint .. is (or bottleneck if you like..or conflict) ..is something that prevents the system from achieving it's goal or achieving more goal units .... With any complex system (especially one that resembles a supply chain) ... or in software .... 'The Cloud' ... the activation of the perception of value ... might 3 or 4 or 10 ... degress of seperation ..or ... degress of freedom.... A simple example with Oracle DB... is that Oracle DB is only 1 part of a bigger system .... If you own a on-line business ..that uses oracle ... the 'constraint' might be oracle ... or something internal to the system ... but eventually you move the bottleneck outside your system ...to the external demand for your product .... you can only increase throughput .... if you are able to increase the demand for your product.... as we know ...you have to balance that with Operating Expense and Inventory ....

As I have just only realized ...Eli books ....articule that after you apply all the right local-optimization procedures ... the biggest constraint is the 'resistence to change' .... and I paraphrase 'any product or service you provide ..must diminish a limitation .... in order for the new technology to remove this limitation or constraint .... you must change the rules ...otherwise you have a new technology old rules...same old results ... .... just like method-c vs method-r ... in order to see the benefits of method-r ... the rules (or assumptions or value) ...have to change ... otherwise the measurement system rewards the old behavior ...

Mind you that software quality is a constraint...if the measurements support quality assurance vs quality inspection ....; July 16, 2012 at 2:31 PM