Monday, July 30, 2012

Why are some bugs harder to fix than others?

There are a lot of different factors that impact how long it could take to find and fix a bug. Some of them we’ve already gone over. How good the bug report is – can you understand it, does it include steps to reproduce the problem. And how old the report is – how much could have changed since then, how much will people remember if you need to ask them for details. How much work is involved in setting up a test environment to reproduce the bug.Whether you can reproduce the bug or not.

There are other factors too. The kind of bug. The size of the code base, how old it is, how ugly it is, and how brittle it is - will it break if you try to change it. How much experience you have with the language, and how well you understand the code, did you write it or at least work on it before. How good your tools are for debugging and profiling and navigating the code and refactoring. What kind of tests you have in place to catch regressions. And how good you are at problem solving, at narrowing in on the problem, and then coming up with a safe and clean fix - especially in code that you didn’t write and don’t understand.

Gaps and Blind Spots

Marc Eisenstadt, in "My Hairiest Bug War Stories", found that the most important factors contributing to the cost of fixing a bug were:

  • Cause/Effect gap. How far is the failure removed from the actual defect in the code? Sometimes the code fails exactly where the problem is. Other times, you can see that something is broken, but you can’t trace it back to where the problem occurred.
  • How hard or expensive it is to duplicate the bug. Does it involve other systems? How much work is it involved to setup a test system? Does the bug only show up after running the system for a long time under heavy load, or with a lot of concurrent sessions? Is the problem intermittent? Is it configuration-specific – and, if so, do you have access to that configuration? Are you familiar with the tools, and do your debugging or tracing tools show you anything useful? When you enable the debugger, does the problem go away (the Heisenbug problem)?
  • Faulty assumptions. What is wrong and what you think is wrong are very different, or you don’t really know enough about the platform or language to understand what’s wrong or where to look so you are starting off wrong. You’ve got a blind spot, and unless you get help from somebody else to see it, you’re not going to be able to fix this problem. Knowledge and stubbornness are both important factors here. You have to know enough to know where to start looking, and you have to be stubborn enough not to give up. But stubbornly sticking with a hypothesis for too long, even – especially – if it’s a good hypothesis, will keep you from moving forward and finding the bug.

Some bugs are easier to fix than others

Capers Jones, as usual, has a lot to say about the costs of fixing bugs – see “The State of Software Quality in 2011”.

How long it takes to fix a bug can depend on what kind of bug it is. The average time to fix a security bug: 10 hours. Design bug: 8.5 hours. Data bugs: 6.5 bugs. Coding bug: only 3 hours. Invalid bugs: 4.75 hours – it takes 4.75 hours on average to figure out that a bug is in fact not a bug, but only 3 hours to fix a coding bug! Wrap your head around that. Duplicate: 1 hour (to figure out that this bug has already been reported so you can ignore it too).

But there’s still the long tail that we talked about earlier: the maximum time to fix any of these kinds of bugs can be >10 times the average. Bugs that cannot be reproduced are the hardest – on average these kinds of bugs take up to 40 hours to fix, if they can be fixed at all.

Costs by Severity

How severe the bug is can also affect how long it takes to find and fix. On average, critical Severity 1 bugs (the system is down or your biggest customer - or your CEO - is screaming at you because something isn't right or your database has been compromised by an attacker) are fixed in an average of 6 hours. Major bugs (Severity 2) in 9 hours. Minor (Severity 3) bugs take 3 hours, and trivial bugs (Severity 4) only 1 hour, if you bother to fix them at all.

What’s interesting is that the most severe bugs (Severity 1) take less time to fix than other major bugs – probably because when the system is down it gets immediate attention from your best people to contain the damage. But fixing a bug like this fast doesn't mean that it is cheap. There are other costs indirectly associated with a Severity 1 emergency, including operations support costs for incident management and escalation, and Root Cause Analysis to figure out what went wrong in the first place, and whatever follow-up actions you need to take to ensure that a problem like this doesn't happen again. Critical bugs are never cheap to fix, at least not if you fix them properly.

It depends on when you find the bug

All of this data assumes that you are fixing bugs found in production. Everyone knows that the earlier that you find a bug in the development cycle, the cheaper it is to fix – if an automated test or static analysis check reports a bug in code that you just changed, of course you can fix it immediately. The famous rule “finding and fixing a software problem after delivery is often 100 times more expensive than finding and fixing it early in development” applies. Or does it?

In “What we have Learned about Fighting Defects”,different studies show that the 100:1 rule of thumb applies for severe and critical defects (because of the direct and indirect costs involved). But for non-severe bugs, the effort multiplier is much lower: as low as only 2:1. This is especially the case for teams working in a Continuous Deployment model, where the boundaries between development, testing and production are blurred, and where the costs and time required to push a fix out to production are minimal.

How old and how big the code base is

The cost to fix bugs also depends on how old the system is, and how old the bugs are. In the first few months of operation bugs will be found quickly and usually fixed quickly, often by the same programmer who wrote the code. As time goes on it gets harder to find and harder to fix bugs, partly because the no-brainers, the more common and obvious and easier-to-reproduce problems, have already been reported and fixed, and now you’re left with more difficult edge cases or timing problems. And partly because over time there’s less chance that the programmer fixing the code is the same programmer who wrote it, so it takes longer simply for whoever has to fix the problem to get their head into the game.

And it depends on how big the system is. Bigger systems have more bugs, and it costs more to fix bugs in big systems, especially really big systems. Severity 2 bugs (the hardest to fix on average) take 9 hours or less to fix on average in systems up to 1,000 function points in size (around 50,000 lines of Java code give or take). But in much bigger systems (500,000+ lines of code or more) the average time goes up to 12 hours. In the biggest systems (another order of magnitude bigger) it can take an average of 24 hours to fix the same kind of bug.

Next, I want to look at the value of programmer experience when it comes to fixing bugs.

No comments:

Site Meter