A Practical Guide to Bug Triage : Sinzui

Or: "Why I don't classify bugs as medium"

The process of triaging issues (bugs, features, and tasks) has one crucial principle: Prioritise the work according to need and certainty.

Work is prioritised because there are not enough engineers to do all the work. Some features will never be completed, some bugs will never be fixed. Triage determines which bugs can and will be fixed, which features can and will be implemented. Need is generally understood, when planning work, but certainty is not, and that often leads to wasted work and unmet expectations.

By need, I mean a measure of severity. What percentage of users does the issue affect, and how severely does it impede them from completing their task.

By certainty, I mean a measure of how certain the engineers are that they can address the issue. Time is also a factor in this measure, the longer an issue takes to address, the more likely that the conditions that were first judged will change.

The act of triage is separating work into groups that are being worked on now, next and last. There can only be as many "now" bugs or features as there are engineers. The number of "next" work is limited to the velocity of the engineers and how infrequently plans change. The bugs that are last will probably never be addressed, the last features may never be started.

The corollary to this rule is that there are a finite number of bugs or features in the first two groups. There cannot be more work in these groups than there are engineers to do for the given period of time; otherwise the engineers, businesses and users are being misinformed about when issues will be addressed.

An Example

Consider there is one engineer and two bugs. He can only work one bug at a time. One bug is more important than the other. The risk is that he may not be able to fix one of the bugs before users are disappointed and abandon the application. He risks disappointing all users if he does not fix either bug because he choose the one with the most need over the one he was certain he could address.

If he does not know how to fix the bug with the most need, or that the fix takes a long time, he is wasting time he could have spent fixing the bug with more certainty. The only way he can address the bug with the most need is to employ a hack to reduce the need, to meet the expectations of some users. The hack is also used to gain time to understand the problem, thus increase certainty.

Only Assign Work that You Are Committing to do in the Near Future

When a work is assigned to an engineer, he is committing to complete the work in the near future. What the "near future" means is different for each project. I suggest 3 releases is the "near future", because when work is planned, the engineer is thinking about now, next, and last. For some projects this period might be 6 weeks, for others, 6 months.

I prefer to plan for the current release, and the next one. As work is reprioritised, it may be rescheduled to the third release. I do not think it is wise to plan a bug or feature to be completed in the third releases because if it slips to the fourth or fifth released, I doubt the it was correctly prioritized as high.

Any high work that is assigned to a engineer for more than 3 releases was not high. If it were, the work would have been reassigned to someone who could complete it in the scheduled time. Any other work that is assigned for more than 1 release is also misprioritised. You are lying to yourself, and the the project's users, when you assign work that you are not committing to fixing.

Practical Classifications of Importance

Work is often classified in relative terms. It is better to classify work according to how it are managed to convey when and under what terms the bug will be fixed or a feature will be complete. There are three priorities that work can be classified as:

Critical

:: The bug dramatically impairs users. Users may lose their data. Users cannot complete crucial tasks. The feature is needed to encourage adoption or prevent abandonment of the project.; Synonyms: required, essential, now, must do; The work is immediately assigned to a engineer. It is his top priority to fix. Team members help the engineer to plan and do the work. The work is released as soon as it is deployable; in the case of a bug, it is released outside of the release schedule.

High

:: The bug prevents users from completing their tasks. The feature provides new kinds of tasks or new ways of completing tasks.; Synonyms: expected, next, can do, should do; The work is assigned to a engineer to be completed in the next 3 releases. The engineer may choose to do other work if he believes it is within the scope of the high priority work.

Medium

:: The bug is inconvenience for many users. The feature provides new ways of completing tasks.; Synonyms: preferred; The work is not scheduled, though it is intended to be completed. When the work is assigned, it may also be scheduled, but there is no commitment to complete it for the stated release. The engineer may choose to postpone the work in favour of more important work.

Low

:: The bug is an inconvenience to users, but it does not prevent them from completing their tasks. The feature is a convenience to users.; Synonyms: optional, last, may do; The engineer may assign the work to himself while working on a high priority work because the high work provides an opportunity to complete the low priority work at less cost. If the low work in any way jeopardises the high priority work, the low work is unassigned. The engineer is thus certain that the work can be fixed quickly and without difficulty. A corollary to this rule is that low work that is assigned to a engineer must be "in progress" or "fixed" states.

The Problem with "Medium"

It might be argued that when the engineer has an opportunity to fix a low or a medium bug, he must choose the medium one. This rules does not define a practical distinction between medium and low. There is no commitment to fix the medium bug; it will not be scheduled for fixing. A engineer chooses to undertake a low bug because he sees an opportunity to fix it while working in the affected code. The engineer is choosing to do unscheduled work because he is certain it does not jeopardise his scheduled work. The engineer might see an opportunity to fix a medium and a low bug at the same time, but that is unlikely.

It can also be argued that 'critical' is 'high' and that 'high' is 'medium'. True, that is a matter of semantics. The crux of the issue is that there are three practical classifications of work. The words chosen to describe the classifications could use the tofu scale of hard, firm, and soft. People who are unfamiliar with triage will appreciate names that convey the kind of attention the issue will receive.

Some teams with a large number of bugs prefer to keep a pool of medium work from which releases are planned. Items in the pool may be escalated to high if it is perceived that once work is started, there should be a commitment to complete it as scheduled. This work is different from low work because the work makes a substantial improvement to the application, but like low, there is no commitment when the work will be completed. It can be argued that work starts on medium bugs and features because of changes to other priorities, certainties, or the number of users it affects.

Consequences of Misprioritised Work

Stakeholders often use reports that list the prioritised work for a release and for each engineer. When work is misclassified there are two commonly observed consequences: a decreased in certainty, and a decrease in communication.

In the first consequence, the engineer's effort may be wasted; there are issues that have more need and certainty. Engineers, and other stakeholders, are often tempted to complete the misdirected work after the misclassification is discovered because it is assumed that it is better to always deliver something finished than nothing at all. This is a risky choice, because it jeopardises work in future releases. By working on less important work, the engineer is decreasing the certainty of the more important work.

The second consequence is that the engineer ignores the list and he works on issues according to some other source, such as the opinion of another stakeholder. While the engineer is working on the correct issue, it is unclear to other parties what work is going on and when will it be completed. Users may abandon the project in frustration. Planners cannot coordinate all the stakeholders.

The first consequence is possibly a failure to do re-prioritisation during the triage process, but second consequence is a total failure in the triage process. Why would anyone do triage if the prioritisation will be ignored? How can work be coordinated if the work is unknown to all stakeholders? Why would users trust a project if it does not do what it says it will do?

Work must be reprioritised during the triage process to ensure that engineers are working on the issues with the most need and certainty. Engineers must work from the list or prioritised issues.

Indicators of Misprioritised Work

The rules of practical classification provide tests for misprioritised bugs, features, or tasks.

The work is critical, but it is not assigned and targeted for release.
The work prioritised as high, but it is not assigned and for a release.
The work is high, but have not been worked on in 3 releases.
The work is low and unassigned, yet it is targeted for a release.
The work is low and assigned, but the engineer is not working on it.
The work is considered to be triaged, but it's priority is not critical, high, or low.
An engineer is assigned more work than he can accomplish in 3 releases, and it cannot be reassigned.