Handling Inevitable Failures and Bugs

All software systems are going to fail at some point. It’s just the way it is. Even the most prominent companies don’t promise 100% uptime. Instead, they talk about their uptime in terms of ‘9s’—like 99.9% (three 9s) or 99.999% (five 9s).

Attaining a high number of 9s in uptime is a significant achievement, a badge of honour for developers. Consider this: four 9s mean about an hour of downtime a year, five 9s mean only about 5.5 minutes, and six 9s drop that down to just 30 seconds. Each additional 9 adds a considerable level of difficulty, but it also signifies a remarkable feat in system resilience.

The same thing goes for bugs. While bugs are more challenging to measure than uptime, the concept still fits. Stopping 99.9% of bugs is way easier than preventing 99.99%. It’s crucial to fix bugs, but we must accept that some bugs will always slip through. So, is it worth it to pour a ton of resources into squashing every last bug?

At a certain point, it becomes more strategic to focus on making your system resilient to bugs, rather than pouring all resources into squashing every last bug. This shift in focus empowers developers to think strategically and proactively about bug management.

This means spotting bugs when they pop up and having a way to recover. One common area where this matters is preventing data loss.

Imagine you’ve got a photo-sharing app. There’s a feature that automatically deletes old photos to free up space for users. Making sure this feature works right is super important. But there’s always a tiny chance a bug could mess things up and delete photos, but it shouldn’t. Here’s how you can make this feature more resilient:

First, have backups of the data stored with the photos. A backup is a full copy of the data stored at a particular time. While this is usually for handling system or hardware issues, you can use these backups to restore data for a single user if needed. Remember when Gmail had that issue in 2011 and lost data for 40,000 users? They restored it from tape drive backups. If it can happen to Gmail, it can happen to anyone. You’re probably not better at making bug-free software than Gmail.

But backups aren’t perfect. They can be expensive and require loading on separate hardware, extracting the correct data, and putting it back into the production database—all without human error.

Plus, backups usually happen hourly or daily. Picture a user who spends 20 minutes editing and uploading photos. If a bug deletes them, and the company says, “Oops, we can’t get your work back because we didn’t run a backup in the last 20 minutes,” that user will be pretty mad!

Another tactic is using soft deletes. Instead of permanently deleting data, you just store a timestamp. If the timestamp is missing or null, the data isn’t deleted. If it’s a real timestamp, the data is considered “deleted.”

For users, everything looks and works the same. The difference is that the data isn’t gone forever. If a bug deletes more data than it should, it’s just setting more timestamps. Fixing it means figuring out what to restore and clearing those timestamps.

There is a catch, though. Some countries have regulations that require user data to be permanently deleted if the user asks for it. Usually, they allow 24 hours to do this. That means you can keep data soft-deleted for 24 hours before permanently deleting it. 24 hours isn’t much, but it’s enough if you catch bugs quickly.

This brings us to robust logging.

When an error happens, the code usually logs an error message. This message tells you where the error occurred in the code but not who it happened to or what data was affected.

You must rely on users to contact you with all the details to track down the issue. If you’ve done support, you know that only sometimes happens. Including relevant data in error logs is crucial for recovering from errors. This relevant data could include the user’s session ID, the specific action they were performing when the error occurred, and any relevant data from the database or API calls.

The best part is that these resilience tactics are not only effective but also easy to implement. They’re not a substitute for ensuring high-quality software, but they are essential fail-safes for when your code inevitably has bugs. Knowing that these tactics are straightforward to put in place can give developers a sense of reassurance and confidence in their bug management strategies.

Article by Dave Fuller

close
type characters to search...
close