Antifragility in Software Systems

Currently I am devouring Nassim Taleb’s amazing book “Antifragile.” In this essay, Taleb describes how societies and organisations can survive in the long term by not only becoming robust against adversity, but even thriving in the face of change and volatility. This got me thinking: Can we apply these ideas to software?

Nassim Taleb describes antifragility as a property of systems that goes beyond robustness. A system that is fragile will break when the environment changes, it is not robust against unexpected changes. A system that is simply robust will survive adversity, it can continue to function when the environment changes and the pressure increases. But according to Taleb, there can be more on the other side of robustness. He calls it antifragility. An antifragile system not only survives change and adversity, it thrives on it.

Let me give you a few examples: The banking system is often seen as fragile. As banks get bigger and more interconnected, the failure of one bank can (or inevitably will) affect other banks to the point where the whole banking system collapses. We all remember the Subprime crisis of 2008, where the collapse of a few banks in the US had repercussions all around the world. So this is a very prominent example of a fragile system.
Since then, regulators have worked hard to make banks more robust, e.g. to allow the banking system to continue functioning even if one bank fails. Have they succeeded? Only time will tell.
What would an anti-fragile banking system look like? This would be the case if it were possible for one bank to fail without affecting the others, and if the failure of one bank resulted in other banks learning from this failure. For example, they could improve by increasing their capital ratios or by becoming leaner, by splitting up, because larger banks have a higher risk of affecting the whole system. But this is clearly not happening – as we have just seen with the acquisition of Credit Suisse by UBS, banks are getting even bigger, increasing the risk of a system-wide collapse.

So much for the point of Taleb’s book. Can we apply these ideas to the software world? Let’s look at some examples.

A good example is the world of operating systems. Here we have a quasi-monopolist, Microsoft. Although it’s market share is declining, it still has over 75% of the desktop market. This monopoly is good for Microsoft’s shareholders, because it ensures steady profits, and good for users, because they have a wider choice of software to run on Windows. But it is also a huge risk: If Microsoft decided tomorrow to stop making Windows, revoke all its certificates and stop providing updates, most of the world’s computers would be unusable. This does not seem likely, as Microsoft has a huge financial interest in keeping Windows running. But we can imagine a very extreme scenario, such as a hacker attack on Microsoft sending malicious updates to all PCs running Windows. Or something equally damaging to Microsoft’s business, like Elon Musk buying the company (let THAT sink in). Something this extreme may be very unlikely, but it can have a huge impact. A black swan event, as Taleb describes it.
Such an event would have huge repercussions around the world. Businesses would cease to function. Hospitals will close. Government administrations will grind to a halt. All because they rely on one vendor, one operating system.
Computer systems seem very fragile in this light. But all is not lost, because Windows is not the only operating system in the world. There is also Mac OS (but here we have the same situation, a single vendor) and Linux. Linux’s market share on desktop systems is very small, but it has one big advantage: it is open source. As Open Source Software (OSS), it is distributed not just by one company, but by many companies and non-profit organisations. For Linux users, the failure of one company does not mean the end of the world; in fact, they can easily switch to another Linux distribution. This makes the Linux ecosystem much more robust than the Windows (or Mac) ecosystem could ever be.

Next, let’s look at the largest software system in the world: The Internet. It connects literally billions of machines around the world, and has been doing so for many decades. It has survived many crises, such as the dotcom bubble, different levels of regulation in countries around the world, or the rise of malware and ransomware. Despite all these problems, today the internet serves more people and provides more and faster services than ever before.
The Internet has not only survived the crises, showing how robust it is, it has even gotten better. We can rightly call it an antifragile system.

The internet shows another property of an antifragile system that I learnt from Nassim Taleb: hormesis. The concept of hormesis comes from biology, where a small dose of a harmful substance makes an organism resistant to larger doses of that substance.
Let’s look at one of the biggest threats to the Internet today: Malware and ransomware. These are highly sophisticated computer programs that can cripple the operations of even large companies and other organisations. There are equally sophisticated countermeasures to prevent these programs from taking control of a network. But when the Internet was invented in the 1980s, it had none of these countermeasures. If we were to confront the early Internet with today’s malware, it would be completely corrupted and unusable within seconds. But this did not happen because the threads had to develop first. The first few threads, like the earliest computer viruses, were quite harmless compared to today. This gave developers the chance to create the first anti-virus programs. In response, the viruses got better, and then the countermeasures got better again. We can think of the security thread as a small dose of a poison that was gradually increased over time, and the organism of the Internet grew stronger on this poison. This is hormesis in action, the hallmark of an antifragile system.

So we have seen an example of a fragile software system, single vendor operating systems. And an antifragile system, the Internet. In the long run, we will all benefit from having antifragile systems, even in software. How can this be achieved?

Fortunately, the operating system world is learning its lessons. There hasn’t been a major Windows outage yet, but there are small doses of such events from time to time. Take the end of support for Windows 10. Many computers can be upgraded to Windows 11. But there will be millions of computers that no longer meet the requirements of the new operating system. But in the Linux ecosystem, these old computers can find an alternative operating system that will keep them running for many years to come. We can think of events like the end of support for an operating system as a small dose of poison that triggers the operating system world to adapt to this change, helping it to become antifragile.

Another route to antifragility is open standards. With the advent of online office suites such as Google Docs, it is no longer vital to have Microsoft Word or Excel on your PC. With open document standards, it is possible to exchange data between different implementations from different vendors. The failure of one vendor no longer means that your data is unusable. You can simply switch to another vendor.

In conclusion, it seems to me that the software world is moving towards an antifragile system. Many values shared by software developers, such as freedom of documentation, open source software and open document formats and standards, are gradually leading to a state where there is no single point of failure. We need to continue on this path to ensure that our computer systems are not fragile and prone to Black Svan events, but instead thrive and get better with each failure and challenge. With software and computers already controlling so much of our daily lives, this evolution is not optional, it is mandatory.

Leave a Reply Cancel reply