Insights | THE BRAVE NEW DIGITAL WORLD

30/07/2024

THE BRAVE NEW DIGITAL WORLD

By José María Roldán Alegre, Senior Advisor Kreab

 

Schadenfreude. I must confess I have a soft spot for that word, without an equivalent in Spanish. It is usually defined in very negative terms (“the experience of pleasure, joy, or self-satisfaction that comes from learning of or witnessing the troubles, failures, pain, suffering, or humiliation of another”). But the reality is that it is quite a mundane feeling: it is a kind of “thank God it did not happen to me!”

Anyway, that was my feeling when observing the chaos and havoc on 19/07/24 (let´s call it “The Incident”). The previous day, a Thursday, I was landing on Madrid/Barajas airport at around 10 pm (yes, a close call). So, witnessing the chaotic situation in airports around the world, I thought “poor guys”, but also “thank God It did no happened yesterday, when I was coming back!”

What happened was not in the end so terrible: in a few hours the recovery process was almost complete, and people around the World managed to find shortcuts to sort out the problems (for instance, hand writing airplane tickets!). And we learned a few new acronyms to add to our dictionary: BSOD loop, or Blue Screen Of Death loop, or when the computer tries to reboot unsuccessfully in a infinite process. A kind of “Goundhog Day” in computer terms.

On a more serious note, there are lessons to be learned from “The Incident”.

It is not always financial firms

In the new digital world, incidents abound. By reading the newspapers you would think that it is just banks the ones affected by blackouts, leakages, cyber-attacks, etc. But that is just the result of banks, and in fact all financial firms (including insurance firms) being as of today IT companies, or fintechs. But as technology penetrates all areas, we will see more and more incidents affecting all sectors beyond financials.

It is not always cyber-attacks

The Incident was not the result of malicious foreign agents trying to create economic and social damage. Cyber attacks are a very real threat, but they are not the only ones. The need to upgrade systems almost on a daily basis (ironically, to protect us from cyber risks) represents a major fragility, considering the complexity of the software ecosystem.

This happens in the best of families

The names involved in The Incident were not secondary actors, but “best of class” ones. This must serve as a reminder that no one is safe from making mistakes. No complacency, please.

The unavoidable fragilities of the Hub and Spokes IT paradigm

The IT world is developing in the form of a Hub and Spokes system, both in software and hardware terms. In Hardware, we have Data Centers, Servers, etc. The Cloud involves a series of Hubs to which the companies (the spokes) connect. In software terms, we have a system organized around corporations with solid IT frameworks (the Hubs) that work with specialized Third-Party Services Providers (TPSP). And we have also big software companies that let third parties (software developers) to offer applications that run in their core systems.

There are inherent risks coming from that model. In particular, the spokes will never be as strong, resilient, effective as the hubs, be it in hardware terms or in software ones (in particular, they are more fragile and exposed to cyberattacks). But we must embrace these challenges, since there is no alternative to the Hub and Spokes model, or to The Cloud.

Recovery is always feasible

The Incident was extremely serious, affecting all countries, all sectors in a fast and massive way. In fact, it is probably the biggest incident on record. But, without underestimating its impact, recovery came very fast. The source problem was detected rapidly, short term fixes were offered and, when that was not possible, human ingenuity took control. I was personally impressed by the hand written airplane tickets: I had never seen one in my whole life!

Quality Assurance Testing is of the essence

In an IT world that is constantly evolving, adapting, upgrading, etc., a basic foundation for the resilience of the ecosystem is a strong framework of QA testing, being sure that the upgrades work as they are intended to work. I vividly recall a mortal crash of a military airplane prototype in Spain. There were no faulty engines, or any structural failures: the cause was a corrupt software loaded into the engines.

Of course, QA testing comes at a cost: it is expensive, and it takes time (and in the IT world of cyber menaces, a quick response is essential). But the costs of not doing it can be brutal, as seen in The Incident.

Adequate Communications do limit impact

When confronted with an IT incident, a god, clear, transparent communication is fundamental to ensure that affected parties can take the necessary steps to minimize the impact. And, incidentally, it helps to also minimize the reputational costs: they do not disappear, but they are reduced.

But, of course, a good communication is never improvised. It has to be planned, rehearsed and, when the time comes, used.

Lessons learned around TPSP (Third Party Services Providers)

Regulators and supervisors are obsessed, and rightly so, with the challenges posed by TPSP in the new IT/Financial ecosystem. Clearly, they are the weakest link in the IT value chain, and are the source of most of the incidents we observe regularly.

Although The Incident was not a financial one, there will be surely a read across sectors, and the pressure on TPSP in the financial sector will increase (in fact, it will increase elsewhere as well). The trend of reinforcing the spokes is here to stay.

But we might as well think about the incentives we put in place. It is reasonable to strengthen the framework of TPSP. And by doing so we are also changing the dynamics of the ecosystem: we are reduced the number of suppliers in the spokes and increasing their average size. We have less operators, and stronger operators in the TPSP reign.

But we must be careful on what we wish for. Bigger operators are undeniably usually stronger, more solid, since they have the capacity of making the needed investments in their systems. But as we lose diversity, we also get closer to a low probability, high impact event (as the one experienced in The Incident). Or, expressed differently, there is a trade-off between diversity or risk diversification and resilience.

In fact, this is not different from what we saw after the Global Financial Crisis with the promotion of CCPs (Central Counterparties). They increase resilience, as long as there is not a problem with a CCP (they are the financial equivalent of a Nuclear Powerplant). No solution comes without side effects.

To sum up, what happened before, after and during The Incident should be analyzed in depth to extract the right lessons to be learned and the policies options that should be considered. But it would be a mistake to put in place policy options that reduce the diversity of TPSP in the new IT ecosystem of Hubs and Spokes. We may reduce risks in the short run, but at the expense of brewing a low probability, high impact event as the one observed in The Incident.