IT needs awareness!

Last week, the assembly lines at Toyota came to a standstill – due to insufficient memory in a server. Many people may not understand “how can something like this happen? may even have caused a certain amount of schadenfreude – but many CIOs and IT managers will have breathed a sigh of relief:

“Thank God that didn’t happen to us”.

Managing a modern IT landscape is a Herculean task. You have to deal with structures that have evolved over time, historical legacy issues “better not switch that thing off!”, little understanding of IT “we build cars, not software”, budget and resource issues, dangerous half-knowledge, and an almost unattainable expectation of IT from top management. Future-proof strategies are to be developed, all hype topics are to be addressed “of course we use AI”, “of course we are pioneers in big data”, “we have been doing IOT for a long time”, “of course we rely on microservices”, “cloud: old hat with us”… and, incidentally, the countless IT projects that are running in the company are all supposed to fit together homogeneously and be implemented on time.

And if everything actually runs smoothly in the end, nobody notices because IT – for whatever reason – is expected to work.

We should therefore always be aware of how complex IT systems in companies are. Unlike a production line, where complicated-looking devices are visible to everyone, IT is usually controlled by software whose source code “nobody understands anyway” and which is “just a few lines”.

Every interface between systems increases the risk of failure and multiplies the probability of failure of the systems involved. If system A has a probability of failure of 5% and system B also has a probability of failure of 5%, then an interface between the two systems results in system A+B having a probability of failure of at least 1-(0.95 x 0.95), i.e. 9.75%. The more systems or services we connect, the higher the risk of failure.

I’m sure that a light in an IT service employee’s tool at Toyota also lit up early on to indicate that server XYZ was running out of memory. I am also sure that this was reported accordingly, an application for an upgrade was written, probably even approved, budgeted and planned for the next quarter or two. It is likely that the sub-project that carried out the maintenance that led to the error and the sub-project that was responsible for the server did not know about each other.

Above all, however, I am sure that such mistakes can happen anywhere!

IT must be firmly anchored in the core of a company, IT needs “awareness” (sorry, haven’t found a good German word for it yet), IT must be taken seriously and needs capable people who work day after day to ensure that such errors occur as rarely as possible!