Fail-safe - Firnkorn & Stortz

What is Reliability?

Reliability in the narrower sense is the property of an IT or other infrastructure component that works completely error-free, without planned or unplanned downtimes. In a broader sense – also by the German BSI in the context of emergency management – the term is often used synonymously with the term availability, which in turn describes the degree of desired or measured uptime per unit of time. However, failure safety means 100% availability and is therefore still above high availability. Reliability or IT reliability is therefore always an issue when agreed availability is 100%. This can refer to an individual component as well as to the entire service. Ultimately, however, what is relevant here is the availability measured at the customer interface, so that reliability can also be achieved by combining several components beyond high availability.

Objectives and reasons for IT reliability

Resiliency is a goal, especially in environments with

extremely high downtime costs (e.g. in the financial sector, in chaotic warehouses, etc.),
Large-scale effects to be feared (e.g. in the energy sector with energy production and distribution or in special areas of transport such as rail or air traffic control) or even
Life and limb depending on the availability of the systems (e.g. in the hospital environment, in airplanes and in the future in general in autonomous vehicles)

There are different failure categories to think about:

hardware error
Logical errors (data corruptions caused by software or user errors)
disaster

While hardware errors play a role above all in the area of individual components and can be protected reasonably well using appropriate architectures, it is the other two categories in the area of systems that can cause critical downtimes.

Ensuring resilience

Reliability in terms of 100% availability of a system can only be guaranteed with enormous effort. However, IT knows alternative ways to come very close to failsafety. Both in the technical components and in the systems, there are architectures that come very close to failure safety. These include cluster technologies, synchronous mirroring methods, components connected in parallel, etc. However, cloud services are currently also opening up ways to map the highest availability requirements in a more cost-efficient manner and with a significantly reduced level of complexity for users in a commercially sensible manner. Cloud providers benefit from economies of scale, which gradually reduce competitive market prices. In order to find the right answers in this highly dynamic environment, consultants such as Firnkorn and Stortz support you with decades of experience. Together with these, it is much easier to find answers to questions that are:

Does a system have to be failsafe?
Which availability classes are necessary for quality service provision?
Should in-house systems be used with an alternative approach such as hosting or cloud operation?
How can resiliency (or high availability) be implemented and how does this integrate with existing infrastructure and organizations
And many more

Server and storage reliability

Without data there is no data processing. No IT services without data processing. The classic failure scenario means that requests to servers or storage components come to nothing. Cluster environments or synchronous mirroring are ideal here, which ensure the commissioning of the parallel line in the event of a line failure and thus ensure server reliability. Server and storage virtualization opens new possibilities here, as do cloud services.

Network resilience

No data traffic without data superhighways. Server, storage, and other components can run as securely and efficiently as possible, without access paths there is no service. The network is a highly critical component, whether internally in the LAN or in external communication in the WAN. Often nothing works without a network. With the advent of cloud services, bandwidth has also become a critical factor and in some places is pushing the existing infrastructure to its limits. For the fail-safety of the “network” component or network fail-safety, multiple network connections are on the list:

Multiple network cards
Multiple physical node access points
Different providers
Various access technologies, e.g. 4G/5G router as backup network access)
And other things

Reliability of SAP systems

SAP systems are of central importance in many companies and form the basis for many business processes. Therefore, the SAP reliability for the company and the general IT reliability is particularly important. It is true that SAP environments, like systems from other software manufacturers, are also based on the architecture components mentioned above. With the complexity of these environments, however, the risks in terms of reliability also increase – system-internal dependencies, any number of interfaces to the company’s own and external systems, workflows, many parallel data changes, large procedural change projects (transports) etc. Protections for the individual components and systems correspond to the above measures. For the special case of procedural change projects within the SAP environment, SAP offers its own tools such as ChaRM, HALM and Focused Build. Other third-party tools have also been established to close the gaps in SAP’s own tools. This includes the TIA (Transport Impact Analysis) tool from Firnkorn and Stortz, which supports reliability at a logical level by checking dependencies and can be used both in project use as a service and as a permanent solution. The entire issue of reliability, with all its perspectives, is extremely complex. The consultants at Firnkorn & Stortz have been supporting companies of all sizes in all details for decades, evaluating and advising holistically and independently of tools.

Also, read more on related topics

Your contact person at Firnkorn & Stortz on the subject of failure safety

Firnkorn + Stortz GmbH

Thomas Firnkorn

Read in this article: