High availability - Firnkorn & Stortz

Definition of high availability

High availability includes a set of technologies, procedures and design principles that support the goal of business continuity. For systems or services, the goal of high availability means a defined, maximum tolerated downtime of generally 0.01%. In other words: 99.99% availability is considered to be highly available. Ideally, systems and services are 100% available, i.e. accessible at any time during the agreed operating hours. However, this is rarely realistic and can only be mapped with enormous effort. For this reason, companies in many areas accept defined, maximum downtimes to keep the costs for services within realistic ranges. In the operational environment, availability requirements are often agreed in the form of service level agreements. The operating time is usually defined as a basis (e.g. 7×24, i.e. around the clock, or Monday to Friday on weekdays from 8 a.m. to 4 p.m., i.e. 5×8), with planned downtimes (e.g. maintenance windows) being included with exception of. In addition to the operating time, the minimum percentage of the time that the system or the service should be accessible (uptime) is defined, e.g. this 99.99% as high availability. Such specifications are intended to ensure that all components required for business operations can be used with a sufficiently high degree of probability during business hours.

availability classes

In practice, in addition to percentage specifications and measurements, the availability classes of the Harvard Research Group, the Availability Environment Classification (AEC)

AEC class	Explanation
AEC-0 / Conventional	Service can be interrupted, data integrity not essential eg: systems that do not store productive data, are not critical to business operations. In many SAP environments, for example, sandbox systems
AEC-1 / Highly Reliable	Service can be interrupted, but data integrity must be guaranteed Service can be interrupted, but data integrity must be guaranteed e.g. systems that are not involved in productive processes but whose function can have an indirect/medium-term impact on them. In the SAP environment, it is often the regular development and QA systems, since dependent data structures and test cases are mapped in these, or BW/BI systems.
AEC-2 / High Availability	Service may only be interrupted minimally during agreed times or during main operating hours, eg systems that are critical during the productive times of a business process. In SAP environments, these are often the productive systems
AEC-3 / Fault Resilient	Service must be maintained uninterrupted during agreed times or during peak hours of operation, eg systems whose processes and data are absolutely critical to the business process. In SAP environments, for example, these are often the systems in the warehouse management environment, especially for chaotic warehouses, since a failure here would often lead to a full inventory
AEC-4 / Fault Tolerant	Service must be maintained uninterrupted, 24/7 operation (24 hours a day, 7 days a week) must be guaranteed As above, but with no restriction on uptime. This often includes systems for online services, where customers can quickly switch to alternatives, or KRITIS environments.
AEC-5 / Disaster Tolerant	Service must be available under all circumstances Systems where life and limb are at stake, often in a critical environment, e.g. in nuclear power plant environments, etc.

Table 1: Availability classes according to AEC Source: Harvard Research Group, https://hrgresearch.com/High%20Availability.html

In addition, there are other classifications, such as the BSI availability classes commonly used in Germany:

Figure 1: Availability classes according to BSI Source: Bundesamt für Sicherheit in der Informaionstechnologie, HV-Kompendium, Band G, Kapitel 2: https://www.bsi.bund.de/SharedDocs/Downloads/DE/BSI/Hochverfuegbarkeit/BandG/G2_Definitionen.html

Calculation of high availability

Availability is usually given as a percentage. In general, the availability – and thus also the high availability – is calculated by dividing the uninterrupted service time of a service by the total (agreed) operating time. Regular maintenance windows, which are also explicitly regulated as planned downtimes in SLAs, do not count towards the service time. 100% availability means uninterrupted operation during the agreed service period. With a 7×24 operation, the typical “9s” mean the following maximum downtimes:

99% ≡ 438 minutes/month, i.e. 7:18:18 hours/month = 87.7 hours/year, i.e. 3 days and 15:39:36 h
99.9% ≡ 43:48 minutes/month, i.e. 8:45:58 hours/year
99.99% ≡ 4:23 minutes/month, i.e. 52:36 minutes/year
99.999% ≡ 26.3 seconds/month, i.e. 5:16 minutes/year
99.9999% ≡ 2.63 seconds/month, i.e. 31.6 seconds/year

Achieving 99.9% availability for a single system is a challenge. However, the same availability for a process with three components, for example, makes the task far greater: Availability requirements multiply. So, three components with 99.9% availability each have a total availability of only 99.9% x 99.9% x 99.9% = 99.7%. Instead of nine hours a year, this is already well over a day. If one of the components has an availability of only 99%, the maximum downtime even increases to almost four and a half days with 98.8% availability. In the case of complex services, the individual availability of each critical component must therefore be considered and set at a correspondingly high level.

Areas of application for high availability

The higher the specified availability, the more effort and costs are involved in the infrastructure to be set up. This pays off above all in sectors and applications in which

Downtimes entail high financial costs (e.g. in stock exchange trading or in the financial sector in general, or in the case of chaotically organized large warehouses),
Life and limb depend on the availability of the infrastructure (e.g. in the healthcare or airport environment) or
large-scale consequences are to be expected in the event of failures (typically in the KRITIS environments, or with providers for large infrastructure environments, e.g. in the cloud sector, in electricity plants).

In addition to environments in which general high availability is required, there are also fields of application in which specific requirements are defined at a specific point in time. This applies, for example, to reporting systems that may only have their big appearance quarterly, but then must be available within a certain time window. Failure to do so will result in fines or worse, bad press. Here, too, it is important to weigh up how specific requirements are covered.

SAP Hana high availability

A special case is the use of HANA databases in the SAP environment. In the case of high availability requirements for S/4, BW/4 HANA or similar, many classic methods only work to a limited extent: As an in-memory database, the HANA-DB does not write all data immediately to physical storage media, but only keeps it in memory, i.e. per se fleeting. For this reason, the HANA database technology offers its own mechanisms in the form of synchronous mirroring with the HANA system replication, supplemented by cluster technologies from the underlying Linux distributions.

costs of high availability

The costs of high availability increase with the requirements. The maximum tolerated downtimes, but also the complexity of the services, play a central role. There are only a few options to reduce costs:

Check availability requirements and reduce if necessary – possibly at the expense of service quality if permitted downtimes are actually (must) be used
Simplify business processes and services to reduce the number of critical components involved and thus the availability factors
Use more cost-effective technologies
Use more cost-efficient components
Conclude more cost-efficient contracts for external service providers

Implementation of high availability

In many cases, the technical implementation of a high-availability environment has something to do with the redundancy (i.e. duplication) of critical system components in order to avoid single points of failure (SPOF), the failure of which would lead to the failure of the entire service. In on-premises operation – i.e. in self-operated environments – the following should be mentioned in particular:

Backup data centers to double the system locations in the event of large-scale incidents (disasters)
Cluster technologies to duplicate servers in case of hardware, software or network problems
SAN or RAID technologies for duplicating storage components in the event of hardware problems
UPS (uninterruptible power supply) to double the power supply
Redundant network access points to double the Internet connection

In addition to redundant components, there is also the class of fail-safe systems , which are structured in such a way that they guarantee 100% availability, but entail corresponding costs. When using cloud structures, all hypervisor technologies offer different availability classes, including local and/or global redundancies. Hybrid availability scenarios have also become established on the market, in which companies secure on-premises with cloud environments, for example. However, the central challenge for the technical structure of high-availability services is their planning based on the actual requirements. For this purpose, it is necessary to analyze the real processes and their requirements and to derive specifications for the individual components of a process chain from them. This has both technical and economic dimensions – against the background of pricing your own, market-compliant service offers, the motto is usually “as much as necessary, as little as possible”. In practice, however, there are often service level agreements that do not meet the actual requirements of high availability:

unnecessarily high specifications, which carries corresponding cost potential,
insufficient availability requirements for critical components – which means a (too) high overall risk for your own service.
undefined, “perceived” service level requirements that can swing arbitrarily in both dimensions.

This is where external consultants often come into play, who analyze processes both technically and organizationally and provide expert advice. Such providers and service providers usually guarantee objective support. The consultants from Firnkorn and Stortz can support you with the introduction and operation of high availability with specialist knowledge, know-how and a lot of practical experience. So that you only pay for high availability where you need it.

Also, read more on related topics

Your contact at Firnkorn & Stortz on the subject of high availability

Firnkorn + Stortz GmbH

Thomas Firnkorn

Read in this article:

Definition of high availability

availability classes

Calculation of high availability

Areas of application for high availability

SAP Hana high availability

costs of high availability

Implementation of high availability

Also, read more on related topics

Your contact at Firnkorn & Stortz on the subject of high availability

Treten Sie mit uns in Kontakt

Stellen Sie uns Ihre Fragen!

Get in contact with us

Ask us your questions!