We have an incredible reliance on computer systems remaining operational in today’s world. Living in the highly online world, when systems go down, it’s a problem. And often a big problem. Long gone are the days when a call of “the system’s down” was a welcome excuse to head on down to the canteen or café for a coffee with your colleagues, or to the lobby for a smoke, while the IT administrators casually sauntered into the server room to figure out what the problems was. Nowadays, a down system leads to calls of frustration and angst from customers and users and managers, concerns around having to provide explanations to government bodies, and highly elevated stress levels amongst the IT staff.
While the IBM i has an enviable reputation for being incredibly reliable, it is still very important that users of the platform put in place measures to protect themselves in the highly unlikely event of the system becoming unavailable. Most companies realise this and, at a bare minimum, do regular backups of their systems. However, some do not go much beyond this, leaving them vulnerable to extended outages while they try to resolve the cause of the problem, then restore their backups.
Yes, even IBM i operators need to consider how to deal with situations when their servers may, or will, become unavailable. And given that IBM i systems host some of the most business-critical applications in the world, downtime in the IBM i world is often simply unacceptable.
The Costs Of Downtime
Depending upon the organisation, the costs of your IBM i being offline can be huge. Sure, if you only use your IBM i internally , your business may be able to get through a minor outage. But even such businesses need to ask themselves whether an IBM i outage will cause headaches such as:
- Not being able to respond promptly to customer requests
- The inconvenience, for example, of not being able to easily process the arrivals and exits of items at their warehouse, leading to delays and possible backlogs of goods received or delays in delivering products to customers
- Having to revert to rudimentary systems to manage on-going business (like, recording business processes using pen and paper – remember that?)
In brief, the cost of a system being offline can be allocated into one of the following 3 categories:
- Revenue loss – examples being unable to process orders, or product information not being available to customers, or staff accessing information on behalf of customers (for example, at call centres);
- Productivity loss – due to staff not being able to operate efficiently, or at all. Consider, for example, the warehouse operations staff unable to access their warehouse management system. Or a bank’s staff not being able to process customer requests;
- Reputation loss – in our online world, people expect information now! If an outage results in information not being available to customers (whether directly by online apps or indirectly through call-centre or frontline staff), an organisation’s reputation will suffer. And that is money – with the potential loss of that customer to the competition.
The costs of downtime will affect different business and industries differently. Financial markets, and the likes of credit card companies tend to be more negatively affected than the likes of manufacturing or logistics companies, for example (in terms of the costs of downtime per hour). Regardless of the industry however, there is a cost. And individual companies should have in place an understanding on what those costs are.
Types Of System Downtime
System downtime can be due to a number of different reasons – some more easily managed than others. Here are some examples:
Planned Outages
For such events, we know when an outage is going to happen, and why. They are generally the most common event that an HA solution is actually used for. Such planned outages might include:
- Taking systems off-line for nightly backups (the period being known as the backup window);
- Hardware and software upgrades;
- Other general maintenance tasks
In most situations, planned downtime concerns can be mitigated by posting advanced notice to users, and we are all probably familiar with that. However, this may not always be a viable option, and possibly result in the loss of valuable business.
With an HA system in place, there is no need for systems to be unavailable to users during such planned downtime. For example, the backups can be run off the secondary system while the primary live system carries on doing its job. Or, in the case of an upgrade, the maintenance can be performed on secondary system first, then the systems can swapped, then the upgrade can be performed on the primary. With an HA system in place, there should be no, or very limited, disruptions to the users.
Unplanned Outages
While unplanned outages are less regular (well, I hope they are!), they are probably the ones that concern us the most, and lead the greatest stress for IT departments when they actually do occur. Again, they can be due to a number of reasons:
- Human errors – interestingly, of these, it is human error that is the most common cause of unplanned outages – warnings ignored, procedures not followed, lack of knowledge and education, and communication breakdowns;
- Software issues – bugs, version control issues;
- Hardware faults – mechanical failures, electrical failures, cable damage, and loose connections;
- Environmental issues – due to issues like power failures, networks going down, or aircon failures.
Whatever the reason for a system going down, a robust high availability solution will provide automated failover to a backup system, ensuring that users can continue working with minimal-to-no disruption.
Issues To Be Considered For High Availability
Exploring high availability requires a great deal of deliberation, and there are a number of items to be considered.
Like all things, budget is one of those key considerations. HA solutions are an investment. They are an investment in a business’ reputation. The longer it takes for an organisation to recover from an outage, the greater the potential to lose business, and recover a hard-earned reputation. If you want zero (or close to zero) downtime, then it is going to cost. First off, you will need another IBM i server – either on prem or access one in the cloud. Then you will need some software to manage the replication. Then you will need to allocate staff to manage the system and to come up with plans around handling outages.
Other key areas to be considered include:
- Up time requirements – what are your commitments to your end-users?
- Recovery Time Objectives (RTO) – how soon do you need to get your systems back running again after a failure?
- Recovery Point Objective (RPO) – how much data (often expressed as time) can you afford to lose in the event of a failure?
- Do you require automated fail over or switch over?
High Availability as a Service as an alternative approach to HA
Setting up an effective high availability solution requires an investment in money and effort. And admittedly, it is an investment that can be quite costly. But that investment needs to be considered against the costs of the system being offline. And that’s a cost/benefit analysis that each organisation must perform itself.
Upon coming to the conclusion that having a high availability system in place makes solid financial sense, companies will often realise that they do not have the experience to implement such solutions themselves, or simply do not desire the headache involved in doing so. For such organisations, outsourcing their HA implementations can be an attractive alternative to managing it entirely in house. And there are multiple benefits. First off, it means that you do not need to skill up your team in yet another skill set – they can remain focused on their primary jobs. Ok, you will need to allocate some staff to work with such a vendor to advise what libraries, objects, and so forth that you need replicated. But the work of actually implementing, monitoring, and managing the system, as well as the planning and executing of switchover/role-swap testing can be outsourced to experts.
In addition to allowing you to focus on business-as-usual, engaging in an HAaaS model makes for a big change on your budget – moving the costs of your HA deployment from a CAPEX model to an OPEX one. While for many organisations, holding all you production data on-premise (even your backup data) is a requirement, there are some organisations who will be able to take advantage of moving even their HA server to the cloud, as many IBM i companies are beginning to explore and do … further reducing their CAPEX costs.
Joule Tech and HAaaS
Joule Tech has a great deal of experience in implementing HA and DR systems and, drawing from this, the company is able to work with you to design a High Availability service agreement that meets your organisation’s unique requirements – while reducing your workload and requiring no upfront investment (CAPEX).
We will also work with you to tailor the best hardware deployment model for your HA & DR environment – whether on-prem or in the cloud.
On-Premise Model
Cloud Model