Disaster Recovery Planning
This document is intended to be guidance for the development of disaster recovery plan(s) for systems, services and applications supported and maintained by the OCIO. It is segmented into the following areas:
|· Guiding Principles
|· Off-Site Disaster Recovery Types
· Planning Process (NEXT STEPS)
- We will consider the foundation of Disaster Recovery to be Data Backup.
- We will plan, build and maintain disaster resilient infrastructures to minimize incidents.
- Defining RTOs and RPOs will be an iterative process because of the competing forces of available budget and required recovery objectives.
- The best DR plan is the plan that we will never have to use.
- We will reduce risks like downtime and data loss by reducing complexity and simplifying architectures.
- Disaster plans will be location-independent plans that focus on the procedures needed to recover a system at the current or an alternate location.
- To minimize risk for our applications, data, and infrastructure, we will understand our single points of failure, dependencies and current redundancies.
- We will test: Disaster Recovery plans that are not tested are not valid plans.
Disaster Recovery: Disaster Recovery (DR) is the ability of an organization to recover or continue to deliver vital services and supporting systems (ex. technology infrastructure, servers, applications, databases, storage, etc.) after a disruption by an incident, emergency or disaster within agreed targets.
- Disaster Recovery plans are the guidance for IT and the business areas to recover disrupted IT and telecommunications capabilities in a prioritized way to ensure services can continue within a minimum period of downtime, pre-determined by the organization with the highest level of data recovery possible and to planned levels of operations.
Business Continuity: Business Continuity (BC) is the term used to define the plans and processes that focus on sustaining an organization’s mission and business processes during and after a disruption.
- Business Continuity prepares you for events that cause disruption to your staff’s normal way of doing business – regardless of if it does physical damage to your buildings.
- Business Continuity is about identifying the most urgent activities that underpin key services and then, once that analysis is complete, devising plans and strategies that will enable you to continue your business operations and enable you to recover quickly and effectively from any type disruption whatever its size or cause.
- Business Continuity gives you a solid guideline to lean on in times of crisis and provides stability and security.
- The BC Plans will reference the DR Plans if services are impacted.
High Availability: High Availability (HA) refers to the availability of resources in a computer system, in the wake of component failures in the system.
- There are a few key characteristics of highly available services: availability, scalability, and fault tolerance.
- Although these characteristics are interrelated, it is important to understand each and how they contribute to the overall availability of the solution.
RTO – RPO: Recovery time objective (RTO) and recovery point objective (RPO) are the key metrics to determine the DR level required to recover. RTO is the maximum time allowed for recovery of a service following an interruption. RPO is the maximum amount of data that may be lost when a service is restored after an interruption. RPO is expressed as a length of time before the failure when data, including transactions, could be lost.
- They are reciprocally proportional to the cost of DR: The closer RTO and RPO need to be to zero, the more expensive DR provisioning will be.
- Determining the necessary RTO’s and RPO’s is the single most important exercise that we as an organization need to perform to ensure the right level of DR without wasting money.
- RTOs and RPOs are derived through business impact analysis of business processes and applications to determine the value of business processes and the anticipated financial impact if they become unavailable.
Maximum Tolerable Downtime (MTD): The MTD represents the total amount of time the system owner is willing to accept for a mission/business process outage or disruption and includes all impact considerations.
- The recovery point objective (RPO) and the recovery time objective (RTO) are two very specific parameters that are closely associated with recovery.
- The RTO is basically how long you can go, without a specific application.
- This is often associated with your maximum allowable or maximum tolerable outage.
- Determining MTD is important because it could leave contingency planners with imprecise direction on (1) selection of an appropriate recovery method, and (2) the depth of detail which will be required when developing recovery procedures, including their scope and content.
System Backup: System Backup is the ability to recover the OS, Application/Middleware binaries and configuration data of a server.
- These would include traditional backups for a system, i.e. Avamar, VM snapshots, etc.
- Would enable the restoration of a system via bare-metal restore or traditional restore methodology requiring initial OS install followed with restore from the backup system.
- These backups should reside at an offsite location.
Data Backup: This includes the transactional data, i.e. databases, file systems, etc.
- These backups should be an offsite location.
- Data backup is foundational and required for all disaster recovery plans and initiatives.
Availability: Ability of a service or application to perform its agreed function when required.
- An available application considers the availability of its underlying infrastructure and dependent services.
- Available applications remove single points of failure through redundancy and resilient design.
- When we talk about availability, it is important to understand the concept of the effective availability of the platform. Effective availability considers each dependent service and their cumulative effect on the total system availability.
Scalability: The ability of an application to perform its agreed function when the workload or scope changes.
- Scalable applications are able to meet increased demand with consistent results in acceptable time windows.
- Scalability directly affects availability—an application that fails under increased load is no longer available.
- When a system is scalable, it scales horizontally or vertically to manage increases in load while maintaining consistent performance. In basic terms, horizontal scaling adds more machines of the same size (processor, memory, bandwidth, etc.) while vertical scaling increases the size of the existing machines.
Fault Tolerance: A fault-tolerant system is one that has the ability to continue service in spite of a hardware or software failure.
- Fault tolerance is not a degree of availability so much as a method for achieving very high levels of availability.
- A fault-tolerant system is characterized by redundancy in most hardware components, including CPU, memory, I/O subsystems, and other elements.
- However, even fault-tolerant systems are subject to outages from human error.
- Note that High Availability does not imply fault tolerance.
Continuous Availability: Continuous availability means non-stop service, that is, there are no planned or unplanned outages at all.
- This is a much more ambitious goal than HA, because there can be no lapse in service.
- In effect, continuous availability is an ideal state rather than a characteristic of any real-world system.
- This term is sometimes used to indicate a very high level of availability in which only a very small known quantity of downtime is acceptable.
- Note that HA does not imply continuous availability.
Reduced Capabilities or Limited Capacity: During some events and periods of service recovery, services may be degraded but not completely unavailable. During these lesser crisis or while an organization focuses on restoring normal operations following a disaster an impacted service may be operating on limited capacity, with reduced capabilities, or within restricted resources.
Off-Site Disaster Recovery Types
Hot Sites are facilities appropriately sized to support system requirements and configured with the necessary system hardware, supporting infrastructure, and support personnel. This would be a requirement for Continuous Availability.
Warm Sites are partially equipped office spaces that contain some of the system hardware, software, telecommunications, and power sources. Currently TNC.
Cold Sites are typically facilities with adequate space and infrastructure (electric power, telecommunications connections, and environmental controls) to support information system recovery activities. Currently Wright State.
Disaster Recovery as a Service (DRaaS) are cloud providers that offer cloud-based infrastructure to deliver DR services. These types of services can reduce the investment required to provide DR.
The planning process that we need to begin to build and mature can be divided into 4 major areas:
|(1) Identify Business Recovery Requirements (MTD). It’s typically calculated as part of a business impact analysis (BIA).
(2) Determine speed of Recovery (RTO)
(3) Create Plan and Address Plan Gaps
(4) Test and Maintain Plan
Sample Planning Grid
|System / Application Name
||Days / Hours
||1. Hardware Failure
2. Software Failure
3. Facility Unavailable
4. Security/DDOS attack
|Steps we have taken to reduce the risk of this threat affecting our system.
||Actions we will take should a failure occur.
||Actions we will take to restore the application to normal operations.
Identify roles and responsibilities: Document and agree on who will do what in the planning process. Include customers, business analysts, service owners, system administrators, engineers, architects, DR coordinator, etc.
Documentation of plans: DR plans for systems, applications, etc. should be centrally coordinated, stored and managed.
Plan Dependencies: Ensure that DR plans identify and consider dependencies (network, batch, etc.)
Testing of plans: Plans must be tested to ensure viability of the plan and identification of plan deficiencies. Tests can encompass both virtual (table-top) and actual tests. Testing should be documented and stored with plan documentation.