Disaster Recovery tiers

February 6, 2017

by Kevin Balogh at 4:24pm

Working definitions of Disaster Recovery tiers (subject to change as we recognize the criticality of more services):

Tier 0, Critical to our ability to deliver most services and must be restored first in order to bring up all other services.

– Technical or Supporting services in catalog that support many other services

– 2010 flu pandemic ranking

– 2013 TNC business continuity plan

– Enterprise Security’s BIA

– Incident Management priority model

– Presented to leaders meeting for review Nov. 2015

– Tier 0 Services: Active Directory Services, Data Center Networking, Server Hosting, Database Services, Enterprise Data Storage, Batch Process Management, Account and Identity Management, Kerberos Authentication, Web Single Sign On, Network Services

Tier 1, Important or essential to our ability to deliver multiple services

– Used count of Depend-on services between 16 and 3 that were not already Level 0

– Presented to leaders meeting for review Nov. 2015

– Tier 1 Services: Human Resource Applications, Telephone Services, Financial Applications, IT Service Management System Administration, Student Information Applications, Email (University E-mail Service), Web Hosting, Identity Management Connectors, Multi-Factor Tokens (RSA), Multi-Factor Tokens (Duo), eReports, eMaterials Support, Carmen(Alpha, D2L, Canvas)

Tier 2, Important to our ability to deliver a few services

– Used count of Depends-on services between 3 and 1 that were not already Level 0 or Level 1

– Tier 2 Services: Data Warehouse, Telephone VOIP Services, Enterprise Integration Services, Call Center Services, Buckeye Oasis Support, Telemanagement Administration & Support, Log Management Hosting, Co-location, Mailing Lists, Wireless Networking Connectivity

Tier 3, the rest of the services found in our catalog

OCIO Disaster Recovery Planning Definitions and Guiding Principles

January 6, 2017

by Kevin Balogh at 12:23pm

Disaster Recovery Planning

Overview

This document is intended to be guidance for the development of disaster recovery plan(s) for systems, services and applications supported and maintained by the OCIO. It is segmented into the following areas:

· Guiding Principles

· Definitions

· Off-Site Disaster Recovery Types

· Planning Process (NEXT STEPS)

Guiding Principles

We will consider the foundation of Disaster Recovery to be Data Backup.
We will plan, build and maintain disaster resilient infrastructures to minimize incidents.
Defining RTOs and RPOs will be an iterative process because of the competing forces of available budget and required recovery objectives.
The best DR plan is the plan that we will never have to use.
We will reduce risks like downtime and data loss by reducing complexity and simplifying architectures.
Disaster plans will be location-independent plans that focus on the procedures needed to recover a system at the current or an alternate location.
To minimize risk for our applications, data, and infrastructure, we will understand our single points of failure, dependencies and current redundancies.
We will test: Disaster Recovery plans that are not tested are not valid plans.

Definitions

Disaster Recovery: Disaster Recovery (DR) is the ability of an organization to recover or continue to deliver vital services and supporting systems (ex. technology infrastructure, servers, applications, databases, storage, etc.) after a disruption by an incident, emergency or disaster within agreed targets.

Disaster Recovery plans are the guidance for IT and the business areas to recover disrupted IT and telecommunications capabilities in a prioritized way to ensure services can continue within a minimum period of downtime, pre-determined by the organization with the highest level of data recovery possible and to planned levels of operations.

Business Continuity: Business Continuity (BC) is the term used to define the plans and processes that focus on sustaining an organization’s mission and business processes during and after a disruption.

Business Continuity prepares you for events that cause disruption to your staff’s normal way of doing business – regardless of if it does physical damage to your buildings.
Business Continuity is about identifying the most urgent activities that underpin key services and then, once that analysis is complete, devising plans and strategies that will enable you to continue your business operations and enable you to recover quickly and effectively from any type disruption whatever its size or cause.
Business Continuity gives you a solid guideline to lean on in times of crisis and provides stability and security.
The BC Plans will reference the DR Plans if services are impacted.

High Availability: High Availability (HA) refers to the availability of resources in a computer system, in the wake of component failures in the system.

There are a few key characteristics of highly available services: availability, scalability, and fault tolerance.
Although these characteristics are interrelated, it is important to understand each and how they contribute to the overall availability of the solution.

RTO – RPO: Recovery time objective (RTO) and recovery point objective (RPO) are the key metrics to determine the DR level required to recover. RTO is the maximum time allowed for recovery of a service following an interruption. RPO is the maximum amount of data that may be lost when a service is restored after an interruption. RPO is expressed as a length of time before the failure when data, including transactions, could be lost.

They are reciprocally proportional to the cost of DR: The closer RTO and RPO need to be to zero, the more expensive DR provisioning will be.
Determining the necessary RTO’s and RPO’s is the single most important exercise that we as an organization need to perform to ensure the right level of DR without wasting money.
RTOs and RPOs are derived through business impact analysis of business processes and applications to determine the value of business processes and the anticipated financial impact if they become unavailable.

Maximum Tolerable Downtime (MTD): The MTD represents the total amount of time the system owner is willing to accept for a mission/business process outage or disruption and includes all impact considerations.

The recovery point objective (RPO) and the recovery time objective (RTO) are two very specific parameters that are closely associated with recovery.
The RTO is basically how long you can go, without a specific application.
This is often associated with your maximum allowable or maximum tolerable outage.
Determining MTD is important because it could leave contingency planners with imprecise direction on (1) selection of an appropriate recovery method, and (2) the depth of detail which will be required when developing recovery procedures, including their scope and content.

System Backup: System Backup is the ability to recover the OS, Application/Middleware binaries and configuration data of a server.

These would include traditional backups for a system, i.e. Avamar, VM snapshots, etc.
Would enable the restoration of a system via bare-metal restore or traditional restore methodology requiring initial OS install followed with restore from the backup system.
These backups should reside at an offsite location.

Data Backup: This includes the transactional data, i.e. databases, file systems, etc.

These backups should be an offsite location.
Data backup is foundational and required for all disaster recovery plans and initiatives.

Availability: Ability of a service or application to perform its agreed function when required.

An available application considers the availability of its underlying infrastructure and dependent services.
Available applications remove single points of failure through redundancy and resilient design.
When we talk about availability, it is important to understand the concept of the effective availability of the platform. Effective availability considers each dependent service and their cumulative effect on the total system availability.

Scalability: The ability of an application to perform its agreed function when the workload or scope changes.

Scalable applications are able to meet increased demand with consistent results in acceptable time windows.
Scalability directly affects availability—an application that fails under increased load is no longer available.
When a system is scalable, it scales horizontally or vertically to manage increases in load while maintaining consistent performance. In basic terms, horizontal scaling adds more machines of the same size (processor, memory, bandwidth, etc.) while vertical scaling increases the size of the existing machines.

Fault Tolerance: A fault-tolerant system is one that has the ability to continue service in spite of a hardware or software failure.

Fault tolerance is not a degree of availability so much as a method for achieving very high levels of availability.
A fault-tolerant system is characterized by redundancy in most hardware components, including CPU, memory, I/O subsystems, and other elements.
However, even fault-tolerant systems are subject to outages from human error.
Note that High Availability does not imply fault tolerance.

Continuous Availability: Continuous availability means non-stop service, that is, there are no planned or unplanned outages at all.

This is a much more ambitious goal than HA, because there can be no lapse in service.
In effect, continuous availability is an ideal state rather than a characteristic of any real-world system.
This term is sometimes used to indicate a very high level of availability in which only a very small known quantity of downtime is acceptable.
Note that HA does not imply continuous availability.

Reduced Capabilities or Limited Capacity: During some events and periods of service recovery, services may be degraded but not completely unavailable. During these lesser crisis or while an organization focuses on restoring normal operations following a disaster an impacted service may be operating on limited capacity, with reduced capabilities, or within restricted resources.

Off-Site Disaster Recovery Types

Hot Sites are facilities appropriately sized to support system requirements and configured with the necessary system hardware, supporting infrastructure, and support personnel. This would be a requirement for Continuous Availability.

Warm Sites are partially equipped office spaces that contain some of the system hardware, software, telecommunications, and power sources. Currently TNC.

Cold Sites are typically facilities with adequate space and infrastructure (electric power, telecommunications connections, and environmental controls) to support information system recovery activities. Currently Wright State.

Disaster Recovery as a Service (DRaaS) are cloud providers that offer cloud-based infrastructure to deliver DR services. These types of services can reduce the investment required to provide DR.

Planning Process

The planning process that we need to begin to build and mature can be divided into 4 major areas:

(1) Identify Business Recovery Requirements (MTD). It’s typically calculated as part of a business impact analysis (BIA).

(2) Determine speed of Recovery (RTO)

(3) Create Plan and Address Plan Gaps

(4) Test and Maintain Plan

Sample Planning Grid

Critical System

RTO/RPO

Threat

Prevention Strategy

Response Strategy

Recovery Strategy

System / Application Name

Days / Hours

1. Hardware Failure

2. Software Failure

3. Facility Unavailable

4. Security/DDOS attack

Steps we have taken to reduce the risk of this threat affecting our system.

Actions we will take should a failure occur.

Actions we will take to restore the application to normal operations.

Additional steps…

Identify roles and responsibilities: Document and agree on who will do what in the planning process. Include customers, business analysts, service owners, system administrators, engineers, architects, DR coordinator, etc.

Documentation of plans: DR plans for systems, applications, etc. should be centrally coordinated, stored and managed.

Plan Dependencies: Ensure that DR plans identify and consider dependencies (network, batch, etc.)

Testing of plans: Plans must be tested to ensure viability of the plan and identification of plan deficiencies. Tests can encompass both virtual (table-top) and actual tests. Testing should be documented and stored with plan documentation.

PROBLEM MANAGEMENT: WHAT ROLE DO YOU PLAY AT OCIO?

February 28, 2014August 30, 2019

by Kevin Balogh at 1:32pmAugust 30, 2019

Problem Management is about discovering the root cause of an outage, adjusting processes to make sure it is avoided in the future and spreading that knowledge to the rest of the organization.

There are three roles defined in the Problem Management process. Do you know what role you play?

Problem Analyst:

A person who can recognize or “propose” a Problem
- Searches the Knowledge Base for Known Errors, prior to declaring a Problem
- Picks the most representative example, the one that describes the issue best, when proposing a Problem based upon receipt of multiple Incidents
- Clearly states Problems with focus on one issue, while avoiding “ands” in the description

Problem Manager:

A person who can “declare” or take a proposed Problem and open it thereby committing OCIO resources to investigate root cause.
- Owners of the Problem through the Root Cause process
- Gathers the appropriate people to work on a Problem and thoroughly states and specifies the Problem

Process Owner: (Kevin Balogh)

Has overall accountability for the performance of the Root Cause Analysis process
Provides Incident trending and monthly reporting on the progress and health of the process

If you’re curious about the results of past Root Cause Analysis findings, search the Knowledge Base in ServiceNow for any of these related phrases; RCA, Root Cause Analysis, or Lessons Learned.

Kevin Balogh
Defender of Stability
Services Management, Quality & Process Management
247-ITIL
balogh.5@osu.edu
Find me on Lync or in 018 Enarson Classrooms Building

Problem Management and Root Cause Analysis at OCIO

January 21, 2014March 5, 2020

by Kevin Balogh at 9:33amMarch 5, 2020

Problem Management and Root Cause Analysis at OCIO

As much as we don’t want Incidents or outages to occur with OCIO services, they do occur. The key to avoiding future Incidents and outages is to fully understand their root cause during the Problem Management process.

Problem Management is about discovering the root cause of an outage, working out how to avoid it in the future and spreading that knowledge to the rest of the organization.

Root Cause Analysis (RCA) is an activity within Problem Management where we aim to reach a shared understanding of the problem and determine the root cause.

If you are invite to attend RCA discussions and meetings, please don’t hesitate to participate. These meetings are not intended to be mysterious or scary.

We have ground rules to help people feel comfortable, safe and to foster creative thinking, including:

Do aim to reach shared understanding
Do learn from the Experience
Do share the outcomes with the rest of the OCIO
Do not blame or point fingers
Do not jump to conclusions without data or facts
Do not judge possible root causes too early

What to expect during a root cause?

We take our time and describe the problem thoroughly before reaching for root cause
If there is data we need during the session and we can get it quickly I ask people to retrieve it right away, otherwise we’ll circle back.
If we have too few or too many possible root causes we’ll take a deliberate look at the distinctions between what, where and when it happened and what, where and when it didn’t happen.
If we still have too few or too many possible causes we look at recent changes to the affected or supporting systems

Discovering root cause isn’t new to OCIO. My aim with the Problem Management Process is to give you a place to store your root cause activities (Problem Records), some structured root cause analysis when needed and a place to share what you’ve learned.

Problem Management
Improving services by investigating root cause

Kevin Balogh
Defender of Stability
Services Management, Quality & Process Management
balogh.5@osu.edu

Problem Management – ServiceNow Wiki

November 14, 2013

by Kevin Balogh at 12:38pm

Problem Management – ServiceNow Wiki.

Problem Management helps to identify the cause of an error in the IT infrastructure that is usually reported as occurrences of related incidents. Resolving a problem means fixing the error that will stop these incidents from occurring in the future.

Posts

Problem Management is about discovering the root cause of an outage, adjusting processes to make sure it is avoided in the future and spreading that knowledge to the rest of the organization.