Home > Design Patterns > Dynamic Failure Detection and Recovery
Dynamic Failure Detection and Recovery

Dynamic Failure Detection and Recovery (Erl, Naserpour)

How can the notification and recovery of IT resource failure be automated?

Problem

When cloud-based IT resources fail, manual intervention may be unacceptably inefficient.

Solution

A watchdog system is established to monitor IT resource status and perform notifications and/or recovery attempts during failure conditions.

Application

Different intelligent monitoring and recovery technologies can be used to establish the automation of failure detection and recovery tasks with a focus on watching, deciding upon, acting upon, reporting and escalating IT resource failure conditions.

Problem

Cloud environments can be comprised of vast quantities of IT resources being accessed by numerous cloud consumers. Any of those IT resources can experience predictable failure conditions that require intervention to resolve. Manually administering and solving standard IT resource failures in cloud environments is generally inefficient and impractical.

Solution

An automated watchdog system is established to monitor and respond to a wide range of pre-defined failure scenarios. This system is further able to notify and escalate certain failure conditions that it cannot automatically solve itself.

Application

The resilient watchdog system relies on a specialized cloud usage monitor (that can be referred to as the intelligent watchdog monitor) to actively monitor IT resources and take pre-defined actions in response to pre-defined events

Dynamic Failure Detection and Recovery: The SLA monitor keeps track of cloud consumer requests (1) and detects that a cloud service has failed (2).

Figure 1 - The SLA monitor keeps track of cloud consumer requests (1) and detects that a cloud service has failed (2).

Dynamic Failure Detection and Recovery: The SLA monitor notifies the watchdog system (3), which restores the cloud service based on predefined policies (4).

Figure 2 - The SLA monitor notifies the watchdog system (3), which restores the cloud service based on predefined policies (4).

The resilient watchdog system, together with the intelligent watchdog monitor, perform the following five core functions:

  • watching
  • deciding upon an event
  • acting upon an event
  • reporting
  • escalating

Sequential recovery policies can be defined for each IT resource to determine how the intelligent watchdog monitor should behave when encountering a failure condition. For example, a recovery policy may state that before issuing a notification, one recovery attempt should be carried out automatically.

Dynamic Failure Detection and Recovery: In the event of any failures, the active monitor refers to its predefined policies to recover the service step by step, escalating the processes as the problem proves to be deeper than expected.

Figure 3 - In the event of any failures, the active monitor refers to its predefined policies to recover the service step by step, escalating the processes as the problem proves to be deeper than expected.

When the intelligent watchdog monitor escalates an issue, there are common types of actions it may take, such as:

  • running a batch file
  • sending a console message
  • sending a text message
  • sending an email message
  • sending an SNMP trap
  • logging a ticket in a ticketing and event monitoring system

There are varieties of programs and products that can act as an intelligent watchdog monitor. Most can be integrated with standard ticketing and event management systems.

NIST Reference Architecture Mapping

This pattern relates to the highlighted parts of the NIST reference architecture, as follows:

Dynamic Failure Detection and Recovery: NIST Reference Architecture Mapping
Dynamic Failure Detection and Recovery: NIST Reference Architecture Mapping
CloudSchool.com Cloud Certified Professional (CCP) Module 5: Advanced Cloud Architecture.

This pattern is covered in CCP Module 5: Advanced Cloud Architecture..

For more information regarding the Cloud Certified Professional (CCP) curriculum, visit www.cloudschool.com.