Dynamic Failure Detection and Recovery (Erl, Naserpour)
How can the notification and recovery of IT resource failure be automated?
ProblemWhen cloud-based IT resources fail, manual intervention may be unacceptably inefficient.
SolutionA watchdog system is established to monitor IT resource status and perform notifications and/or recovery attempts during failure conditions.
ApplicationDifferent intelligent monitoring and recovery technologies can be used to establish the automation of failure detection and recovery tasks with a focus on watching, deciding upon, acting upon, reporting and escalating IT resource failure conditions.
Compound PatternsBurst In, Burst Out to Private Cloud, Burst Out to Public Cloud, Elastic Environment, Infrastructure-as-a-Service (IaaS), Multitenant Environment, Platform-as-a-Service (PaaS), Private Cloud, Public Cloud, Resilient Environment, Software-as-a-Service (SaaS)
Cloud environments can be comprised of vast quantities of IT resources being accessed by numerous cloud consumers. Any of those IT resources can experience predictable failure conditions that require intervention to resolve. Manually administering and solving standard IT resource failures in cloud environments is generally inefficient and impractical.
An automated watchdog system is established to monitor and respond to a wide range of pre-defined failure scenarios. This system is further able to notify and escalate certain failure conditions that it cannot automatically solve itself.
The resilient watchdog system relies on a specialized cloud usage monitor (that can be referred to as the intelligent watchdog monitor) to actively monitor IT resources and take pre-defined actions in response to pre-defined events
Figure 1 - The SLA monitor keeps track of cloud consumer requests (1) and detects that a cloud service has failed (2).
Figure 2 - The SLA monitor notifies the watchdog system (3), which restores the cloud service based on predefined policies (4).
The resilient watchdog system, together with the intelligent watchdog monitor, perform the following five core functions:
- deciding upon an event
- acting upon an event
Sequential recovery policies can be defined for each IT resource to determine how the intelligent watchdog monitor should behave when encountering a failure condition. For example, a recovery policy may state that before issuing a notification, one recovery attempt should be carried out automatically.
Figure 3 - In the event of any failures, the active monitor refers to its predefined policies to recover the service step by step, escalating the processes as the problem proves to be deeper than expected.
When the intelligent watchdog monitor escalates an issue, there are common types of actions it may take, such as:
- running a batch file
- sending a console message
- sending a text message
- sending an email message
- sending an SNMP trap
- logging a ticket in a ticketing and event monitoring system
There are varieties of programs and products that can act as an intelligent watchdog monitor. Most can be integrated with standard ticketing and event management systems.
NIST Reference Architecture Mapping
This pattern relates to the highlighted parts of the NIST reference architecture, as follows: