Home > Design Patterns > Dynamic Failure Detection and Recovery
Dynamic Failure Detection and Recovery

Dynamic Failure Detection and Recovery (Erl, Naserpour)

How can the notification and recovery of IT resource failure be automated?

Problem

When cloud-based IT resources fail, manual intervention may be unacceptably inefficient.

Solution

A watchdog system is established to monitor IT resource status and perform notifications and/or recovery attempts during failure conditions.

Application

Different intelligent monitoring and recovery technologies can be used to establish the automation of failure detection and recovery tasks with a focus on watching, deciding upon, acting upon, reporting and escalating IT resource failure conditions.
Dynamic Failure Detection and Recovery: The SLA monitor keeps track of cloud consumer requests (1) and detects that a cloud service has failed (2).

The SLA monitor keeps track of cloud consumer requests (1) and detects that a cloud service has failed (2).

Dynamic Failure Detection and Recovery: The SLA monitor notifies the watchdog system (3), which restores the cloud service based on predefined policies (4).

The SLA monitor notifies the watchdog system (3), which restores the cloud service based on predefined policies (4).

NIST Reference Architecture Mapping

This pattern relates to the highlighted parts of the NIST reference architecture, as follows:

Dynamic Failure Detection and Recovery: NIST Reference Architecture Mapping
Dynamic Failure Detection and Recovery: NIST Reference Architecture Mapping