Dynamic Failure Detection and Recovery
How can the notification and recovery of IT resource failure be automated?
Problem
When cloud-based IT resources fail, manual intervention may be unacceptably inefficient.
Solution
A watchdog system is established to monitor IT resource status and perform notifications and/or recovery attempts during failure conditions.
Application
Different intelligent monitoring and recovery technologies can be used to establish the automation of failure detection and recovery tasks with a focus on watching, deciding upon, acting upon, reporting and escalating IT resource failure conditions.
Mechanisms
Audit Monitor, Cloud Usage Monitor, Failover System, SLA Management System, SLA Monitor
Compound Patterns
Burst In, Burst Out to Private Cloud, Burst Out to Public Cloud, Cloud Authentication, Cloud Balancing, Elastic Environment, Infrastructure-as-a-Service (IaaS), Isolated Trust Boundary, Multitenant Environment, Platform-as-a-Service (PaaS), Private Cloud, Public Cloud, Resilient Environment, Resource Workload Management, Secure Burst Out to Private Cloud/Public Cloud, Software-as-a-Service (SaaS)
The intelligent watchdog monitor keeps track of cloud consumer requests (1) and detects that a cloud service has failed (2).
The intelligent watchdog monitor notifies the resilient watchdog system (3), which restores the cloud service based on predefined policies (4).
In the event of any failures, the active monitor refers to its predefined policies to recover the service step by step, escalating the processes as the problem proves to be deeper than expected.
NIST Reference Architecture Mapping
This pattern relates to the highlighted parts of the NIST reference architecture, as follows: