Dynamic Data Normalization (Erl, Naserpour)
How can redundant data within cloud storage devices be automatically avoided?
Problem
Cloud consumers may store large volumes of redundant data within cloud storage devices, thereby bloating the storage architecture and compromising data access performance.Solution
Data received by cloud consumers is automatically normalized so that redundant data is avoided and cloud storage device capacity and performance is optimized.Application
Data de-duplication technology is used to detect and eliminate redundant data at block or file-based levels.Mechanisms
Cloud Storage DeviceCompound Patterns
Burst In, Burst Out to Private Cloud, Burst Out to Public Cloud, Elastic Environment, Infrastructure-as-a-Service (IaaS), Multitenant Environment, Platform-as-a-Service (PaaS), Private Cloud, Public Cloud, Resilient Environment, Software-as-a-Service (SaaS)Problem
Redundant data can cause a range of issues in cloud environments, such as:/p>
- Increased time required to store and catalogue files
- Increased required storage and backup space
- Increased costs due to increased data volume
- Increased time required for replication to secondary storage
- Increased time required to backup data
For example, a cloud consumer copies 100 MB of files onto a cloud storage device. If it copies the data redundantly, ten times, the consequences can be considerable:
- The cloud consumer will be charged for using 1,000 MBs (1 GB) of storage space even though it is only storing 100 MBs of unique data.
- The cloud provider needs to provide an unnecessary 900 megabytes of space on both the online cloud storage device and any backup storage systems (such as tape drives).
- It takes nine times the amount of time required to store and catalog data.
- If the cloud provider is performing a site recovery, the data replication duration and performance will suffer, since 1,000 MBs need to be replicated instead of 100 MBs.
In multitenant public clouds, these impacts can be significantly amplified.
Solution
A data de-duplication system is established to prevent cloud consumers from inadvertently saving redundant copies of data. This system detects and eliminates exact amounts of redundant data on cloud storage devices, and can be applied to both block and file-based storage devices (although it works most effectively on the former). The data de-duplication system checks each block it receives to determine whether it is redundant with a block that has already been received. Redundant blocks are replaced with pointers to the equivalent blocks that are already stored.
Application
A de-duplication system examines received data prior to passing it to storage controllers. As part of the examination process, it assigns a hash code to every piece of data that has been processed and stored. It also keeps an index of hashes and pieces. As a result, if a new block of data is received, its generated hash is compared with the current stored hashes to decide if it is a new or duplicate block of data.
If it is a new block, it is saved. If the data is a duplicate, it is eliminated and a link (or pointer) to the original data block is created and saved in the cloud storage device. If a request for the data block is received at a later point, the pointer forwards the request to original data block.
Figure 1 - In Part A, data sets containing redundant data unnecessarily bloat data storage. The Dynamic Data Normalization pattern results in the constant and automatic streamlining of data as shown in Part B, regardless of how denormalized the data received from the cloud consumer is.
This pattern can be applied to both disk storage and backup tape drives. A cloud provider may decide to prevent redundant data only on backup cloud storage devices, while others may more aggressively implement the data de-duplication system on all cloud storage devices.
There are different methods and algorithms for comparing blocks of data and deciding whether they are duplications of other blocks.
NIST Reference Architecture Mapping
This pattern relates to the highlighted parts of the NIST reference architecture, as follows:
This pattern is covered in CCP Module 5: Advanced Cloud Architecture..
For more information regarding the Cloud Certified Professional (CCP) curriculum, visit www.cloudschool.com.
Arcitura IT Certified Professionals (AITCP)
Arcitura IT Certified Professionals (AITCP)
Arcitura IT Certified Professionals (AITCP)
Arcitura YouTube Channel
