kras99 - Fotolia
Deduplication, the process of eliminating redundant data segments across files, brings value to all parts of the data center. It allows backup targets to approach the price of tape libraries and permits all-flash arrays to compete favorably with hard disk-based systems. Savings are not only seen in terms of capacity, but in performance due to the elimination of writes. Using a deduplication process can add more potential value to hyper-converged architectures.
Most hyper-converged architectures are hybrid, meaning they use flash and hard disk-based storage and transparently move data between those tiers. Deduping the flash storage tiers delivers the most return on an organization's deduplication investment because flash has a higher cost per gigabyte versus hard disks. As a result, many vendors decide not to spend the compute resources required to deduplicate the hard-disk tier. The hard-disk tier is also slower and requires a more efficient deduplication process to avoid an effect on performance. This requires extra development resources as well. But if that investment is made, there is a payoff. While squeezing additional capacity out of the hard-disk tier does not deliver the dollar-per-gigabyte savings that flash does, it can help in the following ways.
Compute inefficiency: Hyper-converged systems scale capacity by adding nodes to the cluster. Each additional node typically provides a set amount of flash, hard-disk storage and additional compute resources. In most data centers, capacity is needed more quickly than compute. This usually results in an organization sacrificing compute efficiency to meet its capacity demands.
Using a deduplication process helps resolve, or at least limit, the compute inefficiency problem by enabling the architecture to densely pack data on both the flash and hard-disk tiers. This density means IT does not need to add nodes as quickly to keep up with capacity demands, so the cluster may not end up with excess compute resources as quickly. There is also the physical advantage of not having to take up as much data center floor space. While storage may be cheap, new data centers are not.
Network efficiency: Hyper-converged architectures are busy frameworks of nodes. The architecture writes new data in segments, and each segment goes to a specific node. Inline deduplication identifies redundant data prior to sending it across the cluster; this increases network efficiency by the same factor as the deduplication rate.
George Crump explains why performance can sometimes be tricky to master in hyper-converged architectures.
Data protection: When implemented correctly, the deduplication process can enable the hyper-converged architecture to meet much of an organization's data protection requirements. In almost any cluster architecture, making a secondary copy of data within the cluster provides a relatively safe way to protect that data. Without deduplication, capacity consumption grows every time a copy is made. With deduplication, the copy is likely 100% identical to the original, so only the deduplication metadata is updated and no data is actually written.
This is same-system protection, so organizations still need to make an external copy locally and a copy for disaster recovery. The vendor should provide very careful protection of the deduplication metadata table, since you can lose all the data if the table is lost.
As the cost per gigabyte of hard-disk storage -- and especially flash-based storage -- continues to decline, IT may regard deduplication as an unnecessary technology whose expense may not equal its potential payoff. But dedupe has other benefits, such as limiting the growth of cluster nodes and improving storage media and network performance through write elimination. Some vendors have even gone so far as to integrate data protection into their hyper-converged architectures by leveraging deduplication to make data protection nearly cost-free. Given these capabilities, deduplication is more valuable to hyper-converged architectures than ever.
A CIO's checklist for evaluating hyper-convergence
Virtualized machines a key concern when considering hyper-convergence
How to avoid common hyper-converged configuration problems