Table of contents
What is data deduplication?
Data deduplication is a data compression technique involves redundant copies of data being removed from a system. It is administered in both data backup and network data schemes, and enables the storage of a unique model of data within either a database or broader information system. Data deduplication is also known as intelligent compression, single instance storage, commonality factoring or data reduction.
Data deduplication works by examining and then comparing incoming data pieces with already stored data. If any specific data is already present, deduplication algorithms remove the new data and replace it with a reference to the data already in place.
For example, when an old file is backed up with some changes, the previous file and applied changes are added to the total data segment. However, if there is no difference, the newer data file is discarded, and a reference is created.
Data deduplication is one technology that storage vendors rely on to make better use of storage space; the other is compression. These storage features are usually clumped into a larger category, called data reduction. All of these systems help to reach the same goal, increased storage efficiency. With proper deduplication techniques, businesses can effectively store more data than their overall storage capacity might suggest. As an example, a business with 15 TB of storage, when combined with proper deduplication and compression techniques, can get a 4:1 reduction benefit, meaning it would be possible to storage 60 TB on a 15 TB data array.
Data deduplication case study
Consider this scenario as a practical example of deduplication benefit: an organization is running a virtual desktop environment with hundreds of identical workstations all stored on an expensive storage array that was purchased specifically for support. The organization is running hundreds of copies of Windows 8, Office 2013, ERP software, and any other tools that users might require. Each individual workstation image consumes, say, 25 GB of disk space. With just 200 such workstations, these images alone would consume 5 TB of capacity.
With deduplication, just one copy of these individual virtual machines can be stored. Everytime the engine discovers a piece of data that is stored somewhere else in the storage environment, the storage system saves a small pointer in the data copy’s place, thereby freeing up the blocks that would normally be occupied.
Data deduplication types
As you might expect, different vendors handle deduplication in different ways. In fact, there are two primary deduplication techniques that deserve discussion:
Inline deduplication occurs the moment that data is written to storage. While the data is in motion, the deduplication engine tags the data sequentially. This process, while effective, does create computing overhead. The system has to repeatedly tag incoming data and then swiftly identify whether or not that new fingerprint matches something in the system. If so, a flag pointing to the existing tag is written. If it doesn’t, the block is saved without changes. Inline deduplication is a major feature for many storage devices and, while it does introduce overhead, it’s not too problematic, providing far more benefits than costs.
Post-process deduplication, also known as asynchronous deduplication, occurs when all data is written entirely, until, at regular intervals, the deduplication system goes through and tags all the new data, removes multiple copies, and replaces them with flags pointing to the original data copy.
Post-process deduplication lets businesses utilize their data reduction service without stressing about the repeated processing overhead caused by inline deduplication. This process lets businesses schedule deduplication, so that it can happen during off hours.
The largest downside to post-process deduplication is that all data is stored in its complete form (often called fully hydrated). Because of this, the data requires all of the space that non-deduplicated data needs. Only after the scheduled deduplication process does size decrease occur. For businesses using post-process dedupe, there needs to be a larger overhead of storage capacity at all times.
Client-side data deduplication is a data deduplication technique that is used, for example, on a backup-archive client to remove redundant data during backup and archive processing before the data is transferred to the server. Using client-side data deduplication can reduce the amount of data that is sent over a local area network.
Hardware-based deduplication versus software-based deduplication
Functionally built deduplication appliances lower the processing burden associated with software-based products. These hardware-based deduplication systems can also add deduplication into forms of data protection hardware, like backup appliances, VTLs, or NAS storage.
Even though software-based deduplication can effectively eliminate redundancy at its source, hardware-based methods prioritize data reduction at the storage level. Because of this, hardware-based deduplication won’t bring bandwidth savings obtained by deduplicating at the source, but this problem is offset by increased compression speeds.
Hardware-based data deduplication brings high performance, scalability and relatively nondisruptive deployment. It is best suited to enterprise-class deployments rather than SME or remote office applications.
Software-based deduplication is for the most part less costly to run, and doesn't require any significant changes to a businesses physical network infrastructure. However, software-based deduplication can often be more difficult to install and maintain. Agents have to be installed to allow for communication between the local site and backup server running the same software.