Data
deduplication -- often called intelligent compression or single-instance
storage -- is a process that eliminates redundant copies of data and reduces
storage overhead. Data deduplication techniques ensure that only one unique
instance of data is retained on storage media, such as disk, flash or tape.
Redundant data blocks are replaced with a pointer to the unique data copy. In
that way, data deduplication closely aligns with incremental backup, which
copies only the data that has changed since the previous backup.
For
example, a typical email system might contain 100 instances of the same 1
megabyte (MB) file attachment. If the email platform is backed up or archived,
all 100 instances are saved, requiring 100 MB of storage space. With data
deduplication, only one instance of the attachment is stored; each subsequent
instance is referenced back to the one saved copy. In this example, a 100 MB
storage demand drops to 1 MB.
Target vs. source deduplication
Data
deduplication can occur at the source or target level.
Source-based
dedupe removes redundant blocks before transmitting data to a backup target at
the client or server level. There is no additional hardware required.
Deduplicating at the source reduces bandwidth and storage use.
In
target-based dedupe, backups are transmitted across a network to disk-based
hardware in a remote location. Using deduplication targets increases costs,
although it generally provides a performance advantage compared to source
dedupe, particularly for petabyte-scale data sets.
Techniques to deduplicate data
There
are two main methods used to deduplicate redundant data: inline and
post-processing deduplication. Your backup environment will dictate which
method you use.
Inline
deduplication analyzes data as it is ingested in a backup system. Redundancies
are removed as the data is written to backup storage. Inline dedupe requires
less backup storage, but can cause bottlenecks. Storage array vendors recommend
that their inline data deduplication tools be turned off for high-performance
primary storage.
Post-processing
dedupe is an asynchronous backup process that removes redundant data after it
is written to storage. Duplicate data is removed and replaced with a pointer to
the first iteration of the block. The post-processing approach gives users the
flexibility to dedupe specific workloads and to quickly recover the most recent
backup without hydration. The trade-off is a larger backup storage capacity than
is required with inline deduplication.
Comments
Post a Comment