Skip to main content

Data deduplication


Data deduplication -- often called intelligent compression or single-instance storage -- is a process that eliminates redundant copies of data and reduces storage overhead. Data deduplication techniques ensure that only one unique instance of data is retained on storage media, such as disk, flash or tape. Redundant data blocks are replaced with a pointer to the unique data copy. In that way, data deduplication closely aligns with incremental backup, which copies only the data that has changed since the previous backup.
For example, a typical email system might contain 100 instances of the same 1 megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB of storage space. With data deduplication, only one instance of the attachment is stored; each subsequent instance is referenced back to the one saved copy. In this example, a 100 MB storage demand drops to 1 MB.

Target vs. source deduplication

Data deduplication can occur at the source or target level.
Source-based dedupe removes redundant blocks before transmitting data to a backup target at the client or server level. There is no additional hardware required. Deduplicating at the source reduces bandwidth and storage use.
In target-based dedupe, backups are transmitted across a network to disk-based hardware in a remote location. Using deduplication targets increases costs, although it generally provides a performance advantage compared to source dedupe, particularly for petabyte-scale data sets.

Techniques to deduplicate data

There are two main methods used to deduplicate redundant data: inline and post-processing deduplication. Your backup environment will dictate which method you use.
Inline deduplication analyzes data as it is ingested in a backup system. Redundancies are removed as the data is written to backup storage. Inline dedupe requires less backup storage, but can cause bottlenecks. Storage array vendors recommend that their inline data deduplication tools be turned off for high-performance primary storage.
Post-processing dedupe is an asynchronous backup process that removes redundant data after it is written to storage. Duplicate data is removed and replaced with a pointer to the first iteration of the block. The post-processing approach gives users the flexibility to dedupe specific workloads and to quickly recover the most recent backup without hydration. The trade-off is a larger backup storage capacity than is required with inline deduplication.


Comments

Popular posts from this blog

Black swan

A  black swan event  is an incident that occurs randomly and unexpectedly and has wide-spread ramifications. The event is usually followed with reflection and a flawed rationalization that it was inevitable. The phrase illustrates the frailty of inductive reasoning and the danger of making sweeping generalizations from limited observations. The term came from the idea that if a man saw a thousand swans and they were all white, he might logically conclude that all swans are white. The flaw in his logic is that even when the premises are true, the conclusion can still be false. In other words, just because the man has never seen a black swan, it does not mean they do not exist. As Dutch explorers discovered in 1697, black swans are simply outliers -- rare birds, unknown to Europeans until Willem de Vlamingh and his crew visited Australia. Statistician Nassim Nicholas Taleb uses the phrase black swan as a metaphor for how humans deal with unpredictable events in his 2007...

A Graphics Processing Unit (GPU)

A graphics processing unit (GPU) is a computer chip that performs rapid mathematical calculations, primarily for the purpose of rendering images. A GPU may be found integrated with a central processing unit (CPU) on the same circuit, on a graphics card or in the motherboard of a personal computer or server. In the early days of computing, the CPU performed these calculations. As more graphics-intensive applications such as AutoCAD were developed; however, their demands put strain on the CPU and degraded performance. GPUs came about as a way to offload those tasks from CPUs, freeing up their processing power. NVIDIA, AMD, Intel and ARM are some of the major players in the GPU market. GPU vs. CPU A graphics processing unit is able to render images more quickly than a central processing unit because of its parallel processing architecture, which allows it to perform multiple calculations at the same time. A single CPU does not have this capability, although multi...

6G (sixth-generation wireless)

6G (sixth-generation wireless) is the successor to 5G cellular technology. 6G networks will be able to use higher frequencies than 5G networks and provide substantially higher capacity and much lower latency. One of the goals of the 6G Internet will be to support one micro-second latency communications, representing 1,000 times faster -- or 1/1000th the latency -- than one millisecond throughput. The 6G technology market is expected to facilitate large improvements in the areas of imaging, presence technology and location awareness. Working in conjunction with AI, the computational infrastructure of 6G will be able to autonomously determine the best location for computing to occur; this includes decisions about data storage, processing and sharing.  Advantages of 6G over 5G 6G is expected to support 1 terabyte per second (Tbps) speeds. This level of capacity and latency will be unprecedented and wi...