Skip to main content

Data deduplication


Data deduplication -- often called intelligent compression or single-instance storage -- is a process that eliminates redundant copies of data and reduces storage overhead. Data deduplication techniques ensure that only one unique instance of data is retained on storage media, such as disk, flash or tape. Redundant data blocks are replaced with a pointer to the unique data copy. In that way, data deduplication closely aligns with incremental backup, which copies only the data that has changed since the previous backup.
For example, a typical email system might contain 100 instances of the same 1 megabyte (MB) file attachment. If the email platform is backed up or archived, all 100 instances are saved, requiring 100 MB of storage space. With data deduplication, only one instance of the attachment is stored; each subsequent instance is referenced back to the one saved copy. In this example, a 100 MB storage demand drops to 1 MB.

Target vs. source deduplication

Data deduplication can occur at the source or target level.
Source-based dedupe removes redundant blocks before transmitting data to a backup target at the client or server level. There is no additional hardware required. Deduplicating at the source reduces bandwidth and storage use.
In target-based dedupe, backups are transmitted across a network to disk-based hardware in a remote location. Using deduplication targets increases costs, although it generally provides a performance advantage compared to source dedupe, particularly for petabyte-scale data sets.

Techniques to deduplicate data

There are two main methods used to deduplicate redundant data: inline and post-processing deduplication. Your backup environment will dictate which method you use.
Inline deduplication analyzes data as it is ingested in a backup system. Redundancies are removed as the data is written to backup storage. Inline dedupe requires less backup storage, but can cause bottlenecks. Storage array vendors recommend that their inline data deduplication tools be turned off for high-performance primary storage.
Post-processing dedupe is an asynchronous backup process that removes redundant data after it is written to storage. Duplicate data is removed and replaced with a pointer to the first iteration of the block. The post-processing approach gives users the flexibility to dedupe specific workloads and to quickly recover the most recent backup without hydration. The trade-off is a larger backup storage capacity than is required with inline deduplication.


Comments

Popular posts from this blog

Understanding the Evolution: AI, ML, Deep Learning, and Gen AI

In the ever-evolving landscape of artificial intelligence (AI) and machine learning (ML), one of the most intriguing advancements is the emergence of General AI (Gen AI). To grasp its significance, it's essential to first distinguish between these interconnected but distinct technologies. AI, ML, and Deep Learning: The Building Blocks Artificial Intelligence refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. Machine Learning, a subset of AI, empowers machines to learn from data and improve over time without explicit programming. Deep Learning, a specialized subset of ML, involves neural networks with many layers (hence "deep"), capable of learning intricate patterns from vast amounts of data. Enter General AI (Gen AI): Unraveling the Next Frontier Unlike traditional AI systems that excel in specific tasks (narrow AI), General AI aims to replicate human cognitive abilities across various domains. I...

Normalization of Database

Database Normalisation is a technique of organizing the data in the database. Normalization is a systematic approach of decomposing tables to eliminate data redundancy and undesirable characteristics like Insertion, Update and Deletion Anamolies. It is a multi-step process that puts data into tabular form by removing duplicated data from the relation tables. Normalization is used for mainly two purpose, Eliminating reduntant(useless) data. Ensuring data dependencies make sense i.e data is logically stored. Problem Without Normalization Without Normalization, it becomes difficult to handle and update the database, without facing data loss. Insertion, Updation and Deletion Anamolies are very frequent if Database is not Normalized. To understand these anomalies let us take an example of  Student  table. S_id S_Name S_Address Subject_opted 401 Adam Noida Bio 402 Alex Panipat Maths 403 Stuart Jammu Maths 404 Adam Noida Physics Updation Anamoly :  To upda...

How to deal with a toxic working environment

Handling a toxic working environment can be challenging, but there are steps you can take to address the situation and improve your experience at work: Recognize the Signs : Identify the specific behaviors or situations that contribute to the toxicity in your workplace. This could include bullying, harassment, micromanagement, negativity, or lack of support from management. Maintain Boundaries : Set boundaries to protect your mental and emotional well-being. This may involve limiting interactions with toxic individuals, avoiding gossip or negative conversations, and prioritizing self-care outside of work. Seek Support : Reach out to trusted colleagues, friends, or family members for support and advice. Sharing your experiences with others can help you feel less isolated and provide perspective on the situation. Document Incidents : Keep a record of any incidents or behaviors that contribute to the toxic environment, including dates, times, and specific details. This documentation may b...