Skip to main content

Big Data

Big data is a combination of structured, semi structured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modelling and other advanced analytics applications.

Systems that process and store big data have become a common component of data management architectures in organizations. Big data is often characterized by the 3Vs: the large volume of data in many environments, the wide variety of data types stored in big data systems and the velocity at which the data is generated, collected and processed.

Although big data doesn't equate to any specific volume of data, big data deployments often involve terabytes (TB), petabytes (PB) and even exabytes (EB) of data captured over time.

Importance of big data

Companies use the big data accumulated in their systems to improve operations, provide better customer service, create personalized marketing campaigns based on specific customer preferences and, ultimately, increase profitability. Businesses that utilize big data hold a potential competitive advantage over those that don't since they're able to make faster and more informed business decisions, provided they use the data effectively.

Furthermore, utilizing big data enables companies to become increasingly customer-centric. Historical and real-time data can be used to assess the evolving preferences of consumers, consequently enabling businesses to update and improve their marketing strategies and become more responsive to customer desires and needs.

Breaking down the Vs of big data

Volume is the most commonly cited characteristic of big data. A big data environment doesn't have to contain a large amount of data, but most do because of the nature of the data being collected and stored in them. Clickstreams, system logs and stream processing systems are among the sources that typically produce massive volumes of big data on an ongoing basis.

Big data also encompasses a wide variety of data types, including the following:

  • structured data in databases and data warehouses based on Structured Query Language (SQL);
  • unstructured data, such as text and document files held in Hadoop clusters or NoSQL database systems; and
  • semi structured data, such as web server logs or streaming data from sensors.

All of the various data types can be stored together in a data lake, which typically is based on Hadoop or a cloud object storage service. In addition, big data applications often include multiple data sources that may not otherwise be integrated. For example, a big data analytics project may attempt to gauge a product's success and future sales by correlating past sales data, return data and online buyer review data for that product.

Velocity refers to the speed at which big data is generated and must be processed and analyzed. In many cases, sets of big data are updated on a real- or near-real-time basis, instead of the daily, weekly or monthly updates made in many traditional data warehouses. Big data analytics applications ingest, correlate and analyze the incoming data and then render an answer or result based on an overarching query. This means data scientists and other data analysts must have a detailed understanding of the available data and possess some sense of what answers they're looking for to make sure the information they get is valid and up to date.

Managing data velocity is also important as big data analysis expands into fields like machine learning and artificial intelligence (AI), where analytical processes automatically find patterns in the collected data and use them to generate insights.

More characteristics of big data

Looking beyond the original 3Vs, data veracity refers to the degree of certainty in data sets. Uncertain raw data collected from multiple sources -- such as social media platforms and webpages -- can cause serious data quality issues that may be difficult to pinpoint. For example, a company that collects sets of big data from hundreds of sources may be able to identify inaccurate data, but its analysts need data lineage information to trace where the data is stored so they can correct the issues.

Bad data leads to inaccurate analysis and may undermine the value of business analytics because it can cause executives to mistrust data as a whole. The amount of uncertain data in an organization must be accounted for before it is used in big data analytics applications. IT and analytics teams also need to ensure that they have enough accurate data available to produce valid results.

Some data scientists also add value to the list of characteristics of big data. As explained above, not all data collected has real business value, and the use of inaccurate data can weaken the insights provided by analytics applications. It's critical that organizations employ practices such as data cleansing and confirm that data relates to relevant business issues before they use it in a big data analytics project.

Variability also often applies to sets of big data, which are less consistent than conventional transaction data and may have multiple meanings or be formatted in different ways from one data source to another -- factors that further complicate efforts to process and analyze the data. Some people ascribe even more Vs to big data; data scientists and consultants have created various lists with between seven and 10 Vs.

How big data is stored and processed

The need to handle big data velocity imposes unique demands on the underlying compute infrastructure. The computing power required to quickly process huge volumes and varieties of data can overwhelm a single server or server cluster. Organizations must apply adequate processing capacity to big data tasks in order to achieve the required velocity. This can potentially demand hundreds or thousands of servers that can distribute the processing work and operate collaboratively in a clustered architecture, often based on technologies like Hadoop and Apache Spark.

Achieving such velocity in a cost-effective manner is also a challenge. Many enterprise leaders are reticent to invest in an extensive server and storage infrastructure to support big data workloads, particularly ones that don't run 24/7. As a result, public cloud computing is now a primary vehicle for hosting big data systems. A public cloud provider can store petabytes of data and scale up the required number of servers just long enough to complete a big data analytics project. The business only pays for the storage and compute time actually used, and the cloud instances can be turned off until they're needed again.

Big data challenges

Besides the processing capacity and cost issues, designing a big data architecture is another common challenge for users. Big data systems must be tailored to an organization's particular needs, a DIY undertaking that requires IT teams and application developers to piece together a set of tools from all the available technologies. Deploying and managing big data systems also require new skills compared to the ones possessed by database administrators (DBAs) and developers focused on relational software.

Both of those issues can be eased by using a managed cloud service, but IT managers need to keep a close eye on cloud usage to make sure costs don't get out of hand. Also, migrating on-premises data sets and processing workloads to the cloud is often a complex process for organizations.

Making the data in big data systems accessible to data scientists and other analysts is also a challenge, especially in distributed environments that include a mix of different platforms and data stores. To help analysts find relevant data, IT and analytics teams are increasingly working to build data catalogues that incorporate metadata management and data lineage functions. Data quality and data governance also need to be priorities to ensure that sets of big data are clean, consistent and used properly.

Big data collection practices and regulations

For many years, companies had few restrictions on the data they collected from their customers. However, as the collection and use of big data have increased, so has data misuse. Concerned citizens who have experienced the mishandling of their personal data or have been victims of a data breach are calling for laws around data collection transparency and consumer data privacy.

The outcry about personal privacy violations led the European Union to pass the General Data Protection Regulation (GDPR), which took effect in May 2018; it limits the types of data that organizations can collect and requires opt-in consent from individuals or compliance with other specified lawful grounds for collecting personal data. GDPR also includes a right-to-be-forgotten provision, which lets EU residents ask companies to delete their data.

The human side of big data analytics

Ultimately, the value and effectiveness of big data depend on the workers tasked with understanding the data and formulating the proper queries to direct big data analytics projects. Some big data tools meet specialized niches and enable less technical users to use everyday business data in predictive analytics applications. Other technologies -- such as Hadoop-based big data appliances -- help businesses implement a suitable compute infrastructure to tackle big data projects, while minimizing the need for hardware and distributed software know-how.

Big data can be contrasted with small data, another evolving term that's often used to describe data whose volume and format can be easily used for self-service analytics. A commonly quoted axiom is that "big data is for machines; small data is for people."

Comments

Popular posts from this blog

Understanding the Evolution: AI, ML, Deep Learning, and Gen AI

In the ever-evolving landscape of artificial intelligence (AI) and machine learning (ML), one of the most intriguing advancements is the emergence of General AI (Gen AI). To grasp its significance, it's essential to first distinguish between these interconnected but distinct technologies. AI, ML, and Deep Learning: The Building Blocks Artificial Intelligence refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. Machine Learning, a subset of AI, empowers machines to learn from data and improve over time without explicit programming. Deep Learning, a specialized subset of ML, involves neural networks with many layers (hence "deep"), capable of learning intricate patterns from vast amounts of data. Enter General AI (Gen AI): Unraveling the Next Frontier Unlike traditional AI systems that excel in specific tasks (narrow AI), General AI aims to replicate human cognitive abilities across various domains. I...

Normalization of Database

Database Normalisation is a technique of organizing the data in the database. Normalization is a systematic approach of decomposing tables to eliminate data redundancy and undesirable characteristics like Insertion, Update and Deletion Anamolies. It is a multi-step process that puts data into tabular form by removing duplicated data from the relation tables. Normalization is used for mainly two purpose, Eliminating reduntant(useless) data. Ensuring data dependencies make sense i.e data is logically stored. Problem Without Normalization Without Normalization, it becomes difficult to handle and update the database, without facing data loss. Insertion, Updation and Deletion Anamolies are very frequent if Database is not Normalized. To understand these anomalies let us take an example of  Student  table. S_id S_Name S_Address Subject_opted 401 Adam Noida Bio 402 Alex Panipat Maths 403 Stuart Jammu Maths 404 Adam Noida Physics Updation Anamoly :  To upda...

How to deal with a toxic working environment

Handling a toxic working environment can be challenging, but there are steps you can take to address the situation and improve your experience at work: Recognize the Signs : Identify the specific behaviors or situations that contribute to the toxicity in your workplace. This could include bullying, harassment, micromanagement, negativity, or lack of support from management. Maintain Boundaries : Set boundaries to protect your mental and emotional well-being. This may involve limiting interactions with toxic individuals, avoiding gossip or negative conversations, and prioritizing self-care outside of work. Seek Support : Reach out to trusted colleagues, friends, or family members for support and advice. Sharing your experiences with others can help you feel less isolated and provide perspective on the situation. Document Incidents : Keep a record of any incidents or behaviors that contribute to the toxic environment, including dates, times, and specific details. This documentation may b...