Big data is a combination of structured, semi structured and unstructured data collected by organizations that can be mined for information and used in machine learning projects, predictive modelling and other advanced analytics applications.
Systems that process and store big data have become
a common component of data management architectures in organizations.
Big data is often characterized by the 3Vs: the large volume of
data in many environments, the wide variety of data types stored in
big data systems and the velocity at which the data is generated,
collected and processed.
Although big data doesn't equate to any specific
volume of data, big data deployments often involve terabytes (TB), petabytes (PB)
and even exabytes (EB) of data captured over time.
Importance of big data
Companies use the big data accumulated in their
systems to improve operations, provide better customer service, create
personalized marketing campaigns based on specific customer preferences and,
ultimately, increase profitability. Businesses that utilize big data hold a
potential competitive advantage over those that don't since they're
able to make faster and more informed business decisions, provided they use the
data effectively.
Furthermore, utilizing big data enables companies
to become increasingly customer-centric. Historical and real-time data can
be used to assess the evolving preferences of consumers, consequently enabling
businesses to update and improve their marketing strategies and become more
responsive to customer desires and needs.
Breaking down the Vs of big data
Volume is the most commonly cited characteristic of
big data. A big data environment doesn't have to contain a large amount of
data, but most do because of the nature of the data being collected and stored
in them. Clickstreams, system logs and stream processing systems are among the
sources that typically produce massive volumes of big data on an ongoing basis.
Big data also encompasses a wide variety of data
types, including the following:
- structured data in databases and data warehouses based on Structured Query Language (SQL);
- unstructured data, such as text and document files held in Hadoop clusters or NoSQL database systems; and
- semi structured data, such as web server logs or streaming data from sensors.
All of the various data types can be stored
together in a data lake, which typically is based on Hadoop or a cloud
object storage service. In addition, big data applications often include
multiple data sources that may not otherwise be integrated. For example, a big
data analytics project may attempt to gauge a product's success and future
sales by correlating past sales data, return data and online buyer review data
for that product.
Velocity refers to the speed at which big data is
generated and must be processed and analyzed. In many cases, sets of big data
are updated on a real- or near-real-time basis, instead of the daily, weekly or
monthly updates made in many traditional data warehouses. Big data analytics
applications ingest, correlate and analyze the incoming data and then render an
answer or result based on an overarching query. This means data
scientists and other data analysts must have a detailed understanding of
the available data and possess some sense of what answers they're looking for
to make sure the information they get is valid and up to date.
Managing data velocity is also important as big data analysis expands into fields like machine learning and artificial intelligence (AI), where analytical processes automatically find patterns in the collected data and use them to generate insights.
More characteristics of big data
Looking beyond the original 3Vs, data veracity
refers to the degree of certainty in data sets. Uncertain raw data collected
from multiple sources -- such as social media platforms and webpages -- can
cause serious data quality issues that may be difficult to pinpoint.
For example, a company that collects sets of big data from hundreds of sources
may be able to identify inaccurate data, but its analysts need data
lineage information to trace where the data is stored so they can correct
the issues.
Bad data leads to inaccurate analysis and may
undermine the value of business analytics because it can cause
executives to mistrust data as a whole. The amount of uncertain data in an
organization must be accounted for before it is used in big data analytics
applications. IT and analytics teams also need to ensure that they have enough
accurate data available to produce valid results.
Some data scientists also add value to the list of
characteristics of big data. As explained above, not all data collected has
real business value, and the use of inaccurate data can weaken the insights
provided by analytics applications. It's critical that organizations employ
practices such as data cleansing and confirm that data relates to
relevant business issues before they use it in a big data analytics project.
Variability also often applies to sets of big data,
which are less consistent than conventional transaction data and may have
multiple meanings or be formatted in different ways from one data source to
another -- factors that further complicate efforts to process and analyze the
data. Some people ascribe even more Vs to big data; data scientists and
consultants have created various lists with between seven and 10 Vs.
How big data is stored and processed
The need to handle big data velocity imposes unique
demands on the underlying compute infrastructure. The computing power required
to quickly process huge volumes and varieties of data can overwhelm a single
server or server cluster. Organizations must apply adequate processing
capacity to big data tasks in order to achieve the required velocity. This can
potentially demand hundreds or thousands of servers that can distribute the
processing work and operate collaboratively in a clustered architecture, often
based on technologies like Hadoop and Apache Spark.
Achieving such velocity in a cost-effective manner
is also a challenge. Many enterprise leaders are reticent to invest in an
extensive server and storage infrastructure to support big data workloads,
particularly ones that don't run 24/7. As a result, public cloud computing
is now a primary vehicle for hosting big data systems. A public cloud provider
can store petabytes of data and scale up the required number of servers just
long enough to complete a big data analytics project. The business only pays
for the storage and compute time actually used, and the cloud instances can be
turned off until they're needed again.
Big data challenges
Besides the processing capacity and cost issues,
designing a big data architecture is another common challenge for users. Big
data systems must be tailored to an organization's particular needs, a DIY
undertaking that requires IT teams and application developers to piece together
a set of tools from all the available technologies. Deploying and managing big
data systems also require new skills compared to the ones possessed by database
administrators (DBAs) and developers focused on relational software.
Both of those issues can be eased by using a
managed cloud service, but IT managers need to keep a close eye on cloud usage to
make sure costs don't get out of hand. Also, migrating on-premises data sets
and processing workloads to the cloud is often a complex process for
organizations.
Making the data in big data systems accessible to
data scientists and other analysts is also a challenge, especially in
distributed environments that include a mix of different platforms and data
stores. To help analysts find relevant data, IT and analytics teams are
increasingly working to build data catalogues that incorporate
metadata management and data lineage functions. Data quality and data
governance also need to be priorities to ensure that sets of big data are
clean, consistent and used properly.
Big data collection practices and regulations
For many years, companies had few restrictions on
the data they collected from their customers. However, as the collection and
use of big data have increased, so has data misuse. Concerned citizens who have
experienced the mishandling of their personal data or have been victims of
a data breach are calling for laws around data collection
transparency and consumer data privacy.
The outcry about personal privacy violations led
the European Union to pass the General Data Protection Regulation (GDPR), which
took effect in May 2018; it limits the types of data that organizations can
collect and requires opt-in consent from individuals or compliance with other
specified lawful grounds for collecting personal data. GDPR also includes a
right-to-be-forgotten provision, which lets EU residents ask companies to
delete their data.
The human side of big data analytics
Ultimately, the value and effectiveness of big data
depend on the workers tasked with understanding the data and formulating the
proper queries to direct big data analytics projects. Some big data tools meet
specialized niches and enable less technical users to use everyday business
data in predictive analytics applications. Other technologies -- such as
Hadoop-based big data appliances -- help businesses implement a suitable
compute infrastructure to tackle big data projects, while minimizing the need
for hardware and distributed software know-how.
Big data can be contrasted with small data,
another evolving term that's often used to describe data whose volume and
format can be easily used for self-service analytics. A commonly quoted
axiom is that "big data is for machines; small data is for people."
Comments
Post a Comment