A data
lake is a storage repository that holds a vast amount of raw data in its
native format until it is needed. While a hierarchical data warehouse stores
data in files or folders, a data lake uses a flat architecture to store data.
Each data element in a lake is assigned a unique identifier and tagged with a
set of extended metadata tags. When a business question arises, the data lake
can be queried for relevant data, and that smaller set of data can then be
analyzed to help answer the question.
The
term data lake is often associated with Hadoop-oriented object storage. In such
a scenario, an organization's data is first loaded into the Hadoop platform,
and then business analytics and data mining tools are applied to the data where
it resides on Hadoop's cluster nodes of commodity computers.
Like
big data, the term data lake is sometimes disparaged as being simply a marketing
label for a product that supports Hadoop. Increasingly, however, the term is
being accepted as a way to describe any large data pool in which the schema and
data requirements are not defined until the data is queried.
Data lake vs. data warehouse
Data lakes and data warehouses are both used for storing big data, but each approach has its own uses. Typically, a data warehouse is a relational database housed on an enterprise mainframe server or the cloud. The data stored in a warehouse is extracted from various online transaction processing (OLTP) applications to support business analytics (BA) queries and data marts for specific internal business groups, such as sales or inventory teams.
Data
warehouses are useful when there is a massive amount of data from operational
systems that needs to be readily available for analysis. Because the data in a
lake is often uncurated and can originate from sources outside of the company's
operational systems, lakes are not a good fit for the average business analytics
user.
Comments
Post a Comment