What is a data lake?
Today, the promise of big data to transform business outcomes is immense. Companies trust big data, to gain key insights and competitive advantage since big data create growth in new and unexpected ways.
What is Big Data?
Your data is simply data unless it is processed. Big data in simple words, is a large amount of data that has been stored in databases, cloud, and other file storage systems.
Webopedia defines big data as
a large volume of both structured and unstructured data sets that is difficult to process using traditional database and software techniques.
This means that Big Data storage can be structured or unstructured. The collection of both structured and unstructured data in digital format is Big data.
What is structured data?
As the name suggests, this data storage follows a structure. The structure involves storing files in databases based on structured file formats (e.g. relational databases).
This makes them very easy to search.
Structured Query Language (SQL) is usually used for performing operations on structured databases. The data items have an index and code attached to them as a part of storage algorithms for easier search. The user just needs to type in the search query and retrieve the correct file. They can be further processed, transformed, merged, unmerged, and a whole bunch of different operations can be performed on them since they are stored as data structures.
Examples of relational databases with structured data can be airline reservation systems, sales transactions/ CRM, and financial databases.
What is Unstructured Data?
Unstructured data is essentially everything else. Unstructured data has an internal structure per se but is not structured via pre-defined data models (such as Relational database).
Unstructured data can contain text, numbers, or any other data and can be and human- or machine-generated.
Some examples of human-generated unstructured data are text files (Word docs, spreadsheets, presentations, email, logs), social media data such as that from Facebook, Twitter, LinkedIn, etc
Some examples of unstructured data from machines can be weather data, data from traffic signals, etc
Since unstructured data does not have any fixed format, makes it difficult to analyze and work upon.
Big Data Storage: Data Lake and Data Warehouses
Now that you know what is structured and unstructured data, it will become easier to understand how is Big Data Stored.
Data Warehouses
A data warehouse is used to store a large amount of structured data.
All data in a data warehouse follows a data structure. This makes them very easy to process and work upon. Data is clean and transformed so that processing becomes very easy. We also know what type of information is stored here, and business professionals can easily employ codes and extract relevant information.
Data Lakes
A data lake is used to store a large amount of unstructured data.
Data lakes contain a large amount of raw and unprocessed data. Because of this, data lakes typically require much larger storage capacity than data warehouses. This makes the processing of information very difficult. Mining of data is done by data scientists since the purpose of data is not yet determined.
What is the difference between Data Lake and Data Warehouse?
Besides the obvious difference between storing in a relational database and storing outside of one, the biggest difference is the ease of analyzing structured data vs. unstructured data.
The benefit of a data warehouse is that the processing of data makes the data itself easier to decipher. Mature analytics tools exist in the market for structured data, but analytics tools for mining unstructured data are nascent and yet developing.
However, there is simply much more unstructured data than structured in the world today. Unstructured data makes up 80% and is growing at a rate of 55% and 65% per year.
Data lakes contain large unstructured data. But the lack of orderly internal structure defeats the purpose of traditional data mining tools, and the enterprises get little value from potentially valuable data sources like rich media, network or weblogs, customer interactions, and social media data. Even though unstructured data analytics tools are in the marketplace, no one vendor or toolset are clear winners. And many customers are reluctant to invest in analytics tools with uncertain development roadmaps. It is best to consult a managed service provider in this regard.
Choosing a data lake versus a data warehouse depends on the type of organization and the industry. For example, healthcare industries have used data warehouses for a long time, but it has never been hugely successful. Because of the unstructured nature of much of the data in healthcare (physicians notes, clinical data, etc.) and the need for real-time insights, data lakes may be more useful.
Comments
Post a Comment