
Data Lake | What is a Data Lake?
28.08.2022 | 4 min ReadTag: #data lake
A data lake is suited for storing all forms of data for analytical user stories. A data lake is more of a concept than a technical solution, but in short, the idea is that we should be able to retain all types of data, including data we are not entirely sure we will use.

A data lake allows us to store different types of data
A data lake can encompass structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).
A data lake has a storage system where data is stored in its original format, typically as files (ORC, Parquet) or objects. Hadoop technology enables inexpensive storage, and is now built into the underlying infrastructure of the major cloud providers’ storage services. Data lake services are generally cheaper when we need to store large volumes of data.
If we have a lot of unstructured data, we may need an object storage service to store the data instead. An object storage can be thought of as a large Dropbox account, where we store items such as images and video as objects with associated metadata. From there, we can process the data to make it more structured – for example, interpreting what the machine sees in the images and storing this as metadata (data about data).
Data scientists and other analysts use the data in a data lake to perform analyses and gain new insights from the data, and in some cases to use this to build data-driven services that are put into production. The processing engines in these – i.e. the technology that can read and transform the data – are in data lakes typically built on Spark.
A data lake can be challenging to navigate
Since data lakes are not well suited for creating different data layers, as in a data warehouse, and often lack good descriptions of the data, they have as a concept received much criticism for being “data swamps” where it is difficult to find your way around and make use of the data. Data lakes typically have neither defined structures and schemas, nor data catalogues with definitions or owners.
Is the data lake concept dead?
The ability to store all kinds of data, in large volumes, is the reason why data lakes as a term took off a few years ago, and became synonymous with the architecture behind “Big Data”.
As with so many other things, we have matured and learned. We now take it for granted that we can store all data, in large volumes. We no longer use the term “Big Data”, but have gone back to simply talking about data.
Many now consider the data lake as an architectural pattern to be somewhat too simplistic – we need other types of services as well. Such as data warehouse capabilities and ML processing.
Data lake architecture and technologies are now built in as a central part of cloud providers’ data services. For Azure Data Lake Gen2 and Amazon S3, it is more obvious that the data lake concept lies behind them, while for other solutions it is more hidden – for example Google BigQuery.
We will probably not talk much about the data lake as a concept going forward – the data lakehouse pattern is now in the process of taking over.
Advantages and disadvantages of a data lake
Below we summarise some important advantages and disadvantages of using a data lake for storage for reporting and analytics:
| Advantages |
|---|
|
| Disadvantages |
|---|
|

