Data Lake | What is a Data Lake?

28.08.2022 | 4 min Read
Tag: #data lake

A data lake is suited for storing all forms of data for analytical user stories. A data lake is more of a concept than a technical solution, but in short, the idea is that we should be able to retain all types of data, including data we are not entirely sure we will use.

A data lake enables analysis of different types of data
A data lake enables analysis of different types of data

A data lake allows us to store different types of data

A data lake can encompass structured data from relational databases (rows and columns), semi-structured data (CSV, logs, XML, JSON), unstructured data (emails, documents, PDFs) and binary data (images, audio, video).

A data lake has a storage system where data is stored in its original format, typically as files (ORC, Parquet) or objects. Hadoop technology enables inexpensive storage, and is now built into the underlying infrastructure of the major cloud providers’ storage services. Data lake services are generally cheaper when we need to store large volumes of data.

If we have a lot of unstructured data, we may need an object storage service to store the data instead. An object storage can be thought of as a large Dropbox account, where we store items such as images and video as objects with associated metadata. From there, we can process the data to make it more structured – for example, interpreting what the machine sees in the images and storing this as metadata (data about data).

Data scientists and other analysts use the data in a data lake to perform analyses and gain new insights from the data, and in some cases to use this to build data-driven services that are put into production. The processing engines in these – i.e. the technology that can read and transform the data – are in data lakes typically built on Spark.

A data lake can be challenging to navigate

Since data lakes are not well suited for creating different data layers, as in a data warehouse, and often lack good descriptions of the data, they have as a concept received much criticism for being “data swamps” where it is difficult to find your way around and make use of the data. Data lakes typically have neither defined structures and schemas, nor data catalogues with definitions or owners.

Is the data lake concept dead?

The ability to store all kinds of data, in large volumes, is the reason why data lakes as a term took off a few years ago, and became synonymous with the architecture behind “Big Data”.

As with so many other things, we have matured and learned. We now take it for granted that we can store all data, in large volumes. We no longer use the term “Big Data”, but have gone back to simply talking about data.

Many now consider the data lake as an architectural pattern to be somewhat too simplistic – we need other types of services as well. Such as data warehouse capabilities and ML processing.

Data lake architecture and technologies are now built in as a central part of cloud providers’ data services. For Azure Data Lake Gen2 and Amazon S3, it is more obvious that the data lake concept lies behind them, while for other solutions it is more hidden – for example Google BigQuery.

We will probably not talk much about the data lake as a concept going forward – the data lakehouse pattern is now in the process of taking over.

Advantages and disadvantages of a data lake

Below we summarise some important advantages and disadvantages of using a data lake for storage for reporting and analytics:

Advantages
  • Contributes to data consolidation – can store both structured and unstructured data in one place
  • Provides flexibility – can store all forms of data without a schema or data modelling. Can preserve data in its original format for future processing depending on the requirements of specific user stories
  • Is a less expensive form of storage than traditional databases and data warehouses, both on-premise and in the cloud
  • Provides support for data scientists and analysts who need sandbox environments with original data for various machine learning and deep learning algorithms
Disadvantages
  • Can provide poorer performance for user stories related to reporting and self-service analytics than other alternatives
  • Lacks support for modifying and updating data (ACID), which makes it demanding to handle requirements related to GDPR and privacy
  • Lack of data consistency can make it challenging to deliver sufficient data reliability and security
  • Lack of up-front data processing (through data modelling and solution design) often means that data is not sufficiently described and defined. This can result in data being stored that can never be put to use

Learn more



author image

Magne Bakkeli

Magne has over 20 years of experience as an advisor, architect and project manager in data & analytics, and has a strong understanding of both business and technical challenges.