A data lake is nothing more than a central repository to store an entire organization’s information regardless of its format or origin.

For those who are used to developing traditional data warehouses, the changes in focus are evident. In a simple data lake, there is no need to clean, validate, and prepare the data before it is loaded. It is not necessary to adapt them to a specific schema or apply corporate integration criteria to load them into a data warehouse.

It seems clear that this also implies another scenario in terms of prior analysis, development times, and costs, much lower than in traditional architectures. And is that when a data lake is built, the main purpose is to store all the data generated in the organization, regardless of the use that is going to be given to them or even if they currently have it.

Herein lies the great difference between both approaches: while in a traditional data warehouse, the data is purified, validated, and adapted according to the predefined schema of the data model to which it has to be loaded, in a data lake, in the first instance, the data is stored regardless of its schema and characteristics; these will not be adapted and processed until a real business need has been identified.

On the other hand, if we take a brief look at the technological components that surround data lakes, we find the advantages of the new Big Data technologies such as distributed environments, horizontal scalability, flexibility, real-time processing, etc., among other aspects, imply a reduction in costs compared to traditional systems.

Are You Looking For A Data Lake Expert?

FOR your Data Lake project
Contact Us

Common Layers of a Data Lake

The most common layers that can be found in a data lake are the following:

Data Ingestion

A temporary loading layer in which the data passes basic checks before being stored in the raw data layer. Although not required, it can be implemented to perform:

  1. Basic quality controls, such as possible filters according to the origin of the data, rule out unknown sources.
  2. Data encryption processes if required for security reasons.
  3. Simple metadata records and traceability through tags, storing the origin of the data, date and time of loading, the format, and other technical characteristics, its privacy and security level, encryption algorithm, etc.

Data Storage (Raw Data):

A layer without an established scheme where all data, structured or unstructured, is stored without undergoing adaptations. It is a layer that requires expert data discovery analysts using big data tools.

Data Processing (Trusted Zone):

Once the data analysts have performed data discovery on the raw data, they can see the need to process and adapt certain data sets to host them in a recurrent use layer. Advanced data quality, integrity, and other adaptations can take place at this layer to provide a trusted layer of data exploration that other users have access to.
Once the data analysts have performed data discovery on the raw data, they can see the need to process and adapt certain data sets to host them in a recurrent use layer. Advanced data quality, integrity, and other adaptations can take place at this layer to provide a trusted layer of data exploration that other users have access to.

Data Access (Consumption Zone):

This is a more advanced layer where, finally, the data is made available to business analysts. These analysts will be able to generate reports and analyze them to answer business questions and strengthen decision-making.

Since these layers are optional, each organization can define its strategy when implementing a data lake. However, regardless of the number of layers that you choose to implement, there is always a basic common strategy: store all the data of the organization regardless of whether it is currently in use or not, leaving it available for possible future needs not detected in time to develop the data lake.

In this way, as new business needs arise, the data will go from the raw data layer to more advanced layers, leaving new data sets available to business users. That is why a data lake is a living, constantly evolving data repository. If your organization is looking for big data consulting or data architecture consulting, feel free to contact us..