For those who are used to developing traditional data warehouses, the changes in focus are evident. In a simple data lake, there is no need to clean, validate, and prepare the data before it is loaded. It is not necessary to adapt them to a specific schema or apply corporate integration criteria to load them into a data warehouse.
Herein lies the great difference between both approaches: while in a traditional data warehouse, the data is purified, validated, and adapted according to the predefined schema of the data model to which it has to be loaded, in a data lake, in the first instance, the data is stored regardless of its schema and characteristics; these will not be adapted and processed until a real business need has been identified.
Common Layers of a Data Lake
A temporary loading layer in which the data passes basic checks before being stored in the raw data layer. Although not required, it can be implemented to perform:
- Basic quality controls, such as possible filters according to the origin of the data, rule out unknown sources.
- Data encryption processes if required for security reasons.
- Simple metadata records and traceability through tags, storing the origin of the data, date and time of loading, the format, and other technical characteristics, its privacy and security level, encryption algorithm, etc.
Data Storage (Raw Data):
Data Processing (Trusted Zone):
Data Access (Consumption Zone):
Since these layers are optional, each organization can define its strategy when implementing a data lake. However, regardless of the number of layers that you choose to implement, there is always a basic common strategy: store all the data of the organization regardless of whether it is currently in use or not, leaving it available for possible future needs not detected in time to develop the data lake.