Data lake: store your data without drowning in a lake of data
At a time when the mass of information generated by a business can grow by 50 to 150% from one year to the next, it makes sense to want to make the most of it and get the most out of it.
Many businesses are still put off by the infrastructures and architectures needed to manage Big Data, particularly what is often defined as its heart: the Data Lake.
What is a data lake? How does it differ from a data warehouse? Which data lake solutions should you choose? Read on for the answers.
What is a data lake? Definition
A data lake can be defined first and foremost as a reservoir of raw data, qualified at the margin, in structured or unstructured form. This data can be :
- extracts from relational databases,
- images
- PDF files
- feeds or events from business applications,
- semi-structured CSV files or logs, etc.
Why use a data lake? Advantages of a data lake
The first task of the data lake is to ingest this raw data en masse in order to preserve its history for future use:
- analysing changes in behaviour (of a customer or an application),
- predictive AI or machine learning,
- or, more pragmatically, monetising this information with new partners.
In addition to this main characteristic, there are other key criteria such as :
- its structuring, to make it navigable and avoid the data swamp,
- its elasticity, which will allow it to grow (and in theory shrink) at high speed in terms of storage and computing power,
- its security, to ensure that the data is used correctly.
Data lake, data warehouse: what's the difference?
Unlike the Data Lake, the primary aim of the Data Warehouse is to obtain refined data for a precise, recurring need, requiring solid aggregation performance and making it possible to serve reporting, analysis and sometimes new business applications.
But with the cost per terabyte stored more than 10 times higher, the data warehouse has reached its limits as the cornerstone of enterprise data.
How can we get the best of both worlds?
What data lake solutions should you consider?
Many large companies, having invested significant sums in their data warehouse, have decided to make a smooth transition to the data lake, with an on-premise solution and the customised composition of a range of tools to manage it.
An on-premise solution like the Hadoop data lake
The Apache Foundation has provided the Hadoop open-source framework, which is at the heart of the data lake's ability to ingest data en masse by parallelizing and distributing the storage process.
This framework is enhanced by a large number of open-source tools that have made data lake implementation affordable (financially):
- Kafka for ingestion,
- Yarn for resource allocation,
- Spark for high-performance processing,
- MongoDB as a NoSQL database,
- ElasticSearch and Kibana for content indexing and retrieval,
- and a plethora of other tools (graph databases, audit, security) that are emerging and sometimes disappearing as this market becomes more concentrated.
But in the end, the sheer number of tools and the possibility of creating an ultra-customised environment can lead to very high ownership costs, particularly if you're betting on a technology with an uncertain future.
Logically, then, we may prefer packaged solutions such as Cloudera, which has swallowed up Hortonworks and retained an open source distribution, but of course offers a better-supported paying model.
A strong partnership with IBM also aims to provide strong on-premise solutions.
MapR, having been taken over in 2019 by Hewlett Packard Enterprise, will be integrated into HPE GreenLake, a cloud solution designed to compete with the giants Amazon, Microsoft, Google and Oracle, who are stepping up their partnerships, takeovers and new developments to build cloud platforms that rival the best on-premise data analysis tools.
A cloud solution like the AWS or Azure data lake
Amazon AWS, Microsoft Azure, Google Big Query and Oracle Cloud Infrastructure Data Flow all incorporate more or less sophisticated data management tools (migration, lineage, monitoring) and analysis tools (real-time transformation, aggregation, traditional analysis or AI models), but this time in the cloud.
The big advantage of the shared cloud is that it puts aside the hardware issue, which can quickly become a headache when you anticipate a large increase in data.
However, the uninhibited cloud has shown its limitations, with cases of mass hacking. IBM's Private Cloud can guarantee the integrity of your data (industrial property, confidential contracts, etc.) and the Azure Stack solution offers an on-premise version of Microsoft's main tools in this area.
Teradata, another world leader in data warehousing, has also begun to move towards a cloud solution in the hope of winning back a customer base blunted by the costs of its powerful on-premise servers.
The challenge of good governance
All solutions have their advantages and disadvantages. You must not lose sight of your company's commitments to its customers (RGPD, industrial or professional secrecy) and weigh them up against this quest for elasticity, which can represent significant structural and human costs.
Assessing this balance must be part of the essential work of data governance, which must define and structure the data lake and therefore :
- provide a human, technical and technological framework for the data engineers who will be handling terabytes of data on a daily basis
- facilitate the investigative work of data scientists for their AI and Machine Learning engines
- enable users to trace and validate their sources to guarantee the results of their analyses.
This governance will make it possible to understand the real needs of your core business, while at the same time allowing data to be used more widely. The aim is to
- Develop new uses and a new understanding of data,
- provide your customers with the benefits of greater responsiveness and even anticipation, in complete security.
Good governance can result in architectures that are complex at first sight, but which can be both technically and financially beneficial.
Choosing the data mesh for a successful big data transition
So, while the data lake may be useful, it does not necessarily mean that other data management structures will disappear: from the data swamp upstream, to the data warehouse and data marts downstream, right through to the dialogue between several of these structures in an international context, good data governance can, on the contrary, enable the range of tools to be broadened.
By encouraging dialogue between these data storage and processing elements, the company can make the most of each of them:
- historical systems that are considered indispensable and reliable will continue to work
- and will be able to take advantage of the benefits of the data lake for, for example, archiving cold data, securing raw sources to enable better auditing and possible recovery, etc.
This data mesh, in the context of strong governance, will prevent a company from ruining an existing system by embarking on an "all data lake" migration.by embarking on an "all data lake" or even "all cloud" migration that is sometimes impractical and often inappropriate.
The data mesh will then be a guarantee of acceptance and success in the transition to Big Data.