Our client is an online virtual world and social networking site with many MySQL database servers. All these servers are used in production to work their social game platform.
But there was a problem: developers had to build analytics from data in these MySQL databases. But forming analytics directly on those servers would lead to a slowing down or a complete stop of requests to MySQL made by their social game. End customers and players would suffer, and the game would have slowed down. There are a lot of database servers, dozens of hosts, and an analytical request to everyone would also be challenging.
A new generation of technology allowing us to work with "data lakes" as tables is currently represented by Apache Hudi and Delta Lake technologies. They take a similar approach to using metadata to do "hard" work. Metadata structures are used to define:
what is the table
what is the diagram of the table
how the table is divided
what data files make up the table.
Delta Lake allows us to create open-source format tables. This makes it possible to perform analytics using an open architecture in a "data lake" using various mechanisms and tools.
Hudi (Hoodie) is a multifunctional platform for creating lakes of streaming data with incremental data pipelines at the level of a self-managed DBMS with optimization of regular batch processing.
Both products support adding, deleting, modifying, and validating data stored in HDFS. Of course, not only is HDFS supported but other HDFS-compatible storage systems, such as S3, are also supported.
Apache Hudi and Delta Lake are best-in-class formats explicitly designed for data lakes. All of them solve three problems:
consistency of updates
scalability of data and metadata.
Debezium represents the CDC (Change Data Capture) software category, a set of connectors for various DBMS compatible with the Apache Kafka Connect framework. It allows users to take advantage of the critical advantages of this Big Data platform: fault tolerance, scalability, and reliable real-time processing of large amounts of data.
So our solution is to upload data to the analytical storage using Hudi.
This allows developers to collect live business data in real time from many independent MySQL servers and send it to Data Lake for further analysis. Therefore, production databases MySQL do not slow down and do not subside when performing analytical tasks.