Data Evolution - Databases to Data Lakehouse
Muhammad Adnan Khan
Posted on January 19, 2024
In this blog post, we will discuss the evolution of the data and data analytics solution and how fast things have changed recently. We will discuss all the details from the granular details to better understand the concepts later on.
Data in new oil!
Let's first understand what data is. how it become useful for many organizations.
Data is a term that is often used to describe the information that can be in some stored format and can be transmitted. It can be in the form of text, number, or some fact.
It not just a new term, but it has been used by our ancients in different forms either in the form of oral tradition, in paper written form, or can be in any electronic form stored somewhere.
Before the invention of writing, people carried information in some oral form, in the form of stories, knowledge, and history transferred from generation to generation. Later on, this fashion was converted into written form on stones and leathers, and with the invention of the printing press in the 15th century the information was stored in books and documents. This thing keeps changing with time from the printing press to library catalogs, then punch cards, early computers, databases and now we are in the era of big data where everyone has a personal device where each click generates data and that's stored somewhere in the world.
Around 328.77 million TBs of data is generated each day which is around 120 Zettabytes in 2023 and expected to raise 180 Zettabytes by 2025.
Welcome to the world of data
Now you have the history of data and how fast it has evolved, with all this evolution many organizations utilized it for different purposes to get the edge over competitors.
The data generated by you, are used by the organizations to generate profit. Every industry is using it whether it's some social media platform, the e-commerce store, or some movie platform. They track the history of data, analyze the patterns and make recommendations to the user to keep them engaged in their platform and sell their content or product. It's not just these industries the data has use cases in healthcare, oil & gas, pharma, and many other industries you name.
This is why the data is called the new oil, it drives the world.
The Big Data Era
The journey of data processing and analytics systems has evolved over several decades. In the 1980s, where data would be processed in nightly runs in batch streams.
with the increase in the use of databases, the organizations found themselves in tens or even hundreds of databases, supporting the business. These databases were transactional databases or OLTP. As a result, in 1990s the data warehousing comes into the picture for analytical purposes.
The early 21st century witnessed the era of big data when the data was growing exponentially and in different formats structured, unstructured, and semi-structured by modern digital platforms mobiles, web, sensors, IoT devices, social media, and many others, which need to be stored somewhere to perform the analysis. Well, at that time in the early 2010s, a new technology for big data processing became popular. Hadoop is an open-source framework for processing large-scale datasets on clusters of computers. These clusters contain machines with an attached disk that manages the terabytes of data under a single distributed file system Hadoop distributed file system (HDFS). The main bottleneck of these on-prem systems Hadoop and Spark is scalability, which requires a high upfront payment, along with other factors like latency, hardware management, and complexity factors.
During this time, the cloud-based data warehouses (Redshift, Big query, Snowflake, and Synapse) came into the picture which involved fewer managerial tasks and resolved the issues of scalability and latency and usage-based cost model.
After that, the trend of the modern data stack started to evolve into a data lake architecture that comes with high durability, inexpensive, and limitless cloud object stores. Where you can store any type of data without any transformation. Data lakes become the single source of truth for organizations. In this approach, all the data is ingested into the data lake, and a hot subset of data is moved from the data lake to the data warehouse to support low latency.
By integrating the best capabilities of both the data warehouse and data lake a new architecture came into the picture called data Lakehouse. which overcame the bottlenecks of the data lake and data warehouse like supporting any type of data, supporting ACID transactions, and low latency which the data lake can't support.
Now let's discuss each of the defined above and the associated services used in AWS.
OLTP (Online Transaction Processing)
is a source system where the business transactions are stored.
AWS Service: RDS, Aurora, DynamoDB and other servicesOLAP (Online Analytical Processing)
the systems used for analytical purposes.
AWS Service: AWS Redshift*ETL *(Extract Transform Load)
used to transfer the data from OLTP to OLAP system.
AWS Services: AWS Glue and AWS PipelineData warehouse
Is a single source of truth to store the structure-only data with ACID properties, used for analytical purposes.
AWS Service: RedshiftData lake
is a central repository to store the data from multiple source systems in any structure, it doesn't support ACID transactions and has high latency.
AWS Service: S3Data Lakehouse
combination of data warehouse and data lake best capabilities with support of ACID and low latency with support to store any type of data.
AWS Service: Redshift Spectrum and Lake Formation.
Will continue the series to explore each of the AWS Services mentioned above in depth the architecture, the working mechanism, and how to use multiple services to build a data warehouse, data lake, and data Lakehouse by using AWS.
Posted on January 19, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.