Intro to Data Ingestion and Data Lakes
Flo Comuzzi
Posted on August 9, 2019
I landed in the data engineering space by a bit of luck and a bit of blind faith. As graduation was approaching, I had landed a job through a new grad program. When asked about areas I'd be interested in working with, I mentioned "Big Data" because a friend had told me her mentor advised her to pursue it. It's the hot thing right now, she said. Now, years later, I'm glad I chose this route because data engineering is a superset of software engineering with a focus on performance that can be super fun.
In this first post of this series, I'll go through what a data lake is and how it relates to data ingestion. I start with data ingestion because it gives a look into the work that is commonly done on data engineering teams and trends in the field.
In the rest of the series, I look to give you a view into the concerns that mire my work life and go through how to plan, design, and build data pipelines. I hope you'll get a solid mix of business and engineering perspective. I haven't seen much writing about data engineering that is accessible to many folks so I also hope to provide some of that here and, of course, I am open to feedback!
This series is motivated by and dedicated to my greatest mentors. I am grateful to them for walking alongside this long road with me.
What is a data ingestion pipeline?
With any data pipeline, the objective is getting data from A to B and sometimes even C, D, E, etc. Ingestion is a term specific to data lakes.
What is a data lake?
Well, a data warehouse is usually a small set of curated datasets for specific purposes. Data in data warehouses typically lives for a short time, e.g. 30 days, and is used very often. A data warehouse may fit into a traditional database.
To have good query performance, the working set (the set of records your query touches) should typically be able to fit into memory.
In comparison, what we think of as "Big Data" (think any amount of data that doesn't all fit into memory at once), would need to be processed in some distributed way on several machines. We have indeed developed some frameworks to do this kind of distributed processing like MapReduce
and Spark
. Because "Big Data" by definition should not fit into the memory of a single machine, it should not reside in a database. It should live in another location... So goes the idea of a data lake.
Data Lakes store massive amounts of data, typically historical data going back really far in time. Whereas we would call data in a data warehouse hot because it is the most relevant and therefore the most used/queried, data in a data lake may not be so hot. This colder data can perhaps be stored on storage volumes that have slower retrieval times. Slower hardware (usually) = cheaper hardware...
For now, think of a data lake as a place where you store large amounts of data. Yes, there is more to the concept of a data lake and you can read about it in The Enterprise Data Lake by Alex Gorelik.
Ok, so what's ingestion refer to?
Most data lakes are organized into several zones. For now, we will think of 3 zones:
- landing/dump zone
- raw zone
- transformed zone.
A common pattern is to dump data in the dump zone and have something ingest the data into the raw zone. Data in the raw zone should remain as close to its original form as possible. Datasets that you have changed should go in the transformed zone.
So,
ingestion refers to the process of bringing in data from some location into the raw zone of the data lake
where it can be queried, using technology like Presto and Hive, for understanding so that other datasets can be built off of it.
Ok, now show me how to do the thing...
In the next post, I'll go through how to think through the design of a data ingestion pipeline!
Posted on August 9, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.