Data Engineering Roadmap 2023
Muinde Esther Ndunge
Posted on November 2, 2023
Introduction
Data engineering is a crucial field within the broader realm of data science and analytics. It involves the collection, transformation, and storage of data to make it accessible and useful for analysis. As a beginner in data engineering, you may feel daunted and wonder how to get started and build a successful career in this dynamic and in-demand field. This roadmap will guide you through the essential steps and concepts you need to master as you embark on your data engineering journey.
Data engineers use tools such as Java to build APIs, Python to write dashboard ETL pipelines, and SQL to access data in source systems & move it to target locations.
This roadmap has been broken down into monthly deliverables.
Month 1: Basics of Programming
The first thing to master as a data engineer is a programming language. The most common programming language is Python which will enable you to kickstart your data engineering journey.
Python is a versatile programming language because it is easy to use, has multiple supporting libraries, and has been incorporated into every aspect of Data Engineering processes.
- Understand Python basics that is Operators, Variables, and Data Types
- Learn working with data files this includes learning Python libraries like pandas which are widely used for reading, and manipulating data.
- Learn the Basics of Relational Database
- SQL Server/MySQL/PostgreSQL
Learn the fundamentals of computing
- Master Git and GitHub version control
- Focus on shell scripting in Linux, you'll be using shell scripting for cron jobs, setting up environments, etc
- Web Scraping is part and parcel of a Data Engineer's job. We need to extract data from websites that might not have a straightforward helpful API
Month 2: Databases
Relational databases are one of the most common core storage components used in data storage. One needs a good understanding of relational databases to work with large amounts of data.
One needs to master the following:
- Keys in SQL
- Joins in SQL
- Rank Window Functions
- Normalization
- Aggregations
- Data wrangling and analysis
- Data modeling for warehouse
Month 3: Cloud Computing
Learn about cloud platforms that deliver computing services over the internet.
The three main choices available are
- Amazon Web Services(AWS)
- Microsoft Azure
- Google Cloud Platform(GCP)
You can pick any cloud platform as you learn, it will be easier to master the others. The fundamental concepts are similar, with just slight differences in the user interface, cost, and other factors.
At this point, you understand the basics of programming, SQL, web scraping, and APIs as well. This is enough to work on your first project which could be bringing in data from a website, transforming it using Python, and storing it in a relational database. You can move the data to the cloud depending on which cloud computing you have chosen to work with.
Month 4: Data Processing
Learn how to process big data. Big data has two aspects, batch data, and streaming data. We need specialized tools to handle such intensive data and one of the popular ones is Apache Spark. Focus on the following learning Apache Spark
- Spark architecture
- RDDs in Spark
- Working with Spark Dataframes
- Understand Spark Execution
- Broadcast and Accumulators
- Spark SQL
Learn ETL pipelines using Python spark, data preprocessing libraries constructs like Numpy and Pandas.
Month 5: Big Data Engineering
Here we will build up on what we did during the previous month. Learn Big data engineering with Spark, optimization in Spark, and workflow schedules.
The ETL pipelines you build to get the data into databases and data warehouses must be managed separately. We need a work scheduling tool to manage pipelines and handle errors
Learn the following concepts in Apache Airflow
- DAGs
- Task dependencies
- Operators
- Scheduling
- Branching
Month 6: Data warehousing
Getting data into databases is one thing, the challenge is aggregating and storing data in a central repository. You will first need to understand the differences between a Database, Data Warehouse, and Data lake. Understand the differences between OLTP and OLAP
There are several data warehousing tools available;
- Redshift
- Databricks
- Snowflake
Month 7: Handling data streaming
Data streaming is the continuous flow of data as it is generated, enabling real-time processing and analysis for immediate insights.
To ensure that data is being ingested reliably while it is being generated we use Apache Kafka
- Learn Kafka architecture
- Learn about Producers and Consumers -- Create topics in Kafka
There are other tools used for streaming data such as AWS Kinesis, again you're not limited to which tool to use.
Month 8: Processing streaming data
After learning how to ingest streaming data, learn how to process data in real-time. You can do it with Kafka but it is not flexible for ETL purposes as Spark Streaming
Focus on
- DStreams
- Stateless vs. Stateful transformation
- Checkpointing
- Structured Streaming
Month 9: Data transformation
Every data engineer has to transform data into a form that the other members of the organization can use. Data transformation tools make it easy for data engineers to do so.
Focus on DBT as many companies are using it
- Learn how to use compiler and runner components
- Model data transformation
Month 10: Reporting and Dashboards
This is mostly the end product of data, where the data has already been transformed, insights driven from it, and ready to be presented to stakeholders. One can use any tools to visualize and create dashboards. Such tools include:
- Power Bi
- Tableau
- Looker
Month 11: No SQL
When working with relational databases, the data always needs to be structured and the querying is not that fast when working with large data hence we have NoSQL. These databases deal with structured and unstructured data
You can focus on learning one NoSQL database like MongoDB since it is popularly used in the industry and is easy to learn
Focus on:
- CAP theorem
- CRUD operations
- Documents and Collections
- Working with different types of operators
- Aggregation Pipeline
- Sharding and Replication in MongoDB
Month 12:Building projects
Even though you will build projects in each step, by now you have an understanding of the essential tools in data engineering. To showcase your skills, build a capstone project and keep learning.
Conclusion
This breakdown allows you to progressively build your data engineering skills over the year. You can adjust the pace of your learning based on your personal preferences and the time you have available. Consistent practice and hands-on experience will be crucial in mastering the field of data engineering.
Posted on November 2, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.