Practical data engineering concepts and skills
Hunter Johnson
Posted on August 30, 2022
Data engineers are the backbone of modern data-driven businesses. They are responsible for wrangling, manipulating, and streaming data to fuel insights and better decision-making. So, what skills and concepts do data engineers use in order to be successful?
Today, we'll be going over what data engineers do, their role in a data-driven business, and the skills, concepts, and tools they use in day-to-day operations.
Data engineering is a rapidly growing field, and these skills are in high demand, so if you're looking to make a career change and become a data engineer or develop your existing skill set, this is the article for you.
Let's dive right in!
We'll cover:
- What is a data engineer, and what do they do?
- Data engineer responsibilities
- How do data engineers support decision-making?
- Processes, concepts, and skills for data engineering
- Wrapping up and next steps
What is a data engineer, and what do they do?
Data engineers are a hybrid of data scientists and software engineers, and they collect raw data and turn it into data that other data professionals can draw insights from.
Data engineer responsibilities
A data engineer’s responsibilities include, but are not limited to:
- Collecting raw data from a variety of sources to process and store in a data repository
- Selecting the best type of database, storage system, and cloud architecture/platform for each project
- Designing, maintaining, and optimizing systems for data ingestion, processing, warehousing, and analysis
- Ensuring that data is highly available, secure, and compliant with organizational standards
- Automating and monitoring data pipelines to ensure timely delivery of insights
How do data engineers support decision-making?
Data engineers play a critical role in data-driven decision-making by ensuring that data is high quality, easily accessible, and trustworthy. If the data they provide is inaccurate or of poor quality, then an organization runs the risk of making bad decisions that can have costly consequences.
For data scientists and analysts to do their job, they need access to high-quality data that has been cleaned and processed by data engineers. This data needs to be correctly structured and formatted to an organization's standards so that it can be analyzed easily.
Data engineers enable both data scientists and analysts to focus on their jobs by taking care of the tedious and time-consuming tasks of data preparation and processing.
Processes, concepts, and skills for data engineering
Now that we're all on the same page about what data engineers do, let's look at some of the skills, concepts, and tools they use in their work. These are the things you need to know if you're interested in becoming a data engineer, and if you're already in the field, this will serve as a good refresher.
3 core processes
There are some of the key processes that data engineers use in their work, and you'll need to be familiar with them if you plan on interviewing for data engineering roles.
Step 1: Data acquisition
Data acquisition refers to collecting data from multiple sources. This is typically accomplished through some form of data ingestion, which refers to the process of moving data from one system to another.
There are two main types of data ingestion: batch and real-time.
Batch data ingestion is the process of collecting and storing data in batches, typically at a scheduled interval. This is often used for data that doesn't need to be processed in real-time, such as historical data.
Real-time data ingestion, on the other hand, is the process of collecting and storing data immediately as it's generated. This is often used for data that needs to be processed in real-time, such as streaming data.
Data acquisition can be a complex process due to the numerous data sources and the different formats in which data can be stored.
Step 2: Data processing
Data processing refers to the process of transforming data into the desired format.
This is typically done through some form of data transformation, also known as data wrangling or data munging, which refers to the process of converting data from one format to another.
Types of data transformation include:
Data cleaning involves identifying and cleaning up incorrect, incomplete, or otherwise invalid data. Data cleaning is a necessary step for data quality assurance, which is the process of ensuring that data meets certain standards. Data quality assurance is a critical step in data engineering, as it helps to ensure that data is both accurate and reliable.
Data normalization involves converting data into a cohesive, standard format. Data normalization involves eliminating any redundancies, unstructured data, or other inconsistencies. Normalization is closely related to data cleaning but differs in that it's focused on making data more consistent, while data cleaning is focused on making data more accurate.
Data reduction involves filtering out any irrelevant data to accelerate the data analysis process. Data filtering can be done using several methods, such as de-duplication, sampling, and filtering by specific criteria.
Data extraction involves separating out data from a larger dataset. This can be done using a number of methods, such as SQL queries, APIs, and web scraping. Data extraction is often necessary when data is not readily available in the desired format.
Data aggregation involves aggregating data from multiple sources into a single dataset. Data aggregation is a necessary step for data integration, which is the process of summarizing data from multiple sources into a unified view.
There are many ways to process data, and the best approach will depend on the data you're working with and the goals of your project.
Step 3: Data storage
Data storage in the context of data engineering refers to the process of storing data in a format that is accessible and usable by humans or machines. Data storage is a critical step in data engineering, as it helps to ensure that data can be accessed and used by other data professionals to generate insights.
Data can be structured, semi-structured, or unstructured, and the type of data will largely determine what kind of data repository you'll need.
Structured data is organized in a predefined format and can be easily processed by computers. Structured data is typically stored in databases, such as relational databases, columnar databases, and document-oriented databases. Examples of structured data include customer, product, and financial data.
Semi-structured data has a predefined format but is not as rigidly structured as structured data. Semi-structured data is often stored in XML, JSON, or CSV files. Examples of semi-structured data are emails, social media posts, and blog posts.
Unstructured data does not have a predefined format and is often unorganized. Examples of unstructured data are images, videos, and audio files.
There is a wide variety of options for storing data, which are often referred to as data stores or data repositories.
More factors to consider when choosing a data repository, include cost, performance, and reliability.
Examples of data repositories are:
- Relational databases: MySQL, PostgreSQL, Microsoft SQL Server, Oracle Database, IBM DB2
- NoSQL databases: MongoDB, Apache Cassandra, Amazon DynamoDB, Couchbase, Apache HBase, Apache Accumulo, Apache Hive, Microsoft Azure Cosmos DB, Apache Hadoop, Cloudera Distribution for Hadoop
22 key data engineering concepts
We'll review some key data engineering concepts that you'll want to familiarize yourself with as you explore this career path.
1- Big data is a term used to describe large, complex datasets that are difficult to process using traditional computing techniques. Big data often includes data sets
2- Business intelligence (BI) is defined as the collection of processes and strategies for analyzing data to generate insights used to make business decisions.
3- Data architecture involves the process of designing, constructing, and maintaining data systems. Data architecture includes the design of data models, database management systems, and data warehouses. Data engineers often work with data architects to design and implement data systems, but they can also work independently.
4- Containerization is the process of packaging an application so that it can run in isolated environments known as containers. Containerization allows for better resource utilization and portability of applications. A containerized application encapsulates all of its dependencies, libraries, binaries, and configuration files into containers. This allows an application to run in the cloud or on a virtual machine without needing to be refactored.
Docker has become synonymous with containers and is a suite of tools that can be used to create, run, and share containerized applications.
Kubernetes, or k8s, is a portable, open-source platform for managing containerized applications.
5- Cloud computing is a model for delivering IT services over the internet. Data engineers often use cloud-based services, like Amazon S3 and Google Cloud Storage, to store and process data.
6- Databases are collections of data that can be queried. Relational databases, such as MySQL, Oracle, and Microsoft SQL Server, store data in tables and have existed for over four decades. Now, there are many different types of databases including:
- Wide-column stores such as Cassandra and HBase
- Key-value stores such as DynamoDB and memcachedb
- Document databases such as MongoDB and Couchbase
- Graph databases such as Neo4j 7- Data accessibility is the ability of users to access data stored in a system. 8- Data compliance and privacy is the act of following laws and regulations related to data. Data privacy is the act of protecting data from unauthorized access. 9- Data governance is the process of managing and governing data within an organization. Data governance includes policies and procedures for managing data. 10- Data marts are subsets of data warehouses that contain only the data needed by a specific group or department. 11- Data integration platforms are tools that help organizations combine data from multiple sources. These typically include features for data cleaning and transformation. 12- Data infrastructure components can include virtual machines, cloud services, networking, storage, and software. These components are necessary for data systems to function. 13- Data pipelines encompass the process of extracting data from one or more sources, transforming the data into a format that can be used by applications further down the line, and loading the data into a target system. Data pipelines essentially automate the process of moving data from one system to another. 14- Data repositories or data stores are systems that are used to store data, as discussed earlier. Examples include relational databases, NoSQL databases, and traditional file systems. 15- Data sources are the systems or devices from which data is extracted. Examples of data sources include U.S. Census data, weather data, social media posts, IoT devices, and sensors. 16- Data warehouses are centralized systems that store all the data organizations collect. Data warehousing involves extracting data from multiple sources, transforming the data into a format that can be used for analysis, and loading the data into the warehouse.
17- Data lakes are repositories that store all the data organizations collect, in their rawest form. Data lakes are often used for storing data that has not been transformed or processed in any way.
18- ETL and ELT processes are used for moving data from one system to another.
- ETL (extract, transform, load) processes involve extracting data from one or more sources, transforming the data into a format that can be used by the target system, and loading the data into the target system.
- ELT (extract, load, transform) processes involve extracting data from one or more sources, loading the data into the target system, and then transforming the data into the desired format.
ETL processes are useful for data that needs cleaning in order to be used by the target system. On the other hand, ELT processes are useful when the target system can handle the data in its raw form, so ELT processes tend to be faster than ETL processes.
19- Data formats for storage include text files, CSV files, JSON files, and XML files. Data can also be stored in binary formats, such as Parquet and Avro.
20- Data visualization is the process of creating visual representations of data. These can be used to examine data, find patterns, and make decisions. They are most often used to communicate data to non-technical audiences.
21- Data engineering dashboards are web-based applications that allow data engineers to monitor the status of their data pipelines. These typically display the status of data pipelines, the number of errors in a pipeline, and the time it took to run a pipeline.
22- SQL and NoSQL databases: are two types of databases that are used to store data.
- SQL (structured query language) databases are relational databases, which means that data is stored in tables and can be queried using SQL.
- NoSQL (not only SQL) databases are non-relational databases, which means that data is stored in a format other than tables and can be queried using a variety of methods.
You would use SQL databases for structured data, such as data from a financial system, while NoSQL databases are best suited for unstructured data, such as data from social media. For semi-structured data, such as data from a weblog, you could use either SQL or NoSQL databases.
Technical skills and tools
Now that we've covered some of the essential topics of data engineering, let's look at the tools and languages data engineers use to keep the data ecosystem up and running.
- Expert knowledge of OS: Unix, Linux, Windows, system utilities, and commands
- Knowledge of infrastructure components: Virtual machines, networking, application services, cloud-based services
- Expertise with databases and data warehouses: RDBMS (MySQL, PostgreSQL, IBM DB2, Oracle Database), NoSQL (Redis, MongoDB, Cassandra, Neo4J), and data warehouses (Oracle Exadata, Amazon RedShift, IBM DB2 Warehouse on Cloud)
- Knowledge of popular data pipelines: Apache Beam, AirFlow, DataFlow
- Languages
- Big data processing tools: Hadoop, Hive, Apache Spark, MapReduce, Kafka
- Data visualization tools: Tableau, QlikView, Power BI, Microsoft Excel
- Version control: Git, GitHub, Bitbucket
- Continuous integration and continuous delivery (CI/CD): Jenkins, Bamboo
- Monitoring and logging: ELK (Elasticsearch, Logstash, Kibana) stack, Splunk, AppDynamics
Wrapping up and next steps
A data engineer is responsible for the design, implementation, and maintenance of the systems that store, process, and analyze data. Data engineering is a relatively new field, and as such, there is no one-size-fits-all approach to it. The most important thing for a data engineer to do is to stay up to date on the latest trends and technologies so that they can apply them to the ever-growing data ecosystem.
Today we covered some of the fundamental concepts and skills that data engineers need to keep data pipelines flowing smoothly. As you continue to learn more about the data ecosystem and the role of data engineering within it, you'll find that there's a lot more to learn. But this should give you a good foundation on which to build your knowledge.
To get started learning these concepts and more, check out Educative's Introduction to Big Data and Hadoop.
Happy learning!
Continue learning about data on Educative
- The power of data: How data science can help you lead
- Get started with anomaly detection algorithms in 5 minutes
- Pandas cheat sheet: Top 35 commands and operations
Start a discussion
What other career paths do you hope to learn more about? Was this article helpful? Let us know in the comments below!
Posted on August 30, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.