Understanding data engineering with Datacamp
Joan
Posted on August 9, 2023
Data Processing: converting raw data into meaningful information.
Data processing Value:
- Remove unwanted data
- Optimize memory. process and network costs
- Convert data from one type to another
- Organize data
- To fit into a schema/structure
- Increase productivity
How data engineers process data:
- Data manipulation, cleaning and tidying tasks e.g. dealing with missing values
- Store data in a sanely structured database
- Create views on top of the database tables for easy access of the database
- Normalize the data
- Optimize the performance of the databases e.g. indexing the data for easier retrieve.
Tools used in data processing
Data Processing:
- can apply to any task listed in data processing.
- Scheduling holds each piece and organize how they work together.
- Runs tasks in a specific order and resolves all dependencies correctly.
Scheduling data:
Manually: manual update of the employee data
Automatically :Run at a specific time say update employee table daily at 6AM.
Automatically run if a specified condition is met known as sensor Scheduling
Data Ingestion:
Batches & Streams
Batch processing: Group records at intervals, often cheaper
Steaming: sends individual records right away into the database, new signing in.
Tools used in scheduling
Parallel computing/processing
It's the basis of modern data processing tools, necessary because of memory and processing power.
How it works:
Split tasks up into several smaller subtasks
Distribute these subtasks over several computing
Benefits and risks of parallel computing
pros
- Extra processing power
- reduced memory footprint cons
- moving data incurs a cost
- communication time
Cloud Computing vs On premises computing
- Incur cost for equipment's
- need space
- electrical and maintenance cost
- enough power for peak moments
- processing power unused at quieter times
Server on the cloud:
- Pay as you go
- No need for space
- use resources we need an d when we need them
- closed to the user the better latency
Cloud Computing for Data storage
pros
Database reliability: data replication
Posted on August 9, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 29, 2024