Ingesting Data to Postgres
totalSophie
Posted on January 11, 2024
DE Zoomcamp study notes
To set up PostgreSQL in Docker, run the pgcli command, and execute SQL statements, you can follow these steps:
Step 1: Install Docker
Make sure you have Docker installed on your machine. You can download Docker from the official website: Docker.
Step 2: Pull PostgreSQL Docker Image
Open a terminal and pull the official PostgreSQL Docker image:
docker pull postgres
Step 3: Run PostgreSQL Container
Run a PostgreSQL container with a specified password for the default user 'postgres':
docker run --name mypostgres -e POSTGRES_USER="root" -e POSTGRES_PASSWORD="password" -e POSTGRES_DB="ny_taxi" -v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data -p 5432:5432 -d postgres
This command starts a PostgreSQL container named 'mypostgres' with the password 'password' and exposes port 5432 on the host.
e declares the environment variables.
v declares volume path
Step 4: Install pgcli
Install pgcli, a command-line interface for PostgreSQL, on your local machine:
pip install pgcli
Step 5: Connect to PostgreSQL using pgcli
Connect to the PostgreSQL database using pgcli:
pgcli -h localhost -p 5432 -U root -d ny_taxi -W
h declares the host variable which is localhost connection port.
u is the username.
d is the database name
-W prompts the user for the password. After entering the command
Enter the password when prompted (use 'password' if you followed the previous steps).
Step 6: Execute SQL Statements
Once connected, you can execute SQL statements directly in the pgcli interface. For example:
-- Create a new database
CREATE DATABASE mydatabase;
-- Switch to the new database
\c mydatabase
-- Create a table
CREATE TABLE mytable (
id serial PRIMARY KEY,
name VARCHAR (100),
age INT
);
-- Insert some data
INSERT INTO mytable (name, age) VALUES ('John', 25), ('Jane', 30);
-- Query the data
SELECT * FROM mytable;
Feel free to modify the SQL statements according to your requirements.
Step 7: To Exit pgcli and Stop the PostgreSQL Container
To exit pgcli, type \q
. After that, stop and remove the PostgreSQL container:
docker stop mypostgres
docker rm mypostgres
Data Ingestion from CSV to PostgreSQL using Pandas and SQLAlchemy
- Used Jupyter notebook to insert the data in chunks.
- Downloaded the NY taxi 2021 data
Step 1: Setting Up the Environment:
- Use Pandas to read the CSV file in chunks for more efficient processing.
- Define a PostgreSQL connection string using SQLAlchemy.
Step 2: Creating the Table Schema:
- Read the first chunk of data to create the initial table schema in the database.
- Utilize Pandas'
to_sql
method to replace or create the table in the PostgreSQL database.
Step 3: Iterative Data Insertion:
- Iterate through the remaining chunks of the CSV file.
- Optimize timestamp data types using Pandas'
to_datetime
. - Append each chunk to the existing PostgreSQL table.
Final Code:
from sqlalchemy import create_engine
from time import time
import pandas as pd
# specify the database you want to use based on the docker run command we had
# postgresql://username:password@localhost:port/dbname
db_url = 'postgresql://root:password@localhost:5432/ny_taxi'
engine = create_engine(db_url)
# Chunksize for reading CSV and inserting into the database
chunk_size = 100000
# Create an iterator for reading CSV in chunks
csv_iter = pd.read_csv('2021_Yellow_Taxi_Trip_Data.csv', iterator=True, chunksize=chunk_size)
# Get the first chunk to create the table schema
first_chunk = next(csv_iter)
first_chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='replace', index=False)
# Loop through the remaining chunks and append to the table
for chunk in csv_iter:
t_start = time()
# Fix timestamp type issue
chunk['tpep_pickup_datetime'] = pd.to_datetime(chunk['tpep_pickup_datetime'])
chunk['tpep_dropoff_datetime'] = pd.to_datetime(chunk['tpep_dropoff_datetime'])
# Append data to the existing table
chunk.to_sql(name='yellow_taxi_data', con=engine, if_exists='append', index=False)
# Print a message and benchmark the time
t_end = time()
print(f'Inserted another chunk... took {t_end - t_start:.3f} second(s)')
Extra, Extra!!!
Using argparse
to Parse Command Line Arguments
Utilizing the argparse
standard library to efficiently parse command line arguments, this script downloads a CSV file from a specified URL and ingests its data into a PostgreSQL database.
from time import time
from sqlalchemy import create_engine
import pandas as pd
import argparse
import os
def main(params):
user = params.user
password = params.password
host = params.host
port = params.port
db = params.db
table_name = params.table_name
url = params.url
csv_name = 'output.csv'
# Download the CSV using the os system function to execute command line arguments from Python
os.system(f"wget {url} -O {csv_name}")
engine = create_engine(f'postgresql://{user}:{password}@{host}:{port}/{db}')
df_iter = pd.read_csv(csv_name, iterator=True, chunksize=100000)
df = next(df_iter)
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
# Adding the column names
df.head(n=0).to_sql(name=table_name, con=engine, if_exists="replace")
# Adding the first batch of rows
df.to_sql(name=table_name, con=engine, if_exists="append")
while True:
t_start = time()
df = next(df_iter)
df.tpep_pickup_datetime = pd.to_datetime(df.tpep_pickup_datetime)
df.tpep_dropoff_datetime = pd.to_datetime(df.tpep_dropoff_datetime)
df.to_sql(name=table_name, con=engine, if_exists="append")
t_end = time()
print('Inserted another chunk... took %.3f second(s)' % (t_end - t_start))
if __name__ == '__main__':
parser = argparse.ArgumentParser(description="Ingest CSV data to Postgres")
parser.add_argument('--user', help="user name for postgres")
parser.add_argument('--password', help="password for postgres")
parser.add_argument('--host', help="host for postgres")
parser.add_argument('--port', help="port for postgres")
parser.add_argument('--db', help="database name for postgres")
parser.add_argument('--table_name', help="name of the table where we will write the results to")
parser.add_argument('--url', help="url of the CSV")
args = parser.parse_args()
# Dockerizing Ingestion Script
In the provided Dockerfile:
> **Dockerfile**
> ```
{% endraw %}
docker
> FROM python:3.9.1
>
> RUN apt-get install wget
> RUN pip install pandas sqlalchemy psycopg2
>
> WORKDIR /app
> COPY ingest_data.py ingest_data.py
>
> ENTRYPOINT ["python", "ingest_data.py"]
>
{% raw %}
The psychopg2
package is included to facilitate access to the PostgreSQL database from Python, serving as a valuable "database wrapper."
To build the Docker image, execute the following command:
bash
docker build -t taxi_ingest:v001 .
Now run the image instead of the script with the network argument and changing the database host...
You can serve the local file over HTTP on your machine and access it through your IP address by running this in its location
python3 -m http.server
bash
# If your file is local
URL="http://192.x.x.x:8000/2021_Yellow_Taxi_Trip_Data.csv"
docker run -it \
--network=pg-network \
taxi_ingest:v001 \
--user=root \
--password=password \
--host=pg-database \
--port=5432 \
--db=ny_taxi \
--table_name=yellow_taxi_trips \
--url="${URL}"
Not yet...
Connecting pgAdmin and Postgres
pgCLI allows for quickly looking into data. But the more convenient way to work with a postgres database is to use the pgAdmin tool which is a web based GUI tool.
To install pgAdmin in a Docker container, you can follow these steps:
- Pull the pgAdmin Docker Image: Use the following command to pull the official pgAdmin Docker image from Docker Hub.
bash
docker pull dpage/pgadmin4
- Create a Docker Network: It's a good practice to create a Docker network to facilitate communication between the PostgreSQL container and the pgAdmin container.
bash
docker network create pgadmin-network
- Run the PostgreSQL Container: Now modify the postgres db run command
bash
docker run --name pg-database \
--network pgadmin-network \
-e POSTGRES_USER="root" \
-e POSTGRES_PASSWORD="password" \
-e POSTGRES_DB="ny_taxi" \
-v $(pwd)/ny_taxi_postgres_data:/var/lib/postgresql/data \
-p 5432:5432 \
-d postgres
Replace password
with your desired PostgreSQL password.
- Run the pgAdmin Container: Now, you can run the pgAdmin container and link it to the PostgreSQL container.
bash
docker run --name pgadmin-container \
--network pgadmin-network \
-e PGADMIN_DEFAULT_EMAIL=myemail@example.com \
-e PGADMIN_DEFAULT_PASSWORD=mypassword \
-p 5055:80 \
-d dpage/pgadmin4
Replace myemail@example.com
and mypassword
with your desired pgAdmin login credentials.
Access pgAdmin:
Open your web browser and navigate tohttp://localhost:5055
. Log in with the credentials you provided in the previous step.Add PostgreSQL Server:
In pgAdmin, click on "Add New Server" and fill in the necessary details to connect to the PostgreSQL server running in the Docker container.
- Host name/address:
postgres-container
(the name of your PostgreSQL container) - Port:
5432
- Username:
postgres
- Password: (the password you set in step 3)
Now, you should be able to manage your PostgreSQL server using pgAdmin in a Docker container. Adjust the commands and parameters according to your specific requirements and environment.
Rather, we can also use Docker Compose
Create a docker-compose.yml
.. Now, you don't specify the network
services:
pgdatabase:
image: postgres:latest
environment:
- POSTGRES_USER=root
- POSTGRES_PASSWORD=password
- POSTGRES_DB=ny_taxi
volumes:
- "./ny_taxi_postgres_data:/var/lib/postgresql/data:rw"
ports:
- "5432:5432"
container_name: mypostgres
pgadmin:
image: dpage/pgadmin4
environment:
- PGADMIN_DEFAULT_EMAIL=myemail@example.com
- PGADMIN_DEFAULT_PASSWORD=mypassword
ports:
- "5055:80"
container_name: pgadmin
To start Docker Compose docker-compose up
To run Docker Compose in the background docker-compose up -d
To view Docker Compose containers docker-compose ps
To stop Docker Compose docker-compose down
To stop Docker Compose if you used the -d flag docker-compose down -v
Posted on January 11, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.