Mastering Data Preparation for Your Dream Data Job: Using Job Postings Data
Isaac Oresanya
Posted on September 8, 2023
Data related jobs are fascinating for many of us who love data. We want to know what skills are needed to get a job, how much data analysts, engineers and scientists earn, curious about the salaries and inner workings of these roles, and what other aspects of these jobs are like. There are many sources of information online, such as YouTube videos and blogs by experts in the field. However, one of the best ways to learn about these jobs is by looking at real data. That's what Luke Barousse did in one of his latest projects, where he scraped Google for data analyst job postings. In this article, I will explain how I used Pandas library in Python to clean his dataset, so that we can use it for further analysis and gain valuable insights about the job industry.
Data Source and Collection
The dataset used in this analysis was obtained from Luke Barousse's comprehensive data scraping project.
Link
To gain insights into the data collection process, I recommend watching Luke Barousse's informative video where he details his methodology and challenges faced during the process.
Data Preparation
The initial step in data preprocessing is to import the required dependencies, namely Pandas and Numpy. We will primarily utilize the Pandas library and also leverage Numpy's select function.
import pandas as pd
import numpy as np
Next, we read the data from a CSV file and store it in a variable named "job_data."
# Set pandas option to display all columns
pd.set_option("display.max_columns", None)
# Read the dataset into a DataFrame
job_data = pd.read_csv("gsearch_jobs.csv")
Below is a sample of the last five rows in the dataset:
| Unnamed: 0 | index | title | company_name | location | via | description | extensions | job_id | thumbnail | posted_at | schedule_type | work_from_home | salary | search_term | date_time | search_location | commute_time | salary_pay | salary_rate | salary_avg | salary_min | salary_max | salary_hourly | salary_yearly | salary_standardized | description_tokens |
|-------------:|--------:|:-----------------------------------------------------------------|:---------------|:---------------|:----------------------|:------------------------------------------|:----------------------------------------------------------------------------|:------------------------------------|:-------------------------------------------------------------------------------------------------|:-------------|:----------------|-----------------:|:-----------------------|:--------------|:---------------------------|:------------------|---------------:|:--------------|:--------------|-------------:|-------------:|-------------:|----------------:|----------------:|----------------------:|:--------------------------------------------------------------------|
| 27384 | 928 | Marketing Data & BI Analyst II | EDWARD JONES | Houstonia, MO | via My ArkLaMiss Jobs | lengthy passage of text (content removed) | ['23 hours ago', '76,798–130,764 a year', 'Full-time', 'Health insurance'] | eyJqb2JfdGl0bGUiOiJNYXJrZXRpbmcg... | nan | 23 hours ago | Full-time | nan | 76,798–130,764 a year | data analyst | 2022-11-04 03:40:23.706734 | United States | nan | 76798–130764 | a year | 103781 | 76798 | 130764 | nan | 103781 | 103781 | ['sql', 'r', 'python', 'power_bi', 'tableau', 'snowflake', 'excel'] |
| 27385 | 929 | Lead-Data Analyst | EDWARD JONES | Marshfield, MO | via My ArkLaMiss Jobs | lengthy passage of text (content removed) | ['23 hours ago', '106,916–182,047 a year', 'Full-time', 'Health insurance'] | eyJqb2JfdGl0bGUiOiJMZWFkLURhdGEg... | nan | 23 hours ago | Full-time | nan | 106,916–182,047 a year | data analyst | 2022-11-24 04:00:08.710801 | United States | nan | 106916–182047 | a year | 144482 | 106916 | 182047 | nan | 144482 | 144482 | [] |
| 27386 | 930 | Lead-Data Analyst | EDWARD JONES | High Point, MO | via My ArkLaMiss Jobs | lengthy passage of text (content removed) | ['23 hours ago', '106,916–182,047 a year', 'Full-time', 'Health insurance'] | eyJqb2JfdGl0bGUiOiJMZWFkLURhdGEg... | nan | 23 hours ago | Full-time | nan | 106,916–182,047 a year | data analyst | 2022-12-07 04:00:12.563831 | United States | nan | 106916–182047 | a year | 144482 | 106916 | 182047 | nan | 144482 | 144482 | [] |
| 27387 | 931 | Lead-Data Analyst | EDWARD JONES | Calhoun, MO | via My ArkLaMiss Jobs | lengthy passage of text (content removed) | ['23 hours ago', '106,916–182,047 a year', 'Full-time', 'Health insurance'] | eyJqb2JfdGl0bGUiOiJMZWFkLURhdGEg... | nan | 23 hours ago | Full-time | nan | 106,916–182,047 a year | data analyst | 2022-12-08 04:00:15.975728 | United States | nan | 106916–182047 | a year | 144482 | 106916 | 182047 | nan | 144482 | 144482 | [] |
| 27388 | 932 | Institutional Credit Management - Lending Data Analyst - Vice... | Citi | United States | via My ArkLaMiss Jobs | lengthy passage of text (content removed) | ['24 hours ago', '105,850–158,780 a year', 'Full-time'] | eyJqb2JfdGl0bGUiOiJJbnN0aXR1dGlv... | https://encrypted-tbn0.gstatic.com/images?q=tbn:ANd9GcTiN31j2tqAt1ZruC_654SrkSPAjWbpwkvjT7PG&s=0 | 24 hours ago | Full-time | nan | 105,850–158,780 a year | data analyst | 2023-01-25 04:00:26.521124 | United States | nan | 105850–158780 | a year | 132315 | 105850 | 158780 | nan | 132315 | 132315 | ['tableau', 'cognos'] |
Data Exploration
After loading the data, the first crucial step in data analysis is univariate analysis. During this phase, we seek to understand the fundamental structure of the data. This involves checking the dataset's shape, examining data types for each column, and studying categorical and numerical attributes. It helps data analysts uncover patterns, trends, and characteristics within a dataset and serves as the basis for more advanced analysis.
# Display the first five rows of the DataFrame
job_data.head()
# Display basic dataset information
print("Data Types:\n", job_data.dtypes)
print("Dataset Shape:", job_data.shape)
print("Column Names:", job_data.columns)
# Descriptive statistics for numerical columns
print("Descriptive Statistics:\n", job_data.describe())
# Check for missing values and display their sums
print("Missing Values:\n", job_data.isna().sum())
# Display value counts for the "salary_rate" column
print("Salary Rate Value Counts:\n", job_data["salary_rate"].value_counts())
# Display value counts for the "company_name" column
print("Company Name Value Counts:\n", job_data["company_name"].value_counts())
Data Cleaning
In this section, we enhance the data quality by addressing redundant columns and duplicated rows, making it more suitable for analysis.
The "work_from_home" column originally contains Boolean values. To make it more numeric, we convert 'True' to 1 (indicating employees can work remotely from home) and 'False' to 0 (representing Hybrid or On-site work).
# Replace True with 1 and NaN with 0 in the "work_from_home" column
job_data["work_from_home"] = job_data["work_from_home"].notnull().astype(int)
The "via" column indicates the platform where job postings were made, such as "via LinkedIn." To clean this data, we remove the unnecessary "via " from the values.
# Remove "via " from the "via" column to clean the data
job_data["via"] = job_data["via"].str.replace("via ", "")
To streamline our dataset, we identify columns for removal. The first four are redundant, while the "salary_pay" column contains values like "25-45 an hour" or "65k-97k a year," which are further divided into "salary_rate" and "salary_schedule" columns. Thus, retaining the "salary" column is redundant.
Additionally, "salary_hourly" and "salary_yearly" columns are mutually exclusive, and the range of values in two columns in a dataframe vary significantly. For example, the values in the "salary_yearly" column have a range falling between 9 and 300, while the "salary_yearly" column has its values spanning from 30,000 to 233,500. To address this, we employ the "salary_standardized" column for standardized salary values.
For the 'description' column, it contains large texts that are beyond the scope of this analysis. As an alternative, we have the 'description_tokens' column, which stores data-related tokens as arrays in a list-like structure within pandas DataFrame cells, with values such as ['python', 'SQL', 'Azure'].
# Define columns to drop
columns_to_drop = ["Unnamed: 0", "index", "description", "thumbnail", "posted_at", "commute_time", "salary_pay", "salary_hourly", "salary_yearly"]
# Drop the specified columns
job_data.drop(columns=columns_to_drop, inplace=True)
In preparation for PostgreSQL data analysis, we designate the "job_id" column as the primary key and position it as the first column. Primary keys in databases are unique, so we identify and remove any duplicate values from the "job_id" column.
# Check for duplicated "job_id" values and print the sum
duplicates = job_data["job_id"].duplicated().sum()
print("Duplicate job_id Values:", duplicates)
# Drop duplicate rows based on the "job_id" column
job_data.drop_duplicates(subset="job_id", inplace=True)
# Reorder columns with "job_id" as the first column
column_names = job_data.columns.tolist()
column_names.remove("job_id")
column_names.insert(0, "job_id")
job_data = job_data[column_names]
Data Formatting
To maintain data consistency during migration and analysis, we modify the "date_time" column. This column initially contains both date and time of job postings.
First, we convert the "date_time" column to a date-only format, discarding the time information.
# Convert the "date_time" column to datetime
job_data["date_time"] = pd.to_datetime(job_data["date_time"])
For compatibility with PostgreSQL, our chosen database system, we adjust the date format to YYYY-MM-DD. Simultaneously, we rename the column from "date_time" to simply "date" for clarity.
# Format the "date_time" column as "yyyy-mm-dd" and store it in the same column
job_data["date_time"] = job_data["date_time"].dt.strftime("%Y-%m-%d")
# Rename the "date_time" column to "date"
job_data.rename(columns={"date_time": "date"}, inplace=True)
This process ensures that our dataset minimizes potential issues during data migration.
Data Engineering
In this phase, we enhance our dataset by adding meaningful columns and reorganizing existing ones.
First, we address the issue of salary information. We observe that multiple columns contain salary-related details, and many postings lack this information. To simplify analysis, we create a new column called "salary_info_status" based on the "salary_pay" column. This new column classifies postings into two categories: "Not Specified" for those without salary information and "Available" for those with specified salaries.
# Create a new column "salary_info_status" based on "salary_pay" column
job_data["salary_info_status"] = job_data["salary_pay"].isna().replace({True: "Not Specified", False: "Available"})
Next, we turn our attention to geographical data, particularly the "location" column. To prepare for future data visualization using tools like Tableau, we need to separate this column into "city" and "state." We achieve this using Pandas' insert function, creating these new columns at specific index positions. Then, we iterate through each row, splitting the "location" values based on commas and spaces (", "), and assigning them to the respective "city" and "state" columns. We handle exceptions gracefully, ensuring the split operation only applies to rows with values. Lastly, we drop the original "location" column.
# Initialize empty "city" and "state" columns
job_data.insert(3, "city", "")
job_data.insert(4, "state", "")
# Split the "Location" column into "city" and "state" columns where possible
for index, row in job_data.iterrows():
location = row['location']
if pd.notna(location):
try:
city, state = location.strip().split(', ')
except ValueError:
pass
else:
job_data.at[index, 'city'] = city
job_data.at[index, 'state'] = state
# Drop the original "location" column
job_data.drop("location", axis=1, inplace=True)
However, there's an issue with one row where the "state" column contains an invalid input, "CA (+3 other states)." To maintain data integrity, we remove this row.
# Drop rows where state has more than 2 characters (invalid values)
job_data.drop(job_data[job_data.state.str.len() > 2].index, inplace=True)
To enhance the clarity of job titles, we categorize them into specific groups. We create a new column named "title_group," which depends on the "title" column. This grouping helps us distinguish job roles better, with titles falling into one of six categories: "Analyst/Engineer," "Analyst/Scientist," "Engineer/Scientist," "Analyst," "Engineer," and "Scientist." This step improves the precision of our analysis, and any titles not covered fall into an "Other" category.
# Initialize the 'title_group' column with 'Other'
job_data['title_group'] = 'Other'
# Define the conditions and corresponding values
conditions = [
(job_data['title'].str.contains(r'analy(?:z|s|t)(?:e|is|ic|t)s?', case=False) &
job_data['title'].str.contains(r'engineerg?', case=False)),
(job_data['title'].str.contains(r'analy(?:z|s|t)(?:e|is|ic|t)s?', case=False) &
job_data['title'].str.contains(r'scientists?|sciences?', case=False)),
(job_data['title'].str.contains(r'engineerg?', case=False) &
job_data['title'].str.contains(r'scientists?|sciences?', case=False)),
job_data['title'].str.contains(r'analy(?:z|s|t)(?:e|is|ic|t)s?', case=False),
job_data['title'].str.contains(r'engineerg?', case=False),
job_data['title'].str.contains(r'scientists?|sciences?', case=False)
]
# Define the corresponding values for each condition
values = [
'Analyst/Engineer',
'Analyst/Scientist',
'Engineer/Scientist',
'Analyst',
'Engineer',
'Scientist'
]
# Use numpy's select function to apply the conditions and assign values accordingly
job_data['title_group'] = np.select(conditions, values, default='Other')
With these refinements, our dataset is now well-prepared for comprehensive analysis, ensuring data accuracy and usability.
Data Saving
After all the preprocessing steps, it's essential to save the refined dataset to ensure data integrity and have a foundation for future analysis. In this case, we choose to save the DataFrame as a CSV file:
# Save the DataFrame to a CSV file
job_data.to_csv("jobdata_pandas_processed.csv", index=False)
These saved file will serve as valuable resources for further analysis, sharing insights, and ensuring data integrity for your data analytics journey.
If you're interested in exploring the source code, it's available on my GitHub page, providing a closer look at the data analysis techniques employed.
Conclusion
To summarize, this article provides a step-by-step guide on how to prepare and analyze real-world data using Pandas in Python. The article covers the entire data analysis process, from collecting data from different sources to cleaning, formatting, engineering, and saving it. This resource is a valuable tool for data enthusiasts and professionals who want to learn how to extract valuable insights from diverse datasets.
I hope you found this article informative. Stay tuned for my next article, where I will discuss how I further analyzed the data using SQL and Tableau. I am excited to share my findings with you!
Posted on September 8, 2023
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
September 8, 2023