Pandas - EDA Case Study - 7 Days of Pandas
Piyush Raj
Posted on December 27, 2022
Welcome to the seventh (and final) article in the "7 Days of Pandas" series where we cover the pandas
library in Python which is used for data manipulation.
In the first article of the series, we looked at how to read and write CSV files with Pandas. In this tutorial, we will look at some of the most common operations that we perform on a dataframe in Pandas.
In the second article, we looked at how to perform basic data manipulation.
In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.
In the fourth article, we looked at how to handle missing values in a dataframe.
In the fifth article we looked at how to aggregate and group data in Pandas.
In the sixth article we looked at how to visualize the data in a pandas dataframe.
In this tutorial, we will look apply the methods learned so far in a case-study. We'll be working with a demo assignment on performing EDA from the open source mlcourse.ai project.
The Task
In this task you should use Pandas to answer a few questions about the Adult dataset.
Unique values of all features (for more information, please see the links above):
-
age
: continuous. -
workclass
: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked. -
fnlwgt
: continuous. -
education
: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool. -
education-num
: continuous. -
marital-status
: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse. -
occupation
: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces. -
relationship
: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried. -
race
: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black. -
sex
: Female, Male. -
capital-gain
: continuous. -
capital-loss
: continuous. -
hours-per-week
: continuous. -
native-country
: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands. -
salary
: >50K,<=50K
Let's now read the data as a dataframe.
import pandas as pd
# read data from csv file
df = pd.read_csv("adult.data.csv")
# display the first five rows
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K |
1. How many men and women (sex feature) are represented in this dataset?
# we need to get the value counts in the "sex" column
df["sex"].value_counts()
Male 21790
Female 10771
Name: sex, dtype: int64
2. What is the average age (age feature) of women?
# filter for women and then get their average age
df[df["sex"]=="Female"]["age"].mean()
36.85823043357163
3. What is the percentage of German citizens (native-country feature)?
# find the number of German citizens and divide that by the total population
len(df[df["native-country"]=="Germany"])/len(df) * 100
0.42074874850281013
Only 0.42%
4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?
# group on salary and then calculate the mean and std for the age
df.groupby(by="salary")["age"].agg(['mean', 'std'])
mean | std | |
---|---|---|
salary | ||
<=50K | 36.783738 | 14.020088 |
>50K | 44.249841 | 10.519028 |
6. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)
# filter the dataframe for >50k and see the distribution of education
df[df["salary"] == ">50K"]["education"].value_counts()
Bachelors 2221
HS-grad 1675
Some-college 1387
Masters 959
Prof-school 423
Assoc-voc 361
Doctorate 306
Assoc-acdm 265
10th 62
11th 60
7th-8th 40
12th 33
9th 27
5th-6th 16
1st-4th 6
Name: education, dtype: int64
No, we can see that there are individuals with less than high-school education in the >50K bucket.
7. Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race.
# for each race
df.groupby(by="race")["age"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
race | ||||||||
Amer-Indian-Eskimo | 311.0 | 37.173633 | 12.447130 | 17.0 | 28.0 | 35.0 | 45.5 | 82.0 |
Asian-Pac-Islander | 1039.0 | 37.746872 | 12.825133 | 17.0 | 28.0 | 36.0 | 45.0 | 90.0 |
Black | 3124.0 | 37.767926 | 12.759290 | 17.0 | 28.0 | 36.0 | 46.0 | 90.0 |
Other | 271.0 | 33.457565 | 11.538865 | 17.0 | 25.0 | 31.0 | 41.0 | 77.0 |
White | 27816.0 | 38.769881 | 13.782306 | 17.0 | 28.0 | 37.0 | 48.0 | 90.0 |
# for each gender
df.groupby(by="sex")["age"].describe()
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
sex | ||||||||
Female | 10771.0 | 36.858230 | 14.013697 | 17.0 | 25.0 | 35.0 | 46.0 | 90.0 |
Male | 21790.0 | 39.433547 | 13.370630 | 17.0 | 29.0 | 38.0 | 48.0 | 90.0 |
# for each race and gender
df.groupby(by=["race", "sex"])["age"].describe()
count | mean | std | min | 25% | 50% | 75% | max | ||
---|---|---|---|---|---|---|---|---|---|
race | sex | ||||||||
Amer-Indian-Eskimo | Female | 119.0 | 37.117647 | 13.114991 | 17.0 | 27.0 | 36.0 | 46.00 | 80.0 |
Male | 192.0 | 37.208333 | 12.049563 | 17.0 | 28.0 | 35.0 | 45.00 | 82.0 | |
Asian-Pac-Islander | Female | 346.0 | 35.089595 | 12.300845 | 17.0 | 25.0 | 33.0 | 43.75 | 75.0 |
Male | 693.0 | 39.073593 | 12.883944 | 18.0 | 29.0 | 37.0 | 46.00 | 90.0 | |
Black | Female | 1555.0 | 37.854019 | 12.637197 | 17.0 | 28.0 | 37.0 | 46.00 | 90.0 |
Male | 1569.0 | 37.682600 | 12.882612 | 17.0 | 27.0 | 36.0 | 46.00 | 90.0 | |
Other | Female | 109.0 | 31.678899 | 11.631599 | 17.0 | 23.0 | 29.0 | 39.00 | 74.0 |
Male | 162.0 | 34.654321 | 11.355531 | 17.0 | 26.0 | 32.0 | 42.00 | 77.0 | |
White | Female | 8642.0 | 36.811618 | 14.329093 | 17.0 | 25.0 | 35.0 | 46.00 | 90.0 |
Male | 19174.0 | 39.652498 | 13.436029 | 17.0 | 29.0 | 38.0 | 49.00 | 90.0 |
8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.
# add a new column "is-married"
df["is-married"] = df["marital-status"].str.startswith("Married")
# display the dataframe
df.head()
age | workclass | fnlwgt | education | education-num | marital-status | occupation | relationship | race | sex | capital-gain | capital-loss | hours-per-week | native-country | salary | is-married | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 39 | State-gov | 77516 | Bachelors | 13 | Never-married | Adm-clerical | Not-in-family | White | Male | 2174 | 0 | 40 | United-States | <=50K | False |
1 | 50 | Self-emp-not-inc | 83311 | Bachelors | 13 | Married-civ-spouse | Exec-managerial | Husband | White | Male | 0 | 0 | 13 | United-States | <=50K | True |
2 | 38 | Private | 215646 | HS-grad | 9 | Divorced | Handlers-cleaners | Not-in-family | White | Male | 0 | 0 | 40 | United-States | <=50K | False |
3 | 53 | Private | 234721 | 11th | 7 | Married-civ-spouse | Handlers-cleaners | Husband | Black | Male | 0 | 0 | 40 | United-States | <=50K | True |
4 | 28 | Private | 338409 | Bachelors | 13 | Married-civ-spouse | Prof-specialty | Wife | Black | Female | 0 | 0 | 40 | Cuba | <=50K | True |
df.groupby(by=["is-married"])["salary"].value_counts(normalize=True)
is-married salary
False <=50K 0.935546
>50K 0.064454
True <=50K 0.563080
>50K 0.436920
Name: salary, dtype: float64
We can see that amongst Married people, we have a higher proportion of people with salary >50K
9. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?
# max number of hourse a person works per week
df["hours-per-week"].max()
99
# how many people work the above maximum number of hourse
len(df[df["hours-per-week"] == df["hours-per-week"].max()])
85
# percentage of people in the above population that earn more than 50K
df[df["hours-per-week"] == df["hours-per-week"].max()]['salary'].value_counts(normalize=True)
<=50K 0.705882
>50K 0.294118
Name: salary, dtype: float64
Only 29%
10. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?
# group the data on native country and salary and find the average work time for each group
with pd.option_context('display.max_rows', None):
print(df.groupby(by=["native-country", "salary"])["hours-per-week"].mean())
native-country salary
? <=50K 40.164760
>50K 45.547945
Cambodia <=50K 41.416667
>50K 40.000000
Canada <=50K 37.914634
>50K 45.641026
China <=50K 37.381818
>50K 38.900000
Columbia <=50K 38.684211
>50K 50.000000
Cuba <=50K 37.985714
>50K 42.440000
Dominican-Republic <=50K 42.338235
>50K 47.000000
Ecuador <=50K 38.041667
>50K 48.750000
El-Salvador <=50K 36.030928
>50K 45.000000
England <=50K 40.483333
>50K 44.533333
France <=50K 41.058824
>50K 50.750000
Germany <=50K 39.139785
>50K 44.977273
Greece <=50K 41.809524
>50K 50.625000
Guatemala <=50K 39.360656
>50K 36.666667
Haiti <=50K 36.325000
>50K 42.750000
Holand-Netherlands <=50K 40.000000
Honduras <=50K 34.333333
>50K 60.000000
Hong <=50K 39.142857
>50K 45.000000
Hungary <=50K 31.300000
>50K 50.000000
India <=50K 38.233333
>50K 46.475000
Iran <=50K 41.440000
>50K 47.500000
Ireland <=50K 40.947368
>50K 48.000000
Italy <=50K 39.625000
>50K 45.400000
Jamaica <=50K 38.239437
>50K 41.100000
Japan <=50K 41.000000
>50K 47.958333
Laos <=50K 40.375000
>50K 40.000000
Mexico <=50K 40.003279
>50K 46.575758
Nicaragua <=50K 36.093750
>50K 37.500000
Outlying-US(Guam-USVI-etc) <=50K 41.857143
Peru <=50K 35.068966
>50K 40.000000
Philippines <=50K 38.065693
>50K 43.032787
Poland <=50K 38.166667
>50K 39.000000
Portugal <=50K 41.939394
>50K 41.500000
Puerto-Rico <=50K 38.470588
>50K 39.416667
Scotland <=50K 39.444444
>50K 46.666667
South <=50K 40.156250
>50K 51.437500
Taiwan <=50K 33.774194
>50K 46.800000
Thailand <=50K 42.866667
>50K 58.333333
Trinadad&Tobago <=50K 37.058824
>50K 40.000000
United-States <=50K 38.799127
>50K 45.505369
Vietnam <=50K 37.193548
>50K 39.200000
Yugoslavia <=50K 41.600000
>50K 49.500000
Name: hours-per-week, dtype: float64
# for japan
df[df["native-country"]=="Japan"].groupby(by=["salary"])["hours-per-week"].mean()
salary
<=50K 41.000000
>50K 47.958333
Name: hours-per-week, dtype: float64
Posted on December 27, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.