Pandas - EDA Case Study - 7 Days of Pandas

Welcome to the seventh (and final) article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas. In this tutorial, we will look at some of the most common operations that we perform on a dataframe in Pandas.

In the second article, we looked at how to perform basic data manipulation.

In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.

In the fourth article, we looked at how to handle missing values in a dataframe.

In the fifth article we looked at how to aggregate and group data in Pandas.

In the sixth article we looked at how to visualize the data in a pandas dataframe.

In this tutorial, we will look apply the methods learned so far in a case-study. We'll be working with a demo assignment on performing EDA from the open source mlcourse.ai project.

The Task

In this task you should use Pandas to answer a few questions about the Adult dataset.

Unique values of all features (for more information, please see the links above):

age: continuous.
workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
fnlwgt: continuous.
education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
education-num: continuous.
marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
sex: Female, Male.
capital-gain: continuous.
capital-loss: continuous.
hours-per-week: continuous.
native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
salary: >50K,<=50K

Let's now read the data as a dataframe.

import pandas as pd

# read data from csv file
df = pd.read_csv("adult.data.csv")
# display the first five rows
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K

1. How many men and women (sex feature) are represented in this dataset?

# we need to get the value counts in the "sex" column
df["sex"].value_counts()

Male      21790
Female    10771
Name: sex, dtype: int64

2. What is the average age (age feature) of women?

# filter for women and then get their average age
df[df["sex"]=="Female"]["age"].mean()

36.85823043357163

3. What is the percentage of German citizens (native-country feature)?

# find the number of German citizens and divide that by the total population
len(df[df["native-country"]=="Germany"])/len(df) * 100

0.42074874850281013

Only 0.42%

4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?

# group on salary and then calculate the mean and std for the age
df.groupby(by="salary")["age"].agg(['mean', 'std'])

	mean	std
salary
<=50K	36.783738	14.020088
>50K	44.249841	10.519028

6. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

# filter the dataframe for >50k and see the distribution of education
df[df["salary"] == ">50K"]["education"].value_counts()

Bachelors       2221
HS-grad         1675
Some-college    1387
Masters          959
Prof-school      423
Assoc-voc        361
Doctorate        306
Assoc-acdm       265
10th              62
11th              60
7th-8th           40
12th              33
9th               27
5th-6th           16
1st-4th            6
Name: education, dtype: int64

No, we can see that there are individuals with less than high-school education in the >50K bucket.

7. Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race.

# for each race
df.groupby(by="race")["age"].describe()

	count	mean	std	min	25%	50%	75%	max
race
Amer-Indian-Eskimo	311.0	37.173633	12.447130	17.0	28.0	35.0	45.5	82.0
Asian-Pac-Islander	1039.0	37.746872	12.825133	17.0	28.0	36.0	45.0	90.0
Black	3124.0	37.767926	12.759290	17.0	28.0	36.0	46.0	90.0
Other	271.0	33.457565	11.538865	17.0	25.0	31.0	41.0	77.0
White	27816.0	38.769881	13.782306	17.0	28.0	37.0	48.0	90.0

# for each gender
df.groupby(by="sex")["age"].describe()

	count	mean	std	min	25%	50%	75%	max
sex
Female	10771.0	36.858230	14.013697	17.0	25.0	35.0	46.0	90.0
Male	21790.0	39.433547	13.370630	17.0	29.0	38.0	48.0	90.0

# for each race and gender
df.groupby(by=["race", "sex"])["age"].describe()

		count	mean	std	min	25%	50%	75%	max
race	sex
Amer-Indian-Eskimo	Female	119.0	37.117647	13.114991	17.0	27.0	36.0	46.00	80.0
Amer-Indian-Eskimo	Male	192.0	37.208333	12.049563	17.0	28.0	35.0	45.00	82.0
Asian-Pac-Islander	Female	346.0	35.089595	12.300845	17.0	25.0	33.0	43.75	75.0
Asian-Pac-Islander	Male	693.0	39.073593	12.883944	18.0	29.0	37.0	46.00	90.0
Black	Female	1555.0	37.854019	12.637197	17.0	28.0	37.0	46.00	90.0
Black	Male	1569.0	37.682600	12.882612	17.0	27.0	36.0	46.00	90.0
Other	Female	109.0	31.678899	11.631599	17.0	23.0	29.0	39.00	74.0
Other	Male	162.0	34.654321	11.355531	17.0	26.0	32.0	42.00	77.0
White	Female	8642.0	36.811618	14.329093	17.0	25.0	35.0	46.00	90.0
White	Male	19174.0	39.652498	13.436029	17.0	29.0	38.0	49.00	90.0

8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.

# add a new column "is-married"
df["is-married"] = df["marital-status"].str.startswith("Married")

# display the dataframe
df.head()

	age	workclass	fnlwgt	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	salary	is-married
0	39	State-gov	77516	Bachelors	13	Never-married	Adm-clerical	Not-in-family	White	Male	2174	40	United-States	<=50K	False
1	50	Self-emp-not-inc	83311	Bachelors	13	Married-civ-spouse	Exec-managerial	Husband	White	Male	0	13	United-States	<=50K	True
2	38	Private	215646	HS-grad	9	Divorced	Handlers-cleaners	Not-in-family	White	Male	0	40	United-States	<=50K	False
3	53	Private	234721	11th	7	Married-civ-spouse	Handlers-cleaners	Husband	Black	Male	0	40	United-States	<=50K	True
4	28	Private	338409	Bachelors	13	Married-civ-spouse	Prof-specialty	Wife	Black	Female	0	40	Cuba	<=50K	True

df.groupby(by=["is-married"])["salary"].value_counts(normalize=True)

is-married  salary
False       <=50K     0.935546
            >50K      0.064454
True        <=50K     0.563080
            >50K      0.436920
Name: salary, dtype: float64

We can see that amongst Married people, we have a higher proportion of people with salary >50K

9. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?

# max number of hourse a person works per week
df["hours-per-week"].max()

# how many people work the above maximum number of hourse
len(df[df["hours-per-week"] == df["hours-per-week"].max()])

# percentage of people in the above population that earn more than 50K
df[df["hours-per-week"] == df["hours-per-week"].max()]['salary'].value_counts(normalize=True)

<=50K    0.705882
>50K     0.294118
Name: salary, dtype: float64

Only 29%

10. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?

# group the data on native country and salary and find the average work time for each group
with pd.option_context('display.max_rows', None):
    print(df.groupby(by=["native-country", "salary"])["hours-per-week"].mean())

native-country              salary
?                           <=50K     40.164760
                            >50K      45.547945
Cambodia                    <=50K     41.416667
                            >50K      40.000000
Canada                      <=50K     37.914634
                            >50K      45.641026
China                       <=50K     37.381818
                            >50K      38.900000
Columbia                    <=50K     38.684211
                            >50K      50.000000
Cuba                        <=50K     37.985714
                            >50K      42.440000
Dominican-Republic          <=50K     42.338235
                            >50K      47.000000
Ecuador                     <=50K     38.041667
                            >50K      48.750000
El-Salvador                 <=50K     36.030928
                            >50K      45.000000
England                     <=50K     40.483333
                            >50K      44.533333
France                      <=50K     41.058824
                            >50K      50.750000
Germany                     <=50K     39.139785
                            >50K      44.977273
Greece                      <=50K     41.809524
                            >50K      50.625000
Guatemala                   <=50K     39.360656
                            >50K      36.666667
Haiti                       <=50K     36.325000
                            >50K      42.750000
Holand-Netherlands          <=50K     40.000000
Honduras                    <=50K     34.333333
                            >50K      60.000000
Hong                        <=50K     39.142857
                            >50K      45.000000
Hungary                     <=50K     31.300000
                            >50K      50.000000
India                       <=50K     38.233333
                            >50K      46.475000
Iran                        <=50K     41.440000
                            >50K      47.500000
Ireland                     <=50K     40.947368
                            >50K      48.000000
Italy                       <=50K     39.625000
                            >50K      45.400000
Jamaica                     <=50K     38.239437
                            >50K      41.100000
Japan                       <=50K     41.000000
                            >50K      47.958333
Laos                        <=50K     40.375000
                            >50K      40.000000
Mexico                      <=50K     40.003279
                            >50K      46.575758
Nicaragua                   <=50K     36.093750
                            >50K      37.500000
Outlying-US(Guam-USVI-etc)  <=50K     41.857143
Peru                        <=50K     35.068966
                            >50K      40.000000
Philippines                 <=50K     38.065693
                            >50K      43.032787
Poland                      <=50K     38.166667
                            >50K      39.000000
Portugal                    <=50K     41.939394
                            >50K      41.500000
Puerto-Rico                 <=50K     38.470588
                            >50K      39.416667
Scotland                    <=50K     39.444444
                            >50K      46.666667
South                       <=50K     40.156250
                            >50K      51.437500
Taiwan                      <=50K     33.774194
                            >50K      46.800000
Thailand                    <=50K     42.866667
                            >50K      58.333333
Trinadad&Tobago             <=50K     37.058824
                            >50K      40.000000
United-States               <=50K     38.799127
                            >50K      45.505369
Vietnam                     <=50K     37.193548
                            >50K      39.200000
Yugoslavia                  <=50K     41.600000
                            >50K      49.500000
Name: hours-per-week, dtype: float64

# for japan
df[df["native-country"]=="Japan"].groupby(by=["salary"])["hours-per-week"].mean()

salary
<=50K    41.000000
>50K     47.958333
Name: hours-per-week, dtype: float64

Blog

Pandas - EDA Case Study - 7 Days of Pandas

Piyush Raj

The Task

Join Our Newsletter. No Spam, Only the good stuff.

Related