Pandas - EDA Case Study - 7 Days of Pandas

piyushraj

Piyush Raj

Posted on December 27, 2022

Pandas - EDA Case Study - 7 Days of Pandas

Welcome to the seventh (and final) article in the "7 Days of Pandas" series where we cover the pandas library in Python which is used for data manipulation.

In the first article of the series, we looked at how to read and write CSV files with Pandas. In this tutorial, we will look at some of the most common operations that we perform on a dataframe in Pandas.

In the second article, we looked at how to perform basic data manipulation.

In the third article, we looked at how to perform EDA (exploratory data analysis) with Pandas.

In the fourth article, we looked at how to handle missing values in a dataframe.

In the fifth article we looked at how to aggregate and group data in Pandas.

In the sixth article we looked at how to visualize the data in a pandas dataframe.

In this tutorial, we will look apply the methods learned so far in a case-study. We'll be working with a demo assignment on performing EDA from the open source mlcourse.ai project.

The Task

In this task you should use Pandas to answer a few questions about the Adult dataset.

Unique values of all features (for more information, please see the links above):

  • age: continuous.
  • workclass: Private, Self-emp-not-inc, Self-emp-inc, Federal-gov, Local-gov, State-gov, Without-pay, Never-worked.
  • fnlwgt: continuous.
  • education: Bachelors, Some-college, 11th, HS-grad, Prof-school, Assoc-acdm, Assoc-voc, 9th, 7th-8th, 12th, Masters, 1st-4th, 10th, Doctorate, 5th-6th, Preschool.
  • education-num: continuous.
  • marital-status: Married-civ-spouse, Divorced, Never-married, Separated, Widowed, Married-spouse-absent, Married-AF-spouse.
  • occupation: Tech-support, Craft-repair, Other-service, Sales, Exec-managerial, Prof-specialty, Handlers-cleaners, Machine-op-inspct, Adm-clerical, Farming-fishing, Transport-moving, Priv-house-serv, Protective-serv, Armed-Forces.
  • relationship: Wife, Own-child, Husband, Not-in-family, Other-relative, Unmarried.
  • race: White, Asian-Pac-Islander, Amer-Indian-Eskimo, Other, Black.
  • sex: Female, Male.
  • capital-gain: continuous.
  • capital-loss: continuous.
  • hours-per-week: continuous.
  • native-country: United-States, Cambodia, England, Puerto-Rico, Canada, Germany, Outlying-US(Guam-USVI-etc), India, Japan, Greece, South, China, Cuba, Iran, Honduras, Philippines, Italy, Poland, Jamaica, Vietnam, Mexico, Portugal, Ireland, France, Dominican-Republic, Laos, Ecuador, Taiwan, Haiti, Columbia, Hungary, Guatemala, Nicaragua, Scotland, Thailand, Yugoslavia, El-Salvador, Trinadad&Tobago, Peru, Hong, Holand-Netherlands.
  • salary: >50K,<=50K

Let's now read the data as a dataframe.

import pandas as pd

# read data from csv file
df = pd.read_csv("adult.data.csv")
# display the first five rows
df.head()
Enter fullscreen mode Exit fullscreen mode
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K

1. How many men and women (sex feature) are represented in this dataset?

# we need to get the value counts in the "sex" column
df["sex"].value_counts()
Enter fullscreen mode Exit fullscreen mode
Male      21790
Female    10771
Name: sex, dtype: int64
Enter fullscreen mode Exit fullscreen mode

2. What is the average age (age feature) of women?

# filter for women and then get their average age
df[df["sex"]=="Female"]["age"].mean()
Enter fullscreen mode Exit fullscreen mode
36.85823043357163
Enter fullscreen mode Exit fullscreen mode

3. What is the percentage of German citizens (native-country feature)?

# find the number of German citizens and divide that by the total population
len(df[df["native-country"]=="Germany"])/len(df) * 100
Enter fullscreen mode Exit fullscreen mode
0.42074874850281013
Enter fullscreen mode Exit fullscreen mode

Only 0.42%

4-5. What are the mean and standard deviation of age for those who earn more than 50K per year (salary feature) and those who earn less than 50K per year?

# group on salary and then calculate the mean and std for the age
df.groupby(by="salary")["age"].agg(['mean', 'std'])
Enter fullscreen mode Exit fullscreen mode
mean std
salary
<=50K 36.783738 14.020088
>50K 44.249841 10.519028

6. Is it true that people who earn more than 50K have at least high school education? (education – Bachelors, Prof-school, Assoc-acdm, Assoc-voc, Masters or Doctorate feature)

# filter the dataframe for >50k and see the distribution of education
df[df["salary"] == ">50K"]["education"].value_counts()
Enter fullscreen mode Exit fullscreen mode
Bachelors       2221
HS-grad         1675
Some-college    1387
Masters          959
Prof-school      423
Assoc-voc        361
Doctorate        306
Assoc-acdm       265
10th              62
11th              60
7th-8th           40
12th              33
9th               27
5th-6th           16
1st-4th            6
Name: education, dtype: int64
Enter fullscreen mode Exit fullscreen mode

No, we can see that there are individuals with less than high-school education in the >50K bucket.

7. Display age statistics for each race (race feature) and each gender (sex feature). Use groupby() and describe(). Find the maximum age of men of Amer-Indian-Eskimo race.

# for each race
df.groupby(by="race")["age"].describe()
Enter fullscreen mode Exit fullscreen mode
count mean std min 25% 50% 75% max
race
Amer-Indian-Eskimo 311.0 37.173633 12.447130 17.0 28.0 35.0 45.5 82.0
Asian-Pac-Islander 1039.0 37.746872 12.825133 17.0 28.0 36.0 45.0 90.0
Black 3124.0 37.767926 12.759290 17.0 28.0 36.0 46.0 90.0
Other 271.0 33.457565 11.538865 17.0 25.0 31.0 41.0 77.0
White 27816.0 38.769881 13.782306 17.0 28.0 37.0 48.0 90.0
# for each gender
df.groupby(by="sex")["age"].describe()
Enter fullscreen mode Exit fullscreen mode
count mean std min 25% 50% 75% max
sex
Female 10771.0 36.858230 14.013697 17.0 25.0 35.0 46.0 90.0
Male 21790.0 39.433547 13.370630 17.0 29.0 38.0 48.0 90.0
# for each race and gender
df.groupby(by=["race", "sex"])["age"].describe()
Enter fullscreen mode Exit fullscreen mode
count mean std min 25% 50% 75% max
race sex
Amer-Indian-Eskimo Female 119.0 37.117647 13.114991 17.0 27.0 36.0 46.00 80.0
Male 192.0 37.208333 12.049563 17.0 28.0 35.0 45.00 82.0
Asian-Pac-Islander Female 346.0 35.089595 12.300845 17.0 25.0 33.0 43.75 75.0
Male 693.0 39.073593 12.883944 18.0 29.0 37.0 46.00 90.0
Black Female 1555.0 37.854019 12.637197 17.0 28.0 37.0 46.00 90.0
Male 1569.0 37.682600 12.882612 17.0 27.0 36.0 46.00 90.0
Other Female 109.0 31.678899 11.631599 17.0 23.0 29.0 39.00 74.0
Male 162.0 34.654321 11.355531 17.0 26.0 32.0 42.00 77.0
White Female 8642.0 36.811618 14.329093 17.0 25.0 35.0 46.00 90.0
Male 19174.0 39.652498 13.436029 17.0 29.0 38.0 49.00 90.0

8. Among whom is the proportion of those who earn a lot (>50K) greater: married or single men (marital-status feature)? Consider as married those who have a marital-status starting with Married (Married-civ-spouse, Married-spouse-absent or Married-AF-spouse), the rest are considered bachelors.

# add a new column "is-married"
df["is-married"] = df["marital-status"].str.startswith("Married")

# display the dataframe
df.head()
Enter fullscreen mode Exit fullscreen mode
age workclass fnlwgt education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country salary is-married
0 39 State-gov 77516 Bachelors 13 Never-married Adm-clerical Not-in-family White Male 2174 0 40 United-States <=50K False
1 50 Self-emp-not-inc 83311 Bachelors 13 Married-civ-spouse Exec-managerial Husband White Male 0 0 13 United-States <=50K True
2 38 Private 215646 HS-grad 9 Divorced Handlers-cleaners Not-in-family White Male 0 0 40 United-States <=50K False
3 53 Private 234721 11th 7 Married-civ-spouse Handlers-cleaners Husband Black Male 0 0 40 United-States <=50K True
4 28 Private 338409 Bachelors 13 Married-civ-spouse Prof-specialty Wife Black Female 0 0 40 Cuba <=50K True
df.groupby(by=["is-married"])["salary"].value_counts(normalize=True)
Enter fullscreen mode Exit fullscreen mode
is-married  salary
False       <=50K     0.935546
            >50K      0.064454
True        <=50K     0.563080
            >50K      0.436920
Name: salary, dtype: float64
Enter fullscreen mode Exit fullscreen mode

We can see that amongst Married people, we have a higher proportion of people with salary >50K

9. What is the maximum number of hours a person works per week (hours-per-week feature)? How many people work such a number of hours, and what is the percentage of those who earn a lot (>50K) among them?

# max number of hourse a person works per week
df["hours-per-week"].max()
Enter fullscreen mode Exit fullscreen mode
99
Enter fullscreen mode Exit fullscreen mode
# how many people work the above maximum number of hourse
len(df[df["hours-per-week"] == df["hours-per-week"].max()])
Enter fullscreen mode Exit fullscreen mode
85
Enter fullscreen mode Exit fullscreen mode
# percentage of people in the above population that earn more than 50K
df[df["hours-per-week"] == df["hours-per-week"].max()]['salary'].value_counts(normalize=True)
Enter fullscreen mode Exit fullscreen mode
<=50K    0.705882
>50K     0.294118
Name: salary, dtype: float64
Enter fullscreen mode Exit fullscreen mode

Only 29%

10. Count the average time of work (hours-per-week) for those who earn a little and a lot (salary) for each country (native-country). What will these be for Japan?

# group the data on native country and salary and find the average work time for each group
with pd.option_context('display.max_rows', None):
    print(df.groupby(by=["native-country", "salary"])["hours-per-week"].mean())
Enter fullscreen mode Exit fullscreen mode
native-country              salary
?                           <=50K     40.164760
                            >50K      45.547945
Cambodia                    <=50K     41.416667
                            >50K      40.000000
Canada                      <=50K     37.914634
                            >50K      45.641026
China                       <=50K     37.381818
                            >50K      38.900000
Columbia                    <=50K     38.684211
                            >50K      50.000000
Cuba                        <=50K     37.985714
                            >50K      42.440000
Dominican-Republic          <=50K     42.338235
                            >50K      47.000000
Ecuador                     <=50K     38.041667
                            >50K      48.750000
El-Salvador                 <=50K     36.030928
                            >50K      45.000000
England                     <=50K     40.483333
                            >50K      44.533333
France                      <=50K     41.058824
                            >50K      50.750000
Germany                     <=50K     39.139785
                            >50K      44.977273
Greece                      <=50K     41.809524
                            >50K      50.625000
Guatemala                   <=50K     39.360656
                            >50K      36.666667
Haiti                       <=50K     36.325000
                            >50K      42.750000
Holand-Netherlands          <=50K     40.000000
Honduras                    <=50K     34.333333
                            >50K      60.000000
Hong                        <=50K     39.142857
                            >50K      45.000000
Hungary                     <=50K     31.300000
                            >50K      50.000000
India                       <=50K     38.233333
                            >50K      46.475000
Iran                        <=50K     41.440000
                            >50K      47.500000
Ireland                     <=50K     40.947368
                            >50K      48.000000
Italy                       <=50K     39.625000
                            >50K      45.400000
Jamaica                     <=50K     38.239437
                            >50K      41.100000
Japan                       <=50K     41.000000
                            >50K      47.958333
Laos                        <=50K     40.375000
                            >50K      40.000000
Mexico                      <=50K     40.003279
                            >50K      46.575758
Nicaragua                   <=50K     36.093750
                            >50K      37.500000
Outlying-US(Guam-USVI-etc)  <=50K     41.857143
Peru                        <=50K     35.068966
                            >50K      40.000000
Philippines                 <=50K     38.065693
                            >50K      43.032787
Poland                      <=50K     38.166667
                            >50K      39.000000
Portugal                    <=50K     41.939394
                            >50K      41.500000
Puerto-Rico                 <=50K     38.470588
                            >50K      39.416667
Scotland                    <=50K     39.444444
                            >50K      46.666667
South                       <=50K     40.156250
                            >50K      51.437500
Taiwan                      <=50K     33.774194
                            >50K      46.800000
Thailand                    <=50K     42.866667
                            >50K      58.333333
Trinadad&Tobago             <=50K     37.058824
                            >50K      40.000000
United-States               <=50K     38.799127
                            >50K      45.505369
Vietnam                     <=50K     37.193548
                            >50K      39.200000
Yugoslavia                  <=50K     41.600000
                            >50K      49.500000
Name: hours-per-week, dtype: float64
Enter fullscreen mode Exit fullscreen mode
# for japan
df[df["native-country"]=="Japan"].groupby(by=["salary"])["hours-per-week"].mean()
Enter fullscreen mode Exit fullscreen mode
salary
<=50K    41.000000
>50K     47.958333
Name: hours-per-week, dtype: float64
Enter fullscreen mode Exit fullscreen mode
💖 💪 🙅 🚩
piyushraj
Piyush Raj

Posted on December 27, 2022

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related