Bernice Waweru
Posted on February 8, 2022
Pandas are used hand in hand with NumPy in data science. In this article, we will explore pandas and form a solid understanding of the library.
Introduction
Pandas is a an open-source Python library used to read, write, manipulate and analyze data.
Pandas integrates well with other data visualization libraries.
To use pandas, you have to import it using:
import pandas as pd
pd is the conventional alias for pandas.
Pandas has two main data structures; Data frames and Series.
Series is a one-dimensional array that can contain elements of different data types. A series is similar to a list in python but is displayed as a column in a table.
Dataframes are two dimensional table structured with labeled axes. It is a collection of series.
Creating a Series
You can create a series using pd.Series() and passing a list or dictionary as an argument.
1.Using a list
import pandas as pd
my_list = [1,4,5,8,9,3]
series_A = pd.Series(my_list)
print(series_A)
Output:
0 1
1 4
2 5
3 8
4 9
5 3
dtype: int64
2.Using a dictionary
The key, value pair in the dictionary become the index and value in the series.
my_dict ={'one':'Jane','two':'Tom','three':'Kamau'}
series_dict = pd.Series(my_dict)
print(series_dict)
Output:
one Jane
two Tom
three Kamau
dtype: object
Creating DataFrames
1.Use a list or NumPy array. You can specify column and row indexes.
patient_details = [[101,'Julia','Johns'],
[102,'Jessica','Watkins'],
[103,'Amanda','Elis']]
patient_dataframe = pd.DataFrame(patient_details,columns=['ID','FirstName','LastName'],index=[1,2,3])
print(patient_dataframe)
Output:
ID FirstName LastName
1 101 Julia Johns
2 102 Jessica Watkins
3 103 Amanda Elis
2.Using a dictionary
data = {'Name':['Kris', 'Kate', 'Gao', 'Anita'],
'Age':[27, 24, 22, 32],
'Major':['Statistics', 'Accounting', 'Economics', 'Telecoms']}
df = pd.DataFrame(data)
print(df)
Output:
Name Age Major
0 Kris 27 Statistics
1 Kate 24 Accounting
2 Gao 22 Economics
3 Anita 32 Telecoms
The dataframe consists of columns with different datatypes
python df.dtypes
Output:
Name object
Age int64
Major object
dtype: object
Reading and writing CSV files
We often work with already existing data which can be used to create data frames. Most data exist as CSV files, therefore it is important to understand how to read and write CSV files.
The read_csv()
function in pandas is used for reading CSV files stored locally or from a URL.
df2 = pd.read_csv("https://raw.githubusercontent.com/dphi-official/Datasets/master/Standard_Metropolitan_Areas_Data-data.csv")
df2
Output:
land_area percent_city percent_senior physicians hospital_beds graduates work_force income region crime_rate
0 1384 78.1 12.3 25627 69678 50.1 4083.9 72100 1 75.55
1 3719 43.9 9.4 13326 43292 53.9 3305.9 54542 2 56.03
2 3553 37.4 10.7 9724 33731 50.6 2066.3 33216 1 41.32
3 3916 29.9 8.8 6402 24167 52.2 1966.7 32906 2 67.38
4 2480 31.5 10.5 8502 16751 66.1 1514.5 26573 4 80.19
... ... ... ... ... ... ... ... ... ... ...
94 1511 38.7 10.7 348 1093 50.4 127.2 1452 4 70.66
95 1543 39.6 8.1 159 481 30.3 80.6 769 3 36.36
96 1011 37.8 10.5 264 964 70.7 93.2 1337 3 60.16
97 813 13.4 10.9 371 4355 58.0 97.0 1589 1 36.33
The to_csv()
function converts a data frame into a CSV file.
syntax: df.to_csv('filename.csv')
You can also specify index=False to import csv file without the index.
df.to_csv('filename.csv', index=False)
Attributes and Methods
1.The shape attribute shows the shape of the dataframe; the number of rows and columns.
Syntax : df.shape
2.dtype attribute: shows the data types in the columns.
Syntax : df.dtypes
3.axes: Returns a list with the row axis labels and column axis labels.
Syntax : df.axes
4.empty: Returns True if dataFrame is entirely empty.
Syntax : df.empty
5.ndim : Returns the number of dimensions of the underlying data, by definition 1.
Syntax : df.ndim
6.size: Returns the number of elements in the dataFrame. Product of the rows and columns.
Syntax : df.size
To display the first five observations from the dataframe use the head() method.
df2.head()
Output:
land_area percent_city percent_senior physicians hospital_beds graduates work_force income region crime_rate
0 1384 78.1 12.3 25627 69678 50.1 4083.9 72100 1 75.55
1 3719 43.9 9.4 13326 43292 53.9 3305.9 54542 2 56.03
2 3553 37.4 10.7 9724 33731 50.6 2066.3 33216 1 41.32
3 3916 29.9 8.8 6402 24167 52.2 1966.7 32906 2 67.38
4 2480 31.5 10.5 8502 16751 66.1 1514.5 26573 4 80.19
To display the last five observations from the dataframe use the tail() method.
You can also specify the number of rows to be displayed by passing an argument to the tail() and head() methods.
df2.tail()
Output:
land_area percent_city percent_senior physicians hospital_beds graduates work_force income region crime_rate
94 1511 38.7 10.7 348 1093 50.4 127.2 1452 4 70.66
95 1543 39.6 8.1 159 481 30.3 80.6 769 3 36.36
96 1011 37.8 10.5 264 964 70.7 93.2 1337 3 60.16
97 813 13.4 10.9 371 4355 58.0 97.0 1589 1 36.33
98 654 28.8 3.9 140 1296 55.1 66.9 1148 3 68.76
The info() method returns a summary of the dataframe.
df2.info()
Output:
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 land_area 99 non-null int64
1 percent_city 99 non-null float64
2 percent_senior 99 non-null float64
3 physicians 99 non-null int64
4 hospital_beds 99 non-null int64
5 graduates 99 non-null float64
6 work_force 99 non-null float64
7 income 99 non-null int64
8 region 99 non-null int64
9 crime_rate 99 non-null float64
dtypes: float64(5), int64(5)
The describe() method returns a summary of the numerical columns only.
To get a summary of the categorical columns, we use the include parameter.
Syntax : df.describe(include='object')
You can also use include='all' to get a summary of all the columns.
Syntax : df.describe(include='all')
The mean() function returns the mean of each numerical column.
df2.mean()
Output:
land_area 2615.727273
percent_city 42.518182
percent_senior 9.781818
physicians 1828.333333
hospital_beds 6345.868687
graduates 54.463636
work_force 449.366667
income 6762.505051
region 2.494949
crime_rate 55.643030
dtype: float64
The median() function returns the median of each numerical column.
df2.median()
Output:
land_area 1951.00
percent_city 39.50
percent_senior 9.70
physicians 774.00
hospital_beds 3472.00
graduates 54.00
work_force 257.20
income 3510.00
region 3.00
crime_rate 56.06
dtype: float64
The value_counts() function returns the number of unique entries in the data.
df2.value_counts()
Output:
land_area percent_city percent_senior physicians hospital_beds graduates work_force income region crime_rate
47 41.9 11.9 745 3352 36.3 258.9 3915 1 51.70 1
2966 26.9 10.3 2053 6604 56.3 450.4 6966 1 56.55 1
2766 67.9 7.7 679 3873 56.3 224.0 2598 3 63.22 1
2737 45.0 10.5 602 1462 71.3 131.4 1980 4 63.44 1
2710 63.7 6.2 357 1277 72.8 110.9 1639 4 63.10 1
..
1490 33.1 11.9 827 3818 47.4 300.2 4144 1 30.59 1
1489 58.8 9.5 911 5720 56.5 175.1 2264 3 70.55 1
1465 30.3 6.8 598 6456 50.6 164.7 2201 3 70.66 1
1456 46.7 10.4 2484 8555 56.8 710.4 10104 2 44.64 1
27293 25.3 12.3 2018 6323 57.4 510.6 7399 4 76.03 1
Length: 99, dtype: int64
The unique() function returns all unique categories in a column.
df.Age.unique()
Output
array([27, 24, 22, 32])
To drop columns or rows use df.drop().
Drop values from rows (axis=0)
Drop values from columns(axis=1)
Indexing
The loc() and iloc() methods are used for indexing.
loc() is label based selection and iloc() is index based selection.
df.iloc[1]
Output:
Name Kate
Age 24
Major Accounting
Name: 1, dtype: object
You can specify the row and column to access specific elements.
You can also use negative indexing.
df.iloc[1,2]
Output:
'Accounting'
You can also use slicing to access a range of items.
df.iloc[:3,1]
Output:
0 27
1 24
2 22
Name: Age, dtype: int64
Using loc() you can specify the columns
df.loc[0:2,['Name','Age']]
Output:
Name Age
0 Kris 27
1 Kate 24
2 Gao 22
Selecting and Assigning data
Attribute based selection
You can use dot selection to select a column using the following syntax: df.columnName
df.Name
Output:
0 Kris
1 Kate
2 Gao
3 Anita
Name: Name, dtype: object
You can also use the bracket based selection to select a column or multiple columns.
When selecting multiple columns we use double square brackets.
df[['Age','Major']]
Output:
Age Major
0 27 Statistics
1 24 Accounting
2 22 Economics
3 32 Telecoms
Conditional Selection.
To select a row or rows that satisfy a certain condition, we can use conditional operators such as ==.
df.Age == 27
Output:
0 True
1 False
2 False
3 False
Name: Age, dtype: bool
Assigning Data
To assign data to a given column, or row select and provide the new data to be replaced.
df.Name= 'Kamau'
df
Output:
Name Age Major
0 Kamau 27 Statistics
1 Kamau 24 Accounting
2 Kamau 22 Economics
3 Kamau 32 Telecoms
Posted on February 8, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.