Introduction to Pandas
Ugonma Ononogbu
Posted on June 22, 2024
Introduction
If you're reading this article, you probably want to get an understanding of Pandas, so let's get to it.
Pandas - short for “Panel Data” - is a popular open-source programming language widely used for performing data manipulation and analysis. It has in-built functions to efficiently clean, transform, manipulate, visualize, and analyze data.
The Pandas library is an essential tool for Data analysts, Scientists, and Engineers working with structured data in Python.
This article will teach you basic functions you need to know when using Pandas library–its specific uses, and how to install Pandas.
Getting Started with Pandas
Let’s learn how to install Python Pandas Library.
How to install Pandas:
When working with Pandas Library, the first step is to ensure that it is installed in the system using the pip command.
pip install pandas
Requirement already satisfied: pandas in c:\users\userpc\anaconda3\lib\site-packages (2.1.4)Note: you may need to restart the kernel to use updated packages.
Requirement already satisfied: numpy<2,>=1.23.2 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (1.24.3)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (2022.7)
Requirement already satisfied: tzdata>=2022.1 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (2023.4)
Requirement already satisfied: six>=1.5 in c:\users\userpc\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Importing Pandas
To begin working with Pandas, import pandas package as follows:
import pandas as pd
We are importing pandas from anaconda.
The most common and preferred short form for pandas is 'pd'. This shorter name is used because it makes the code shorter and easier to write whenever you need to use a pandas function.
Components of pandas
Pandas provides two data structures for manipulating data. They include:
- Series
- DataFrame
A Series is essentially a column while a DataFrame is like a multi-dimensional structure or table formed by combining multiple Series together.
Creating Series:
A pandas series is just like a column from an Excel Spreadsheet.
Example:
#create a pandas Series
numbers = pd.Series([1, 2, 3, 4, 5])
# Displaying the Series
numbers
Output:
0 1
1 2
2 3
3 4
4 5
dtype: int64
Note: The codes embedded inside the hashtag symbol(#) is called comments. Comments is used in python to explain what a code is about so that incase another programmer gets to go through your written code, the person can understand what the code is about.
Another example includes:
# Creating a Pandas Series of colors
colors_series = pd.Series(['red', 'blue', 'green'])
# Displaying the Series
colors_series
Output:
0 red
1 blue
2 green
dtype: object
For more reference on how create Pandas Series, refer to this article on Creating Pandas Series.
Creating DataFrame:
A DataFrame is a two-dimensional data structure similar to a table in a spreadsheet. It has rows and columns, where each row represents a record or observation, and each column represents a variable or an attribute.
A DataFrame lets you easily organize, manipulate, and analyze a dataset.
One simple way of creating DataFrames is by using a dictionary.
Here's how:
# Creating a DataFrame with fruits and car types
data = {'Fruits': ['Apple', 'Banana', 'Orange'],'Car Types': ['SUV', 'Sedan', 'Truck']}
df = pd.DataFrame(data)
# Displaying the DataFrame
df
Output:
Fruits | Car Types | |
---|---|---|
0 | Apple | SUV |
1 | Banana | Sedan |
2 | Orange | Truck |
Importing Datasets
There are various formats in which data can be imported into the working environment(in this case, the working environment is Jupyter notebook). These formats can be: CSV, excel, HTML, JSON, SQL, and many more.
Comma-Seperated Values(CSV) is the most common format. It is imported into the working environment by using the pd.read_csv() function.
Now let's import our real dataset from here at Kaggle
df = pd.read_csv("C:/Users/USERPC/Downloads/house_price.csv")
df
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-05-02 00:00:00 | 3.130000e+05 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA |
1 | 2014-05-02 00:00:00 | 2.384000e+06 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA |
2 | 2014-05-02 00:00:00 | 3.420000e+05 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA |
3 | 2014-05-02 00:00:00 | 4.200000e+05 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA |
4 | 2014-05-02 00:00:00 | 5.500000e+05 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | Redmond | WA 98052 | USA |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4595 | 2014-07-09 00:00:00 | 3.081667e+05 | 3.0 | 1.75 | 1510 | 6360 | 1.0 | 0 | 0 | 4 | 1510 | 0 | 1954 | 1979 | 501 N 143rd St | Seattle | WA 98133 | USA |
4596 | 2014-07-09 00:00:00 | 5.343333e+05 | 3.0 | 2.50 | 1460 | 7573 | 2.0 | 0 | 0 | 3 | 1460 | 0 | 1983 | 2009 | 14855 SE 10th Pl | Bellevue | WA 98007 | USA |
4597 | 2014-07-09 00:00:00 | 4.169042e+05 | 3.0 | 2.50 | 3010 | 7014 | 2.0 | 0 | 0 | 3 | 3010 | 0 | 2009 | 0 | 759 Ilwaco Pl NE | Renton | WA 98059 | USA |
4598 | 2014-07-10 00:00:00 | 2.034000e+05 | 4.0 | 2.00 | 2090 | 6630 | 1.0 | 0 | 0 | 3 | 1070 | 1020 | 1974 | 0 | 5148 S Creston St | Seattle | WA 98178 | USA |
4599 | 2014-07-10 00:00:00 | 2.206000e+05 | 3.0 | 2.50 | 1490 | 8102 | 2.0 | 0 | 0 | 4 | 1490 | 0 | 1990 | 0 | 18717 SE 258th St | Covington | WA 98042 | USA |
4600 rows × 18 columns
Let's explore some pandas functions:
.head()
A dataset consists of several rows and columns, which is hard to see all at once. So in this case, using the .head() function will return the first 5 rows of the dataset by default.
df.head()
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-05-02 00:00:00 | 313000.0 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA |
1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA |
2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA |
3 | 2014-05-02 00:00:00 | 420000.0 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA |
4 | 2014-05-02 00:00:00 | 550000.0 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | Redmond | WA 98052 | USA |
In case you want to see the first 7 or 10 rows of the dataset, you pass in the number of rows you want to see into the bracket like this:
# to return the first 7 rows of the dataset
df.head(7)
# to return the first 10 rows of the dataset
# df.head(10)
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-05-02 00:00:00 | 313000.0 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA |
1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA |
2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA |
3 | 2014-05-02 00:00:00 | 420000.0 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA |
4 | 2014-05-02 00:00:00 | 550000.0 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | Redmond | WA 98052 | USA |
5 | 2014-05-02 00:00:00 | 490000.0 | 2.0 | 1.00 | 880 | 6380 | 1.0 | 0 | 0 | 3 | 880 | 0 | 1938 | 1994 | 522 NE 88th St | Seattle | WA 98115 | USA |
6 | 2014-05-02 00:00:00 | 335000.0 | 2.0 | 2.00 | 1350 | 2560 | 1.0 | 0 | 0 | 3 | 1350 | 0 | 1976 | 0 | 2616 174th Ave NE | Redmond | WA 98052 | USA |
.tail()
The .tail() function returns the last 5 rows of the dataset by default. You can also pass in the specific number of rows that you want to the function to return as i earlier stated. In this case, you say .tail(8) or .tail(3). Calling these functions will return the last 8 rows of the dataset and the last 3 rows of the dataset. Let's run these codes, shall we?
# to return the last 5 rows of the dataset(by default)
df.tail()
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4595 | 2014-07-09 00:00:00 | 308166.666667 | 3.0 | 1.75 | 1510 | 6360 | 1.0 | 0 | 0 | 4 | 1510 | 0 | 1954 | 1979 | 501 N 143rd St | Seattle | WA 98133 | USA |
4596 | 2014-07-09 00:00:00 | 534333.333333 | 3.0 | 2.50 | 1460 | 7573 | 2.0 | 0 | 0 | 3 | 1460 | 0 | 1983 | 2009 | 14855 SE 10th Pl | Bellevue | WA 98007 | USA |
4597 | 2014-07-09 00:00:00 | 416904.166667 | 3.0 | 2.50 | 3010 | 7014 | 2.0 | 0 | 0 | 3 | 3010 | 0 | 2009 | 0 | 759 Ilwaco Pl NE | Renton | WA 98059 | USA |
4598 | 2014-07-10 00:00:00 | 203400.000000 | 4.0 | 2.00 | 2090 | 6630 | 1.0 | 0 | 0 | 3 | 1070 | 1020 | 1974 | 0 | 5148 S Creston St | Seattle | WA 98178 | USA |
4599 | 2014-07-10 00:00:00 | 220600.000000 | 3.0 | 2.50 | 1490 | 8102 | 2.0 | 0 | 0 | 4 | 1490 | 0 | 1990 | 0 | 18717 SE 258th St | Covington | WA 98042 | USA |
# to return the last 8 rows of the dataset
df.tail(8)
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4592 | 2014-07-08 00:00:00 | 252980.000000 | 4.0 | 2.50 | 2530 | 8169 | 2.0 | 0 | 0 | 3 | 2530 | 0 | 1993 | 0 | 37654 18th Pl S | Federal Way | WA 98003 | USA |
4593 | 2014-07-08 00:00:00 | 289373.307692 | 3.0 | 2.50 | 2538 | 4600 | 2.0 | 0 | 0 | 3 | 2538 | 0 | 2013 | 1923 | 5703 Charlotte Ave SE | Auburn | WA 98092 | USA |
4594 | 2014-07-09 00:00:00 | 210614.285714 | 3.0 | 2.50 | 1610 | 7223 | 2.0 | 0 | 0 | 3 | 1610 | 0 | 1994 | 0 | 26306 127th Ave SE | Kent | WA 98030 | USA |
4595 | 2014-07-09 00:00:00 | 308166.666667 | 3.0 | 1.75 | 1510 | 6360 | 1.0 | 0 | 0 | 4 | 1510 | 0 | 1954 | 1979 | 501 N 143rd St | Seattle | WA 98133 | USA |
4596 | 2014-07-09 00:00:00 | 534333.333333 | 3.0 | 2.50 | 1460 | 7573 | 2.0 | 0 | 0 | 3 | 1460 | 0 | 1983 | 2009 | 14855 SE 10th Pl | Bellevue | WA 98007 | USA |
4597 | 2014-07-09 00:00:00 | 416904.166667 | 3.0 | 2.50 | 3010 | 7014 | 2.0 | 0 | 0 | 3 | 3010 | 0 | 2009 | 0 | 759 Ilwaco Pl NE | Renton | WA 98059 | USA |
4598 | 2014-07-10 00:00:00 | 203400.000000 | 4.0 | 2.00 | 2090 | 6630 | 1.0 | 0 | 0 | 3 | 1070 | 1020 | 1974 | 0 | 5148 S Creston St | Seattle | WA 98178 | USA |
4599 | 2014-07-10 00:00:00 | 220600.000000 | 3.0 | 2.50 | 1490 | 8102 | 2.0 | 0 | 0 | 4 | 1490 | 0 | 1990 | 0 | 18717 SE 258th St | Covington | WA 98042 | USA |
# to return the last 3 rows of the dataset
df.tail(3)
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4597 | 2014-07-09 00:00:00 | 416904.166667 | 3.0 | 2.5 | 3010 | 7014 | 2.0 | 0 | 0 | 3 | 3010 | 0 | 2009 | 0 | 759 Ilwaco Pl NE | Renton | WA 98059 | USA |
4598 | 2014-07-10 00:00:00 | 203400.000000 | 4.0 | 2.0 | 2090 | 6630 | 1.0 | 0 | 0 | 3 | 1070 | 1020 | 1974 | 0 | 5148 S Creston St | Seattle | WA 98178 | USA |
4599 | 2014-07-10 00:00:00 | 220600.000000 | 3.0 | 2.5 | 1490 | 8102 | 2.0 | 0 | 0 | 4 | 1490 | 0 | 1990 | 0 | 18717 SE 258th St | Covington | WA 98042 | USA |
.shape
We use the .shape attribute to check how large the dataset is. This function will return the number of rows and column in the dataset.
df.shape
Output:
(4600, 18)
Here we can see that the dataset has 4600 rows and 18 columns.
Note: The index of the dataset always starts with 0 by default and not 1. If the index starts from 1, the index number of the last row will be 4600. So since
the index
of the dataset starts from 0, the index of the last row will be 4599. You can go through the .tail() to check it.
.info()
The .info() function displays basic information about the dataset including the datatype, columns, non-null values, number of rows and columns and the memory usage.
df.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 date 4600 non-null object
1 price 4600 non-null float64
2 bedrooms 4600 non-null float64
3 bathrooms 4600 non-null float64
4 sqft_living 4600 non-null int64
5 sqft_lot 4600 non-null int64
6 floors 4600 non-null float64
7 waterfront 4600 non-null int64
8 view 4600 non-null int64
9 condition 4600 non-null int64
10 sqft_above 4600 non-null int64
11 sqft_basement 4600 non-null int64
12 yr_built 4600 non-null int64
13 yr_renovated 4600 non-null int64
14 street 4600 non-null object
15 city 4600 non-null object
16 statezip 4600 non-null object
17 country 4600 non-null object
dtypes: float64(4), int64(9), object(5)
memory usage: 647.0+ KB
.describe()
The .describe() function returns the descriptive statistics of the dataframe.
df.describe()
Output:
price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 4.600000e+03 | 4600.000000 | 4600.000000 | 4600.000000 | 4.600000e+03 | 4600.000000 | 4600.000000 | 4600.000000 | 4600.000000 | 4600.000000 | 4600.000000 | 4600.000000 | 4600.000000 |
mean | 5.519630e+05 | 3.400870 | 2.160815 | 2139.346957 | 1.485252e+04 | 1.512065 | 0.007174 | 0.240652 | 3.451739 | 1827.265435 | 312.081522 | 1970.786304 | 808.608261 |
std | 5.638347e+05 | 0.908848 | 0.783781 | 963.206916 | 3.588444e+04 | 0.538288 | 0.084404 | 0.778405 | 0.677230 | 862.168977 | 464.137228 | 29.731848 | 979.414536 |
min | 0.000000e+00 | 0.000000 | 0.000000 | 370.000000 | 6.380000e+02 | 1.000000 | 0.000000 | 0.000000 | 1.000000 | 370.000000 | 0.000000 | 1900.000000 | 0.000000 |
25% | 3.228750e+05 | 3.000000 | 1.750000 | 1460.000000 | 5.000750e+03 | 1.000000 | 0.000000 | 0.000000 | 3.000000 | 1190.000000 | 0.000000 | 1951.000000 | 0.000000 |
50% | 4.609435e+05 | 3.000000 | 2.250000 | 1980.000000 | 7.683000e+03 | 1.500000 | 0.000000 | 0.000000 | 3.000000 | 1590.000000 | 0.000000 | 1976.000000 | 0.000000 |
75% | 6.549625e+05 | 4.000000 | 2.500000 | 2620.000000 | 1.100125e+04 | 2.000000 | 0.000000 | 0.000000 | 4.000000 | 2300.000000 | 610.000000 | 1997.000000 | 1999.000000 |
max | 2.659000e+07 | 9.000000 | 8.000000 | 13540.000000 | 1.074218e+06 | 3.500000 | 1.000000 | 4.000000 | 5.000000 | 9410.000000 | 4820.000000 | 2014.000000 | 2014.000000 |
Where;
• count - is the total number of non-missing values
• mean - is the average of the dataframe
• std - is the standard deviation
• min - is the minimum value
• 25% - is the 25th percentile
• 50% - is the 50th percentile
• 75% - is the 75th percentile
• max - is the maximum value
.isnull()
This is an important function used to check for null values in a dataset.
# this function will find null values in the first 5 rows of the dataset
df.isnull().head()
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
1 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
2 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
3 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
4 | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False |
To count the number of null or missing values in the dataset, we do this:
df.isnull().sum()
Output:
date 0
price 0
bedrooms 0
bathrooms 0
sqft_living 0
sqft_lot 0
floors 0
waterfront 0
view 0
condition 0
sqft_above 0
sqft_basement 0
yr_built 0
yr_renovated 0
street 0
city 0
statezip 0
country 0
dtype: int64
We can see that there are no null values because their individual sum is 0.
.columns
This function is used to view the column names.
df.columns
Output:
Index(['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
'floors', 'waterfront', 'view', 'condition', 'sqft_above',
'sqft_basement', 'yr_built', 'yr_renovated', 'street', 'city',
'statezip', 'country'],
dtype='object')
Selecting a column
To select a specfic column from the dataframe, insert the name of the column into square brackets [].
# return the first five values of the 'date' column in the dataframe
df[["date"]].head()
Output:
date | |
---|---|
0 | 2014-05-02 00:00:00 |
1 | 2014-05-02 00:00:00 |
2 | 2014-05-02 00:00:00 |
3 | 2014-05-02 00:00:00 |
4 | 2014-05-02 00:00:00 |
Note: To select a column, you can decide to call the name of the column to be returned as a dataframe object or a series object. This is usually done by either using double brackets or a single bracket. Hence the reason for using double brackets the example above.
To call a variable or the specific name of a column as a Series object, you pass the name of the column into a single bracket by doing this;
df['date'].head()
Output:
0 2014-05-02 00:00:00
1 2014-05-02 00:00:00
2 2014-05-02 00:00:00
3 2014-05-02 00:00:00
4 2014-05-02 00:00:00
Name: date, dtype: object
Did you notice the difference in the output of the two functions? The first output with double square brackets is a DataFrame object while this example gave an output as Series
Grouping data
Grouping data or columns is done by using the groupby function. Example:
# group the dataset by the number of bedrooms and calculate the average prize
grouped = df.groupby('bedrooms')['price'].mean()
grouped
Output:
bedrooms
0.0 1.195324e+06
1.0 2.740763e+05
2.0 3.916219e+05
3.0 4.886130e+05
4.0 6.351194e+05
5.0 7.701860e+05
6.0 8.173628e+05
7.0 1.049429e+06
8.0 1.155000e+06
9.0 5.999990e+05
Name: price, dtype: float64
Adding Rows and Columns
You can create new rows and columns to your dataset. However, when adding a new column to your dataset, you want to choose a column that can provide additional insights or useful information for analysis.
Let's add a new column called Price per Square Foot.
df['Price_per_Square_Ft'] = df['price'] / df['sqft_living']
df.head()
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | Price_per_Square_Ft | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-05-02 00:00:00 | 313000.0 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA | 233.582090 |
1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA | 653.150685 |
2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA | 177.202073 |
3 | 2014-05-02 00:00:00 | 420000.0 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA | 210.000000 |
4 | 2014-05-02 00:00:00 | 550000.0 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | Redmond | WA 98052 | USA | 283.505155 |
See that a new column called 'Price per Square Ft' has been added to the dataset. Let's select the column so that we can see it.
# Selecting the Price_per_Square_Ft column from the dataset
df[['Price_per_Square_Ft']]
Output:
Price_per_Square_Ft | |
---|---|
0 | 233.582090 |
1 | 653.150685 |
2 | 177.202073 |
3 | 210.000000 |
4 | 283.505155 |
... | ... |
4595 | 204.083885 |
4596 | 365.981735 |
4597 | 138.506368 |
4598 | 97.320574 |
4599 | 148.053691 |
4600 rows × 1 columns
df
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | Price_per_Square_Ft | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-05-02 00:00:00 | 3.130000e+05 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA | 233.582090 |
1 | 2014-05-02 00:00:00 | 2.384000e+06 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA | 653.150685 |
2 | 2014-05-02 00:00:00 | 3.420000e+05 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA | 177.202073 |
3 | 2014-05-02 00:00:00 | 4.200000e+05 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA | 210.000000 |
4 | 2014-05-02 00:00:00 | 5.500000e+05 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | Redmond | WA 98052 | USA | 283.505155 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
4595 | 2014-07-09 00:00:00 | 3.081667e+05 | 3.0 | 1.75 | 1510 | 6360 | 1.0 | 0 | 0 | 4 | 1510 | 0 | 1954 | 1979 | 501 N 143rd St | Seattle | WA 98133 | USA | 204.083885 |
4596 | 2014-07-09 00:00:00 | 5.343333e+05 | 3.0 | 2.50 | 1460 | 7573 | 2.0 | 0 | 0 | 3 | 1460 | 0 | 1983 | 2009 | 14855 SE 10th Pl | Bellevue | WA 98007 | USA | 365.981735 |
4597 | 2014-07-09 00:00:00 | 4.169042e+05 | 3.0 | 2.50 | 3010 | 7014 | 2.0 | 0 | 0 | 3 | 3010 | 0 | 2009 | 0 | 759 Ilwaco Pl NE | Renton | WA 98059 | USA | 138.506368 |
4598 | 2014-07-10 00:00:00 | 2.034000e+05 | 4.0 | 2.00 | 2090 | 6630 | 1.0 | 0 | 0 | 3 | 1070 | 1020 | 1974 | 0 | 5148 S Creston St | Seattle | WA 98178 | USA | 97.320574 |
4599 | 2014-07-10 00:00:00 | 2.206000e+05 | 3.0 | 2.50 | 1490 | 8102 | 2.0 | 0 | 0 | 4 | 1490 | 0 | 1990 | 0 | 18717 SE 258th St | Covington | WA 98042 | USA | 148.053691 |
4600 rows × 19 columns
Now to add a new row, do this;
# Create a dictionary with the new row data
new_row_data = {
'date': '2014-07-10 00:00:00',
'price': 350000.000000,
'bedrooms': 3.0,
'bathrooms': 2.5,
'sqft_living': 1800,
'sqft_lot': 8000,
'floors': 2.0,
'waterfront': 0,
'view': 0,
'condition': 3,
'sqft_above': 1800,
'sqft_basement': 0,
'yr_built': 1990,
'yr_renovated': 0,
'street': '1234 New St',
'city': 'Seattle',
'statezip': 'WA 98105',
'country': 'USA'
}
# Calculate the index for the new row
new_row_index = len(df)
# Assign the new row data to the DataFrame using .loc
df.loc[new_row_index] = new_row_data
# Display the updated DataFrame
df.tail()
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | Price_per_Square_Ft | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
4596 | 2014-07-09 00:00:00 | 534333.333333 | 3.0 | 2.5 | 1460 | 7573 | 2.0 | 0 | 0 | 3 | 1460 | 0 | 1983 | 2009 | 14855 SE 10th Pl | Bellevue | WA 98007 | USA | 365.981735 |
4597 | 2014-07-09 00:00:00 | 416904.166667 | 3.0 | 2.5 | 3010 | 7014 | 2.0 | 0 | 0 | 3 | 3010 | 0 | 2009 | 0 | 759 Ilwaco Pl NE | Renton | WA 98059 | USA | 138.506368 |
4598 | 2014-07-10 00:00:00 | 203400.000000 | 4.0 | 2.0 | 2090 | 6630 | 1.0 | 0 | 0 | 3 | 1070 | 1020 | 1974 | 0 | 5148 S Creston St | Seattle | WA 98178 | USA | 97.320574 |
4599 | 2014-07-10 00:00:00 | 220600.000000 | 3.0 | 2.5 | 1490 | 8102 | 2.0 | 0 | 0 | 4 | 1490 | 0 | 1990 | 0 | 18717 SE 258th St | Covington | WA 98042 | USA | 148.053691 |
4600 | 2014-07-10 00:00:00 | 350000.000000 | 3.0 | 2.5 | 1800 | 8000 | 2.0 | 0 | 0 | 3 | 1800 | 0 | 1990 | 0 | 1234 New St | Seattle | WA 98105 | USA | NaN |
The first line of code is creating a dictionary called new_row_data where the keys are the column names of your DataFrame and the values are the new data you want to add.
The second line of the code calculates the new row's index as the new row will be added at the end of the DataFrame.
In the third line of code, the .loc function is used to assign the new_row_data to the DataFrame at the calculated index.
And then lastly, we display the DataFrame using the df.tail() function as this will display the last five rows of the Dataframe.
.iloc and .loc
The .iloc() and .loc() functions are used in Pandas to access data in a DataFrame using different types of indexing.
.iloc() is for integer-based indexing - meaning you can specify the position of rows and columns using integer indices, while .loc() is for label-based indexing - meaning you can specify the names(labels) of the rows and columns.
Here's how to use .iloc():
Accessing a specific row based on its integer index.
# Accessing the row at index 0
df.iloc[0]
Output:
date 2014-05-02 00:00:00
price 313000.0
bedrooms 3.0
bathrooms 1.5
sqft_living 1340
sqft_lot 7912
floors 1.5
waterfront 0
view 0
condition 3
sqft_above 1340
sqft_basement 0
yr_built 1955
yr_renovated 2005
street 18810 Densmore Ave N
city Shoreline
statezip WA 98133
country USA
Price_per_Square_Ft 233.58209
Name: 0, dtype: object
Accessing a specific column within a row:
# Access the 'price' column of the row at index 0
df.iloc[0, 1]
Output:
313000.0
Accessing multiple rows and columns using slices:
# Access rows at index 0 to 2(inclusive) and columns at index 0 to 4
df.iloc[0:3, 0:4]
Output:
date | price | bedrooms | bathrooms | |
---|---|---|---|---|
0 | 2014-05-02 00:00:00 | 313000.0 | 3.0 | 1.5 |
1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.5 |
2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.0 |
Changing a value in a specific cell:
# Change the price in the first row to 350000
df.iloc[0, 1] = 350000
df.iloc[0]
Output:
date 2014-05-02 00:00:00
price 350000.0
bedrooms 3.0
bathrooms 1.5
sqft_living 1340
sqft_lot 7912
floors 1.5
waterfront 0
view 0
condition 3
sqft_above 1340
sqft_basement 0
yr_built 1955
yr_renovated 2005
street 18810 Densmore Ave N
city Shoreline
statezip WA 98133
country USA
Price_per_Square_Ft 233.58209
Name: 0, dtype: object
Here's how to use .loc():
Accessing a specified row based on its label:
# Access the row with index 4500
df.loc[4500]
Output:
date 2014-06-17 00:00:00
price 540000.0
bedrooms 3.0
bathrooms 2.75
sqft_living 2750
sqft_lot 18029
floors 1.0
waterfront 0
view 2
condition 5
sqft_above 1810
sqft_basement 940
yr_built 1978
yr_renovated 0
street 4708 154th Pl SE
city Bellevue
statezip WA 98006
country USA
Price_per_Square_Ft 196.363636
Name: 4500, dtype: object
Accessing a specified column within a row using the name of the column.
# Access the 'price' column of the row with index 5
df.loc[5, 'price']
Output:
490000.0
Accessing multiple rows and columns using label slices or lists.
# Access rows with indices 0 to 3 and columns 'price' and 'bedrooms'
df.loc[0:3, ['price', 'bedrooms']]
Output:
price | bedrooms | |
---|---|---|
0 | 350000.0 | 3.0 |
1 | 2384000.0 | 5.0 |
2 | 342000.0 | 3.0 |
3 | 420000.0 | 3.0 |
The key difference between .iloc() and .loc() is that, .loc() uses integer indices to access rows and columns, while .loc() uses labels(names).
Dropping Rows and Columns
Here's how to drop a row from your dataset:
# Specify the index of the row you want to drop
row_to_drop = 4
# Drop the column and return a new DataFrame
df_new = df.drop(row_to_drop)
# Display the new dataset
df_new.head()
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | city | statezip | country | Price_per_Square_Ft | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-05-02 00:00:00 | 350000.0 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | Shoreline | WA 98133 | USA | 233.582090 |
1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | Seattle | WA 98119 | USA | 653.150685 |
2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | Kent | WA 98042 | USA | 177.202073 |
3 | 2014-05-02 00:00:00 | 420000.0 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | Bellevue | WA 98008 | USA | 210.000000 |
5 | 2014-05-02 00:00:00 | 490000.0 | 2.0 | 1.00 | 880 | 6380 | 1.0 | 0 | 0 | 3 | 880 | 0 | 1938 | 1994 | 522 NE 88th St | Seattle | WA 98115 | USA | 556.818182 |
You can see that row 4 is no longer in the dataset.
Here's how to drop a column from your dataset:
# Specify the name of the column you want to drop
column_to_drop = 'city'
# Drop the column and return a new Dataframe
df_new = df.drop(columns=[column_to_drop])
#Display the new DataFrame
df_new.head()
Output:
date | price | bedrooms | bathrooms | sqft_living | sqft_lot | floors | waterfront | view | condition | sqft_above | sqft_basement | yr_built | yr_renovated | street | statezip | country | Price_per_Square_Ft | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2014-05-02 00:00:00 | 350000.0 | 3.0 | 1.50 | 1340 | 7912 | 1.5 | 0 | 0 | 3 | 1340 | 0 | 1955 | 2005 | 18810 Densmore Ave N | WA 98133 | USA | 233.582090 |
1 | 2014-05-02 00:00:00 | 2384000.0 | 5.0 | 2.50 | 3650 | 9050 | 2.0 | 0 | 4 | 5 | 3370 | 280 | 1921 | 0 | 709 W Blaine St | WA 98119 | USA | 653.150685 |
2 | 2014-05-02 00:00:00 | 342000.0 | 3.0 | 2.00 | 1930 | 11947 | 1.0 | 0 | 0 | 4 | 1930 | 0 | 1966 | 0 | 26206-26214 143rd Ave SE | WA 98042 | USA | 177.202073 |
3 | 2014-05-02 00:00:00 | 420000.0 | 3.0 | 2.25 | 2000 | 8030 | 1.0 | 0 | 0 | 4 | 1000 | 1000 | 1963 | 0 | 857 170th Pl NE | WA 98008 | USA | 210.000000 |
4 | 2014-05-02 00:00:00 | 550000.0 | 4.0 | 2.50 | 1940 | 10500 | 1.0 | 0 | 0 | 4 | 1140 | 800 | 1976 | 1992 | 9105 170th Ave NE | WA 98052 | USA | 283.505155 |
Now, the 'city' column is no longer present in the dataset.
Conclusion
Pandas is a crucial tool for data work in Python. We've demonstrated how Pandas can effectively help manipulate and analyze housing data. Whether you're a beginner or a seasoned professional, Pandas provides a variety of features to make your data analysis smoother.
For more information, check out the official Pandas documentation and online tutorials for advanced topics and features. Happy coding!
Posted on June 22, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.