Introduction to Pandas

Introduction

If you're reading this article, you probably want to get an understanding of Pandas, so let's get to it.

Pandas - short for “Panel Data” - is a popular open-source programming language widely used for performing data manipulation and analysis. It has in-built functions to efficiently clean, transform, manipulate, visualize, and analyze data.

The Pandas library is an essential tool for Data analysts, Scientists, and Engineers working with structured data in Python.

This article will teach you basic functions you need to know when using Pandas library–its specific uses, and how to install Pandas.

Getting Started with Pandas

Let’s learn how to install Python Pandas Library.

How to install Pandas:

When working with Pandas Library, the first step is to ensure that it is installed in the system using the pip command.

pip install pandas

Requirement already satisfied: pandas in c:\users\userpc\anaconda3\lib\site-packages (2.1.4)Note: you may need to restart the kernel to use updated packages.

Requirement already satisfied: numpy<2,>=1.23.2 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (1.24.3)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (2022.7)
Requirement already satisfied: tzdata>=2022.1 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (2023.4)
Requirement already satisfied: six>=1.5 in c:\users\userpc\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)

Importing Pandas

To begin working with Pandas, import pandas package as follows:

import pandas as pd

We are importing pandas from anaconda.
The most common and preferred short form for pandas is 'pd'. This shorter name is used because it makes the code shorter and easier to write whenever you need to use a pandas function.

Components of pandas

Pandas provides two data structures for manipulating data. They include:

Series
DataFrame

A Series is essentially a column while a DataFrame is like a multi-dimensional structure or table formed by combining multiple Series together.

Creating Series:

A pandas series is just like a column from an Excel Spreadsheet.

Example:

#create a pandas Series
numbers = pd.Series([1, 2, 3, 4, 5])

# Displaying the Series
numbers

Output:

0    1
1    2
2    3
3    4
4    5
dtype: int64

Note: The codes embedded inside the hashtag symbol(#) is called comments. Comments is used in python to explain what a code is about so that incase another programmer gets to go through your written code, the person can understand what the code is about.

Another example includes:

# Creating a Pandas Series of colors
colors_series = pd.Series(['red', 'blue', 'green'])

# Displaying the Series
colors_series

Output:

0      red
1     blue
2    green
dtype: object

For more reference on how create Pandas Series, refer to this article on Creating Pandas Series.

Creating DataFrame:

A DataFrame is a two-dimensional data structure similar to a table in a spreadsheet. It has rows and columns, where each row represents a record or observation, and each column represents a variable or an attribute.

A DataFrame lets you easily organize, manipulate, and analyze a dataset.

One simple way of creating DataFrames is by using a dictionary.

Here's how:


# Creating a DataFrame with fruits and car types
data = {'Fruits': ['Apple', 'Banana', 'Orange'],'Car Types': ['SUV', 'Sedan', 'Truck']}

df = pd.DataFrame(data)

# Displaying the DataFrame
df

Output:

	Fruits	Car Types
0	Apple	SUV
1	Banana	Sedan
2	Orange	Truck

Importing Datasets

There are various formats in which data can be imported into the working environment(in this case, the working environment is Jupyter notebook). These formats can be: CSV, excel, HTML, JSON, SQL, and many more.

Comma-Seperated Values(CSV) is the most common format. It is imported into the working environment by using the pd.read_csv() function.

Now let's import our real dataset from here at Kaggle

df = pd.read_csv("C:/Users/USERPC/Downloads/house_price.csv")
df

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	city	statezip	country
0	2014-05-02 00:00:00	3.130000e+05	3.0	1.50	1340	7912	1.5	0	0	3	1340	0	1955	2005	18810 Densmore Ave N	Shoreline	WA 98133	USA
1	2014-05-02 00:00:00	2.384000e+06	5.0	2.50	3650	9050	2.0	0	4	5	3370	280	1921	0	709 W Blaine St	Seattle	WA 98119	USA
2	2014-05-02 00:00:00	3.420000e+05	3.0	2.00	1930	11947	1.0	0	0	4	1930	0	1966	0	26206-26214 143rd Ave SE	Kent	WA 98042	USA
3	2014-05-02 00:00:00	4.200000e+05	3.0	2.25	2000	8030	1.0	0	0	4	1000	1000	1963	0	857 170th Pl NE	Bellevue	WA 98008	USA
4	2014-05-02 00:00:00	5.500000e+05	4.0	2.50	1940	10500	1.0	0	0	4	1140	800	1976	1992	9105 170th Ave NE	Redmond	WA 98052	USA
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4595	2014-07-09 00:00:00	3.081667e+05	3.0	1.75	1510	6360	1.0	0	0	4	1510	0	1954	1979	501 N 143rd St	Seattle	WA 98133	USA
4596	2014-07-09 00:00:00	5.343333e+05	3.0	2.50	1460	7573	2.0	0	0	3	1460	0	1983	2009	14855 SE 10th Pl	Bellevue	WA 98007	USA
4597	2014-07-09 00:00:00	4.169042e+05	3.0	2.50	3010	7014	2.0	0	0	3	3010	0	2009	0	759 Ilwaco Pl NE	Renton	WA 98059	USA
4598	2014-07-10 00:00:00	2.034000e+05	4.0	2.00	2090	6630	1.0	0	0	3	1070	1020	1974	0	5148 S Creston St	Seattle	WA 98178	USA
4599	2014-07-10 00:00:00	2.206000e+05	3.0	2.50	1490	8102	2.0	0	0	4	1490	0	1990	0	18717 SE 258th St	Covington	WA 98042	USA

4600 rows × 18 columns

Let's explore some pandas functions:

.head()

A dataset consists of several rows and columns, which is hard to see all at once. So in this case, using the .head() function will return the first 5 rows of the dataset by default.

df.head()

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	view	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	city	statezip	country
0	2014-05-02 00:00:00	313000.0	3.0	1.50	1340	7912	1.5	0	3	1340	0	1955	2005	18810 Densmore Ave N	Shoreline	WA 98133	USA
1	2014-05-02 00:00:00	2384000.0	5.0	2.50	3650	9050	2.0	4	5	3370	280	1921	0	709 W Blaine St	Seattle	WA 98119	USA
2	2014-05-02 00:00:00	342000.0	3.0	2.00	1930	11947	1.0	0	4	1930	0	1966	0	26206-26214 143rd Ave SE	Kent	WA 98042	USA
3	2014-05-02 00:00:00	420000.0	3.0	2.25	2000	8030	1.0	0	4	1000	1000	1963	0	857 170th Pl NE	Bellevue	WA 98008	USA
4	2014-05-02 00:00:00	550000.0	4.0	2.50	1940	10500	1.0	0	4	1140	800	1976	1992	9105 170th Ave NE	Redmond	WA 98052	USA

In case you want to see the first 7 or 10 rows of the dataset, you pass in the number of rows you want to see into the bracket like this:

# to return the first 7 rows of the dataset
df.head(7)

# to return the first 10 rows of the dataset
# df.head(10)

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	view	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	city	statezip	country
0	2014-05-02 00:00:00	313000.0	3.0	1.50	1340	7912	1.5	0	3	1340	0	1955	2005	18810 Densmore Ave N	Shoreline	WA 98133	USA
1	2014-05-02 00:00:00	2384000.0	5.0	2.50	3650	9050	2.0	4	5	3370	280	1921	0	709 W Blaine St	Seattle	WA 98119	USA
2	2014-05-02 00:00:00	342000.0	3.0	2.00	1930	11947	1.0	0	4	1930	0	1966	0	26206-26214 143rd Ave SE	Kent	WA 98042	USA
3	2014-05-02 00:00:00	420000.0	3.0	2.25	2000	8030	1.0	0	4	1000	1000	1963	0	857 170th Pl NE	Bellevue	WA 98008	USA
4	2014-05-02 00:00:00	550000.0	4.0	2.50	1940	10500	1.0	0	4	1140	800	1976	1992	9105 170th Ave NE	Redmond	WA 98052	USA
5	2014-05-02 00:00:00	490000.0	2.0	1.00	880	6380	1.0	0	3	880	0	1938	1994	522 NE 88th St	Seattle	WA 98115	USA
6	2014-05-02 00:00:00	335000.0	2.0	2.00	1350	2560	1.0	0	3	1350	0	1976	0	2616 174th Ave NE	Redmond	WA 98052	USA

.tail()

The .tail() function returns the last 5 rows of the dataset by default. You can also pass in the specific number of rows that you want to the function to return as i earlier stated. In this case, you say .tail(8) or .tail(3). Calling these functions will return the last 8 rows of the dataset and the last 3 rows of the dataset. Let's run these codes, shall we?

# to return the last 5 rows of the dataset(by default)
df.tail()

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	city	statezip	country
4595	2014-07-09 00:00:00	308166.666667	3.0	1.75	1510	6360	1.0	4	1510	0	1954	1979	501 N 143rd St	Seattle	WA 98133	USA
4596	2014-07-09 00:00:00	534333.333333	3.0	2.50	1460	7573	2.0	3	1460	0	1983	2009	14855 SE 10th Pl	Bellevue	WA 98007	USA
4597	2014-07-09 00:00:00	416904.166667	3.0	2.50	3010	7014	2.0	3	3010	0	2009	0	759 Ilwaco Pl NE	Renton	WA 98059	USA
4598	2014-07-10 00:00:00	203400.000000	4.0	2.00	2090	6630	1.0	3	1070	1020	1974	0	5148 S Creston St	Seattle	WA 98178	USA
4599	2014-07-10 00:00:00	220600.000000	3.0	2.50	1490	8102	2.0	4	1490	0	1990	0	18717 SE 258th St	Covington	WA 98042	USA

# to return the last 8 rows of the dataset
df.tail(8)

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	city	statezip	country
4592	2014-07-08 00:00:00	252980.000000	4.0	2.50	2530	8169	2.0	3	2530	0	1993	0	37654 18th Pl S	Federal Way	WA 98003	USA
4593	2014-07-08 00:00:00	289373.307692	3.0	2.50	2538	4600	2.0	3	2538	0	2013	1923	5703 Charlotte Ave SE	Auburn	WA 98092	USA
4594	2014-07-09 00:00:00	210614.285714	3.0	2.50	1610	7223	2.0	3	1610	0	1994	0	26306 127th Ave SE	Kent	WA 98030	USA
4595	2014-07-09 00:00:00	308166.666667	3.0	1.75	1510	6360	1.0	4	1510	0	1954	1979	501 N 143rd St	Seattle	WA 98133	USA
4596	2014-07-09 00:00:00	534333.333333	3.0	2.50	1460	7573	2.0	3	1460	0	1983	2009	14855 SE 10th Pl	Bellevue	WA 98007	USA
4597	2014-07-09 00:00:00	416904.166667	3.0	2.50	3010	7014	2.0	3	3010	0	2009	0	759 Ilwaco Pl NE	Renton	WA 98059	USA
4598	2014-07-10 00:00:00	203400.000000	4.0	2.00	2090	6630	1.0	3	1070	1020	1974	0	5148 S Creston St	Seattle	WA 98178	USA
4599	2014-07-10 00:00:00	220600.000000	3.0	2.50	1490	8102	2.0	4	1490	0	1990	0	18717 SE 258th St	Covington	WA 98042	USA

# to return the last 3 rows of the dataset
df.tail(3)

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	sqft_above	sqft_basement	yr_built	street	city	statezip	country
4597	2014-07-09 00:00:00	416904.166667	3.0	2.5	3010	7014	2.0	3	3010	0	2009	759 Ilwaco Pl NE	Renton	WA 98059	USA
4598	2014-07-10 00:00:00	203400.000000	4.0	2.0	2090	6630	1.0	3	1070	1020	1974	5148 S Creston St	Seattle	WA 98178	USA
4599	2014-07-10 00:00:00	220600.000000	3.0	2.5	1490	8102	2.0	4	1490	0	1990	18717 SE 258th St	Covington	WA 98042	USA

.shape

We use the .shape attribute to check how large the dataset is. This function will return the number of rows and column in the dataset.

df.shape

Output:

(4600, 18)

Here we can see that the dataset has 4600 rows and 18 columns.

Note: The index of the dataset always starts with 0 by default and not 1. If the index starts from 1, the index number of the last row will be 4600. So since the index of the dataset starts from 0, the index of the last row will be 4599. You can go through the .tail() to check it.

.info()

The .info() function displays basic information about the dataset including the datatype, columns, non-null values, number of rows and columns and the memory usage.

df.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   float64
 3   bathrooms      4600 non-null   float64
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   float64
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float64(4), int64(9), object(5)
memory usage: 647.0+ KB

.describe()

The .describe() function returns the descriptive statistics of the dataframe.

df.describe()

Output:

	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	sqft_above	sqft_basement	yr_built	yr_renovated
count	4.600000e+03	4600.000000	4600.000000	4600.000000	4.600000e+03	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000	4600.000000
mean	5.519630e+05	3.400870	2.160815	2139.346957	1.485252e+04	1.512065	0.007174	0.240652	3.451739	1827.265435	312.081522	1970.786304	808.608261
std	5.638347e+05	0.908848	0.783781	963.206916	3.588444e+04	0.538288	0.084404	0.778405	0.677230	862.168977	464.137228	29.731848	979.414536
min	0.000000e+00	0.000000	0.000000	370.000000	6.380000e+02	1.000000	0.000000	0.000000	1.000000	370.000000	0.000000	1900.000000	0.000000
25%	3.228750e+05	3.000000	1.750000	1460.000000	5.000750e+03	1.000000	0.000000	0.000000	3.000000	1190.000000	0.000000	1951.000000	0.000000
50%	4.609435e+05	3.000000	2.250000	1980.000000	7.683000e+03	1.500000	0.000000	0.000000	3.000000	1590.000000	0.000000	1976.000000	0.000000
75%	6.549625e+05	4.000000	2.500000	2620.000000	1.100125e+04	2.000000	0.000000	0.000000	4.000000	2300.000000	610.000000	1997.000000	1999.000000
max	2.659000e+07	9.000000	8.000000	13540.000000	1.074218e+06	3.500000	1.000000	4.000000	5.000000	9410.000000	4820.000000	2014.000000	2014.000000

Where;

• count - is the total number of non-missing values

• mean - is the average of the dataframe

• std - is the standard deviation

• min - is the minimum value

• 25% - is the 25th percentile

• 50% - is the 50th percentile

• 75% - is the 75th percentile

• max - is the maximum value

.isnull()

This is an important function used to check for null values in a dataset.

# this function will find null values in the first 5 rows of the dataset
df.isnull().head()

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	city	statezip	country
0	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
1	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
2	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
3	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False
4	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False	False

To count the number of null or missing values in the dataset, we do this:

df.isnull().sum()

Output:

date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
street           0
city             0
statezip         0
country          0
dtype: int64

We can see that there are no null values because their individual sum is 0.

.columns

This function is used to view the column names.

df.columns

Output:

Index(['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'street', 'city',
       'statezip', 'country'],
      dtype='object')

Selecting a column

To select a specfic column from the dataframe, insert the name of the column into square brackets [].

# return the first five values of the 'date' column in the dataframe
df[["date"]].head()

Output:

	date
0	2014-05-02 00:00:00
1	2014-05-02 00:00:00
2	2014-05-02 00:00:00
3	2014-05-02 00:00:00
4	2014-05-02 00:00:00

Note: To select a column, you can decide to call the name of the column to be returned as a dataframe object or a series object. This is usually done by either using double brackets or a single bracket. Hence the reason for using double brackets the example above.

To call a variable or the specific name of a column as a Series object, you pass the name of the column into a single bracket by doing this;

df['date'].head()

Output:

0    2014-05-02 00:00:00
1    2014-05-02 00:00:00
2    2014-05-02 00:00:00
3    2014-05-02 00:00:00
4    2014-05-02 00:00:00
Name: date, dtype: object

Did you notice the difference in the output of the two functions? The first output with double square brackets is a DataFrame object while this example gave an output as Series

Grouping data

Grouping data or columns is done by using the groupby function. Example:

# group the dataset by the number of bedrooms and calculate the average prize
grouped = df.groupby('bedrooms')['price'].mean()
grouped

Output:

bedrooms
0.0    1.195324e+06
1.0    2.740763e+05
2.0    3.916219e+05
3.0    4.886130e+05
4.0    6.351194e+05
5.0    7.701860e+05
6.0    8.173628e+05
7.0    1.049429e+06
8.0    1.155000e+06
9.0    5.999990e+05
Name: price, dtype: float64

Adding Rows and Columns

You can create new rows and columns to your dataset. However, when adding a new column to your dataset, you want to choose a column that can provide additional insights or useful information for analysis.
Let's add a new column called Price per Square Foot.

df['Price_per_Square_Ft'] = df['price'] / df['sqft_living']
df.head()

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	view	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	city	statezip	country	Price_per_Square_Ft
0	2014-05-02 00:00:00	313000.0	3.0	1.50	1340	7912	1.5	0	3	1340	0	1955	2005	18810 Densmore Ave N	Shoreline	WA 98133	USA	233.582090
1	2014-05-02 00:00:00	2384000.0	5.0	2.50	3650	9050	2.0	4	5	3370	280	1921	0	709 W Blaine St	Seattle	WA 98119	USA	653.150685
2	2014-05-02 00:00:00	342000.0	3.0	2.00	1930	11947	1.0	0	4	1930	0	1966	0	26206-26214 143rd Ave SE	Kent	WA 98042	USA	177.202073
3	2014-05-02 00:00:00	420000.0	3.0	2.25	2000	8030	1.0	0	4	1000	1000	1963	0	857 170th Pl NE	Bellevue	WA 98008	USA	210.000000
4	2014-05-02 00:00:00	550000.0	4.0	2.50	1940	10500	1.0	0	4	1140	800	1976	1992	9105 170th Ave NE	Redmond	WA 98052	USA	283.505155

See that a new column called 'Price per Square Ft' has been added to the dataset. Let's select the column so that we can see it.

# Selecting the Price_per_Square_Ft column from the dataset
df[['Price_per_Square_Ft']]

Output:

	Price_per_Square_Ft
0	233.582090
1	653.150685
2	177.202073
3	210.000000
4	283.505155
...	...
4595	204.083885
4596	365.981735
4597	138.506368
4598	97.320574
4599	148.053691

4600 rows × 1 columns

df

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	waterfront	view	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	city	statezip	country	Price_per_Square_Ft
0	2014-05-02 00:00:00	3.130000e+05	3.0	1.50	1340	7912	1.5	0	0	3	1340	0	1955	2005	18810 Densmore Ave N	Shoreline	WA 98133	USA	233.582090
1	2014-05-02 00:00:00	2.384000e+06	5.0	2.50	3650	9050	2.0	0	4	5	3370	280	1921	0	709 W Blaine St	Seattle	WA 98119	USA	653.150685
2	2014-05-02 00:00:00	3.420000e+05	3.0	2.00	1930	11947	1.0	0	0	4	1930	0	1966	0	26206-26214 143rd Ave SE	Kent	WA 98042	USA	177.202073
3	2014-05-02 00:00:00	4.200000e+05	3.0	2.25	2000	8030	1.0	0	0	4	1000	1000	1963	0	857 170th Pl NE	Bellevue	WA 98008	USA	210.000000
4	2014-05-02 00:00:00	5.500000e+05	4.0	2.50	1940	10500	1.0	0	0	4	1140	800	1976	1992	9105 170th Ave NE	Redmond	WA 98052	USA	283.505155
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
4595	2014-07-09 00:00:00	3.081667e+05	3.0	1.75	1510	6360	1.0	0	0	4	1510	0	1954	1979	501 N 143rd St	Seattle	WA 98133	USA	204.083885
4596	2014-07-09 00:00:00	5.343333e+05	3.0	2.50	1460	7573	2.0	0	0	3	1460	0	1983	2009	14855 SE 10th Pl	Bellevue	WA 98007	USA	365.981735
4597	2014-07-09 00:00:00	4.169042e+05	3.0	2.50	3010	7014	2.0	0	0	3	3010	0	2009	0	759 Ilwaco Pl NE	Renton	WA 98059	USA	138.506368
4598	2014-07-10 00:00:00	2.034000e+05	4.0	2.00	2090	6630	1.0	0	0	3	1070	1020	1974	0	5148 S Creston St	Seattle	WA 98178	USA	97.320574
4599	2014-07-10 00:00:00	2.206000e+05	3.0	2.50	1490	8102	2.0	0	0	4	1490	0	1990	0	18717 SE 258th St	Covington	WA 98042	USA	148.053691

4600 rows × 19 columns

Now to add a new row, do this;

# Create a dictionary with the new row data
new_row_data = {
    'date': '2014-07-10 00:00:00',
    'price': 350000.000000,
    'bedrooms': 3.0,
    'bathrooms': 2.5,
    'sqft_living': 1800,
    'sqft_lot': 8000,
    'floors': 2.0,
    'waterfront': 0,
    'view': 0,
    'condition': 3,
    'sqft_above': 1800,
    'sqft_basement': 0,
    'yr_built': 1990,
    'yr_renovated': 0,
    'street': '1234 New St',
    'city': 'Seattle',
    'statezip': 'WA 98105',
    'country': 'USA'
}

# Calculate the index for the new row
new_row_index = len(df)

# Assign the new row data to the DataFrame using .loc
df.loc[new_row_index] = new_row_data

# Display the updated DataFrame
df.tail()

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	city	statezip	country	Price_per_Square_Ft
4596	2014-07-09 00:00:00	534333.333333	3.0	2.5	1460	7573	2.0	3	1460	0	1983	2009	14855 SE 10th Pl	Bellevue	WA 98007	USA	365.981735
4597	2014-07-09 00:00:00	416904.166667	3.0	2.5	3010	7014	2.0	3	3010	0	2009	0	759 Ilwaco Pl NE	Renton	WA 98059	USA	138.506368
4598	2014-07-10 00:00:00	203400.000000	4.0	2.0	2090	6630	1.0	3	1070	1020	1974	0	5148 S Creston St	Seattle	WA 98178	USA	97.320574
4599	2014-07-10 00:00:00	220600.000000	3.0	2.5	1490	8102	2.0	4	1490	0	1990	0	18717 SE 258th St	Covington	WA 98042	USA	148.053691
4600	2014-07-10 00:00:00	350000.000000	3.0	2.5	1800	8000	2.0	3	1800	0	1990	0	1234 New St	Seattle	WA 98105	USA	NaN

The first line of code is creating a dictionary called new_row_data where the keys are the column names of your DataFrame and the values are the new data you want to add.

The second line of the code calculates the new row's index as the new row will be added at the end of the DataFrame.

In the third line of code, the .loc function is used to assign the new_row_data to the DataFrame at the calculated index.

And then lastly, we display the DataFrame using the df.tail() function as this will display the last five rows of the Dataframe.

.iloc and .loc

The .iloc() and .loc() functions are used in Pandas to access data in a DataFrame using different types of indexing.

.iloc() is for integer-based indexing - meaning you can specify the position of rows and columns using integer indices, while .loc() is for label-based indexing - meaning you can specify the names(labels) of the rows and columns.

Here's how to use .iloc():

Accessing a specific row based on its integer index.

# Accessing the row at index 0
df.iloc[0]

Output:

date                    2014-05-02 00:00:00
price                              313000.0
bedrooms                                3.0
bathrooms                               1.5
sqft_living                            1340
sqft_lot                               7912
floors                                  1.5
waterfront                                0
view                                      0
condition                                 3
sqft_above                             1340
sqft_basement                             0
yr_built                               1955
yr_renovated                           2005
street                 18810 Densmore Ave N
city                              Shoreline
statezip                           WA 98133
country                                 USA
Price_per_Square_Ft               233.58209
Name: 0, dtype: object

Accessing a specific column within a row:

# Access the 'price' column of the row at index 0
df.iloc[0, 1]

Output:

313000.0

Accessing multiple rows and columns using slices:

# Access rows at index 0 to 2(inclusive) and columns at index 0 to 4
df.iloc[0:3, 0:4]

Output:

	date	price	bedrooms	bathrooms
0	2014-05-02 00:00:00	313000.0	3.0	1.5
1	2014-05-02 00:00:00	2384000.0	5.0	2.5
2	2014-05-02 00:00:00	342000.0	3.0	2.0

Changing a value in a specific cell:

# Change the price in the first row to 350000
df.iloc[0, 1] = 350000
df.iloc[0]

Output:

date                    2014-05-02 00:00:00
price                              350000.0
bedrooms                                3.0
bathrooms                               1.5
sqft_living                            1340
sqft_lot                               7912
floors                                  1.5
waterfront                                0
view                                      0
condition                                 3
sqft_above                             1340
sqft_basement                             0
yr_built                               1955
yr_renovated                           2005
street                 18810 Densmore Ave N
city                              Shoreline
statezip                           WA 98133
country                                 USA
Price_per_Square_Ft               233.58209
Name: 0, dtype: object

Here's how to use .loc():

Accessing a specified row based on its label:

# Access the row with index 4500
df.loc[4500]

Output:

date                   2014-06-17 00:00:00
price                             540000.0
bedrooms                               3.0
bathrooms                             2.75
sqft_living                           2750
sqft_lot                             18029
floors                                 1.0
waterfront                               0
view                                     2
condition                                5
sqft_above                            1810
sqft_basement                          940
yr_built                              1978
yr_renovated                             0
street                    4708 154th Pl SE
city                              Bellevue
statezip                          WA 98006
country                                USA
Price_per_Square_Ft             196.363636
Name: 4500, dtype: object

Accessing a specified column within a row using the name of the column.

# Access the 'price' column of the row with index 5
df.loc[5, 'price']

Output:

490000.0

Accessing multiple rows and columns using label slices or lists.

# Access rows with indices 0 to 3 and columns 'price' and 'bedrooms'
df.loc[0:3, ['price', 'bedrooms']]

Output:

	price	bedrooms
0	350000.0	3.0
1	2384000.0	5.0
2	342000.0	3.0
3	420000.0	3.0

The key difference between .iloc() and .loc() is that, .loc() uses integer indices to access rows and columns, while .loc() uses labels(names).

Dropping Rows and Columns

Here's how to drop a row from your dataset:

# Specify the index of the row you want to drop
row_to_drop = 4

# Drop the column and return a new DataFrame
df_new = df.drop(row_to_drop)

# Display the new dataset
df_new.head()

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	view	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	city	statezip	country	Price_per_Square_Ft
0	2014-05-02 00:00:00	350000.0	3.0	1.50	1340	7912	1.5	0	3	1340	0	1955	2005	18810 Densmore Ave N	Shoreline	WA 98133	USA	233.582090
1	2014-05-02 00:00:00	2384000.0	5.0	2.50	3650	9050	2.0	4	5	3370	280	1921	0	709 W Blaine St	Seattle	WA 98119	USA	653.150685
2	2014-05-02 00:00:00	342000.0	3.0	2.00	1930	11947	1.0	0	4	1930	0	1966	0	26206-26214 143rd Ave SE	Kent	WA 98042	USA	177.202073
3	2014-05-02 00:00:00	420000.0	3.0	2.25	2000	8030	1.0	0	4	1000	1000	1963	0	857 170th Pl NE	Bellevue	WA 98008	USA	210.000000
5	2014-05-02 00:00:00	490000.0	2.0	1.00	880	6380	1.0	0	3	880	0	1938	1994	522 NE 88th St	Seattle	WA 98115	USA	556.818182

You can see that row 4 is no longer in the dataset.

Here's how to drop a column from your dataset:

# Specify the name of the column you want to drop
column_to_drop = 'city'

# Drop the column and return a new Dataframe
df_new = df.drop(columns=[column_to_drop])

#Display the new DataFrame
df_new.head()

Output:

	date	price	bedrooms	bathrooms	sqft_living	sqft_lot	floors	view	condition	sqft_above	sqft_basement	yr_built	yr_renovated	street	statezip	country	Price_per_Square_Ft
0	2014-05-02 00:00:00	350000.0	3.0	1.50	1340	7912	1.5	0	3	1340	0	1955	2005	18810 Densmore Ave N	WA 98133	USA	233.582090
1	2014-05-02 00:00:00	2384000.0	5.0	2.50	3650	9050	2.0	4	5	3370	280	1921	0	709 W Blaine St	WA 98119	USA	653.150685
2	2014-05-02 00:00:00	342000.0	3.0	2.00	1930	11947	1.0	0	4	1930	0	1966	0	26206-26214 143rd Ave SE	WA 98042	USA	177.202073
3	2014-05-02 00:00:00	420000.0	3.0	2.25	2000	8030	1.0	0	4	1000	1000	1963	0	857 170th Pl NE	WA 98008	USA	210.000000
4	2014-05-02 00:00:00	550000.0	4.0	2.50	1940	10500	1.0	0	4	1140	800	1976	1992	9105 170th Ave NE	WA 98052	USA	283.505155

Now, the 'city' column is no longer present in the dataset.

Conclusion

Pandas is a crucial tool for data work in Python. We've demonstrated how Pandas can effectively help manipulate and analyze housing data. Whether you're a beginner or a seasoned professional, Pandas provides a variety of features to make your data analysis smoother.

For more information, check out the official Pandas documentation and online tutorials for advanced topics and features. Happy coding!

Blog

Introduction to Pandas

Ugonma Ononogbu

Introduction

Getting Started with Pandas

How to install Pandas:

Importing Pandas

Components of pandas

Creating Series:

Creating DataFrame:

Importing Datasets

Let's explore some pandas functions:

.head()

.tail()

.shape

.info()

.describe()

.isnull()

.columns

Selecting a column

Grouping data

Adding Rows and Columns

.iloc and .loc

Here's how to use .iloc():

Here's how to use .loc():

Dropping Rows and Columns

Conclusion

Join Our Newsletter. No Spam, Only the good stuff.

Related