Introduction to Pandas

ugonma

Ugonma Ononogbu

Posted on June 22, 2024

Introduction to Pandas

Introduction

If you're reading this article, you probably want to get an understanding of Pandas, so let's get to it.

Pandas - short for “Panel Data” - is a popular open-source programming language widely used for performing data manipulation and analysis. It has in-built functions to efficiently clean, transform, manipulate, visualize, and analyze data.

The Pandas library is an essential tool for Data analysts, Scientists, and Engineers working with structured data in Python.

This article will teach you basic functions you need to know when using Pandas library–its specific uses, and how to install Pandas.

Getting Started with Pandas

Let’s learn how to install Python Pandas Library.

How to install Pandas:

When working with Pandas Library, the first step is to ensure that it is installed in the system using the pip command.

pip install pandas
Enter fullscreen mode Exit fullscreen mode
Requirement already satisfied: pandas in c:\users\userpc\anaconda3\lib\site-packages (2.1.4)Note: you may need to restart the kernel to use updated packages.

Requirement already satisfied: numpy<2,>=1.23.2 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (1.24.3)
Requirement already satisfied: python-dateutil>=2.8.2 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (2.8.2)
Requirement already satisfied: pytz>=2020.1 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (2022.7)
Requirement already satisfied: tzdata>=2022.1 in c:\users\userpc\anaconda3\lib\site-packages (from pandas) (2023.4)
Requirement already satisfied: six>=1.5 in c:\users\userpc\anaconda3\lib\site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0)
Enter fullscreen mode Exit fullscreen mode

Importing Pandas

To begin working with Pandas, import pandas package as follows:

import pandas as pd 
Enter fullscreen mode Exit fullscreen mode

We are importing pandas from anaconda.
The most common and preferred short form for pandas is 'pd'. This shorter name is used because it makes the code shorter and easier to write whenever you need to use a pandas function.

Components of pandas

Pandas provides two data structures for manipulating data. They include:

  1. Series
  2. DataFrame

A Series is essentially a column while a DataFrame is like a multi-dimensional structure or table formed by combining multiple Series together.

Creating Series:

A pandas series is just like a column from an Excel Spreadsheet.

Example:

#create a pandas Series
numbers = pd.Series([1, 2, 3, 4, 5])

# Displaying the Series
numbers
Enter fullscreen mode Exit fullscreen mode

Output:

0    1
1    2
2    3
3    4
4    5
dtype: int64
Enter fullscreen mode Exit fullscreen mode

Note: The codes embedded inside the hashtag symbol(#) is called comments. Comments is used in python to explain what a code is about so that incase another programmer gets to go through your written code, the person can understand what the code is about.

Another example includes:

# Creating a Pandas Series of colors
colors_series = pd.Series(['red', 'blue', 'green'])

# Displaying the Series
colors_series
Enter fullscreen mode Exit fullscreen mode

Output:

0      red
1     blue
2    green
dtype: object
Enter fullscreen mode Exit fullscreen mode

For more reference on how create Pandas Series, refer to this article on Creating Pandas Series.

Creating DataFrame:

A DataFrame is a two-dimensional data structure similar to a table in a spreadsheet. It has rows and columns, where each row represents a record or observation, and each column represents a variable or an attribute.

A DataFrame lets you easily organize, manipulate, and analyze a dataset.

One simple way of creating DataFrames is by using a dictionary.

Here's how:


# Creating a DataFrame with fruits and car types
data = {'Fruits': ['Apple', 'Banana', 'Orange'],'Car Types': ['SUV', 'Sedan', 'Truck']}

df = pd.DataFrame(data)

# Displaying the DataFrame
df
Enter fullscreen mode Exit fullscreen mode

Output:

Fruits Car Types
0 Apple SUV
1 Banana Sedan
2 Orange Truck

Importing Datasets

There are various formats in which data can be imported into the working environment(in this case, the working environment is Jupyter notebook). These formats can be: CSV, excel, HTML, JSON, SQL, and many more.

Comma-Seperated Values(CSV) is the most common format. It is imported into the working environment by using the pd.read_csv() function.

Now let's import our real dataset from here at Kaggle

df = pd.read_csv("C:/Users/USERPC/Downloads/house_price.csv")
df
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country
0 2014-05-02 00:00:00 3.130000e+05 3.0 1.50 1340 7912 1.5 0 0 3 1340 0 1955 2005 18810 Densmore Ave N Shoreline WA 98133 USA
1 2014-05-02 00:00:00 2.384000e+06 5.0 2.50 3650 9050 2.0 0 4 5 3370 280 1921 0 709 W Blaine St Seattle WA 98119 USA
2 2014-05-02 00:00:00 3.420000e+05 3.0 2.00 1930 11947 1.0 0 0 4 1930 0 1966 0 26206-26214 143rd Ave SE Kent WA 98042 USA
3 2014-05-02 00:00:00 4.200000e+05 3.0 2.25 2000 8030 1.0 0 0 4 1000 1000 1963 0 857 170th Pl NE Bellevue WA 98008 USA
4 2014-05-02 00:00:00 5.500000e+05 4.0 2.50 1940 10500 1.0 0 0 4 1140 800 1976 1992 9105 170th Ave NE Redmond WA 98052 USA
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4595 2014-07-09 00:00:00 3.081667e+05 3.0 1.75 1510 6360 1.0 0 0 4 1510 0 1954 1979 501 N 143rd St Seattle WA 98133 USA
4596 2014-07-09 00:00:00 5.343333e+05 3.0 2.50 1460 7573 2.0 0 0 3 1460 0 1983 2009 14855 SE 10th Pl Bellevue WA 98007 USA
4597 2014-07-09 00:00:00 4.169042e+05 3.0 2.50 3010 7014 2.0 0 0 3 3010 0 2009 0 759 Ilwaco Pl NE Renton WA 98059 USA
4598 2014-07-10 00:00:00 2.034000e+05 4.0 2.00 2090 6630 1.0 0 0 3 1070 1020 1974 0 5148 S Creston St Seattle WA 98178 USA
4599 2014-07-10 00:00:00 2.206000e+05 3.0 2.50 1490 8102 2.0 0 0 4 1490 0 1990 0 18717 SE 258th St Covington WA 98042 USA

4600 rows × 18 columns

Let's explore some pandas functions:

.head()

A dataset consists of several rows and columns, which is hard to see all at once. So in this case, using the .head() function will return the first 5 rows of the dataset by default.

df.head()
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country
0 2014-05-02 00:00:00 313000.0 3.0 1.50 1340 7912 1.5 0 0 3 1340 0 1955 2005 18810 Densmore Ave N Shoreline WA 98133 USA
1 2014-05-02 00:00:00 2384000.0 5.0 2.50 3650 9050 2.0 0 4 5 3370 280 1921 0 709 W Blaine St Seattle WA 98119 USA
2 2014-05-02 00:00:00 342000.0 3.0 2.00 1930 11947 1.0 0 0 4 1930 0 1966 0 26206-26214 143rd Ave SE Kent WA 98042 USA
3 2014-05-02 00:00:00 420000.0 3.0 2.25 2000 8030 1.0 0 0 4 1000 1000 1963 0 857 170th Pl NE Bellevue WA 98008 USA
4 2014-05-02 00:00:00 550000.0 4.0 2.50 1940 10500 1.0 0 0 4 1140 800 1976 1992 9105 170th Ave NE Redmond WA 98052 USA

In case you want to see the first 7 or 10 rows of the dataset, you pass in the number of rows you want to see into the bracket like this:

# to return the first 7 rows of the dataset
df.head(7)

# to return the first 10 rows of the dataset
# df.head(10)
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country
0 2014-05-02 00:00:00 313000.0 3.0 1.50 1340 7912 1.5 0 0 3 1340 0 1955 2005 18810 Densmore Ave N Shoreline WA 98133 USA
1 2014-05-02 00:00:00 2384000.0 5.0 2.50 3650 9050 2.0 0 4 5 3370 280 1921 0 709 W Blaine St Seattle WA 98119 USA
2 2014-05-02 00:00:00 342000.0 3.0 2.00 1930 11947 1.0 0 0 4 1930 0 1966 0 26206-26214 143rd Ave SE Kent WA 98042 USA
3 2014-05-02 00:00:00 420000.0 3.0 2.25 2000 8030 1.0 0 0 4 1000 1000 1963 0 857 170th Pl NE Bellevue WA 98008 USA
4 2014-05-02 00:00:00 550000.0 4.0 2.50 1940 10500 1.0 0 0 4 1140 800 1976 1992 9105 170th Ave NE Redmond WA 98052 USA
5 2014-05-02 00:00:00 490000.0 2.0 1.00 880 6380 1.0 0 0 3 880 0 1938 1994 522 NE 88th St Seattle WA 98115 USA
6 2014-05-02 00:00:00 335000.0 2.0 2.00 1350 2560 1.0 0 0 3 1350 0 1976 0 2616 174th Ave NE Redmond WA 98052 USA

.tail()

The .tail() function returns the last 5 rows of the dataset by default. You can also pass in the specific number of rows that you want to the function to return as i earlier stated. In this case, you say .tail(8) or .tail(3). Calling these functions will return the last 8 rows of the dataset and the last 3 rows of the dataset. Let's run these codes, shall we?

# to return the last 5 rows of the dataset(by default)
df.tail()
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country
4595 2014-07-09 00:00:00 308166.666667 3.0 1.75 1510 6360 1.0 0 0 4 1510 0 1954 1979 501 N 143rd St Seattle WA 98133 USA
4596 2014-07-09 00:00:00 534333.333333 3.0 2.50 1460 7573 2.0 0 0 3 1460 0 1983 2009 14855 SE 10th Pl Bellevue WA 98007 USA
4597 2014-07-09 00:00:00 416904.166667 3.0 2.50 3010 7014 2.0 0 0 3 3010 0 2009 0 759 Ilwaco Pl NE Renton WA 98059 USA
4598 2014-07-10 00:00:00 203400.000000 4.0 2.00 2090 6630 1.0 0 0 3 1070 1020 1974 0 5148 S Creston St Seattle WA 98178 USA
4599 2014-07-10 00:00:00 220600.000000 3.0 2.50 1490 8102 2.0 0 0 4 1490 0 1990 0 18717 SE 258th St Covington WA 98042 USA
# to return the last 8 rows of the dataset
df.tail(8)
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country
4592 2014-07-08 00:00:00 252980.000000 4.0 2.50 2530 8169 2.0 0 0 3 2530 0 1993 0 37654 18th Pl S Federal Way WA 98003 USA
4593 2014-07-08 00:00:00 289373.307692 3.0 2.50 2538 4600 2.0 0 0 3 2538 0 2013 1923 5703 Charlotte Ave SE Auburn WA 98092 USA
4594 2014-07-09 00:00:00 210614.285714 3.0 2.50 1610 7223 2.0 0 0 3 1610 0 1994 0 26306 127th Ave SE Kent WA 98030 USA
4595 2014-07-09 00:00:00 308166.666667 3.0 1.75 1510 6360 1.0 0 0 4 1510 0 1954 1979 501 N 143rd St Seattle WA 98133 USA
4596 2014-07-09 00:00:00 534333.333333 3.0 2.50 1460 7573 2.0 0 0 3 1460 0 1983 2009 14855 SE 10th Pl Bellevue WA 98007 USA
4597 2014-07-09 00:00:00 416904.166667 3.0 2.50 3010 7014 2.0 0 0 3 3010 0 2009 0 759 Ilwaco Pl NE Renton WA 98059 USA
4598 2014-07-10 00:00:00 203400.000000 4.0 2.00 2090 6630 1.0 0 0 3 1070 1020 1974 0 5148 S Creston St Seattle WA 98178 USA
4599 2014-07-10 00:00:00 220600.000000 3.0 2.50 1490 8102 2.0 0 0 4 1490 0 1990 0 18717 SE 258th St Covington WA 98042 USA
# to return the last 3 rows of the dataset
df.tail(3)
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country
4597 2014-07-09 00:00:00 416904.166667 3.0 2.5 3010 7014 2.0 0 0 3 3010 0 2009 0 759 Ilwaco Pl NE Renton WA 98059 USA
4598 2014-07-10 00:00:00 203400.000000 4.0 2.0 2090 6630 1.0 0 0 3 1070 1020 1974 0 5148 S Creston St Seattle WA 98178 USA
4599 2014-07-10 00:00:00 220600.000000 3.0 2.5 1490 8102 2.0 0 0 4 1490 0 1990 0 18717 SE 258th St Covington WA 98042 USA

.shape

We use the .shape attribute to check how large the dataset is. This function will return the number of rows and column in the dataset.

df.shape
Enter fullscreen mode Exit fullscreen mode

Output:

(4600, 18)
Enter fullscreen mode Exit fullscreen mode

Here we can see that the dataset has 4600 rows and 18 columns.

Note: The index of the dataset always starts with 0 by default and not 1. If the index starts from 1, the index number of the last row will be 4600. So since the index of the dataset starts from 0, the index of the last row will be 4599. You can go through the .tail() to check it.

.info()

The .info() function displays basic information about the dataset including the datatype, columns, non-null values, number of rows and columns and the memory usage.

df.info()
Enter fullscreen mode Exit fullscreen mode

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4600 entries, 0 to 4599
Data columns (total 18 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   date           4600 non-null   object 
 1   price          4600 non-null   float64
 2   bedrooms       4600 non-null   float64
 3   bathrooms      4600 non-null   float64
 4   sqft_living    4600 non-null   int64  
 5   sqft_lot       4600 non-null   int64  
 6   floors         4600 non-null   float64
 7   waterfront     4600 non-null   int64  
 8   view           4600 non-null   int64  
 9   condition      4600 non-null   int64  
 10  sqft_above     4600 non-null   int64  
 11  sqft_basement  4600 non-null   int64  
 12  yr_built       4600 non-null   int64  
 13  yr_renovated   4600 non-null   int64  
 14  street         4600 non-null   object 
 15  city           4600 non-null   object 
 16  statezip       4600 non-null   object 
 17  country        4600 non-null   object 
dtypes: float64(4), int64(9), object(5)
memory usage: 647.0+ KB
Enter fullscreen mode Exit fullscreen mode

.describe()

The .describe() function returns the descriptive statistics of the dataframe.

df.describe()
Enter fullscreen mode Exit fullscreen mode

Output:

price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated
count 4.600000e+03 4600.000000 4600.000000 4600.000000 4.600000e+03 4600.000000 4600.000000 4600.000000 4600.000000 4600.000000 4600.000000 4600.000000 4600.000000
mean 5.519630e+05 3.400870 2.160815 2139.346957 1.485252e+04 1.512065 0.007174 0.240652 3.451739 1827.265435 312.081522 1970.786304 808.608261
std 5.638347e+05 0.908848 0.783781 963.206916 3.588444e+04 0.538288 0.084404 0.778405 0.677230 862.168977 464.137228 29.731848 979.414536
min 0.000000e+00 0.000000 0.000000 370.000000 6.380000e+02 1.000000 0.000000 0.000000 1.000000 370.000000 0.000000 1900.000000 0.000000
25% 3.228750e+05 3.000000 1.750000 1460.000000 5.000750e+03 1.000000 0.000000 0.000000 3.000000 1190.000000 0.000000 1951.000000 0.000000
50% 4.609435e+05 3.000000 2.250000 1980.000000 7.683000e+03 1.500000 0.000000 0.000000 3.000000 1590.000000 0.000000 1976.000000 0.000000
75% 6.549625e+05 4.000000 2.500000 2620.000000 1.100125e+04 2.000000 0.000000 0.000000 4.000000 2300.000000 610.000000 1997.000000 1999.000000
max 2.659000e+07 9.000000 8.000000 13540.000000 1.074218e+06 3.500000 1.000000 4.000000 5.000000 9410.000000 4820.000000 2014.000000 2014.000000

Where;

• count - is the total number of non-missing values

• mean - is the average of the dataframe

• std - is the standard deviation

• min - is the minimum value

• 25% - is the 25th percentile

• 50% - is the 50th percentile

• 75% - is the 75th percentile

• max - is the maximum value

.isnull()

This is an important function used to check for null values in a dataset.

# this function will find null values in the first 5 rows of the dataset
df.isnull().head()
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country
0 False False False False False False False False False False False False False False False False False False
1 False False False False False False False False False False False False False False False False False False
2 False False False False False False False False False False False False False False False False False False
3 False False False False False False False False False False False False False False False False False False
4 False False False False False False False False False False False False False False False False False False

To count the number of null or missing values in the dataset, we do this:

df.isnull().sum()
Enter fullscreen mode Exit fullscreen mode

Output:

date             0
price            0
bedrooms         0
bathrooms        0
sqft_living      0
sqft_lot         0
floors           0
waterfront       0
view             0
condition        0
sqft_above       0
sqft_basement    0
yr_built         0
yr_renovated     0
street           0
city             0
statezip         0
country          0
dtype: int64
Enter fullscreen mode Exit fullscreen mode

We can see that there are no null values because their individual sum is 0.

.columns

This function is used to view the column names.

df.columns
Enter fullscreen mode Exit fullscreen mode

Output:

Index(['date', 'price', 'bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot',
       'floors', 'waterfront', 'view', 'condition', 'sqft_above',
       'sqft_basement', 'yr_built', 'yr_renovated', 'street', 'city',
       'statezip', 'country'],
      dtype='object')
Enter fullscreen mode Exit fullscreen mode

Selecting a column

To select a specfic column from the dataframe, insert the name of the column into square brackets [].

# return the first five values of the 'date' column in the dataframe
df[["date"]].head()
Enter fullscreen mode Exit fullscreen mode

Output:

date
0 2014-05-02 00:00:00
1 2014-05-02 00:00:00
2 2014-05-02 00:00:00
3 2014-05-02 00:00:00
4 2014-05-02 00:00:00

Note: To select a column, you can decide to call the name of the column to be returned as a dataframe object or a series object. This is usually done by either using double brackets or a single bracket. Hence the reason for using double brackets the example above.

To call a variable or the specific name of a column as a Series object, you pass the name of the column into a single bracket by doing this;

df['date'].head()
Enter fullscreen mode Exit fullscreen mode

Output:

0    2014-05-02 00:00:00
1    2014-05-02 00:00:00
2    2014-05-02 00:00:00
3    2014-05-02 00:00:00
4    2014-05-02 00:00:00
Name: date, dtype: object
Enter fullscreen mode Exit fullscreen mode

Did you notice the difference in the output of the two functions? The first output with double square brackets is a DataFrame object while this example gave an output as Series

Grouping data

Grouping data or columns is done by using the groupby function. Example:

# group the dataset by the number of bedrooms and calculate the average prize
grouped = df.groupby('bedrooms')['price'].mean()
grouped
Enter fullscreen mode Exit fullscreen mode

Output:

bedrooms
0.0    1.195324e+06
1.0    2.740763e+05
2.0    3.916219e+05
3.0    4.886130e+05
4.0    6.351194e+05
5.0    7.701860e+05
6.0    8.173628e+05
7.0    1.049429e+06
8.0    1.155000e+06
9.0    5.999990e+05
Name: price, dtype: float64
Enter fullscreen mode Exit fullscreen mode

Adding Rows and Columns

You can create new rows and columns to your dataset. However, when adding a new column to your dataset, you want to choose a column that can provide additional insights or useful information for analysis.
Let's add a new column called Price per Square Foot.

df['Price_per_Square_Ft'] = df['price'] / df['sqft_living']
df.head()
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country Price_per_Square_Ft
0 2014-05-02 00:00:00 313000.0 3.0 1.50 1340 7912 1.5 0 0 3 1340 0 1955 2005 18810 Densmore Ave N Shoreline WA 98133 USA 233.582090
1 2014-05-02 00:00:00 2384000.0 5.0 2.50 3650 9050 2.0 0 4 5 3370 280 1921 0 709 W Blaine St Seattle WA 98119 USA 653.150685
2 2014-05-02 00:00:00 342000.0 3.0 2.00 1930 11947 1.0 0 0 4 1930 0 1966 0 26206-26214 143rd Ave SE Kent WA 98042 USA 177.202073
3 2014-05-02 00:00:00 420000.0 3.0 2.25 2000 8030 1.0 0 0 4 1000 1000 1963 0 857 170th Pl NE Bellevue WA 98008 USA 210.000000
4 2014-05-02 00:00:00 550000.0 4.0 2.50 1940 10500 1.0 0 0 4 1140 800 1976 1992 9105 170th Ave NE Redmond WA 98052 USA 283.505155

See that a new column called 'Price per Square Ft' has been added to the dataset. Let's select the column so that we can see it.

# Selecting the Price_per_Square_Ft column from the dataset
df[['Price_per_Square_Ft']]
Enter fullscreen mode Exit fullscreen mode

Output:

Price_per_Square_Ft
0 233.582090
1 653.150685
2 177.202073
3 210.000000
4 283.505155
... ...
4595 204.083885
4596 365.981735
4597 138.506368
4598 97.320574
4599 148.053691

4600 rows × 1 columns

df
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country Price_per_Square_Ft
0 2014-05-02 00:00:00 3.130000e+05 3.0 1.50 1340 7912 1.5 0 0 3 1340 0 1955 2005 18810 Densmore Ave N Shoreline WA 98133 USA 233.582090
1 2014-05-02 00:00:00 2.384000e+06 5.0 2.50 3650 9050 2.0 0 4 5 3370 280 1921 0 709 W Blaine St Seattle WA 98119 USA 653.150685
2 2014-05-02 00:00:00 3.420000e+05 3.0 2.00 1930 11947 1.0 0 0 4 1930 0 1966 0 26206-26214 143rd Ave SE Kent WA 98042 USA 177.202073
3 2014-05-02 00:00:00 4.200000e+05 3.0 2.25 2000 8030 1.0 0 0 4 1000 1000 1963 0 857 170th Pl NE Bellevue WA 98008 USA 210.000000
4 2014-05-02 00:00:00 5.500000e+05 4.0 2.50 1940 10500 1.0 0 0 4 1140 800 1976 1992 9105 170th Ave NE Redmond WA 98052 USA 283.505155
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
4595 2014-07-09 00:00:00 3.081667e+05 3.0 1.75 1510 6360 1.0 0 0 4 1510 0 1954 1979 501 N 143rd St Seattle WA 98133 USA 204.083885
4596 2014-07-09 00:00:00 5.343333e+05 3.0 2.50 1460 7573 2.0 0 0 3 1460 0 1983 2009 14855 SE 10th Pl Bellevue WA 98007 USA 365.981735
4597 2014-07-09 00:00:00 4.169042e+05 3.0 2.50 3010 7014 2.0 0 0 3 3010 0 2009 0 759 Ilwaco Pl NE Renton WA 98059 USA 138.506368
4598 2014-07-10 00:00:00 2.034000e+05 4.0 2.00 2090 6630 1.0 0 0 3 1070 1020 1974 0 5148 S Creston St Seattle WA 98178 USA 97.320574
4599 2014-07-10 00:00:00 2.206000e+05 3.0 2.50 1490 8102 2.0 0 0 4 1490 0 1990 0 18717 SE 258th St Covington WA 98042 USA 148.053691

4600 rows × 19 columns

Now to add a new row, do this;

# Create a dictionary with the new row data
new_row_data = {
    'date': '2014-07-10 00:00:00',
    'price': 350000.000000,
    'bedrooms': 3.0,
    'bathrooms': 2.5,
    'sqft_living': 1800,
    'sqft_lot': 8000,
    'floors': 2.0,
    'waterfront': 0,
    'view': 0,
    'condition': 3,
    'sqft_above': 1800,
    'sqft_basement': 0,
    'yr_built': 1990,
    'yr_renovated': 0,
    'street': '1234 New St',
    'city': 'Seattle',
    'statezip': 'WA 98105',
    'country': 'USA'
}

# Calculate the index for the new row
new_row_index = len(df)

# Assign the new row data to the DataFrame using .loc
df.loc[new_row_index] = new_row_data

# Display the updated DataFrame
df.tail()
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country Price_per_Square_Ft
4596 2014-07-09 00:00:00 534333.333333 3.0 2.5 1460 7573 2.0 0 0 3 1460 0 1983 2009 14855 SE 10th Pl Bellevue WA 98007 USA 365.981735
4597 2014-07-09 00:00:00 416904.166667 3.0 2.5 3010 7014 2.0 0 0 3 3010 0 2009 0 759 Ilwaco Pl NE Renton WA 98059 USA 138.506368
4598 2014-07-10 00:00:00 203400.000000 4.0 2.0 2090 6630 1.0 0 0 3 1070 1020 1974 0 5148 S Creston St Seattle WA 98178 USA 97.320574
4599 2014-07-10 00:00:00 220600.000000 3.0 2.5 1490 8102 2.0 0 0 4 1490 0 1990 0 18717 SE 258th St Covington WA 98042 USA 148.053691
4600 2014-07-10 00:00:00 350000.000000 3.0 2.5 1800 8000 2.0 0 0 3 1800 0 1990 0 1234 New St Seattle WA 98105 USA NaN

The first line of code is creating a dictionary called new_row_data where the keys are the column names of your DataFrame and the values are the new data you want to add.

The second line of the code calculates the new row's index as the new row will be added at the end of the DataFrame.

In the third line of code, the .loc function is used to assign the new_row_data to the DataFrame at the calculated index.

And then lastly, we display the DataFrame using the df.tail() function as this will display the last five rows of the Dataframe.

.iloc and .loc

The .iloc() and .loc() functions are used in Pandas to access data in a DataFrame using different types of indexing.

.iloc() is for integer-based indexing - meaning you can specify the position of rows and columns using integer indices, while .loc() is for label-based indexing - meaning you can specify the names(labels) of the rows and columns.

Here's how to use .iloc():

Accessing a specific row based on its integer index.

# Accessing the row at index 0
df.iloc[0]
Enter fullscreen mode Exit fullscreen mode

Output:

date                    2014-05-02 00:00:00
price                              313000.0
bedrooms                                3.0
bathrooms                               1.5
sqft_living                            1340
sqft_lot                               7912
floors                                  1.5
waterfront                                0
view                                      0
condition                                 3
sqft_above                             1340
sqft_basement                             0
yr_built                               1955
yr_renovated                           2005
street                 18810 Densmore Ave N
city                              Shoreline
statezip                           WA 98133
country                                 USA
Price_per_Square_Ft               233.58209
Name: 0, dtype: object
Enter fullscreen mode Exit fullscreen mode

Accessing a specific column within a row:

# Access the 'price' column of the row at index 0
df.iloc[0, 1]
Enter fullscreen mode Exit fullscreen mode

Output:

313000.0
Enter fullscreen mode Exit fullscreen mode

Accessing multiple rows and columns using slices:

# Access rows at index 0 to 2(inclusive) and columns at index 0 to 4
df.iloc[0:3, 0:4]
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms
0 2014-05-02 00:00:00 313000.0 3.0 1.5
1 2014-05-02 00:00:00 2384000.0 5.0 2.5
2 2014-05-02 00:00:00 342000.0 3.0 2.0

Changing a value in a specific cell:

# Change the price in the first row to 350000
df.iloc[0, 1] = 350000
df.iloc[0]
Enter fullscreen mode Exit fullscreen mode

Output:

date                    2014-05-02 00:00:00
price                              350000.0
bedrooms                                3.0
bathrooms                               1.5
sqft_living                            1340
sqft_lot                               7912
floors                                  1.5
waterfront                                0
view                                      0
condition                                 3
sqft_above                             1340
sqft_basement                             0
yr_built                               1955
yr_renovated                           2005
street                 18810 Densmore Ave N
city                              Shoreline
statezip                           WA 98133
country                                 USA
Price_per_Square_Ft               233.58209
Name: 0, dtype: object
Enter fullscreen mode Exit fullscreen mode

Here's how to use .loc():

Accessing a specified row based on its label:

# Access the row with index 4500
df.loc[4500]
Enter fullscreen mode Exit fullscreen mode

Output:

date                   2014-06-17 00:00:00
price                             540000.0
bedrooms                               3.0
bathrooms                             2.75
sqft_living                           2750
sqft_lot                             18029
floors                                 1.0
waterfront                               0
view                                     2
condition                                5
sqft_above                            1810
sqft_basement                          940
yr_built                              1978
yr_renovated                             0
street                    4708 154th Pl SE
city                              Bellevue
statezip                          WA 98006
country                                USA
Price_per_Square_Ft             196.363636
Name: 4500, dtype: object
Enter fullscreen mode Exit fullscreen mode

Accessing a specified column within a row using the name of the column.

# Access the 'price' column of the row with index 5
df.loc[5, 'price']
Enter fullscreen mode Exit fullscreen mode

Output:

490000.0
Enter fullscreen mode Exit fullscreen mode

Accessing multiple rows and columns using label slices or lists.

# Access rows with indices 0 to 3 and columns 'price' and 'bedrooms'
df.loc[0:3, ['price', 'bedrooms']]
Enter fullscreen mode Exit fullscreen mode

Output:

price bedrooms
0 350000.0 3.0
1 2384000.0 5.0
2 342000.0 3.0
3 420000.0 3.0

The key difference between .iloc() and .loc() is that, .loc() uses integer indices to access rows and columns, while .loc() uses labels(names).

Dropping Rows and Columns

Here's how to drop a row from your dataset:

# Specify the index of the row you want to drop
row_to_drop = 4

# Drop the column and return a new DataFrame
df_new = df.drop(row_to_drop)

# Display the new dataset
df_new.head()
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street city statezip country Price_per_Square_Ft
0 2014-05-02 00:00:00 350000.0 3.0 1.50 1340 7912 1.5 0 0 3 1340 0 1955 2005 18810 Densmore Ave N Shoreline WA 98133 USA 233.582090
1 2014-05-02 00:00:00 2384000.0 5.0 2.50 3650 9050 2.0 0 4 5 3370 280 1921 0 709 W Blaine St Seattle WA 98119 USA 653.150685
2 2014-05-02 00:00:00 342000.0 3.0 2.00 1930 11947 1.0 0 0 4 1930 0 1966 0 26206-26214 143rd Ave SE Kent WA 98042 USA 177.202073
3 2014-05-02 00:00:00 420000.0 3.0 2.25 2000 8030 1.0 0 0 4 1000 1000 1963 0 857 170th Pl NE Bellevue WA 98008 USA 210.000000
5 2014-05-02 00:00:00 490000.0 2.0 1.00 880 6380 1.0 0 0 3 880 0 1938 1994 522 NE 88th St Seattle WA 98115 USA 556.818182

You can see that row 4 is no longer in the dataset.

Here's how to drop a column from your dataset:

# Specify the name of the column you want to drop
column_to_drop = 'city'

# Drop the column and return a new Dataframe
df_new = df.drop(columns=[column_to_drop])

#Display the new DataFrame
df_new.head()
Enter fullscreen mode Exit fullscreen mode

Output:

date price bedrooms bathrooms sqft_living sqft_lot floors waterfront view condition sqft_above sqft_basement yr_built yr_renovated street statezip country Price_per_Square_Ft
0 2014-05-02 00:00:00 350000.0 3.0 1.50 1340 7912 1.5 0 0 3 1340 0 1955 2005 18810 Densmore Ave N WA 98133 USA 233.582090
1 2014-05-02 00:00:00 2384000.0 5.0 2.50 3650 9050 2.0 0 4 5 3370 280 1921 0 709 W Blaine St WA 98119 USA 653.150685
2 2014-05-02 00:00:00 342000.0 3.0 2.00 1930 11947 1.0 0 0 4 1930 0 1966 0 26206-26214 143rd Ave SE WA 98042 USA 177.202073
3 2014-05-02 00:00:00 420000.0 3.0 2.25 2000 8030 1.0 0 0 4 1000 1000 1963 0 857 170th Pl NE WA 98008 USA 210.000000
4 2014-05-02 00:00:00 550000.0 4.0 2.50 1940 10500 1.0 0 0 4 1140 800 1976 1992 9105 170th Ave NE WA 98052 USA 283.505155

Now, the 'city' column is no longer present in the dataset.

Conclusion

Pandas is a crucial tool for data work in Python. We've demonstrated how Pandas can effectively help manipulate and analyze housing data. Whether you're a beginner or a seasoned professional, Pandas provides a variety of features to make your data analysis smoother.

For more information, check out the official Pandas documentation and online tutorials for advanced topics and features. Happy coding!

💖 💪 🙅 🚩
ugonma
Ugonma Ononogbu

Posted on June 22, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Introduction to Pandas
python Introduction to Pandas

June 22, 2024

What's New in Python 3.10!
python What's New in Python 3.10!

January 4, 2021