Python Pandas for Beginners – A Complete Guide (Part 2)
Python Geeks
Posted on November 2, 2019
Python Pandas for Beginners - A Complete Guide (Part 2)
In this article, we continue learning Python Pandas. We will know how to read DataFrame from file and the most important Pandas operator for beginners.
In the previous post of the series, we understand the basic concepts in Pandas such as "what is Pandas?", Series
and DataFrame
. If you don't remember, click here to get back to Part 1.
How to read and write DataFrame
It is quite simple to read data with the support of the Pandas library. We can load data from various data sources such as CSV, JSON or Excel file. Because of the learning purpose, we will try to load data with all kinds of data sources.
JSON file
Assuming that we have a JSON file that contains data related to product inventory as bellow.
{
"CAN": {
"Headphone": 1,
"Laptop": 5
},
"SGN": {
"Headphone": 3,
"Laptop": 8
},
"SIN": {
"Headphone": 0,
"Laptop": 5,
}
}
Then we can read JSON file easily with read_json
function of pandas.
import pandas as pd
data = pd.read_json("data.json")
print(data)
# Result:
# CAN SGN SIN
# headphone 1 3 0
# laptop 5 8 5
#
Writing data to JSON file
data.to_json("exported_data.json")
CSV file
Using the same demo data in previous example. However, the file format is CSV. The data in file as below:
,CAN,SGN,SIN
headphone,1,3,0
laptop,5,8,5
Loading data to DataFrame using read_csv
import pandas as pd
data = pd.read_csv("data.csv")
print(data)
# Result:
# Unnamed: 0 CAN SGN SIN
# 0 headphone 1 3 0
# 1 laptop 5 8 5
#
In the above code, read_csv
generates the index column as default. However, we want the first column (headphone, laptop) is the index column. So passing index_col
parameter to read_csv
to let it know which column will be indexed.
import pandas as pd
data = pd.read_csv("data.csv", index_col=0)
print(data)
# Result:
# CAN SGN SIN
# headphone 1 3 0
# laptop 5 8 5
#
Writing DataFrame to CSV file
data.to_csv("exported_data.csv")
Excel file
Reading data from excel file using read_excel
function.
import pandas as pd
data = pd.read_excel("data.xlsx", index_col=0)
print(data)
# Result:
# CAN SGN SIN
# headphone 1 3 0
# laptop 5 8 5
#
Writing DataFrame to excel file
data.to_excel('exported_data.xlsx', sheet_name='Sheet1')
Important Pandas Operators
Before learning pandas operator, we will create the DataFrame and use it in example of pandas operators section.
import pandas as pd
import numpy as np
dates = pd.date_range('20190101', periods=50)
data = pd.DataFrame(np.random.randn(50, 4), index=dates, columns=list('ABCD'))
Viewing your data
Now we have a DataFrame with 50 rows. It's too big to print. In a real-life situation, the number of rows will be larger many times. Therefore, the first thing we should know is how to print out a few rows to keep as a visual reference.
We will use .head()
to display the first 5 rows of your data
data.head()
# Result:
# A B C D
# 2019-01-01 -1.005372 0.142613 -0.181516 1.036709
# 2019-01-02 0.790087 0.294033 -0.602744 1.035578
# 2019-01-03 -1.703856 0.126258 1.080593 -0.421066
# 2019-01-04 -0.558818 -2.923537 -1.721127 -0.275644
# 2019-01-05 -1.408334 -0.860980 0.052589 1.104063
What do we do, if we would like to print the first 10 rows instead of 5 rows? In this case, we can pass a number to .head()
to define the number of rows you want to print. For example:
data.head(10)
# Result:
# A B C D
# 2019-01-01 -1.005372 0.142613 -0.181516 1.036709
# 2019-01-02 0.790087 0.294033 -0.602744 1.035578
# 2019-01-03 -1.703856 0.126258 1.080593 -0.421066
# 2019-01-04 -0.558818 -2.923537 -1.721127 -0.275644
# 2019-01-05 -1.408334 -0.860980 0.052589 1.104063
# 2019-01-06 0.418509 0.607834 0.017931 0.748909
# 2019-01-07 0.336740 -0.406930 1.420234 -1.702862
# 2019-01-08 0.739592 1.051292 -0.757623 -1.156324
# 2019-01-09 -0.225760 0.416810 0.128996 -1.450296
# 2019-01-10 0.527811 1.771893 -0.551995 1.101953
We can use .tail()
to get the last 5 rows of the data. And the same with .head()
, we can pass a number to determine the number of rows will be printed.
data.tail(7)
# Result:
# A B C D
# 2019-02-13 0.834528 2.912336 -0.957908 0.758701
# 2019-02-14 -0.866577 -0.886605 -0.339376 1.296223
# 2019-02-15 0.324452 -1.030220 0.854473 1.471936
# 2019-02-16 2.657040 -1.169546 -1.746896 -0.745877
# 2019-02-17 1.494073 -0.709933 -0.086347 -0.512125
# 2019-02-18 -1.455421 -0.370378 1.475331 -0.867604
# 2019-02-19 -1.542814 0.355690 -0.705522 0.069457
Showing data info
.info()
will show you the summary info of your data
data.info()
# Result
# <class 'pandas.core.frame.DataFrame'>
# DatetimeIndex: 50 entries, 2019-01-01 to 2019-02-19
# Freq: D
# Data columns (total 4 columns):
# A 50 non-null float64
# B 50 non-null float64
# C 50 non-null float64
# D 50 non-null float64
# dtypes: float64(4)
# memory usage: 2.0 KB
DataFrame selection, addition, and deletion
Until now, we known to understand Pandas data structure, how to create, load and write DataFrame
. In this section, we will learn the methods of selecting, adding and deleting that you'll need to use constantly.
By Column
The simple way to select a column of data is by using brackets. For example:
data['A'].head()
# Result
# 2019-01-01 -1.005372
# 2019-01-02 0.790087
# 2019-01-03 -1.703856
# 2019-01-04 -0.558818
# 2019-01-05 -1.408334
# Freq: D, Name: A, dtype: float64
We can select multiple columns inside the brackets.
data[['A', 'D']].head()
# Result
# A D
# 2019-01-01 -1.005372 1.036709
# 2019-01-02 0.790087 1.035578
# 2019-01-03 -1.703856 -0.421066
# 2019-01-04 -0.558818 -0.275644
# 2019-01-05 -1.408334 1.104063
By Row
Another choice for indexing data is using .loc
and .iloc
. Remember that this syntax is used for indexing by rows only. Below is its syntax:
-
.loc
: loc ation of name -
.iloc
: loc ation by i ndex number
If we pass only 1 parameter to .loc or .iloc . It understands the parameter is rows as default. Now we try to get the row labeled "2019-02-19".
data.loc['2019-02-19']
# Result
# A -1.542814
# B 0.355690
# C -0.705522
# D 0.069457
# Name: 2019-02-19 00:00:00, dtype: float64
Now we will select the specific column in the result
data.loc['2019-02-19', ['A', 'B']]
# Result
# A -1.542814
# B 0.355690
# Name: 2019-02-19 00:00:00, dtype: float64
Because the row label is time so we also can get rows by time range.
data.loc['2019-02-02':'2019-02-5']
# Result
# A B C D
# 2019-02-02 1.343458 -0.940805 -0.671073 -0.102026
# 2019-02-03 0.815890 1.190464 -0.479341 -0.105754
# 2019-02-04 -0.256803 0.595940 0.279049 2.010431
# 2019-02-05 0.311328 -0.111790 -0.515169 -0.014552
Similar to .loc
, we can use .iloc
to select one or more specific index numbers. Below example for selecting multiple index numbers:
data.iloc[[1,3],:]
# Result
# A B C D
# 2019-01-02 0.790087 0.294033 -0.602744 1.035578
# 2019-01-04 -0.558818 -2.923537 -1.721127 -0.275644
Or selecting a slice of index:
data.iloc[10:13]
# Result
# A B C D
# 2019-01-11 0.530191 -1.301325 -0.050186 0.267728
# 2019-01-12 -0.652778 -0.611594 0.291840 -0.935883
# 2019-01-13 1.602400 -0.137409 1.002766 0.820419
Be aware that we're getting rows [10:13] but what we get is data from 10th row to 12th row. 13th row is not in the result.
Slicing .iloc
follows the same rule as slicing the lists. The object at the last index is not included in the result.
Selection by callable
.loc
, .iloc
and []
indexing as well can accept a callable. The function must be a function with one argument. For best practice, the Python developer usually uses lambda
. You can refer our article for more detail about lambda
function ( Python Lambda Function )
data.loc[lambda row: row.D > 0]
# Result:
# A B C D
# 2019-01-01 -0.862970 1.312603 -0.427154 1.222134
# 2019-01-03 0.173703 -0.258924 -0.014464 0.320602
# 2019-01-04 0.252102 -1.914325 0.649628 0.885115
# 2019-01-06 -0.461437 0.063366 0.569461 0.427437
# 2019-01-08 -0.956210 0.067605 -2.273661 0.833149
# 2019-01-09 0.340620 -1.976085 -1.144538 0.425282
# 2019-01-11 0.530191 -1.301325 -0.050186 0.267728
# 2019-01-13 1.602400 -0.137409 1.002766 0.820419
# 2019-01-15 -1.083804 1.790858 -0.301093 0.674832
# 2019-01-16 -0.143779 2.104730 -1.228123 1.568482
# 2019-01-22 -0.821942 1.381137 -1.812166 0.785756
# 2019-01-23 -1.404428 0.563167 2.138703 2.387186
# 2019-01-28 0.271000 -0.633658 -0.839952 0.816997
# 2019-01-29 -0.259307 1.638976 -0.648043 0.715521
# 2019-02-01 0.109348 -0.624031 -0.283261 0.138264
# 2019-02-03 -1.319024 -0.131162 2.011317 1.312116
# 2019-02-04 0.910435 -0.636221 -0.857388 0.456990
Handling duplicated data
Handling duplicated data is a common case while working with a large data set. We can iterate over each row of data and check if it is duplicated to others, but it will be very slow and complicated.
Fortunately, the DataFrame class provides a function to help the developer to handle duplicated data easily. We can find duplicated data either by row or by column. Now we will figure out how to it works.
-
duplicated
It will return a Boolean series with True at the place of each duplicated. -
drop_duplicates
remove duplicate rows.
By default, the first observed row is considered as unique. However, each method has a keep
parameter to specify which row is kept.
-
keep='first'
: The default value, mark duplicated except for the first observed data. -
keep='last'
: mark duplicated except for the last observed data. -
keep=False
: mark all are duplicated.
Before doing the examples, we're going to create data as below
import pandas as pd
employees = [('Mark', 34, 'Toronto'),
('Kana', 30, 'Delhi'),
('Tam', 26, 'Ha Noi'),
('Kana', 30, 'Delhi'),
('Kana', 30, 'Delhi'),
('Kana', 30, 'Delhi'),
('Hashima', 40, 'London'),
('Rook', 30, 'Delhi')
]
data = pd.DataFrame(employees, columns=['Name', 'Age', 'City'])
print(data)
# Name Age City
# 0 Mark 34 Toronto
# 1 Kana 30 Delhi
# 2 Tam 26 Ha Noi
# 3 Kana 30 Delhi
# 4 Kana 30 Delhi
# 5 Kana 30 Delhi
# 6 Hashima 40 London
# 7 Rook 30 Delhi
By Row
Finding the duplicated rows:
# The same with data.duplicated(keep='first')
data.duplicated()
# Result
# 0 False
# 1 False
# 2 False
# 3 True
# 4 True
# 5 True
# 6 False
# 7 False
# dtype: bool
The result is a Series data that each row marks that if it is duplicated or not. If we use the result for selecting the data, we will get the value of the duplicated results.
data[data.duplicated()]
# Result
# Name Age City
# 3 Kana 30 Delhi
# 4 Kana 30 Delhi
# 5 Kana 30 Delhi
Dropping the duplicated row, keep the first observed data as default
data.drop_duplicates()
# Result
# Name Age City
# 0 Mark 34 Toronto
# 1 Kana 30 Delhi
# 2 Tam 26 Ha Noi
# 6 Hashima 40 London
# 7 Rook 30 Delhi
Now let see what is the difference if keep='last'
data.drop_duplicates(keep='last')
# Result
# Name Age City
# 0 Mark 34 Toronto
# 2 Tam 26 Ha Noi
# 5 Kana 30 Delhi
# 6 Hashima 40 London
# 7 Rook 30 Delhi
And now keep=False
data.drop_duplicates(keep=False)
# Result
# Name Age City
# 0 Mark 34 Toronto
# 2 Tam 26 Ha Noi
# 6 Hashima 40 London
# 7 Rook 30 Delhi
By Column
We are able to select specific columns to determine if data is duplicated
data[data.duplicated(['Age', 'City'])]
# Result
# Name Age City
# 3 Kana 30 Delhi
# 4 Kana 30 Delhi
# 5 Kana 30 Delhi
# 7 Rook 30 Delhi
Other examples for drop_duplicates by the selected column will be available in our GitHub. Check it for more detail if you need it.
Applying Function
Now, for example, we would like to get the employees who are over 30 years old. Don't iterate over each employee and check if the employee is over 30 years old or not. Instead, take advantage of the Pandas library. It is apply()
function.
data[data['Age'].apply(lambda x: x > 30)]
# Result
# Name Age City
# 0 Mark 34 Toronto
# 6 Hashima 40 London
Summary
Exploring, cleaning, transforming, and visualization data with pandas in Python is an essential skill in data science. This series just described very basic knowledge about Python Pandas for Beginner. Find full examples in this series on our GitHub project.
If you found your own useful information through the series, please share it for others, leave a comment and subscribe to us for getting notification of future posts. We are going to really really appreciate your time to read the post and share it with other Python Geeks.
Reference
Like every article from us, we always provide you the references that we used to make the post. In case you want to know more about Pandas library, you can refer below links:
The post Python Pandas for Beginners – A Complete Guide (Part 2) appeared first on Python Geeks.
Posted on November 2, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.