Introduction to Pandas: Data Reading

codeitmichael

Michael_Maranan

Posted on October 4, 2023

Introduction to Pandas: Data Reading

In this blog, we'll explore the process of reading and working with data using Pandas. No prior experience is necessary – we'll start from the basics and guide you through every step with simple explanations and practical examples.

By the end of this journey, you'll have the skills to effortlessly import, explore, and make sense of data from various sources. So, let's dive into it!

Topics

  • A. Importing and Reading Data Files
    • A.1. Reading Data
    • A.2. Naming Columns
    • A.3. Handling Missing Values
    • A.4. Using Different Separators
  • B. Exploring the Data
  • C. Saving the Data

A. Importing and Reading Data Files

Before we start, let's import our Pandas module first.

import pandas as pd
Enter fullscreen mode Exit fullscreen mode

A.1. Reading Data

Reading data is a fundamental step in data analysis, and Pandas simplifies this process by offering functions for importing data from various sources, including CSV, Excel, and databases.

For this tutorial, I'm going to use a CSV file containing the following data.

John Doe,28,Male,45000,johndoe@example.com
Jane Smith,35,Female,60000,janesmith@example.com
Mike Johnson,22,Male,,mikejohnson@example.com
Emily Davis,,Female,55000,emilydavis@example.com
Chris Brown,40,,75000,chrisbrown@example.com
Anna Lee,27,Female,62000,annalee@example.com
David Clark,32,Male,68000,davidclark@example.com
Sophia Kim,29,Female,51000,sophiakim@example.com
Kevin Wilson,38,Male,,kevinwilson@example.com
Linda Johnson,45,Female,72000,lindajohnson@example.com
Robert Smith,33,Male,58000,robertsmith@example.com
Mary Wilson,,Female,62000,marywilson@example.com
William Davis,28,Male,48000,williamdavis@example.com
Jennifer Lee,29,Female,53000,jenniferlee@example.com
Michael Brown,36,Male,70000,michaelbrown@example.com
Patricia Miller,31,Female,56000,patriciamiller@example.com
James Taylor,42,Male,80000,jamestaylor@example.com
Karen Anderson,34,Female,,karenanderson@example.com
Joseph Martinez,26,Male,49000,josephmartinez@example.com
Enter fullscreen mode Exit fullscreen mode

As what I mentioned earlier, Pandas has built-in functions for reading different data files. Some of these are read_csv, read_xlsx, read_html, read_json, read_sql, etc.

# filename
csv_url = "data.csv"

# reading CSV files
df = pd.read_csv(csv_url)

df
Enter fullscreen mode Exit fullscreen mode

Output:
whole data table

By calling the df variable, we can see the data from our source file.

A.2. Naming Columns

As you notice if we call our data table, it sets the first row of text as its column-name/header by default. But in our CSV file, we don't have a header for each column so it set the first row of data as the header. To prevent it from doing so, we can use the header argument and set it to None.

df = pd.read_csv(csv_url, header=None)
df.head()
Enter fullscreen mode Exit fullscreen mode

Output:
prevent the data table from doing first row header

By setting our header to None, we avoided setting the first row as header and set our header to numbers instead. To give each column a proper header name, we can use the names argument.

df = pd.read_csv(csv_url, header=None, names=["Name", "Age", "Sex", "Income", "Email"])
df.head()
Enter fullscreen mode Exit fullscreen mode

Output:
Naming column headers

A.3. Handling Missing Values

If we look back at our table, especially in the "Income" column, we will notice some of the missing data have different symbols.

df["Income"]
Enter fullscreen mode Exit fullscreen mode

Output:
Income column

You can convert these symbols into NaN by mentioning it in the na_values argument.

df = pd.read_csv(
    csv_url, 
    header=None, 
    names=["Name", "Age", "Sex", "Income", "Email"], 
    na_values=["?", "_", "--"] # replaces the following character with NaN
)
df["Income"]
Enter fullscreen mode Exit fullscreen mode

Output:
Convert character value to NaN

A.4. Using Different Separators

If you ever run into a CSV with a different separator (let's say your data file got - instead of ,), you can use the delimiter or sep argument.

df = pd.read_csv(
    csv_url, 
    header=None, 
    names=["Name", "Age", "Sex", "Income", "Email"], 
    na_values=["?", "_", "--"] # replaces the following character with NaN
    sep="-" # changes separator
)
Enter fullscreen mode Exit fullscreen mode

sep and delimiter works the same.

df = pd.read_csv(
    csv_url, 
    header=None, 
    names=["Name", "Age", "Sex", "Income", "Email"], 
    na_values=["?", "_", "--"] # replaces the following character with NaN
    delimiter="-" # changes separator
)
Enter fullscreen mode Exit fullscreen mode

You can often use delimiter or sep interchangeably for simple cases where a single character separates values. However, if you need to handle more complex separation patterns or use regular expressions, then sep offers more flexibility.

Here's an example of sep using a regular expression to separate by either a comma or a semicolon:

df = pd.read_csv('data.csv', sep='[,;]')
Enter fullscreen mode Exit fullscreen mode

B. Exploring the Data

Pandas also have built-in methods for viewing the basic information about our data.

df.info()
Enter fullscreen mode Exit fullscreen mode

Output:
Basic Data table information

The info(), from the name itself, returns the basic information about the data table, data such as Column, Dtype (data type), etc.

If you want to access the first or last few rows of the table, you can use head() and tail().

# head()
df.head()
Enter fullscreen mode Exit fullscreen mode

Output:
using head()

# tail()
df.tail()
Enter fullscreen mode Exit fullscreen mode

Output:
using tail()

The head() returns the first rows of the data table while the tail() returns the last rows. They return 5 rows by default.

You can pass a number on both functions to specify how many rows you want to be returned.

df.head(3)
Enter fullscreen mode Exit fullscreen mode

Output:
head() with number argument

df.tail(7)
Enter fullscreen mode Exit fullscreen mode

Output:
tail() with number argument

Here are some more functions that will return basic information:

df.describe() - Generate summary statistics of numeric columns.

df.describe()
Enter fullscreen mode Exit fullscreen mode

Output:
describe() output

df.shape - Get the dimensions of the DataFrame (rows, columns).

df.shape
Enter fullscreen mode Exit fullscreen mode

Output:
shape() otuput

df['column_name'] - Access a specific column.

df["Email"]
Enter fullscreen mode Exit fullscreen mode

accessing a column

C. Saving Data

After processing your data, you can save it back to a file or a database using Pandas. For example, to save a DataFrame to a CSV file:

df.to_csv('new_data.csv', index=False)
Enter fullscreen mode Exit fullscreen mode

The index=False argument prevents Pandas from writing the row indices to the CSV file.

Pandas provides a rich set of functions for data manipulation and analysis, making it a powerful tool for working with tabular data in Python.

In conclusion, data reading with Pandas is an essential and powerful skill for data analysts and scientists. This Python library simplifies the process of importing data from various sources, such as CSV files, Excel spreadsheets, and databases, and provides a user-friendly interface for data exploration. By mastering Pandas' data reading capabilities, professionals can efficiently load, manipulate, and analyze datasets, making it a cornerstone of practical data analysis workflows.

💖 💪 🙅 🚩
codeitmichael
Michael_Maranan

Posted on October 4, 2023

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Introduction to Pandas: Data Reading
datascience Introduction to Pandas: Data Reading

October 4, 2023