Email Analysis with Python 3 (Part II)

yomaokobiah

Ogheneyoma Okobiah

Posted on June 29, 2021

Email Analysis with Python 3 (Part II)

Welcome back! This is a sequel to Part I where we covered making changes to our Gmail account, getting the subject of the email and its sender and visualising some of the email data.
The emphasis of this part is on getting the body of the emails.

If you read the first part and tried it out, I hope it was without hitches. If you did not but you are interested in learning how to get the body of the email, you can follow from here. Let's jump right in.

Prerequisites

  • Python 3
  • Pandas
  • A gmail account

Getting The Data

This was covered in Part I as it involves making changes to your Gmail account in order for IMAPLib to work with it.

Step 1: Importing the required libraries to get the email data

Here we import the libraries we need which are imaplib, email, getpass and pandas. You may want to install pandas using pip install pandas if you do not have it.

import imaplib
import email
import getpass
import pandas as pd
Enter fullscreen mode Exit fullscreen mode

Step 2: Gaining access to the email server

Here we log into the email server with our credentials.

username =  input("Enter the email address: ")
password = getpass.getpass("Enter password: ")
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login(username, password)
Enter fullscreen mode Exit fullscreen mode

Step 3: Specifying the mailbox to get data from.

Here we print out the mail list to see the available mailboxes and we select one.

print(mail.list())
mail.select("inbox")
Enter fullscreen mode Exit fullscreen mode

Step 4: Searching and Fetching the data

The block of code below searches the selected mailbox with the given criteria, fetches the emails and stores it to the variable messages.
Here I am searching for emails from FreeCodeCamp.

result, numbers = mail.search(None, '(FROM "quincy@freecodecamp.org")')
uids = numbers[0].split()
uids = [id.decode("utf-8") for id in uids ]
result, messages = mail.fetch(','.join(uids) ,'(RFC822)')
Enter fullscreen mode Exit fullscreen mode

Step 5: Preparing the data to be exported

The block of code below loops through the fetched emails, gets the date it was received, who sent it, the subject of the mail and the body of the mail.

  • We use the walk() method in the email library to get the parts and subparts of the message.
  • We use the get_content_type() method to get the email body maintype/subtype.
  • We use the get_payload() to get a string or message instance of the part.
body_list =[]
date_list = []
from_list = [] 
subject_list = []
for _, message in messages[::2]:
  email_message = email.message_from_bytes(message)
  email_subject = email.header.decode_header(email_message['Subject'])[0]
  for part in email_message.walk():
    if part.get_content_type() == "text/plain" :
        body = part.get_payload(decode=True)
        body = body.decode("utf-8")
        body_list.append(body)
    else:
        continue
    if isinstance(email_subject[0],bytes):
      decoded = email_subject.decode(errors="ignore")
      subject_list.append(decoded)
    else:
      subject_list.append(email_subject[0])
  date_list.append(email_message.get('date'))
  fromlist = email_message.get('From')
  fromlist = fromlist.split("<")[0].replace('"', '')
  from_list.append(fromlist)
Enter fullscreen mode Exit fullscreen mode

Here we convert the objects in date_list to datetime objects using the to_datetime() method, because the time has its UTC format attached, we sliced off the UTC format.
The retrieved information is then converted to a pandas DataFrame and exported to a CSV file.

date_list = pd.to_datetime(date_list)
date_list = [item.isoformat(' ')[:-6]for item in date_list]
data = pd.DataFrame(data={'Date':date_list,'Sender':from_list,'Subject':subject_list, 'Body':body_list})
data.to_csv('emails.csv',index=False)
Enter fullscreen mode Exit fullscreen mode

Data Cleaning

Now we are going to view the data and clean it where necessary to make it readable. First we read in the csv file and view it:

data = pd.read_csv("\emails.csv")
data.head()
Enter fullscreen mode Exit fullscreen mode

The output can be seen below, going through there are escape characters in the Body column:
Screenshot 2021-06-29 at 09.41.50.png

The function below removes the escape characters in the text. In this case "\r\n" as seen in the screenshot above. This makes the text more readable.

def clean_data(data, column, i):
    data.loc[i, column] = data.loc[i, column].split("\r\n")
    new_string = " ".join(data.loc[i, column])
    new_string = new_string.split("'',")
    data[column][i:i+1] = pd.DataFrame(data = new_string)
    return data
Enter fullscreen mode Exit fullscreen mode

The code below is using the function above to clean every email body in the file.

for n in range(len(data)):
    new_data = clean_data(data, column = "Body", i = n)
Enter fullscreen mode Exit fullscreen mode

The output can be seen below:
Screenshot 2021-06-29 at 09.50.02.png

You are advised to rewrite a new function according to the escape characters you may find in the Subject or Body of the email you retrieved.

Conclusion

I encountered ERRORS while writing this, the most recurring one was being unable to sign in even after following the instructions on the Google help page. This problem was encountered because I have more than one Gmail account signed in, and I was not using my default email. In case you encounter the same, the solution is outlined below:

You can find the full code on GitHub below:

GitHub logo yomaokobiah / email_analysis

Email data analysis

email_analysis

Email data analysis

Thank you for getting to the end of this article, I hope you had fun trying it out.

References

💖 💪 🙅 🚩
yomaokobiah
Ogheneyoma Okobiah

Posted on June 29, 2021

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related