email_analysis
Email data analysis
Posted on June 29, 2021
Welcome back! This is a sequel to Part I where we covered making changes to our Gmail account, getting the subject of the email and its sender and visualising some of the email data.
The emphasis of this part is on getting the body of the emails.
If you read the first part and tried it out, I hope it was without hitches. If you did not but you are interested in learning how to get the body of the email, you can follow from here. Let's jump right in.
This was covered in Part I as it involves making changes to your Gmail account in order for IMAPLib to work with it.
Here we import the libraries we need which are imaplib, email, getpass and pandas. You may want to install pandas using pip install pandas
if you do not have it.
import imaplib
import email
import getpass
import pandas as pd
Here we log into the email server with our credentials.
username = input("Enter the email address: ")
password = getpass.getpass("Enter password: ")
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login(username, password)
Here we print out the mail list to see the available mailboxes and we select one.
print(mail.list())
mail.select("inbox")
The block of code below searches the selected mailbox with the given criteria, fetches the emails and stores it to the variable messages.
Here I am searching for emails from FreeCodeCamp.
result, numbers = mail.search(None, '(FROM "quincy@freecodecamp.org")')
uids = numbers[0].split()
uids = [id.decode("utf-8") for id in uids ]
result, messages = mail.fetch(','.join(uids) ,'(RFC822)')
The block of code below loops through the fetched emails, gets the date it was received, who sent it, the subject of the mail and the body of the mail.
walk()
method in the email library to get the parts and subparts of the message. get_content_type()
method to get the email body maintype/subtype.get_payload()
to get a string or message instance of the part.
body_list =[]
date_list = []
from_list = []
subject_list = []
for _, message in messages[::2]:
email_message = email.message_from_bytes(message)
email_subject = email.header.decode_header(email_message['Subject'])[0]
for part in email_message.walk():
if part.get_content_type() == "text/plain" :
body = part.get_payload(decode=True)
body = body.decode("utf-8")
body_list.append(body)
else:
continue
if isinstance(email_subject[0],bytes):
decoded = email_subject.decode(errors="ignore")
subject_list.append(decoded)
else:
subject_list.append(email_subject[0])
date_list.append(email_message.get('date'))
fromlist = email_message.get('From')
fromlist = fromlist.split("<")[0].replace('"', '')
from_list.append(fromlist)
Here we convert the objects in date_list to datetime objects using the to_datetime()
method, because the time has its UTC format attached, we sliced off the UTC format.
The retrieved information is then converted to a pandas DataFrame and exported to a CSV file.
date_list = pd.to_datetime(date_list)
date_list = [item.isoformat(' ')[:-6]for item in date_list]
data = pd.DataFrame(data={'Date':date_list,'Sender':from_list,'Subject':subject_list, 'Body':body_list})
data.to_csv('emails.csv',index=False)
Now we are going to view the data and clean it where necessary to make it readable. First we read in the csv file and view it:
data = pd.read_csv("\emails.csv")
data.head()
The output can be seen below, going through there are escape characters in the Body column:
The function below removes the escape characters in the text. In this case "\r\n"
as seen in the screenshot above. This makes the text more readable.
def clean_data(data, column, i):
data.loc[i, column] = data.loc[i, column].split("\r\n")
new_string = " ".join(data.loc[i, column])
new_string = new_string.split("'',")
data[column][i:i+1] = pd.DataFrame(data = new_string)
return data
The code below is using the function above to clean every email body in the file.
for n in range(len(data)):
new_data = clean_data(data, column = "Body", i = n)
You are advised to rewrite a new function according to the escape characters you may find in the Subject or Body of the email you retrieved.
I encountered ERRORS while writing this, the most recurring one was being unable to sign in even after following the instructions on the Google help page. This problem was encountered because I have more than one Gmail account signed in, and I was not using my default email. In case you encounter the same, the solution is outlined below:
You can find the full code on GitHub below:
Thank you for getting to the end of this article, I hope you had fun trying it out.
Posted on June 29, 2021
Sign up to receive the latest update from our blog.