email_analysis
Email data analysis
Posted on June 19, 2021
There is a lot of data out there, mostly unstructured. Emails are a great source of communication data as such there is no limit to what we can harness from it.
At the end of this tutorial you would be able to get email data for insights.
There are several ways to go achieve the aim of this article; find below, how I did mine.
Here a Gmail account is being used; for the imaplib script to work I made the following changes to my account; enabled IMAP and turned on less secured apps.
If you still can’t log in after doing the above, kindly visit here for official Google help support.
imaplib is an Internet Message Access Protocol (IMAP) library
email is a python library that parses, handles and generates email messages.
getpass is a python library that contains utilities to get password or current username
pandas is a python library for data manipulation and analysis.
import imaplib
import email
import getpass
import pandas as pd
username = input("Enter the email address: ")
password = getpass.getpass("Enter password: ")
mail = imaplib.IMAP4_SSL('imap.gmail.com')
mail.login(username, password)
mail.list()
is a method that gives a list of the mailboxes - i.e inbox, draft and so on in the email address.mail.select()
is a method that takes an argument of the mailbox you want to get data from.
print(mail.list())
mail.select("inbox")
result, numbers = mail.uid('search', None, "ALL")
uids = numbers[0].split()
uids = [id.decode("utf-8") for id in uids ]
uids = uids[-1:-101:-1]
result, messages = mail.uid('fetch', ','.join(uids), '(BODY[HEADER.FIELDS (SUBJECT FROM DATE)])')
date_list = []
from_list = []
subject_text = []
for i, message in messages[::2]:
msg = email.message_from_bytes(message)
decode = email.header.decode_header(msg['Subject'])[0]
if isinstance(decode[0],bytes):
decoded = decode[0].decode()
subject_text.append(decoded)
else:
subject_text.append(decode[0])
date_list.append(msg.get('date'))
fromlist = msg.get('From')
fromlist = fromlist.split("<")[0].replace('"', '')
from_list1.append(fromlist)
date_list = pd.to_datetime(date_list)
date_list1 = []
for item in date_list:
date_list1.append(item.isoformat(' ')[:-6])
print(len(subject_text))
print(len(from_list))
print(len(date_list1))
df = pd.DataFrame(data={'Date':date_list1, 'Sender':from_list, 'Subject':subject_text})
print(df.head())
df.to_csv('inbox_email.csv',index=False)
Now that we have a the email data in CSV format, we can read the data using pandas, and visualise it.
There are several Python data visualisation libraries, but here I used Wordcloud, Matplotlib and Seaborn. I wanted to see an infographic on the most used words in the subjects of my emails and here is how I did it.
I used the the describe method to get the statistical data, unique values and all to get insight on the what's in the data.
I created two variables; Time and SinceMid. SinceMid is the number of hours after midnight.
(Note: The time can be removed from the date column completely)
from datetime import datetime
FMT = '%H:%M:%S'
emails['Time'] = emails['Date'].apply(lambda x: datetime.strptime(x, '%Y-%m-%-d%H:%M:%S').strftime(FMT))
emails['SinceMid'] = emails['Time'].apply(lambda x: (datetime.strptime(x, FMT) - datetime.strptime("00:00:00", FMT)).seconds) / 60 / 60
I created a wordcloud image of the most used words in the subjects of my mails. In this example there are no stopwords, stopwords are usually filtered out as most times they're not informative.
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# Create a list of words
text = ""
for item in emails["Subject"]:
if isinstance(item,str):
text += " " + item
text.replace("'", "")
text.replace(",","")
text.replace('"','')
# Create the wordcloud object
wordcloud = WordCloud(width=800, height=800, background_color="white")
# Display the generated image:
wordcloud.generate(text)
plt.figure(figsize=(8,8))
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.margins(x=0, y=0)
plt.title("Most Used Subject Words", fontsize=20,ha="center", pad=20)
plt.show()
Here's the output:
I created a histogram of the hours after midnight using seaborn.
import seaborn as sns
sns.distplot(emails["SinceMid"],bins=20)
plt.title("Hours since midnight")
Here is the histogram:
You can check out python gallery for more possible visualisations.
I had fun writing this, I hope you did too while reading it. This goes without saying, I encountered ERRORS while doing this [some of them I had never seen before]. When you get error messages, a good starting point is using the print statement to get insight and then googling the error message.
The Part II will also be published on this blog, it would focus on getting the body of the mail and not the subject as this one.
The full code can be found below:
Thank you for reading up to this point.
Disclaimer: I encourage you to experiment outside what's written here, if you encounter bugs and you feel like getting me involved [after Googling], send me a DM on Twitter I'd be happy to learn something new. Thank you in anticipation.
References/Credits
Posted on June 19, 2021
Sign up to receive the latest update from our blog.