Deduplicate Salesforce Contact Email Addresses with Python
Katie
Posted on November 23, 2018
Salesforce #AwesomeAdmin Matthew (@matthewsalibi) asked how he could use Python to deduplicate Salesforce NPSP email addresses across several fields. Today I’ll provide you a script you can copy, paste, and try on your own data. And for the geeks, I’ll break it down and explain how every line works.
Data & Expectations
Input Data
As a recap, Matthew proposed that we start with a CSV-formatted export of his “Contacts” table, “ contacts.csv ,” that looked something like this (I consolidated his examples and added a bit of complexity):
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | ca@gp.com | ca@hotmail.com | ca@hotmail.com |
003…02 | DDDD | Eleanor | Shellstrop | Personal | es@gp.com | ||
003…03 | EEEE | Jason | Mendoza | Work | jm@gp.com | jm@gp.com | jm@gp.com |
003…04 | FFFF | Eleanor | Shellstrop | Alternate | es@hotmail.com | es@gp.com | |
003…05 | GGGG | Janet | Della-Denunzio | Alternate | jd@gp.com | jm@gp.com | |
003…06 | HHHH | Tahani | Al-Jamil | Alternate | ta@gp.com | ta@yahoo.com | ta@gp.com |
003…07 | IIII | Chidi | Anagonye | Work | ca@gp.com | ||
003…08 | JJJJ | Mindy | St. Claire | Personal | ms@mp.com | ||
003…09 | KKKK | Kamilah | Al-Jamil | Personal | ka@kafoundation.org | ||
003…10 | AAAA | Eleanor | Shellstrop | Alternate | es@gp.com | ||
003…11 | CCCC | Ellie | Shellstrop | Work | es@gp.com | ||
003…12 | LLLL | Mindy | St. Claire | Personal | ms@yahoo.com | ms@hotmail.com | ms@hotmail.com |
003…13 | MMMM | Mindy | St. Claire | Work | ms@z.com | ms@z.com | ms@z.com |
003…14 | NNNN | Tahani | Al-Jamil | Alternate | ta@gp.com |
Output File #1
Matthew wanted to see an “ output1.csv ” that would identify duplicates within rows and provide a “cleanup” file, ready for Data Loader, that looks like this:
SalesforceID | Human-Readable ID | FirstName | LastName | PreferredEmail | PersonalEmail | WorkEmail | AlternateEmail |
---|---|---|---|---|---|---|---|
003…01 | BBBB | Chidi | Anagonye | Work | ca@gp.com | ca@hotmail.com | |
003…03 | EEEE | Jason | Mendoza | Work | jm@gp.com | ||
003…06 | HHHH | Tahani | Al-Jamil | Alternate | ta@yahoo.com | ta@gp.com | |
003…13 | MMMM | Mindy | St. Claire | Work | ms@z.com |
Output File #2A
To identify entire rows suspected of being redundant Contacts, Matthew proposed a “ready for DemandTools” file “ output2a.csv ” that would look like the data below.
This data is compatible with the “Select Master / Non-Master records from file” functionality of DemandTools’s “Single Table Dedupe” module.
Each row indicates a “same email address, same first name” pairing that was found in multiple Contact records (where the “email address” could have appeared in any of the “Email” fields that come with the NPSP).
Each column is the Salesforce ID of a Contact in which the email address was found.
Master | Duplicate_1 | Duplicate_2 |
---|---|---|
003…01 | 003…07 | |
003…10 | 003…02 | 003…04 |
003…06 | 003…14 |
Output File #2B
For sets of Contacts that shared an email address but not a first name, Matthew wanted to see a bit more detail in the output file “ output2b.csv ”, to facilitate human review. He suggested it look like the output below.
emAddr | Master | Master_FN | Master_LN | Duplicate_1 | Duplicate_1_FN | Duplicate_1_LN |
---|---|---|---|---|---|---|
es@gp.com | 003…10 | Eleanor | Shellstrop | 003…11 | Ellie | Shellstrop |
jm@gp.com | 003…03 | Jason | Mendoza | 003…05 | Janet | Della-Denunzio |
Running The Code Yourself
Unfortunately, I haven’t yet made a blog post about running Python code on your own computer, so if you’re totally new to Python, you’re a bit stuck – but show this code to a colleague who does know how!
Or, if you’re adventurous , download an “Anaconda suite” installer for your computer’s operating system and give it a try.
- Don’t bother installing VSCode when it asks if you want to.
- Once it’s done, bring up the “Anaconda Navigator” and open “Spyder.”
- Type
print('hello')
in the code-editing panel at the left, hit “run” near the top (if a box pops up, hit “cancel” and do it again), and then check to see ifhello
is printed at the right-hand side of your screen in an output console (probably bottom right). - If so, congrats – you just wrote a program! Backspace out
print(hello)
and copy/paste my code in instead, then start tweaking it.
- Type
And stay tuned for my “office hours” and “hands-on trainings,” where I teach total non-programmers to run code like this.
The Code
There’s no shame in simply copying this code, pasting it into your Python “IDE,” and making a few tweaks!
You will, of course, have to do a bit of “Find & Replace” to fix up column-name references (which, by the way, are case-sensitive to match their capitalization in your input CSV file ), if your columns aren’t precisely spelled “PersonalEmail,” “WorkEmail,” “Human-Readable ID,” etc.
Code to produce “Output1” only
import time
start_time = time.time()
import pandas
pandas.set_option('expand_frame_repr', False)
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running
df = pandas.read_csv(inputFilePath, dtype=object)
df['ChangedAnything'] = False
validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))
for fieldName in emFieldNames:
needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
df.loc[needsClearingTFSer,fieldName] = None
df.loc[needsClearingTFSer,'ChangedAnything'] = True
output1df = df[df['ChangedAnything']]
output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])
if len(output1df) > 0:
output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
with open(output1NoneFoundFilePath, 'w') as f:
f.writelines('No records found')
print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
Code to produce “Output2A” and “Output2B” only
import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running
df = pandas.read_csv(inputFilePath, dtype=object)
df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])
allEmailsSeries = df[emFieldNames].stack()
groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()
emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}
output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
seenFirstNamesAndTheirContacts = {}
numCellsFound = len(foundCells)
output2bFact = OrderedDict()
if numCellsFound > 1:
for cellLoopCounter, cell in enumerate(foundCells):
if cell[2] not in seenFirstNamesAndTheirContacts:
seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
if cellLoopCounter == 0:
output2bFact['emAddr'] = emAddr
output2bFact['Master'] = cell[1]
output2bFact['Master_FN'] = cell[2]
output2bFact['Master_LN'] = cell[3]
else:
uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
if cellLoopCounter == numCellsFound-1:
if 'Duplicate_1' in output2bFact:
output2bList.append(output2bFact)
for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
output2aFact = OrderedDict()
if len(contactsForFN) > 1:
for contactLoopCounter, contact in enumerate(contactsForFN):
if contactLoopCounter == 0:
output2aFact['Master'] = contact[1]
else:
output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
if 'Duplicate_1' in output2aFact:
output2aList.append(output2aFact)
if len(output2aList) > 0:
output2adf = pandas.DataFrame(output2aList)
output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
with open(output2aNoneFoundFilePath, 'w') as f:
f.writelines('No records found')
if len(output2bList) > 0:
output2bdf = pandas.DataFrame(output2bList)
output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
with open(output2bNoneFoundFilePath, 'w') as f:
f.writelines('No records found')
print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
Code to produce all 3 output files at once
import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)
emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running
df = pandas.read_csv(inputFilePath, dtype=object)
# ### Output1 portion starts here ###
df['ChangedAnything'] = False
validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))
for fieldName in emFieldNames:
needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
df.loc[needsClearingTFSer,fieldName] = None
df.loc[needsClearingTFSer,'ChangedAnything'] = True
output1df = df[df['ChangedAnything']]
output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])
if len(output1df) > 0:
output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
with open(output1NoneFoundFilePath, 'w') as f:
f.writelines('No records found')
# ### Output1 portion ends here ###
# ### Output2 portion starts here ###
df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])
allEmailsSeries = df[emFieldNames].stack()
groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()
emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}
output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
seenFirstNamesAndTheirContacts = {}
numCellsFound = len(foundCells)
output2bFact = OrderedDict()
if numCellsFound > 1:
for cellLoopCounter, cell in enumerate(foundCells):
if cell[2] not in seenFirstNamesAndTheirContacts:
seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
if cellLoopCounter == 0:
output2bFact['emAddr'] = emAddr
output2bFact['Master'] = cell[1]
output2bFact['Master_FN'] = cell[2]
output2bFact['Master_LN'] = cell[3]
else:
uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
if cellLoopCounter == numCellsFound-1:
if 'Duplicate_1' in output2bFact:
output2bList.append(output2bFact)
for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
output2aFact = OrderedDict()
if len(contactsForFN) > 1:
for contactLoopCounter, contact in enumerate(contactsForFN):
if contactLoopCounter == 0:
output2aFact['Master'] = contact[1]
else:
output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
if 'Duplicate_1' in output2aFact:
output2aList.append(output2aFact)
if len(output2aList) > 0:
output2adf = pandas.DataFrame(output2aList)
output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
with open(output2aNoneFoundFilePath, 'w') as f:
f.writelines('No records found')
if len(output2bList) > 0:
output2bdf = pandas.DataFrame(output2bList)
output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
with open(output2bNoneFoundFilePath, 'w') as f:
f.writelines('No records found')
# ### Output2 portion ends here ###
print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')
Stop reading here if you don’t want to learn Python
You don’t have to thoroughly understand the code above to make use of it.
Just be sure that your “output” files are different filenames than your “input” file and you can’t do too much damage.
Worst-case scenario, you’ll freeze your computer running the program against too large of a file – but several million rows is nothing (a minute or few), so get up and stretch. It’s good for you!
For the curious, the rest of this post is total geekery peeking under the hood.
Code Explanations
I'm struggling to get all the formatting to show up correctly on DEV, so check out my breakdown at the original blog post and come back here with questions, comments, or corrections. I look forward to hearing from you!
Posted on November 23, 2018
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.