Deduplicate Salesforce Contact Email Addresses with Python

Salesforce #AwesomeAdmin Matthew (@matthewsalibi) asked how he could use Python to deduplicate Salesforce NPSP email addresses across several fields. Today I’ll provide you a script you can copy, paste, and try on your own data. And for the geeks, I’ll break it down and explain how every line works.

Data & Expectations

Input Data

As a recap, Matthew proposed that we start with a CSV-formatted export of his “Contacts” table, “ contacts.csv ,” that looked something like this (I consolidated his examples and added a bit of complexity):

SalesforceID	Human-Readable ID	FirstName	LastName	PreferredEmail	PersonalEmail	WorkEmail	AlternateEmail
003…01	BBBB	Chidi	Anagonye	Work	ca@gp.com	ca@hotmail.com	ca@hotmail.com
003…02	DDDD	Eleanor	Shellstrop	Personal	es@gp.com
003…03	EEEE	Jason	Mendoza	Work	jm@gp.com	jm@gp.com	jm@gp.com
003…04	FFFF	Eleanor	Shellstrop	Alternate	es@hotmail.com		es@gp.com
003…05	GGGG	Janet	Della-Denunzio	Alternate	jd@gp.com		jm@gp.com
003…06	HHHH	Tahani	Al-Jamil	Alternate	ta@gp.com	ta@yahoo.com	ta@gp.com
003…07	IIII	Chidi	Anagonye	Work		ca@gp.com
003…08	JJJJ	Mindy	St. Claire	Personal	ms@mp.com
003…09	KKKK	Kamilah	Al-Jamil	Personal	ka@kafoundation.org
003…10	AAAA	Eleanor	Shellstrop	Alternate			es@gp.com
003…11	CCCC	Ellie	Shellstrop	Work		es@gp.com
003…12	LLLL	Mindy	St. Claire	Personal	ms@yahoo.com	ms@hotmail.com	ms@hotmail.com
003…13	MMMM	Mindy	St. Claire	Work	ms@z.com	ms@z.com	ms@z.com
003…14	NNNN	Tahani	Al-Jamil	Alternate			ta@gp.com

Output File #1

Matthew wanted to see an “ output1.csv ” that would identify duplicates within rows and provide a “cleanup” file, ready for Data Loader, that looks like this:

SalesforceID	Human-Readable ID	FirstName	LastName	PreferredEmail	PersonalEmail	WorkEmail	AlternateEmail
003…01	BBBB	Chidi	Anagonye	Work	ca@gp.com	ca@hotmail.com
003…03	EEEE	Jason	Mendoza	Work		jm@gp.com
003…06	HHHH	Tahani	Al-Jamil	Alternate		ta@yahoo.com	ta@gp.com
003…13	MMMM	Mindy	St. Claire	Work		ms@z.com

Output File #2A

To identify entire rows suspected of being redundant Contacts, Matthew proposed a “ready for DemandTools” file “ output2a.csv ” that would look like the data below.

This data is compatible with the “Select Master / Non-Master records from file” functionality of DemandTools’s “Single Table Dedupe” module.

Each row indicates a “same email address, same first name” pairing that was found in multiple Contact records (where the “email address” could have appeared in any of the “Email” fields that come with the NPSP).

Each column is the Salesforce ID of a Contact in which the email address was found.

Master	Duplicate_1	Duplicate_2
003…01	003…07
003…10	003…02	003…04
003…06	003…14

Output File #2B

For sets of Contacts that shared an email address but not a first name, Matthew wanted to see a bit more detail in the output file “ output2b.csv ”, to facilitate human review. He suggested it look like the output below.

emAddr	Master	Master_FN	Master_LN	Duplicate_1	Duplicate_1_FN	Duplicate_1_LN
es@gp.com	003…10	Eleanor	Shellstrop	003…11	Ellie	Shellstrop
jm@gp.com	003…03	Jason	Mendoza	003…05	Janet	Della-Denunzio

Running The Code Yourself

Unfortunately, I haven’t yet made a blog post about running Python code on your own computer, so if you’re totally new to Python, you’re a bit stuck – but show this code to a colleague who does know how!

Or, if you’re adventurous , download an “Anaconda suite” installer for your computer’s operating system and give it a try.

Don’t bother installing VSCode when it asks if you want to.
Once it’s done, bring up the “Anaconda Navigator” and open “Spyder.”
- Type print('hello') in the code-editing panel at the left, hit “run” near the top (if a box pops up, hit “cancel” and do it again), and then check to see if hello is printed at the right-hand side of your screen in an output console (probably bottom right).
- If so, congrats – you just wrote a program! Backspace out print(hello) and copy/paste my code in instead, then start tweaking it.

And stay tuned for my “office hours” and “hands-on trainings,” where I teach total non-programmers to run code like this.

The Code

There’s no shame in simply copying this code, pasting it into your Python “IDE,” and making a few tweaks!

You will, of course, have to do a bit of “Find & Replace” to fix up column-name references (which, by the way, are case-sensitive to match their capitalization in your input CSV file ), if your columns aren’t precisely spelled “PersonalEmail,” “WorkEmail,” “Human-Readable ID,” etc.

Code to produce “Output1” only

import time
start_time = time.time()
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running

df = pandas.read_csv(inputFilePath, dtype=object)

df['ChangedAnything'] = False

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

for fieldName in emFieldNames:
    needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

output1df = df[df['ChangedAnything']]

output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])

if len(output1df) > 0:
    output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
    with open(output1NoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

Code to produce “Output2A” and “Output2B” only

import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running

df = pandas.read_csv(inputFilePath, dtype=object)

df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

allEmailsSeries = df[emFieldNames].stack()

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output2bFact = OrderedDict()
    if numCellsFound > 1:
        for cellLoopCounter, cell in enumerate(foundCells):
            if cell[2] not in seenFirstNamesAndTheirContacts:
                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
                if cellLoopCounter == 0:
                    output2bFact['emAddr'] = emAddr
                    output2bFact['Master'] = cell[1]
                    output2bFact['Master_FN'] = cell[2]
                    output2bFact['Master_LN'] = cell[3]
                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
            elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
                seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
            if cellLoopCounter == numCellsFound-1:
                if 'Duplicate_1' in output2bFact:
                    output2bList.append(output2bFact)
                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    output2aFact = OrderedDict()
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]
                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

if len(output2aList) > 0:
    output2adf = pandas.DataFrame(output2aList)
    output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
    with open(output2aNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

if len(output2bList) > 0:
    output2bdf = pandas.DataFrame(output2bList)
    output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
    with open(output2bNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

Code to produce all 3 output files at once

import time
start_time = time.time()
from collections import OrderedDict
import pandas
pandas.set_option('expand_frame_repr', False)

emFieldNames = ['PersonalEmail','WorkEmail','AlternateEmail'] # Edit this before running
pickToField = {'Personal':'PersonalEmail','Work':'WorkEmail','Alternate':'AlternateEmail'} # Edit this before running
inputFilePath = r'c:\example\contacts.csv' # Edit this before running
output1FilePath = r'c:\example\output1.csv' # Edit this before running
output1NoneFoundFilePath = r'c:\example\nonefound1.txt' # Edit this before running
output2aFilePath = r'c:\example\output2a.csv' # Edit this before running
output2aNoneFoundFilePath = r'c:\example\nonefound2a.txt' # Edit this before running
output2bFilePath = r'c:\example\output2b.csv' # Edit this before running
output2bNoneFoundFilePath = r'c:\example\nonefound2b.txt' # Edit this before running

df = pandas.read_csv(inputFilePath, dtype=object)

# ### Output1 portion starts here ###
df['ChangedAnything'] = False

validPrefEmailIndicatorTFSeries = df['PreferredEmail'].map(pickToField).isin(df.columns)
df.loc[validPrefEmailIndicatorTFSeries,'PrefEmAddr'] = df.lookup(validPrefEmailIndicatorTFSeries.index, df[validPrefEmailIndicatorTFSeries]['PreferredEmail'].map(pickToField))

for fieldName in emFieldNames:
    needsClearingTFSer = (df['PreferredEmail'].map(pickToField) != fieldName) & (df[fieldName] == df['PrefEmAddr'])
    df.loc[needsClearingTFSer,fieldName] = None
    df.loc[needsClearingTFSer,'ChangedAnything'] = True

output1df = df[df['ChangedAnything']]

output1df = output1df.drop(columns=['PrefEmAddr','ChangedAnything'])

if len(output1df) > 0:
    output1df.to_csv(output1FilePath, index=False, quoting=1)
else:
    with open(output1NoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
# ### Output1 portion ends here ###

# ### Output2 portion starts here ###
df = df.sort_values('Human-Readable ID')
df = df.reset_index(drop=True)
df = df.set_index(['Human-Readable ID','SalesforceID','FirstName','LastName','PreferredEmail'])

allEmailsSeries = df[emFieldNames].stack()

groupedEmailSeriesGroupItems = allEmailsSeries.groupby(allEmailsSeries).groups.items()

emsAndTheirCellsFound = {emaddr:([x for x in indices]) for emaddr,indices in groupedEmailSeriesGroupItems if len(indices) > 1}

output2aList = []
output2bList = []
for emAddr, foundCells in emsAndTheirCellsFound.items():
    seenFirstNamesAndTheirContacts = {}
    numCellsFound = len(foundCells)
    output2bFact = OrderedDict()
    if numCellsFound > 1:
        for cellLoopCounter, cell in enumerate(foundCells):
            if cell[2] not in seenFirstNamesAndTheirContacts:
                seenFirstNamesAndTheirContacts[cell[2]] = [cell[:-1]]
                if cellLoopCounter == 0:
                    output2bFact['emAddr'] = emAddr
                    output2bFact['Master'] = cell[1]
                    output2bFact['Master_FN'] = cell[2]
                    output2bFact['Master_LN'] = cell[3]
                else:
                    uniqueFirstNamesSeenCountMinus1Str = str(len(seenFirstNamesAndTheirContacts) - 1)
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)] = cell[1]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_FN'] = cell[2]
                    output2bFact['Duplicate_'+str(uniqueFirstNamesSeenCountMinus1Str)+'_LN'] = cell[3]
            elif cell[:-1] not in seenFirstNamesAndTheirContacts[cell[2]]:
                seenFirstNamesAndTheirContacts[cell[2]].append(cell[:-1])
            if cellLoopCounter == numCellsFound-1:
                if 'Duplicate_1' in output2bFact:
                    output2bList.append(output2bFact)
                for firstName, contactsForFN in seenFirstNamesAndTheirContacts.items():
                    output2aFact = OrderedDict()
                    if len(contactsForFN) > 1:
                        for contactLoopCounter, contact in enumerate(contactsForFN):
                            if contactLoopCounter == 0:
                                output2aFact['Master'] = contact[1]
                            else:
                                output2aFact['Duplicate_'+str(contactLoopCounter)]=contact[1]
                    if 'Duplicate_1' in output2aFact:
                        output2aList.append(output2aFact)

if len(output2aList) > 0:
    output2adf = pandas.DataFrame(output2aList)
    output2adf.to_csv(output2aFilePath, index=False, quoting=1)
else:
    with open(output2aNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')

if len(output2bList) > 0:
    output2bdf = pandas.DataFrame(output2bList)
    output2bdf.to_csv(output2bFilePath, index=False, quoting=1)
else:
    with open(output2bNoneFoundFilePath, 'w') as f:
        f.writelines('No records found')
# ### Output2 portion ends here ###

print('Done running script. Took ' + str(round(time.time()-start_time,2)) + ' seconds.')

Stop reading here if you don’t want to learn Python

You don’t have to thoroughly understand the code above to make use of it.

Just be sure that your “output” files are different filenames than your “input” file and you can’t do too much damage.

Worst-case scenario, you’ll freeze your computer running the program against too large of a file – but several million rows is nothing (a minute or few), so get up and stretch. It’s good for you!

For the curious, the rest of this post is total geekery peeking under the hood.

Code Explanations

I'm struggling to get all the formatting to show up correctly on DEV, so check out my breakdown at the original blog post and come back here with questions, comments, or corrections. I look forward to hearing from you!

Blog