C# Dealing with duplicates
Karen Payne
Posted on July 21, 2024
Introduction
Learn how to contend with duplicate data using ISet and HashSet. Code samples range from working with simply arrays, collections using mocked data to read an Excel WorkSheet.
By following along with the provided code a developer can than consider using what is shown with various operations like reading incoming data then adding data to a internal data source like a database.
Note that these methods in general are better than using Distinct.
Definitions
ISet is an interface located in the System.Collections.Generic namespace,
designed to represent a collection of unique elements, ensuring no duplicates are stored. The primary aim of ISet is to facilitate the management of collections where the uniqueness of each element is paramount, providing efficient methods for set operations like union, intersection, and difference.
A HashSet is a collection of unique elements that uses a hash table for storage, allowing faster retrieval of elements than other collection types. Adding and removing elements to the HashSet also has constant time complexity. However, it does not maintain insertion order and cannot access elements by index.
Frozen Collections (which are used in several of the code samples) are collections optimized for situations where you have collections that will be frequently accessed, and you do not need to change the keys and values after creating. These collections are a bit slower during the creation, but reading operations are faster.
Examples
All code samples use mocked data with a small data set for easy of following along except the final example which uses an Excel file using a NuGet package ExcelMapper to read data. See also C# Excel read/write on the cheap for more on ExcelMapper.
Two models are used, both implement INotifyPropertyChanged which is not needed in regards to ensuring no duplication of data.
Example 1
When adding new items an average developer with an ISet wanting to add new items will resort to seeing if an item is contained in the set as shown below.
ISet<int> set = new HashSet<int> { 1, 2, 3 };
int[] array = [3, 4, 5];
foreach (var item in array)
{
// ReSharper disable once CanSimplifySetAddingWithSingleCall
if (!set.Contains(item))
{
set.Add(item);
}
}
But there is no need to check if an item exists, instead the Add method will not add a new item if it already exists in the set.
Reside code
ISet<int> set = new HashSet<int> { 1, 2, 3 };
int[] array = [3, 4, 5];
foreach (var item in array)
{
set.Add(item);
}
Example 2
Moving on to a more realistic scenario.
We have a model were to determine duplicates not all properties are needed e.g. the primary key should not be included, only FirstName, LastName and BirthDate. The best course is to implement IEquatable<Person> were the properties used to, in this case are used to define the properties used to determine duplication of items.
public class Person : INotifyPropertyChanged, IEquatable<Person>
{
private int _id;
private string _firstName;
private string _lastName;
private DateOnly _birthDate;
public int Id
{
get => _id;
set
{
if (value == _id) return;
_id = value;
OnPropertyChanged(nameof(Id));
}
}
public string FirstName
{
get => _firstName;
set
{
if (value == _firstName) return;
_firstName = value;
OnPropertyChanged(nameof(FirstName));
}
}
public string LastName
{
get => _lastName;
set
{
if (value == _lastName) return;
_lastName = value;
OnPropertyChanged(nameof(LastName));
}
}
public DateOnly BirthDate
{
get => _birthDate;
set
{
if (value.Equals(_birthDate)) return;
_birthDate = value;
OnPropertyChanged(nameof(BirthDate));
}
}
public bool Equals(Person compareTo)
=> (FirstName == compareTo.FirstName &&
LastName == compareTo.LastName &&
BirthDate == compareTo.BirthDate);
public override int GetHashCode()
=> HashCode.Combine(FirstName, LastName, BirthDate);
public event PropertyChangedEventHandler? PropertyChanged;
protected virtual void OnPropertyChanged(string propertyName)
{
PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));
}
public override string ToString() => $"{FirstName,-12}{LastName}";
}
Using the follow Set to representing existing data (no need for a large set of data).
private static ISet<Person> PeopleData()
{
ISet<Person> peopleSet = new HashSet<Person>([
new() { Id = 1, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956,9,24)},
new() { Id = 2, FirstName = "Sam", LastName = "Smith",
BirthDate = new DateOnly(1976,3,4) },
new() { Id = 1, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956,9,24) }
]);
return peopleSet;
}
And for simplicity, add two new items using mocked data that in a real application might be an import from a file, database or web service.
As with shown in example for int, here only Frank Adams is added, rejecting Karen Payne as per IEquatable<Person> definition.
private static FrozenSet<Person> PeopleAdd()
{
ShowExecutingMethodName();
var peopleSet = PeopleData();
peopleSet.Add(new() { Id = 3, FirstName = "Frank", LastName = "Adams",
BirthDate = new DateOnly(1966, 3, 4) });
peopleSet.Add(new() { Id = 4, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956, 9, 24) });
return peopleSet.ToFrozenSet();
}
Note
ToFrozenSet can be expensive to create but efficient for read operations.
Result
Example 3
In this example we will introduce UnionWith.
Base data.
private static ISet<Person> PeopleData()
{
ISet<Person> peopleSet = new HashSet<Person>([
new() { Id = 1, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956,9,24)},
new() { Id = 2, FirstName = "Sam", LastName = "Smith",
BirthDate = new DateOnly(1976,3,4) },
new() { Id = 1, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956,9,24) }
]);
return peopleSet;
}
Add items with UnionWith.
var peopleSet = PeopleData();
peopleSet.UnionWith([
new() { Id = 1, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956,9,24)},
new() { Id = 2, FirstName = "Sam", LastName = "Smith",
BirthDate = new DateOnly(1976,3,4) },
new() { Id = 3, FirstName = "Frank", LastName = "Adams",
BirthDate = new DateOnly(1966,3,4) },
new() { Id = 1, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956,9,24) }
]);
Result
Example 4
In this example we will introduce ExceptWith which removes all elements in the specified collection from the current set. This method is an O(n) operation, where n is the number of elements in the other parameter.
Base data
private static ISet<Person> PeopleData()
{
ISet<Person> peopleSet = new HashSet<Person>([
new() { Id = 1, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956,9,24)},
new() { Id = 2, FirstName = "Sam", LastName = "Smith",
BirthDate = new DateOnly(1976,3,4) },
new() { Id = 1, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956,9,24) }
]);
return peopleSet;
}
Using ExceptWith
private static FrozenSet<Person> PeopleExceptWith()
{
ShowExecutingMethodName();
var peopleSet = PeopleData();
peopleSet.ExceptWith([
new() { Id = 2, FirstName = "Sam", LastName = "Smith",
BirthDate = new DateOnly(1976,3,4) },
new() { Id = 3, FirstName = "Frank", LastName = "Adams",
BirthDate = new DateOnly(1966,3,4) },
]);
return peopleSet.ToFrozenSet();
}
Result
Example 5
In this example, data is read from Excel as shown below.
Model for reading data using ExcelMapper to read the above work sheet.
Using Equals method the properties used define comparison are, Company (string), Country (string) and JoinDate (DateOnly).
public partial class Customers : INotifyPropertyChanged, IEquatable<Customers>
{
public int Id { get; set; }
public string Company { get; set; }
public string ContactType { get; set; }
public string ContactName { get; set; }
public string Country { get; set; }
public DateOnly JoinDate { get; set; }
public event PropertyChangedEventHandler? PropertyChanged;
protected virtual void OnPropertyChanged(string propertyName)
{
PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));
}
public bool Equals(Customers other)
{
if (ReferenceEquals(null, other)) return false;
if (ReferenceEquals(this, other)) return true;
return Company == other.Company && Country == other.Country && JoinDate.Equals(other.JoinDate);
}
public override bool Equals(object obj)
{
if (ReferenceEquals(null, obj)) return false;
if (ReferenceEquals(this, obj)) return true;
if (obj.GetType() != this.GetType()) return false;
return Equals((Customers)obj);
}
public override int GetHashCode()
{
unchecked
{
var hashCode = (Company != null ? Company.GetHashCode() : 0);
hashCode = (hashCode * 397) ^ (Country != null ? Country.GetHashCode() : 0);
hashCode = (hashCode * 397) ^ JoinDate.GetHashCode();
return hashCode;
}
}
}
When reading the sheet, the first row defines column names are not read.
- Create an instance of ExcelMapper
- Read the worksheet using ExcelMapper to a list.
- Feed the above list to a HashSet which is then uses as an ISet.
- Validate that no duplicates were added.
private static async Task ReadFromExcel()
{
ShowExecutingMethodName1();
const string excelFile = "ExcelFiles\\Customers.xlsx";
ExcelMapper excel = new();
var customers = (await excel.FetchAsync<Customers>(excelFile, nameof(Customers)))
.ToList();
AnsiConsole.MarkupLine($"[cyan]Read {customers.Count}[/]");
/*
* There are two duplicates so the next count is two less
*/
ISet<Customers> customersSet = new HashSet<Customers>(customers);
AnsiConsole.MarkupLine($"[cyan]Afterwards {customersSet.Count}[/]");
List<Customers> customersList = [.. customersSet];
}
Result
Read 92
Afterwards 90
Example 6
Using the Person model, sort by last name and use a language extension method to provide AddRange for a SortedSet.
Extension method for AddRange. Note that GitHub Copilot was used to document the AddRange method.
public static class Extensions
{
/// <summary>
/// Adds a range of items to the <see cref="SortedSet{T}"/>.
/// </summary>
/// <typeparam name="T">The type of items in the <see cref="SortedSet{T}"/>.</typeparam>
/// <param name="source">The <see cref="SortedSet{T}"/> to add the items to.</param>
/// <param name="items">The collection of items to add.</param>
/// <returns><c>true</c> if all items were successfully added; otherwise, <c>false</c>.</returns>
public static bool AddRange<T>(this SortedSet<T> source, IEnumerable<T> items)
{
bool allAdded = true;
foreach (var item in items)
{
allAdded = allAdded & source.Add(item);
}
return allAdded;
}
}
Class to sort people by LastName property.
public class PersonComparer : IComparer<Person>
{
public int Compare(Person left, Person right)
=> string.Compare(left.LastName, right.LastName, StringComparison.Ordinal);
}
Mocked List<Person> with one duplicate by FirstName, LastName and BirthDate
private static List<Person> PeopleDataList()
{
List<Person> peopleList =
[
new() { Id = 1, FirstName = "Mike", LastName = "Williams",
BirthDate = new DateOnly(1956,9,24)},
new()
{
Id = 1, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956, 9, 24)
},
new()
{
Id = 2, FirstName = "Sam", LastName = "Smith",
BirthDate = new DateOnly(1976, 3, 4)
},
new()
{
Id = 1, FirstName = "Karen", LastName = "Payne",
BirthDate = new DateOnly(1956, 9, 24)
}
];
return peopleList.OrderBy(x => x.LastName).ToList();
}
Code to add the list above to the SortedSet and display the results.
private static void PersonSortedByLastNameExample()
{
ShowExecutingMethodName1();
var list = PeopleDataList();
var people = new SortedSet<Person>(new PersonComparer());
people.AddRange(list);
people.Dump(tableConfig: _tableConfig);
}
Summary
From the provided code sample to prevent duplication from predefined list becomes easy. And there are other examples provides like removal of items.
Word of advice, if an operation is for a large dataset that will eventually be pushed to a database consider using functionality in the database e.g. SQL-Server MERGE or creating a unique index.
Posted on July 21, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.