C# Dealing with duplicates

karenpayneoregon

Karen Payne

Posted on July 21, 2024

C# Dealing with duplicates

Introduction

Learn how to contend with duplicate data using ISet and HashSet. Code samples range from working with simply arrays, collections using mocked data to read an Excel WorkSheet.

By following along with the provided code a developer can than consider using what is shown with various operations like reading incoming data then adding data to a internal data source like a database.

Note that these methods in general are better than using Distinct.

Definitions

ISet is an interface located in the System.Collections.Generic namespace,
designed to represent a collection of unique elements, ensuring no duplicates are stored. The primary aim of ISet is to facilitate the management of collections where the uniqueness of each element is paramount, providing efficient methods for set operations like union, intersection, and difference.

A HashSet is a collection of unique elements that uses a hash table for storage, allowing faster retrieval of elements than other collection types. Adding and removing elements to the HashSet also has constant time complexity. However, it does not maintain insertion order and cannot access elements by index.

Frozen Collections (which are used in several of the code samples) are collections optimized for situations where you have collections that will be frequently accessed, and you do not need to change the keys and values after creating. These collections are a bit slower during the creation, but reading operations are faster.

Examples

All code samples use mocked data with a small data set for easy of following along except the final example which uses an Excel file using a NuGet package ExcelMapper to read data. See also C# Excel read/write on the cheap for more on ExcelMapper.

Two models are used, both implement INotifyPropertyChanged which is not needed in regards to ensuring no duplication of data.

Source code

Example 1

When adding new items an average developer with an ISet wanting to add new items will resort to seeing if an item is contained in the set as shown below.

ISet<int> set = new HashSet<int> { 1, 2, 3 };

int[] array = [3, 4, 5];

foreach (var item in array)
{
    // ReSharper disable once CanSimplifySetAddingWithSingleCall
    if (!set.Contains(item))
    {
        set.Add(item);
    }
}
Enter fullscreen mode Exit fullscreen mode

But there is no need to check if an item exists, instead the Add method will not add a new item if it already exists in the set.

Reside code

ISet<int> set = new HashSet<int> { 1, 2, 3 };

int[] array = [3, 4, 5];

foreach (var item in array)
{
    set.Add(item);
}
Enter fullscreen mode Exit fullscreen mode

Example 2

Moving on to a more realistic scenario.

We have a model were to determine duplicates not all properties are needed e.g. the primary key should not be included, only FirstName, LastName and BirthDate. The best course is to implement IEquatable<Person> were the properties used to, in this case are used to define the properties used to determine duplication of items.

public class Person : INotifyPropertyChanged, IEquatable<Person>
{
    private int _id;
    private string _firstName;
    private string _lastName;
    private DateOnly _birthDate;

    public int Id
    {
        get => _id;
        set
        {
            if (value == _id) return;
            _id = value;
            OnPropertyChanged(nameof(Id));
        }
    }

    public string FirstName
    {
        get => _firstName;
        set
        {
            if (value == _firstName) return;
            _firstName = value;
            OnPropertyChanged(nameof(FirstName));
        }
    }

    public string LastName
    {
        get => _lastName;
        set
        {
            if (value == _lastName) return;
            _lastName = value;
            OnPropertyChanged(nameof(LastName));
        }
    }

    public DateOnly BirthDate
    {
        get => _birthDate;
        set
        {
            if (value.Equals(_birthDate)) return;
            _birthDate = value;
            OnPropertyChanged(nameof(BirthDate));
        }
    }

    public bool Equals(Person compareTo) 
        => (FirstName == compareTo.FirstName && 
            LastName == compareTo.LastName && 
            BirthDate == compareTo.BirthDate);

    public override int GetHashCode() 
        => HashCode.Combine(FirstName, LastName, BirthDate);

    public event PropertyChangedEventHandler? PropertyChanged;

    protected virtual void OnPropertyChanged(string propertyName)
    {
        PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));
    }
    public override string ToString() => $"{FirstName,-12}{LastName}";
}
Enter fullscreen mode Exit fullscreen mode

Using the follow Set to representing existing data (no need for a large set of data).

private static ISet<Person> PeopleData()
{

    ISet<Person> peopleSet = new HashSet<Person>([
        new() { Id = 1, FirstName = "Karen", LastName = "Payne",
            BirthDate = new DateOnly(1956,9,24)},
        new() { Id = 2, FirstName = "Sam", LastName = "Smith",
            BirthDate = new DateOnly(1976,3,4) },
        new() { Id = 1, FirstName = "Karen", LastName = "Payne", 
            BirthDate = new DateOnly(1956,9,24) }
    ]);

    return peopleSet;
}
Enter fullscreen mode Exit fullscreen mode

And for simplicity, add two new items using mocked data that in a real application might be an import from a file, database or web service.

As with shown in example for int, here only Frank Adams is added, rejecting Karen Payne as per IEquatable<Person> definition.

private static FrozenSet<Person> PeopleAdd()
{
    ShowExecutingMethodName();

    var peopleSet = PeopleData();

    peopleSet.Add(new() { Id = 3, FirstName = "Frank", LastName = "Adams", 
        BirthDate = new DateOnly(1966, 3, 4) });
    peopleSet.Add(new() { Id = 4, FirstName = "Karen", LastName = "Payne",
        BirthDate = new DateOnly(1956, 9, 24) });

    return peopleSet.ToFrozenSet();
}
Enter fullscreen mode Exit fullscreen mode

Note
ToFrozenSet can be expensive to create but efficient for read operations.

Result

Shows result for adding

Example 3

In this example we will introduce UnionWith.

Base data.

private static ISet<Person> PeopleData()
{

    ISet<Person> peopleSet = new HashSet<Person>([
        new() { Id = 1, FirstName = "Karen", LastName = "Payne",
            BirthDate = new DateOnly(1956,9,24)},
        new() { Id = 2, FirstName = "Sam", LastName = "Smith",
            BirthDate = new DateOnly(1976,3,4) },
        new() { Id = 1, FirstName = "Karen", LastName = "Payne", 
            BirthDate = new DateOnly(1956,9,24) }
    ]);

    return peopleSet;
}
Enter fullscreen mode Exit fullscreen mode

Add items with UnionWith.

var peopleSet = PeopleData();

peopleSet.UnionWith([
    new() { Id = 1, FirstName = "Karen", LastName = "Payne", 
        BirthDate = new DateOnly(1956,9,24)},
    new() { Id = 2, FirstName = "Sam", LastName = "Smith", 
        BirthDate = new DateOnly(1976,3,4) },
    new() { Id = 3, FirstName = "Frank", LastName = "Adams", 
        BirthDate = new DateOnly(1966,3,4) },
    new() { Id = 1, FirstName = "Karen", LastName = "Payne", 
        BirthDate = new DateOnly(1956,9,24) }
]);
Enter fullscreen mode Exit fullscreen mode

Result

Shows result from UnionWith

Example 4

In this example we will introduce ExceptWith which removes all elements in the specified collection from the current set. This method is an O(n) operation, where n is the number of elements in the other parameter.

Base data

private static ISet<Person> PeopleData()
{

    ISet<Person> peopleSet = new HashSet<Person>([
        new() { Id = 1, FirstName = "Karen", LastName = "Payne",
            BirthDate = new DateOnly(1956,9,24)},
        new() { Id = 2, FirstName = "Sam", LastName = "Smith",
            BirthDate = new DateOnly(1976,3,4) },
        new() { Id = 1, FirstName = "Karen", LastName = "Payne", 
            BirthDate = new DateOnly(1956,9,24) }
    ]);

    return peopleSet;
}
Enter fullscreen mode Exit fullscreen mode

Using ExceptWith

private static FrozenSet<Person> PeopleExceptWith()
{
    ShowExecutingMethodName();

    var peopleSet = PeopleData();

    peopleSet.ExceptWith([
        new() { Id = 2, FirstName = "Sam", LastName = "Smith", 
            BirthDate = new DateOnly(1976,3,4) },
        new() { Id = 3, FirstName = "Frank", LastName = "Adams", 
            BirthDate = new DateOnly(1966,3,4) },
    ]);


    return peopleSet.ToFrozenSet();
}
Enter fullscreen mode Exit fullscreen mode

Result

Shows results for using ExceptWith

Example 5

In this example, data is read from Excel as shown below.

Excel sheet with two duplicate rows

Model for reading data using ExcelMapper to read the above work sheet.

Using Equals method the properties used define comparison are, Company (string), Country (string) and JoinDate (DateOnly).

public partial class Customers : INotifyPropertyChanged, IEquatable<Customers>
{
    public int Id { get; set; }

    public string Company { get; set; }

    public string ContactType { get; set; }

    public string ContactName { get; set; }

    public string Country { get; set; }

    public DateOnly JoinDate { get; set; }

    public event PropertyChangedEventHandler? PropertyChanged;

    protected virtual void OnPropertyChanged(string propertyName)
    {
        PropertyChanged?.Invoke(this, new PropertyChangedEventArgs(propertyName));
    }

    public bool Equals(Customers other)
    {
        if (ReferenceEquals(null, other)) return false;
        if (ReferenceEquals(this, other)) return true;
        return Company == other.Company && Country == other.Country && JoinDate.Equals(other.JoinDate);
    }

    public override bool Equals(object obj)
    {
        if (ReferenceEquals(null, obj)) return false;
        if (ReferenceEquals(this, obj)) return true;
        if (obj.GetType() != this.GetType()) return false;
        return Equals((Customers)obj);
    }

    public override int GetHashCode()
    {
        unchecked
        {
            var hashCode = (Company != null ? Company.GetHashCode() : 0);
            hashCode = (hashCode * 397) ^ (Country != null ? Country.GetHashCode() : 0);
            hashCode = (hashCode * 397) ^ JoinDate.GetHashCode();
            return hashCode;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

When reading the sheet, the first row defines column names are not read.

  • Create an instance of ExcelMapper
  • Read the worksheet using ExcelMapper to a list.
  • Feed the above list to a HashSet which is then uses as an ISet.
  • Validate that no duplicates were added.
private static async Task ReadFromExcel()
{

    ShowExecutingMethodName1();

    const string excelFile = "ExcelFiles\\Customers.xlsx";
    ExcelMapper excel = new();

    var customers = (await excel.FetchAsync<Customers>(excelFile, nameof(Customers)))
        .ToList();

    AnsiConsole.MarkupLine($"[cyan]Read {customers.Count}[/]");

    /*
     * There are two duplicates so the next count is two less
     */
    ISet<Customers> customersSet = new HashSet<Customers>(customers);
    AnsiConsole.MarkupLine($"[cyan]Afterwards {customersSet.Count}[/]");

    List<Customers> customersList = [.. customersSet];

}
Enter fullscreen mode Exit fullscreen mode

Result
Read 92
Afterwards 90

Example 6

Using the Person model, sort by last name and use a language extension method to provide AddRange for a SortedSet.

Extension method for AddRange. Note that GitHub Copilot was used to document the AddRange method.

public static class Extensions
{

    /// <summary>
    /// Adds a range of items to the <see cref="SortedSet{T}"/>.
    /// </summary>
    /// <typeparam name="T">The type of items in the <see cref="SortedSet{T}"/>.</typeparam>
    /// <param name="source">The <see cref="SortedSet{T}"/> to add the items to.</param>
    /// <param name="items">The collection of items to add.</param>
    /// <returns><c>true</c> if all items were successfully added; otherwise, <c>false</c>.</returns>
    public static bool AddRange<T>(this SortedSet<T> source, IEnumerable<T> items)
    {
        bool allAdded = true;
        foreach (var item in items)
        {
            allAdded = allAdded & source.Add(item);
        }
        return allAdded;
    }
}
Enter fullscreen mode Exit fullscreen mode

Class to sort people by LastName property.

public class PersonComparer : IComparer<Person>
{
    public int Compare(Person left, Person right) 
        => string.Compare(left.LastName, right.LastName, StringComparison.Ordinal);
}
Enter fullscreen mode Exit fullscreen mode

Mocked List<Person> with one duplicate by FirstName, LastName and BirthDate

private static List<Person> PeopleDataList()
{
    List<Person> peopleList =
    [
        new() { Id = 1, FirstName = "Mike", LastName = "Williams",
            BirthDate = new DateOnly(1956,9,24)},
        new()
        {
            Id = 1, FirstName = "Karen", LastName = "Payne",
            BirthDate = new DateOnly(1956, 9, 24)
        },

        new()
        {
            Id = 2, FirstName = "Sam", LastName = "Smith",
            BirthDate = new DateOnly(1976, 3, 4)
        },

        new()
        {
            Id = 1, FirstName = "Karen", LastName = "Payne",
            BirthDate = new DateOnly(1956, 9, 24)
        }
    ];

    return peopleList.OrderBy(x => x.LastName).ToList();
}
Enter fullscreen mode Exit fullscreen mode

Code to add the list above to the SortedSet and display the results.

private static void PersonSortedByLastNameExample()
{
    ShowExecutingMethodName1();

    var list = PeopleDataList();

    var people = new SortedSet<Person>(new PersonComparer());

    people.AddRange(list);

    people.Dump(tableConfig: _tableConfig);

}
Enter fullscreen mode Exit fullscreen mode

Results for above code

Summary

From the provided code sample to prevent duplication from predefined list becomes easy. And there are other examples provides like removal of items.

Word of advice, if an operation is for a large dataset that will eventually be pushed to a database consider using functionality in the database e.g. SQL-Server MERGE or creating a unique index.

💖 💪 🙅 🚩
karenpayneoregon
Karen Payne

Posted on July 21, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

C# Dealing with duplicates
csharp C# Dealing with duplicates

July 21, 2024

FluentValidation inline validate