Batch processing with Directory.EnumerateFiles
Bohdan Stupak
Posted on April 28, 2021
In case one wants to retrieve files from catalog Directory.GetFiles
is a simple answer sufficient for most scenarios. However, when you deal with a large amount of data you might need more advanced techniques.
Example
Let's assume you have a big data solution and you need to process a directory that contains 200000 files. For each file, you extract some basic info
public record FileProcessingDto
{
public string FullPath { get; set; }
public long Size { get; set; }
public string FileNameWithoutExtension { get; set; }
public string Hash { get; internal set; }
}
Note how we conveniently use novel C# 9 record types for our DTO here.
After that, we send extracted info for further processing. Let's emulate it with the following snippet
public class FileProcessingService
{
public Task Process(IReadOnlyCollection<FileProcessingDto> files, CancellationToken cancellationToken = default)
{
files.Select(p =>
{
Console.WriteLine($"Processing {p.FileNameWithoutExtension} located at {p.FullPath} of size {p.Size} bytes");
return p;
});
return Task.Delay(TimeSpan.FromMilliseconds(20), cancellationToken);
}
}
Now the final piece is extracting info and calling the service
public class Worker
{
public const string Path = @"path to 200k files";
private readonly FileProcessingService _processingService;
public Worker()
{
_processingService = new FileProcessingService();
}
private string CalculateHash(string file)
{
using (var md5Instance = MD5.Create())
{
using (var stream = File.OpenRead(file))
{
var hashResult = md5Instance.ComputeHash(stream);
return BitConverter.ToString(hashResult)
.Replace("-", "", StringComparison.OrdinalIgnoreCase)
.ToLowerInvariant();
}
}
}
private FileProcessingDto MapToDto(string file)
{
var fileInfo = new FileInfo(file);
return new FileProcessingDto()
{
FullPath = file,
Size = fileInfo.Length,
FileNameWithoutExtension = fileInfo.Name,
Hash = CalculateHash(file)
};
}
public Task DoWork()
{
var files = Directory.GetFiles(Path)
.Select(p => MapToDto(p))
.ToList();
return _processingService.Process(files);
}
}
Note that here we act in a naive fashion and extract all files via Directory.GetFiles(Path)
in one take.
However, once you run this code via
await new Worker().DoWork()
you'll notice that results are far from satisfying and the application is consuming memory extensively.
Directory.EnumerateFiles to the rescue
The thing with Directory.EnumerateFiles is that it returns IEnumerable<string>
thus allowing us to fetch collection items one by one. This in turn prevents us from excessive use of memory while loading huge amounts of data at once.
Still, as you may have noticed FileProcessingService.Process
has delay coded in it (sort of I/O operation we emulate with simple delay). In a real-world scenario, this might be a call to an external HTTP-endpoint or work with the storage. This brings us to the conclusion that calling FileProcessingService.Process
200 000 times might be inefficient. That's why we're going to load reasonable batches of data into memory at once.
The reworked code looks as follows
public class WorkerImproved
{
//omitted for brevity
public async Task DoWork()
{
const int batchSize = 10000;
var files = Directory.EnumerateFiles(Path);
var count = 0;
var filesToProcess = new List<FileProcessingDto>(batchSize);
foreach (var file in files)
{
count++;
filesToProcess.Add(MapToDto(file));
if (count == batchSize)
{
await _processingService.Process(filesToProcess);
count = 0;
filesToProcess.Clear();
}
}
if (filesToProcess.Any())
{
await _processingService.Process(filesToProcess);
}
}
}
Here we enumerate collection with foreach
and once we reach the size of the batch we process it and flush the collection. The only interesting moment here is to call service one last time after we exit the loop in order to flush remaining items.
Evaluation
Results produced by Benchmark.NET are pretty convincing
Few words on batch processing
In this article we took a glance at the common pattern in software engineering. Batches of reasonable amount help us to beat both I/O penalty of working in an item-by-item fashion and excessive memory consumption of loading all items in memory at once.
As a rule, you should strive for using batch APIs when doing I/O operations for multiple items. And once the number of items becomes high you should think about splitting these items into batches.
Few words on return types
Quite often when dealing with codebases I see code similar to the following
public IEnumerable<int> Numbers => new List<int> { 1, 2, 3 };
I would argue that this code violates Postel's principle and the thing that follows from it is that as a consumer of a property I have can't figure out whether I can enumerate items one by one or if they are just loaded at once in memory.
This is a reason I suggest being more specific about return type i.e.
public IList<int> Numbers => new List<int> { 1, 2, 3 };
Conclusion
Batching is a nice technique that allows you to handle big amounts of data gracefully. Directory.EnumerateFiles
is the API that allows you to organize batch processing for the directory with a large number of files.
Posted on April 28, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 26, 2024