Garnet. Does Microsoft's cache -store also store bugs?
Anastasiia Vorobeva
Posted on May 22, 2024
Microsoft has introduced Garnet, an open-source, cross-platform, fast cache-store project written in C#. Let's equip ourselves with a static analyzer to see what bugs and oddities hide in the project source code.
Garnet is a remote cache-store from Microsoft Research. The project offers the following key advantages:
- it adopts the popular RESP protocol;
- it provides better server throughput and scalability with many client sessions;
- it shows low client latency on virtual machines with accelerated networking support;
- due to the use of the .NET technology, Garnet is a cross-platform, extensible, and modern solution.
According to the developers, the project emerged from the FASTER project, which was completed in 2018. Microsoft claims that they're already using Garnet in their infrastructure.
The project is open source, so we can examine the source code up and down to find out what bugs and weird stuff are in there.
For our check, let's take the latest Garnet release (the 1.0.6 version at the time of writing the article) and analyze the source code using the PVS-Studio static analyzer 7.30.
The check results gave me the impression that the project is new (which is fine) and seems to have been done in a hurry. What's sparked my conclusions? The analysis results contain several warnings of the following types:
- unused constructor parameters;
- a method parameter is overwritten before it's used;
- a check for null where the value surely can't be null;
- expressions taken in parentheses several times;
- methods that have different names but the same implementation;
- methods that always return the same value.
We won't go into the analyzer warnings issued for this code, because it's rather boring for the article. The code is most likely just poorly written and contains no real errors. Anyway, it's still worth checking and possibly refactoring, but that's beyond the scope of the article. Of course, developers can check their project themselves and find all such fragments. To do so, they may simply request a trial key.
I'd also like to note that, based on the available information, the project is very complex. The developers had to use a number of tricks to achieve such a performance level:
- a lot of code is written using unsafe;
- Garnet uses its own memory manager to bypass GC.
All in all, the code is a far cry from the usual style of writing C# projects. I feel like it would've been better to use a different language instead of applying a bunch of hacks. Moreover, even though the project is open source, the barrier to entry is so high that the community is unlikely to develop it.
Now that we've talked about the project, let's move on to looking at the errors.
Errors and oddities
NullReferenceException
A potential NullReferenceException is one of the most common issues identified by static analysis. Garnet is no exception. It contains code that is susceptible to such an "affliction". We won't analyze all the warnings, as there are dozens of them, but rather pick the most obvious or interesting ones.
Fragment 1
public (....) GetCustomTransactionProcedure(....)
{
if (sessionTransactionProcMap[id].Item1 == null)
{
var entry = customCommandManager.transactionProcMap[id];
sessionTransactionProcMap[id].Item1 = entry.proc != null ? entry.proc()
: null;
sessionTransactionProcMap[id].Item2 = entry.NumParams;
sessionTransactionProcMap[id].Item1.txnManager = txnManager;
sessionTransactionProcMap[id].Item1.scratchBufferManager =
scratchBufferManager;
}
return sessionTransactionProcMap[id];
}
The PVS-Studio warning:
V3080 Possible null dereference. Consider inspecting 'sessionTransactionProcMap[id].Item1'. CustomCommandManagerSession.cs 28
Depending on the ternary operator condition, a null reference is assigned in the sessionTransactionProcMap[id].Item1 tuple field. In addition, literally the next line later, the field is dereferenced without any check.
Fragment 2
public void Log<TState>(....)
{
var msg = string.Format("[{0:D3}.{1}] ({2}) <{3}> {4}", ....);
lock (this.lockObj)
{
streamWriter?.WriteLine(msg); // <=
//var now = DateTime.UtcNow;
//if(now - lastFlush > flushInterval)
{
//lastFlush = now;
streamWriter.Flush(); // <=
}
}
}
The PVS-Studio warning:
V3125 The 'streamWriter' object was used after it was verified against null. Check lines: 95, 89. FileLoggerProvider.cs 95
The analyzer has detected a case when an object is used after checking for null. If we remove the commented lines, the code looks like this:
....
streamWriter?.WriteLine(msg);
streamWriter.Flush();
....
It looks weird. I think this is an impossible case, and the '?.' operator is redundant. So, there's no error here. However, the analyzer still got it right. It accurately warns us of an anomaly in the code. It's either necessary to write '?.' everywhere (if there's a possible scenario where the reference is null) or just write '.' to make the code simpler and more consistent. Also, other programmers wouldn't have to think over such code when modifying or refactoring it.
Fragment 3
public void Run()
{
PrintClusterConfig();
Console.WriteLine($"Running benchmark using {opts.Client} client type");
InitClients(clusterConfig?.Nodes.ToArray()); // <=
Thread[] workers = InitializeThreadWorkers();
}
The PVS-Studio warning:
V3105 The result of null-conditional operator is dereferenced inside the 'InitClients' method. NullReferenceException is possible. Inspect the first argument 'clusterConfig?.Nodes.ToArray()'. ShardedRespOnlineBench.cs 414
The clusterConfig?.Nodes.ToArray() expression is used as an argument of the InitClients method. As a result, null may be passed to the method. Let's look at the code of the InitClients method:
private void InitClients(ClusterNode[] nodes)
{
switch (opts.Client)
{
case ClientType.GarnetClient:
gclient = new GarnetClient[nodes.Length]; // <=
....
....
}
}
As we can see, in the first case of the switch operator, the parameter is dereferenced without checking for null.
Fragment 4
Usually, we don't include errors found in the code of tests in our articles, but this time I decided to pick some of them. The thing is that, according to the comments, the code triggers the built-in stress test:
public EmbeddedPerformanceTest(...., Options opts, ....)
{
this.server = server;
this.opts = opts;
logger = loggerFactory.CreateLogger("EmbeddedBench");
opPercent = opts.OpPercent?.ToArray(); // <=
opWorkload = opts.OpWorkload?.ToArray(); // <=
if (opPercent.Length != opWorkload.Length) // <=
throw new Exception($"....");
....
}
The PVS-Studio warnings:
- V3105 The 'opPercent' variable was used after it was assigned through null-conditional operator. NullReferenceException is possible. EmbeddedPerformanceTest.cs 58
- V3105 The 'opWorkload' variable was used after it was assigned through null-conditional operator. NullReferenceException is possible. EmbeddedPerformanceTest.cs 58
The value obtained with the '?.' operator is assigned to the opPercent and opWorkload fields again. Then, in the next line, these fields are used without checking for null.
Fragment 5
public IEnumerable<long> GetPendingRequests()
{
foreach (var kvp in ctx.prevCtx?.ioPendingRequests)
yield return kvp.Value.serialNum;
foreach (var kvp in ctx.ioPendingRequests)
yield return kvp.Value.serialNum;
}
The PVS-Studio warning:
V3153 Enumerating the result of null-conditional access operator can lead to NullReferenceException. Consider inspecting: ctx.prevCtx?.ioPendingRequests. ClientSession.cs 748
Developers used the '?.' operator in this code fragment. Apparently, the programmer implied that prevCtx could have the null value. However, the thing is that foreach doesn't work with null. So, the exception still occurs, but when the GetEnumerator method is called.
As we've written many times in other articles, this isn't obvious to many developers. We have a whole article on this topic: "The ?. operator in foreach will not protect from NullReferenceException".
Such errors aren't critical in C# and are quickly detected if the reference becomes null (unlike in C++). Perhaps a null reference is rare or impossible in such code, so this code contains such errors for a long time and in large numbers. However, it causes problems over time. A program crash due to a NullReferenceException can turn into a frustrating and time-consuming search for the issue. This is especially true if the error occurs only on the client's side, and it's very difficult or even impossible to reproduce the error.
Ternary operator issues
I don't know why, but a lot of erroneous code in this project involves the ternary operator. Another error related to it was in the NullReferenceException section. Let's look at some examples.
Fragment 1
internal static void DoInternalLock<....>(....)
{
OperationStatus status;
if (cancellationToken.IsCancellationRequested)
status = OperationStatus.CANCELED;
else
{
status = DoInternalLock(clientSession, key);
if (status == OperationStatus.SUCCESS)
continue; // Success; continue to the next key.
}
var unlockIdx = keyIdx - (status == OperationStatus.SUCCESS ? 0 : 1);
}
The PVS-Studio warning:
V3022 Expression 'status == OperationStatus.SUCCESS' is always false. LockableContext.cs 133
In such a case, the analyzer shows that the condition of the ternary operator is always false. The status local variable is assigned a value within if. Let's look at two possible scenarios:
- In the then branch, OperationStatus.CANCELED *is assigned to status*;
- In the else branch, the return value of the DoInternalLock method is assigned to* status*. If after that status equals to OperationStatus.SUCCESS, then continue is executed.
So, at the moment of the ternary operator execution, status is never equal to OperationStatus.SUCCESS, and the result of the ternary operator is always 0.
Fragment 2
public class FileLoggerOutput : IDisposable
{
private StreamWriter streamWriter;
private readonly object lockObj = new object();
private readonly TimeSpan flushInterval =
Debugger.IsAttached ?
TimeSpan.FromMilliseconds(10) :
TimeSpan.FromMilliseconds(10);
private DateTime lastFlush = DateTime.UtcNow;
....
}
The PVS-Studio warning:
V3012 The '?:' operator, regardless of its conditional expression, always returns one and the same value: TimeSpan.FromMilliseconds(10). FileLoggerProvider.cs 43
Developers use the ternary operator to assign different values to the flushInterval field depending on whether the debugger is connected to the process or not. Because of the error, the value is the same regardless of the condition.
You can find the following entry in one of the class constructors:
public FileLoggerOutput(string filename, int flushInterval = default)
{
this.flushInterval = flushInterval == default ?
Debugger.IsAttached ? TimeSpan.FromMilliseconds(10)
: TimeSpan.FromSeconds(1)
: TimeSpan.FromMilliseconds(flushInterval);
....
}
Most likely, this is the proper use of the ternary operator.
Fragment 3 (fixed)
Initially, I checked Garnet v1.0.5. In the new version (1.0.6), one of the errors has disappeared:
private async Task<bool> SendFailoverAuthReq(long requestedEpoch)
{
int firstClientIndex = option == FailoverOption.FORCE ? 1 : 0;
....
int majority = (clients.Length - firstClientIndex / 2) + 1;
....
}
The PVS-Studio warning:
V3157 The 'firstClientIndex / 2' expression evaluates to 0 because absolute value of the left operand is less than the right operand. ReplicaFailoverSession.cs 85
In this case, the analyzer understands that the firstClientIndex local variable is assigned 1 or 0. Both numbers are less than 2, so the result of the division is always 0.
Now this PVS-Studio warning is gone, as well as the method :) The code fragment still managed to get into one of the releases. Yes, the developers fixed it, but only after the users got it. However, if the developers had used static code analysis at an earlier stage, they could've avoided this error in the release. This shows how important it is to use such tools at the development stage to catch potential issues before they end up in the release version of the program.
Other errors
Fragment 1
public ClusterConfig InitializeLocalWorker(....)
{
Worker[] newWorkers = new Worker[workers.Length];
Array.Copy(workers, newWorkers, workers.Length);
newWorkers[1].address = address;
newWorkers[1].port = port;
newWorkers[1].nodeid = nodeId;
newWorkers[1].configEpoch = configEpoch;
newWorkers[1].lastVotedConfigEpoch = currentConfigEpoch; // <=
newWorkers[1].lastVotedConfigEpoch = lastVotedConfigEpoch; // <=
newWorkers[1].role = role;
newWorkers[1].replicaOfNodeId = replicaOfNodeId;
newWorkers[1].replicationOffset = 0;
newWorkers[1].hostname = hostname;
return new ClusterConfig(slotMap, newWorkers);
}
The PVS-Studio warning:
V3127 Two similar code fragments were found. Perhaps, this is a typo and 'currentConfigEpoch' variable should be used instead of 'lastVotedConfigEpoch' ClusterConfig.cs 116
The copy-paste error is a frequent guest in our articles. Static analyzers are good at finding typos and copy-paste errors. In this example, the developer writes the value to lastVotedConfigEpoch twice. Most likely, the CurrentConfigEpoch property should've been used here instead.
Fragment 2
public long GetConfigEpochFromSlot(int slot)
{
if (slotMap[slot].workerId < 0)
return 0;
return workers[slotMap[slot].workerId].configEpoch;
}
The PVS-Studio warning:
V3022 Expression 'slotMap[slot].workerId < 0' is always false. Unsigned type value is always >= 0. ClusterConfig.cs 460
Here, the analyzer shows that the if statement condition is always false. The thing is that the workerId property is of the ushort type. The possible range is 0 to 65535. This means that workerId can't be less than 0. The check doesn't make any sense. It should either be deleted or look different, for example:
if (slotMap[slot].workerId == 0)
return 0;
Conclusion
The project gets frequent updates, and of course, bugs will be fixed eventually. Even one of the examples shown in the article has already been fixed. However, as I've written before, the community is skeptical of such a coding style in C#. It seems that C# doesn't fit for writing most of the project code using unsafe and custom memory manager. I can only wish the project success in its development and will contribute to this by reporting all bugs found to the developers via issues on GitHub.
Posted on May 22, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.