Bruce Du
Posted on May 8, 2024
0 Background
Just in the last year, the department platform project I was responsible for released a new version. This version also launched a new function, which is similar to a scheduled task. Everything was ok on the first day, but on the second day there were very few tasks that were not executed as expected (paused tasks continued to be executed, but stared tasks were not executed).
The first reaction of myself and another colleagues was that there were some issues with the logic of scheduled task execution. But after we spent a lot of time debugging and testing, we found that the root of the problem was not the functional logic, but a piece of underlying public code that had been online for a year and had not been touched. The core of this code is the protagonist of this article gob
, and the root cause of the problem is a feature of the go
language: zero value.
I will use a more simplified example to describe this BUG later.
1 gob
and the Zero Value
Let’s briefly introduce gob
and zero value first.
1.1 Zero Value
The zero value is a feature of the Go language. Simply put, the Go language will provide a default value for some variables which have not been assigned a value yet.
For example, the following code:
package main
import (
"fmt"
)
type task struct {
name string
state int
}
func main() {
t := task{}
var list []byte
var f float32
var s string
var m map[string]int
fmt.Println(list, f, s, m)
fmt.Printf("%+v", t)
}
/* Output:
[] 0 map[]
{name: state:0}
*/
As you see, all variables have been assigned a default value, those default values are the zero value.
The zero value does bring convenience to developers in many cases, but there are also many people who don't like it who think that the existence of the zero value makes the code grammatically lax and brings some uncertainty. For example, the problem I will describe in detail later.
1.2 gob
gob
is a standard library in encoding/gob
. Gob is actually the abbreviation of go binary, so from its name we can also guess that gob
should be related to binary.
In fact, gob
is a unique format of Golang for serializing and deserializing program data in binary form, similar to pickle
in Python. Its most common usage is to serialize an object (structure) and store it in a disk file, then read the file and deserialize it when needed to achieve object persistence.
I won’t give examples, and this article is not a special topic on the use of gob
. This is its Official Document. Readers who are not familiar with gob
usage can take a look at the Example
part of the document, or just look at the codes I use to describe the problem later.
2 Problem
2.1 Requirement
At the beginning of this article, I briefly described the origin of the problem. Here I use a simpler model to expand the description.
First we define a structure named task
:
type task struct {
// Just as json library, the fields must be public could be serialized
Name string
State int // Task state, it's only 2 states for simplistic, 1: running, 0: stopped
}
Around this structure, we will enter several tasks data, each task is a task
object. But for some reasons, we must use gob
to persist these task information to the local disk instead of using a database like MySQL.
Next, we have such a requirement:
Iterate and deserialize the locally stored gob
files, then count how many tasks are running, and how many of them are stopped.
2.2 Code
Based on the above requirements and background, the code is as follows:
- defines.go
package main
// Directory where .gob files saved
const DIR = "./tasks"
type task struct {
// Just as json library, the fields must be public could be serialized
ID int
Name string
State int // Task state, it's only 2 states for simplistic, 1: running, 0: stopped
}
// Test tasks which will be serialized and saved into .gob files.
// We define 2 running tasks and 3 stopped tasks here.
var tasks = []task{
{0, "StoppedTask1", 0},
{1, "RunningTask1", 1},
{2, "RunningTask2", 1},
{3, "StoppedTask2", 0},
{4, "StoppedTask3", 0},
}
- serializer.go
package main
import (
"bytes"
"encoding/gob"
"fmt"
"os"
"path/filepath"
)
// serialize serializes person objects and save to files,
// the files are ./persons/${t.Name}.gob
func serialize(t task) {
filename := filepath.Join(DIR, fmt.Sprintf("%s.gob", t.Name))
buffer := new(bytes.Buffer)
encoder := gob.NewEncoder(buffer)
_ = encoder.Encode(t)
_ = os.WriteFile(filename, buffer.Bytes(), 0o644)
}
// deserialize deserializes .gob files to task objects
func deserialize(path string, t *task) {
raw, _ := os.ReadFile(path)
buffer := bytes.NewBuffer(raw)
decoder := gob.NewDecoder(buffer)
_ = decoder.Decode(t)
}
- main.go
package main
import (
"fmt"
"path/filepath"
)
func main() {
storeTasks()
countStates()
}
func storeTasks() {
for _, p := range tasks {
serialize(p)
}
}
func countStates() {
counter := make(map[int]int)
// Use a temporary pointer as a carrier of objects in the file
// to save the cost of creating new objects.
tmpT := &task{}
for _, t := range tasks {
filename := filepath.Join(DIR, fmt.Sprintf("%s.gob", t.Name))
// Deserialize task into tmpT
deserialize(filename, tmpT)
// Count the state
counter[tmpT.State]++
}
fmt.Printf("Stopped: %+v, Running: %+v\n", counter[0], counter[1])
}
After running code, I got the result below:
Stopped: 1, Running: 4
What? 1 stopped task and 4 running tasks? Bug occurred. The result is different with our test data: 3 stopped tasks and 2 running tasks. So, what went wrong?
2.3 Debug
Let’s add a print statement line in for loop of main.go/countStates()
function to see what we could get:
func countStates() {
// ...
for _, t := range tasks {
// ...
deserialize(filename, tmpT)
// Add this line
fmt.Printf("%+v\n", *tmpT)
// ...
}
// ...
}
Output:
{ID:0 Name:StoppedTask1 State:0}
{ID:1 Name:RunningTask1 State:1}
{ID:2 Name:RunningTask2 State:1}
{ID:3 Name:StoppedTask2 State:1}
{ID:4 Name:StoppedTask3 State:1}
Oh man, StoppedTask2 and StoppedTask3 are both running! But the amazing thing is that except for State, all other data are valid! Seeing this result, if you often deal with configuration files such as JSON and Yml just like me, you may take it for granted that the above gob files are read correctly, and there should be a storage problem.
But gob files are binary files, and it is difficult for us to verify them with eyes like JSON files. Even if you use tools like xxd
or something, you can only get such ambiguous output:
> xxd .\tasks\RunningTask1.gob
00000000: 2b7f 0301 0104 7461 736b 01ff 8000 0103 +.....task......
00000010: 0102 4944 0104 0001 044e 616d 6501 0c00 ..ID.....Name...
00000020: 0105 5374 6174 6501 0400 0000 15ff 8001 ..State.........
00000030: 0201 0c52 756e 6e69 6e67 5461 736b 3101 ...RunningTask1.
00000040: 0200 ..
> xxd .\tasks\StoppedTask2.gob
00000000: 2b7f 0301 0104 7461 736b 01ff 8000 0103 +.....task......
00000010: 0102 4944 0104 0001 044e 616d 6501 0c00 ..ID.....Name...
00000020: 0105 5374 6174 6501 0400 0000 13ff 8001 ..State.........
00000030: 0601 0c53 746f 7070 6564 5461 736b 3200 ...StoppedTask2.
Maybe we can try to forcely parse these binary files to compare the differences between them; or deserialize two identical objects except State into gob files, and then compare them. If you are interested, you can try it. At that time, due to time constraints and other reasons, we did not try this approach. We continue to modify the data to test.
2.4 Find patterns
Observing the results of the previous error, we found that the error data were all StoppedTask. Programmer's intuition tells me that this may not be a coincidence. So I tried to change the order of the data, completely separate RunningTasks and StoppedTasks, and then tested:
// First group, stopped first.
var tasks = []task{
{0, "StoppedTask1", 0},
{3, "StoppedTask2", 0},
{4, "StoppedTask3", 0},
{1, "RunningTask1", 1},
{2, "RunningTask2", 1},
}
// Output
{ID:0 Name:StoppedTask1 State:0}
{ID:3 Name:StoppedTask2 State:0}
{ID:4 Name:StoppedTask3 State:0}
{ID:1 Name:RunningTask1 State:1}
{ID:2 Name:RunningTask2 State:1}
Stopped: 3, Running: 2
// Second group, running first.
var tasks = []task{
{1, "RunningTask1", 1},
{2, "RunningTask2", 1},
{0, "StoppedTask1", 0},
{3, "StoppedTask2", 0},
{4, "StoppedTask3", 0},
}
// Output
{ID:1 Name:RunningTask1 State:1}
{ID:2 Name:RunningTask2 State:1}
{ID:2 Name:StoppedTask1 State:1}
{ID:3 Name:StoppedTask2 State:1}
{ID:4 Name:StoppedTask3 State:1}
Stopped: 0, Running: 5
The weird phenomenon appears, when StoppedTask first, the result is all right; when RunningTask first, RunningTasks are correct, StoppedTasks are all wrong, and even the ID of StoppedTask1, which is originally 0, has become 2 here!
After repeated testing and observation of the result set, we had come to the conclusion that all the RunningTasks data is fine, and all the problems are with the StoppedTasks data!
To further formalize this conclusion: If the preceding data is a non-zero number, and the following data is a zero, then the following zero will be overwritten by the non-zero in front of it.
3 Explanation
Reviewing the program code again, I noticed this sentence:
// Use a temporary pointer as a carrier of objects in the file
// to save the cost of creating new objects.
tmpT := &task{}
In order to save the extra overhead of creating a new object, I used the same variable to recursively load the data in the file.
Combined with the bug pattern we found earlier, the answer seems to be close at hand: the so-called 0's are being overwritten by non-zero's in front of them, most likely because the same object is being used to load the file, which results in residual data in front of it.
The validation is also simple, just put that public object into the following for
loop so that each loop recreates a new object for loading the file data to cut off the effects of the previous data.
Let's modify the code:
for _, t := range tasks {
tmpT := &task{}
// ...skip
}
// Output
{ID:1 Name:RunningTask1 State:1}
{ID:2 Name:RunningTask2 State:1}
{ID:0 Name:StoppedTask1 State:0}
{ID:3 Name:StoppedTask2 State:0}
{ID:4 Name:StoppedTask3 State:0}
Stopped: 3, Running: 2
Yes! It’s right!
The result is indeed as we thought, it is the reason of data residuals. But here's another question: Why is it that the old method reads everything correctly when there are 0s and then non-0s (Stopped first then Running tasks)? And why are all other numbers (e.g. non-zero IDs) not affected, except for the 0's?
All questions now seem to point to the special number 0!
It was only at this point that the zero-value property finally caught up with us. So I rushed to read the official documentation of the gob
library and found this sentence:
If a field has the zero value for its type (except for arrays; see above), it is omitted from the transmission.
The words before and after this sentence are talking about struct
, so field
here also refers to a field in a structure, in line with the example in our text.
Based on the conclusions we got earlier, and the official documentation, we can now finally come to a complete conclusion:
The gob
library ignores zero values outside of arrays when manipulating data. Whereas our code starts out using a public object to load the file data, since zero values are not transmitted, the fields in the original data that are zero-valued are not read, and we actually see the data from the last non-zero-valued object.
For the example in this article, the solution is also very simple, is what I did above, do not use a public object to load data.
4 Recall
In the project bug I described at the beginning of this article, I also used 0 and 1 to indicate the state of a timed task (stopped, running). As a result, different tasks were interfering with each other due to the zero value issue, which caused the task to execute abnormally, while the other fields that did not involve the zero value were fine. Although this is an online production environment, fortunately, the problem was found early and handled in time, and did not cause any production accidents. However, the whole process and the final answer were deeply imprinted in my mind.
Later I briefly discussed with my colleague why gob
chose to ignore the zero value. From my perspective, it was probably to save space. The code we wrote at the beginning also created a public object to save space, and as a result, the two space-saving logics eventually collided to create a hidden bug.
Posted on May 8, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.