Logs monitoring with Loki, Node.js and Fastify.js

Hello 👋

Over the past few months, I've been spending a lot of time creating dashboards on Grafana using Loki for MyUnisoft (the company I work for).

So I decided to write you an article to explain my adventure from the point of view of a back-end Node.js developer.

👀 Why grafana and loki?

For context, the tool was implemented by a DevOps employee who left the company two years ago. So I didn't really choose the tool myself. I had no previous experience with either of these tools, so I had to experiment and discover as I went along.

Grafana is pretty well-known and mature, so I wasn't worried. On the other side, I quickly liked Loki to manage and search my logs. So I thought, "I've got to learn to make dashboards out of this".

📃 Writing good logs

💬 Several examples of logs are multiline to avoid scrolling and simplify reading.

When you write logs, you must constantly think about the information they will provide and make sure they can be analyzed without too much difficulty 😵. It always takes several iterations to get a good result.

In my experience, you can find two kinds of logs:

The debug logs you know are useless to your dashboards. They'll come in handy in case of bugs or when you're looking for more context about a given request.
All other logs that can provide exploitable data for dashboards (and for a human reader)

Here an example of a debug log we print when our services start (quite useful to customize the --max-old-space-size flag).

const availableHeapSize = prettyBytes(
  v8.getHeapStatistics().total_available_size
);
server.log.info(`Total available heap size: ${availableHeapSize}`);

Another common example is logging a JSON payload (you could exploit the data, of course, but that's often not the end goal).

this.entry.once("payloadReady", (payload) => {
  this.log("payload before publishing:");
  this.log(JSON.stringify(payload, null, 2));
});

😱 Be sure to redact the data (you may divulge confidential information without realizing it).

Most of the time, you won't get anything out of these logs, and they don't require any particular effort. In this article, I'm going to concentrate on the logs that we can exploit for our dashboards.

🐤 Practical example

Here's an example of bad logs we initially set up for a document upload middleware/plugin

Successfully uploaded '...'
//
Failed to upload '...'

There are several problems:

No scope (we can't easily search for all the logs related to our middleware).
Can easily be mixed with other logs (from other middlewares or services).
It can be quite a challenge to parse the state (failed or successful).
Lack information about the request or the user.

A better way of writing it:

[uploader|req-x5d4] document 'name.jpg' uploaded (state: ok)

You can easily extend the log with other information such as extension, size, runtime, etc. (without breaking the Loki pattern or regexp).

(state: ok|ext: .jpg|size: 52.5 kB|upload-time: 0.503 ms)

⚠️ Always use the same unit when logging size (bytes) or time (milliseconds). Setting the right 'unit' on Grafana will do the work of cleaning the value for you.

📜 Format clarification

A good log has to be structured to allow good searches with LogQL. Here's a bad example where it will be really difficult to extract information.

hello-world.jpg 52.5 kB|0.503 ms

I would recommend adding a label before each value (it's also easier for a human to read). Having start, end, and separator characters can also be a great help.

doc 'hello-world.jpg' [size: 52.5 kB|exec: 0.503 ms]

Here the required RegExp to parse all labels with Loki

doc '(?P<doc>\S+)' \[size: (?P<size>\S+) kB|exec: (?P<exec>\S+) ms\]

Loki 2.0 also allows you to retrieve labels with a simplified syntax called pattern.

pattern `doc '<doc>' [size: <size> kB|exec: <exec> ms]`

🔎 Implementation in the framework/code

⚠️ code examples are simplified/truncated

The Fastify framework includes the Pino logger by default (a really great logger with lots of cool features that doesn't compromise on performance). The framework itself allows a lot of really cool stuff, like controlling the level of logs at runtime.

On my team, we have chosen to customize the default request and response logs to include additional information. To achieve this, you need:

setup fastify constructor option disableRequestLogging to true
exploit two hooks (onRequest and onResponse).

server.decorateRequest("standardLog", null);
server.addHook("onRequest", async(request: FastifyRequest) => {
  request.log.info(
    `(${request.id}) receiving request ...`
  );
  request.standardLog = standardLog.bind(request);
});

server.addHook("onResponse", async(request: FastifyRequest) => {
  request.log.info(
    request.standardLog(
      `response returned "${request.method} ${request.raw.url}"
    )
  );
});

The standardLog decorator allows us to display information about the request and the token in use (for authenticated endpoints).

We have to handle several types of tokens (it all depends on who is consuming the API). A classical user or a partner through our partner API).

Each of them includes:

The postgreSQL schema of the customer
The user's or third-party's ID
The accounting folder (which can be null)

function standardLog(
  this: FastifyRequest,
  msg: string
): string {
  if (
    this.server.hasRequestDecorator("tokenInfo") &&
    this.tokenInfo !== null) {
    const { tokenInfo } = this;
    let tokenInfoLog = `${tokenInfo.type}`;

    switch (tokenInfo.type) {
      case "api":
        tokenInfoLog += `|s:${...}|t:${...}|acf:${...}`;
        break;
      case "user":
        tokenInfoLog += `|s:${...}|p:${...}|acf:${...}`;
        break;
      default:
        tokenInfoLog = "";
        break;
    }

    return `(${this.id}|${tokenInfoLog}) ${msg}`;
  }

  return `(${this.id}|none) ${msg}`;
}

Here's what it looks like in the logs

(req-1rti) receiving request "POST /ged/base-docs/docs"
(req-1rti|user|s:538|p:1|acf:23) response returned
"POST /ged/base-docs/docs", statusCode: 201 (460.106ms)

These informations will be critical to populating our dashboards with information about customers and the scope of actions.

🌀 Logs ingestion

My team currently uses promtail with a YML configuration to retrieve service logs.

server:
  http_listen_port: 9080
positions:
  filename: /var/lib/promtail/positions.yml
clients:
  - url: https://xxx.fr/loki/api/v1/push

scrape_configs:
  - job_name: svc_api_dev
    static_configs:
    - labels:
        __path__: /home/xxx/logs/service-dev-*.log
        app: api
        env: dev
        host: xxx.fr
        job: svc_api_dev
      targets:
      - localhost

I'm not going to spend too much time on this chapter, as you can find a lot of tutorials on the Internet on how to set up and configure promtail.

💡 If you don't want to depend on promtail for sending the logs to Loki, you can use the fastify plugin pino-loki

📊 OCR Dashboard

MyUnisoft is a French accounting editor, so Optical Character Recognition is an important feature of our software. Monitoring is essential to adapting to the needs of our customers and responding promptly to incidents.

👀😍 We won the OCR silver medal 2023 awarded by Le Monde du Chiffre.

Here is an example of a dashboard we have built just by using the following log:

(req-2987|user|s:24|p:5710|acf:7398) OCR xxx.jpg
[type: invoice|ext: .jpg|size: 2925.73 kB]

Group by label

We can construct the above graph with the following logQL. Notice the sum by extension.

sum(
  count_over_time(
    {app="ocr",env="$env"}
      |= "OCR"
      |= "|ext:"
      | regexp `\|ext: \.(?P<extension>\S+)\|`
  [$__range])
) by (extension)

RegExp are built using Golang syntax (I personnaly use regex101.com to test them). It is useful here to extract/detect the extension label.

We can use the syntax $env to inject a dashboard variable (here I can view my dashboard in production, staging or dev just by moving a value in a select).

I also use the built-in $__range variable, which loads data over the currently selected range in Grafana.

Unwrap

It took me a long time to really understand unwrap. There is virtually no clear documentation or explanation on the Internet!

Here it will use the size label and extract all values to compute them with min_over_time and min functions. This is useful for extracting numerical values (counter, execution time, etc.).

min(
  min_over_time(
    {app="ocr",env="$env"}
      |= "OCR"
      | regexp `\|size: (?P<size>[.0-9]+) kB\]`
      | unwrap size
      | __error__ = ""
  [$__range])
) by (app)

💬 | __error__ = "" avoid crashing Loki with unexpected Error (could happen if unwrap of size fail for any reasons).

Then, on the right side of the stat graph, we select the corresponding unit.

💡 Tips and tricks

display name

For a long time, I had some really bad raw labels in my charts (with a format similar to JSON).

You can customize it by editing the display name option. You can use variables in this field to retrieve label values directly.

no data

Sometimes the graphs will display "no data" because no logs have been detected in the selected range. This can be problematic in certain graphics (such as gauges).

But maybe you'd rather have zero. No problem, just use the No value option.

filter using a regexp and variable

On some dashboards, you will probably want to filter dynamically according to several criteria. One way of doing this is to use a regexp and a dashboard variable.

This is what I do in one of my dashboards to filter the results for one or more partners.

count(
  sum(
    count_over_time(
      {app="api",env="production"}
        |= "] CALL"
        |= "$endpoint"
        |~ `\]\[$thirdparty\]`
        | regexp `\((?P<schemaId>[0-9]+):(?P<folderId>[0-9]+)\)`
    [$__range])
  ) by (schemaId, folderId)
)

Here's the line that does the job: you need to use backticks syntax to be able to inject a variable in it.

|~ `\]\[$thirdparty\]`

🚀 Data is the key to success

As developers, we don't realize enough how monitoring and the data it generates can be powerful sources of improvement. At MyUnisoft, we work with over a hundred partners.

Understanding how they use our APIs and the various anomalies they experience has become essential for us to continue to grow exponentially.

This allows us to continually improve to provide a better experience for both our accounting customers and our developer partners who maintain the integrations.

Our ability to materialize and exploit this data has become an essential part of my team's expertise. It's really exciting to see what can be achieved with simple logs.

👋 Conclusion

I'm so glad I finally managed to write this article (and I hope it brought you some value). Many thanks to my team and especially to Cédric, without whom I would not have realized all that.

I still struggle with a lot of Grafana features, such as transformations or alerts, but I will continue to dig and improve.

🙏 Thanks for reading me 🙏

Blog