Observability with Grafana Cloud and OpenTelemetry in .net microservices

People say, that application development lifecycle consists of 3 steps

make it work
make it right ← we’re here
make it fast

Suppose you’re developing a microservice. It can be over rest/grpc/kafka or whatever, You’ve completed all functional and non-functional requirements incl. authentication/authorization, validation, your application is secure, scalable and solving a business problem.

In this article, you’ll find out how to make your app production-ready whether it is cloud native and hosted in Kubernetes or more traditional without using containers.

We will cover popular tools and frameworks aiming to solve common needs for every application without reinventing the wheel:

Grafana Cloud (Prometheus, Grafana, Loki, Tempo),
OpenTelemetry,
Serilog

If you haven’t heard about them, no worries, the article applies to any experience level. You can find a fully working demo project on Github.

Observability/Monitoring

Observability means some collected data explaining the state of your application. For production environments it is critical to know how your application behaves. Nginx post simplified it into 3 simple questions:

Metrics – “Is there a problem?”
Traces – “Where is the problem?”
Logs – “What is the problem?”

Plenty of cool tools capable of doing full monitoring of your system can answer all the questions above. The problem we face nowadays isn’t knowing the answer but having too many answers.

This article can help you make the right decision and save tons of time and money for your business. Described approach works best for start-ups and small companies.

From a high-level perspective, there are 2 popular groups:

SaaS: Dynatrace, Logz.io, Datadog, Honeycomb, New Relic
Open source, self-hosted: Prometheus + Grafana and ElasticSearch + Kibana.

Today we’re talking about Grafana Cloud (Prometheus for metrics, Loki for logs, Tempo for traces), a SaaS product. Its free plan includes:

10,000 series for Prometheus or Graphite metrics
50 GB of logs
50 GB of traces

It's very generous compared to competitors!

When exceeding these limits, you can freely choose to continue on SaaS or switch to open-source self-hosted distribution.

If you decide to give it a try, Sign Up here before moving to the next step.

Before we start, these 2 portals shortcut will be helpful:

https://grafana.com/orgs/{YouOrganizationName} - Account Management
https://{YouOrganizationName}.grafana.net - Grafana UI

Grafana Agent

Grafana Agent is responsible for delivering your metrics/traces from your application to the cloud. We use grafana agent for **metrics **and **traces **only. The approach for **logging **will be different.

The data flow is illustrated below:

Open https://{YouOrganizationName}.grafana.net

=> go to the integrations tab,
=> choose grafana agent
=> and then follow the instructions.

Grafana Agent can work on Windows/MacOS/Debian/RedHat

After installation, we need to configure the agent:

If you’re using Windows with the default installation, go to C:\Program Files\Grafana Agent and edit agent-config.yaml For other cases check the documentation


 yaml
metrics:
  configs:
  - name: integrations
    remote_write:
    - basic_auth:
        password: {replaceit}
        username: {replaceit}
      url: {replaceit}
    scrape_configs:
      - job_name: dogs-service
        scrape_interval: 30s
        metrics_path: /metrics/prometheus
        static_configs:
          - targets: ['localhost:5000']
      - job_name: dogs-service-healthchecks
        scrape_interval: 30s
        metrics_path: /health/prometheus
        static_configs:
          - targets: ['localhost:5000']
  global:
    scrape_interval: 60s
  wal_directory: /tmp/grafana-agent-wal

traces:
  configs:
  - name: default
    remote_write:
      - endpoint: {replaceit}
        basic_auth:
          username: {replaceit}
          password: {replaceit}
    receivers:
      otlp:
        protocols:
          grpc:

To get the username, password and URLs, go to https://grafana.com/orgs/{YouOrganizationName} and hit ‘Details’ in Tempo and Prometheus.

Save the file and restart Grafana Agent service.

With this, Grafana Agent configuration is complete. Now, let’s send some data to Grafana Cloud.

Traces

To understand the traces it’s easier to take a look at the picture.

In short, you can trace an activity through your services in a distributed system.

To demonstrate it, in our simple demo project we’ll be using 3 components:

.net 6 web api
sqlite database
external dogs api

To send traces to the monitoring system, we need to use some framework. OpenTelemetry is a standardized and recommended approach to implementing tracing in the application nowadays. It gets support from all popular tools so the integration will be seamless.

OpenTelemetry is a collection of tools, APIs, and SDKs. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior.

We're going to use OpenTelemetry .NET SDK. Add following nuget dependencies to the project:


 xml
<PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.4.0-alpha.2" />
<PackageReference Include="OpenTelemetry.Exporter.Prometheus.AspNetCore" Version="1.4.0-alpha.2" />
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.0.0-rc9.6" />
<PackageReference Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.0.0-rc9.6" />
<PackageReference Include="OpenTelemetry.Instrumentation.EntityFrameworkCore" Version="1.0.0-beta.3" />
<PackageReference Include="OpenTelemetry.Instrumentation.EventCounters" Version="0.1.0-alpha.1" />
<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.0.0-rc9.6" />
<PackageReference Include="OpenTelemetry.Instrumentation.SqlClient" Version="1.0.0-rc9.6" />

Then, in Program.cs configure:


 csharp
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetryTracing(options =>
{
    options.ConfigureResource(resourceBuilder =>
    {
        resourceBuilder.AddService(
            builder.Environment.ApplicationName,
            builder.Environment.EnvironmentName,
            builder.Configuration["OpenTelemetry:ApplicationVersion"],
            false,
            Environment.MachineName);
    })
    .AddHttpClientInstrumentation(instrumentationOptions =>
    {
        instrumentationOptions.RecordException = true;
    })
    .AddAspNetCoreInstrumentation(instrumentationOptions =>
    {
        instrumentationOptions.RecordException = true;
    })
    .AddSqlClientInstrumentation(instrumentationOptions =>
    {
        instrumentationOptions.RecordException = true;
        instrumentationOptions.SetDbStatementForText = true;
    })
    .AddEntityFrameworkCoreInstrumentation(instrumentationOptions =>
    {
        instrumentationOptions.SetDbStatementForText = true;
    })
    .AddOtlpExporter(opt =>
    {
        opt.Protocol = OtlpExportProtocol.Grpc;
        opt.Endpoint = new Uri(builder.Configuration["OpenTelemetry:Exporter:Otlp:Endpoint"]);
    });
});

‘OpenTelemetry:Exporter:Otlp:Endpoint’ comes from appsettings.json


 json

  "OpenTelemetry": {
    "ApplicationVersion": "1.0.0", 
    "Exporter": {
      "Otlp": {
        "Endpoint": "http://localhost:4317"
      }
    }
  }

where http://localhost:4317 is an endpoint of the Grafana Agent we installed in the previous step.

By using OTLP protocol, our application will send traces to grafana agent which will take care of the rest. In our case, the agent will resend it to grafana cloud. If required, you can always easily switch from grafana cloud to your self-hosted tempo just by configuring the agent. There's no need to modify the source code,

That’s it.

Let’s run the app and hit the test endpoint.



GET {{host}}/api/v1/dogs/new

The parent span belongs to our API request, and 2 child spans for calling external API and saving data to the database. There’s a correlation between the duration, operation results, and metadata. That is basically everything we need to trace and debug activity from top to bottom.

Let’s hit another endpoint to see what we get in case of error:



GET {{host}}/api/v1/fail500

Going back to search panel, and searching for our request:

It looks perfect, we have all the needed information here to backtrace the issue: External API returned HTTP 404, Refit threw an exception, and our API returned HTTP 500 to the client.

Metrics

Metrics - aggregated real-time data to measure your application performance.

For example, it can be the latency of your API endpoints, number of http 5XX errors, and the free space on the hard drive.

There are many frameworks to collect metrics in .net core service:

All of them are working just fine, but we’re going to use OpenTelemetry sdk again, and then expose prometheus endpoint. Grafana Agent will fetch it and send to Grafana Cloud, similarly to traces.


 csharp
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOpenTelemetryMetrics(options =>
{
    options.ConfigureResource(resourceBuilder =>
    {
        resourceBuilder.AddService(
            builder.Environment.ApplicationName,
            builder.Environment.EnvironmentName,
            builder.Configuration["OpenTelemetry:ApplicationVersion"],
            false,
            Environment.MachineName);
        resourceBuilder.AddTelemetrySdk();
    })
    .AddHttpClientInstrumentation()
    .AddAspNetCoreInstrumentation()
    .AddEventCounterMetrics()
    .AddPrometheusExporter();
});


 csharp
var app = builder.Build();
app.UseHealthChecksPrometheusExporter("/health/prometheus", options =>
{
    options.ResultStatusCodes[HealthStatus.Unhealthy] = 200;
});

That’s it, very simple. Go to Grafana Cloud, change data source to prometheus and try to visualize some metrics of your choice. E.g. queries per second:

You might already be familiar with Prometheus and Grafana metrics. It’s the most loved tool among DevOps experts all around the globe to monitor VMs, networks, databases, and whatever metrics you can imagine.

Healthchecks

Healthchecks are the edge case of app metrics to automate operations (e.g. when your app should automatically restart when it's running out of memory, route traffic, a new instance in your cluster becomes available, and so on).

Microsoft provided extensive documentation for us, explaining everything about health checks in depth.

To keep the article short and simple, I won't add any more detail in this subject. So we just simply try it out.

First step is Installing required packages:


 xml
    <PackageReference Include="AspNetCore.HealthChecks.Network" Version="6.0.4" />
    <PackageReference Include="AspNetCore.HealthChecks.Prometheus.Metrics" Version="6.0.2" />
    <PackageReference Include="AspNetCore.HealthChecks.Publisher.Prometheus" Version="6.0.2" />
    <PackageReference Include="AspNetCore.HealthChecks.System" Version="6.0.5" />
    <PackageReference Include="Microsoft.Diagnostics.NETCore.Client" Version="0.2.328102" />
    <PackageReference Include="Microsoft.Extensions.Diagnostics.HealthChecks.EntityFrameworkCore" Version="6.0.8" />

Configure program.cs


 csharp
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddHealthChecks()
    .AddDiskStorageHealthCheck(_ => { }, tags: new[] { "live", "ready" })
    .AddPingHealthCheck(_ => { }, tags: new[] { "live", "ready" })
    .AddPrivateMemoryHealthCheck(512 * 1024 * 1024, tags: new[] { "live", "ready" })
    .AddDnsResolveHealthCheck(_ => { }, tags: new[] { "live", "ready" })
    .AddDbContextCheck&lt;DogsDbContext>(tags: new[] { "ready" });


 csharp
var app = builder.Build();
app.MapHealthChecks("health", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = HealthChecksLogWriters.WriteResponseAsync
});

app.MapHealthChecks("health/ready", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("ready"),
    ResponseWriter = HealthChecksLogWriters.WriteResponseAsync
});

app.MapHealthChecks("health/live", new HealthCheckOptions
{
    Predicate = check => check.Tags.Contains("live"),
    ResponseWriter = HealthChecksLogWriters.WriteResponseAsync
});

app.UseHealthChecksPrometheusExporter("/health/prometheus", options =>
{
    options.ResultStatusCodes[HealthStatus.Unhealthy] = 200;
});

In this example, we collect some crucial health check metrics and expose them in 3 different ways:

For kubernetes probes
For prometheus collector
For people

Kubernetes Probes

You can skip it if you’re not using Kubernetes.

Kubernetes pings our application. Depending on the result, it can decide to be up and running, restart ,or maybe wait a bit longer until the app loads all required dependencies. Only then it will be ready to handle incoming traffic.

Detailed documentation you can find here.

We’re exposing 2 endpoints:

/health/live - for **startupProbe and livenessProbe if the app signalized that is not live (some of healthchecks fauld) Kubernetes has to restart the service.
/health/ready - readinessProbe When the app is ready Kubernetes will route the traffic to this instance.

Prometheus healthcheck

Additional endpoint for exporting health checks in prometheus format. And similar to app metrics we can use them in grafana:

Human readable format

And the last endpoint just to simplify testing and operations. Just open it in your browser:

Logs

Logging for development and production environments will be different.

On production best practice is to write logs to stdout, then use log collector (e.g. promtail) to deliver it to the storage. And for development purposes, it is much easier to have a sink and write logs directly to Grafana Cloud.

I believe that in 2022, OpenTelemetry logs are not ready for production use. It doesn't give you any advantages, and tooling is quite poor compared to well-known tools.

So I still recommend using Serilog for logging. It has an OpenTelemetry sink, so again, you can switch to OTLP anytime when needed without modifying your source code.

As usual we start with installing packages:


 xml
<PackageReference Include="Serilog.AspNetCore" Version="6.0.1" />
<PackageReference Include="Serilog.Enrichers.Demystifier" Version="1.0.2" />
<PackageReference Include="Serilog.Enrichers.Span" Version="2.3.0" />
<PackageReference Include="Serilog.Exceptions" Version="8.4.0" />
<PackageReference Include="Serilog.Exceptions.EntityFrameworkCore" Version="8.4.0" />
<PackageReference Include="Serilog.Exceptions.Refit" Version="8.4.0" />
<PackageReference Include="Serilog.Formatting.Compact" Version="1.1.0" />
<PackageReference Include="Serilog.Sinks.Grafana.Loki" Version="8.0.0" />

Program.cs:


 csharp
builder.Host.UseSerilog((_, configuration) => configuration
    .ReadFrom.Configuration(builder.Configuration)
    .Enrich.WithSpan()
    .Enrich.WithExceptionDetails(new DestructuringOptionsBuilder()
        .WithDefaultDestructurers()
        .WithDestructurers(new IExceptionDestructurer[]
        {
            new DbUpdateExceptionDestructurer(),
            new ApiExceptionDestructurer()
        }))
    .Enrich.WithDemystifiedStackTraces();

builder.Services.AddHttpLogging(logging =>
{
    logging.LoggingFields = HttpLoggingFields.All;
});


 csharp
var app = builder.Build();
app.UseHttpLogging();

appsettings.json:


 json
"Serilog": {
    "Using": [
      "Serilog.Sinks.Grafana.Loki"
    ],
    "MinimumLevel": {
      "Default": "Debug"
    },
    "WriteTo": [
      {
        "Name": "Console",
        "Args": {
          "formatter": "Serilog.Formatting.Compact.CompactJsonFormatter, Serilog.Formatting.Compact"
        }
      },
      {
        "Name": "GrafanaLoki",
        "Args": {
          "uri": "https://logs-prod3.grafana.net",
          "credentials": {
            "login": "",
            "password": ""
          },
          "labels": [
            {
              "key": "service",
              "value": "demo-services-dogs"
            }
          ],
          "propertiesAsLabels": [
            "app"
          ]
        }
      }
    ]
  }

Don’t forget to update your grafana credentials in your secrets

In our demo project we’re going to log http requests/responses. Mirosoft provides 2 middleware to log http messages (body, headers etc).

http-logging <- we’re using this one
w3c-logger

Serilog is the most popular logging framework and has tons of extensions, for example:

WithSpan - adding information from OpenTelemetry traces.
WithExceptionDetails - log the exceptions in convenient and human-readable format.

Okay, after running our service go to Grafana Cloud -> Explore -> Loki Logs Datasource

Here are our logs. Thanks to deep integration between Loki and Tempo, Grafana allows us to quickly jump from logs to according to traces.

Conclusion

Source code is on Github

In the article we explored Grafana Cloud and tried 3 observability tools:

Prometheus for metrics
Tempo for traces
Loki for logs

For every data store we use Grafana UI only, which is super convenient for analytics and troubleshooting.

We also got our hands dirty with OpenTelemetry, which aims to standardize observability tools and protocols to make distributed applications maintenance much easier.

In the next article we’ll cover more topics need to be done for production ready apps, such as:

Errors handling;
Retries, Jitter, Circuit Breaker patterns.

Blog