Honeycomb, Python, and I: an OpenTelemetry Horror Story (With a Happy Ending)

Background

I'm currently on day 1.5 of my Staycation where I'm trying to catch up on my MBA coursework, but it can be somewhat painful and I needed a break. I decided to take up the challenge of finally integrating distributed tracing into my web apps.

It's no surprise that my apps are mostly written using Sanic as I'm pretty involved with the project. I've been wanting to start testing honeycomb out as well, so it seemed the perfect opportunity to try out.

OpenTelemetry & Python: The Nightmare Begins

First thing I want to say is that the OpenTelemetry project is both admirable for its goals and awesome in its scope. There's a lot going for it.

Second thing I want to say is that the documentation sucks. And I'm not the only person that thinks so. But it's okay enough to kind of work out the kinks once you've been spinning for a bit.

The third thing I want to say is that there need to be better examples for both manual instrumentation and automatic instrumentation NOT using the Flask example. I love Flask, and it's definitely one of the most popular Python web frameworks, but if you search for "Python OpenTelemetry" that's pretty much all you get.

For reference they all pretty much include the following instructions:

In your shell

virtualenv venv
source ./venv/bin/activate
pip install opentelemetry-api
pip install opentelemetry-sdk
pip install opentelemetry-distro
pip install opentelemetry-instrumentation-flask
pip install flask
pip install requests

app.py:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor, ConsoleSpanExporter
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from flask import Flask

app = Flask(__name__)
FlaskInstrumentor().instrument_app(app)

@app.route("/")
def hello_world():    # You can still use the OpenTelemetry API as usual to create custom spans    # within your trace    
    with tracer.start_as_current_span("do_work"):
        time.sleep(0.1) 
     return "Hello, World!"

This is great... but I'm not using Flask and Sanic does not yet have automatic instrumentation.

Honeycomb Tasting

I've been a fan of the idea of honeycomb for some time, and have interacted from time to time with the honeycomb team.

I had signed up for a honeycomb account some time back, and thankfully they understand the concept of try-before-you-buy with a free tier that I never got to take advantage of. After logging in and setting up a new API key, I went to work attempting manual instrumentation of my sanic app.

So first, let's just try exporting a span with a test script:

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import SimpleSpanProcessor
from opentelemetry.sdk.resources import Resource

OTEL_EXPORTER_OTLP_ENDPOINT="https://api.honeycomb.io"
OTEL_EXPORTER_OTLP_HEADERS={'x-honeycomb-team': 'no-i-wont-tell-you'}
OTEL_SERVICE_NAME="test-service-for-hcotel"

resource = Resource({'name': 'test'})
provider = TracerProvider(resource=resource)
processor = SimpleSpanProcessor(OTLPSpanExporter(endpoint=OTEL_EXPORTER_OTLP_ENDPOINT, headers=OTEL_EXPORTER_OTLP_HEADERS))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer(__name__)

with tracer.start_as_current_span("test-span"):
    print("test service")

Failure, Failure, Failure

My workflow often varies depending on location, time of day, phase of the moon, etc. My primary devices for development are a Lenovo X1 Nano running Fedora 35, a Dell XPS 8930 Desktop with some significant upgrades, an M1 Mac Mini, and a 14" MacBook Pro.

When developing in Python, on the linux systems, things Just Work™ but M1 macs still have some quirks. During this test, I was working on the MacBook Pro

GRPC Problems

On M1 macs, for some time now the grpcio wheel has been broken. There are a few different workarounds, but in some cases you won't even know you need them until you try testing out something that relies on it - say, the otel grpc trace exporter, and then you get a traceback like this:

Traceback (most recent call last):
  File "/Users/ssadowski/projects/otel-honeycomb-test/test.py", line 4, in <module>
    from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
  File "/Users/ssadowski/projects/otel-honeycomb-test/__pypackages__/3.10/lib/opentelemetry/exporter/otlp/proto/grpc/trace_exporter/__init__.py", line 20, in <module>
    from grpc import ChannelCredentials, Compression
  File "/Users/ssadowski/projects/otel-honeycomb-test/__pypackages__/3.10/lib/grpc/__init__.py", line 22, in <module>
    from grpc import _compression
  File "/Users/ssadowski/projects/otel-honeycomb-test/__pypackages__/3.10/lib/grpc/_compression.py", line 15, in <module>
    from grpc._cython import cygrpc
ImportError: dlopen(/Users/ssadowski/projects/otel-honeycomb-test/__pypackages__/3.10/lib/grpc/_cython/cygrpc.cpython-310-darwin.so, 0x0002): tried: '/Users/ssadowski/projects/otel-honeycomb-test/__pypackages__/3.10/lib/grpc/_cython/cygrpc.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64e'))

The solution for me was to do the following as part of my installation... sort of:

$ export PIP_NO_BINARY=grpcio

This gets me a bit farther, as now we get this weird error:

test service
Failed to export span batch, error code: StatusCode.UNAUTHENTICATED

Honeycomb's Bad Taste

There are a few things missing in honeycomb's documentation for using OpenTelemetry and unfortunately, that's not obvious. When reading through the docs, there are some significant misses that make honeycomb unusable as a collector, and what's worse if you're using the grpc exporter there are two major hurdles to surpass.

I spent a huge amount of time - maybe 6 hours - digging into the problem. I was flipping back and forth between honeycomb's docs, the open telemetry docs, and the open telemetry python repo digging through issues.

One of the big problems is that the python exporters do this bit of stupid in the grpc exporter on lines 152-153. Also it's clear the author knew the issue, because they specifically disable the pylint warning!

            except Exception as error:  # pylint: disable=broad-except
                logger.exception(error)

Yes, we just catch the exception and don't really provide any detail, which yes, in our case just told us the error listed above.

I kept digging, however, and found that honeycomb has experimental http support for open telemetry... I decided to give it a whirl and see if I could at least get a better error. Answer? Yes, and also answer too problem #1 with the honeycomb docs: the API endpoint was wrong. After switching to the http exporter, the error changed:

test service
Failed to export batch code: 400, reason: {"error":"invalid OTLP endpoint - should you be sending to /v1/traces?, see API docs https://docs.honeycomb.io/"}

Well gosh, probably I should be sending my traces there. So I updated my endpoint and switched back to grpc. Bad move.

test service
Failed to export span batch, error code: StatusCode.UNAUTHENTICATED

Well crap. Back to http? But first - is the http exporter really that much better?

     _logger.error(
                    "Failed to export batch code: %s, reason: %s",
                    resp.status_code,
                    resp.text,
                )

So yes, we'll get significantly more detail fromt he http exporter. But what did we get?

test service
Failed to export batch code: 401, reason: {"message":"missing 'x-honeycomb-dataset' header"}

Now wait a minute, that's not even a header the doc says to use... so is that problem 2? If I add the header, what do I get from the http exporter this time?

test service

Success? Yes! I was able to validate that the span was present, and that's what I needed.

Making the changes to the grpc exporter got the same results, and so clearly that was the issue.

In the end, I was able to start exporting spans to honeycomb and could take my learning with me.

Gotta Go Fast

So with no automatic instrumentation, how can we get things going quickly?

Off To The Races

Like many python frameworks, Sanic has a method to hook into the request/response cycle. In Sanic, also like many python frameworks, we call this middleware and we can do some execution before and after. We also have some convenience decorators, which I generally don't use outside of doing some POCs. Because this is a POC, let's set up our hooks!


@app.on_request
async def otel_req_middleware(request: request.Request) -> None:
    route_name = request.route.name if request.route else 'Error' # set a default route name
                                                                  # necessary because route isn't
                                                                  # set on some errors like NotFound
    request.ctx.span = tracer.start_span(name=route_name)         # create our span
    request.ctx.span.add_event("route", {"path": request.path})   # add our path as an event


@app.on_response
async def otel_rsp_middleware(request: request.Request, 
        response: response.HTTPResponse) -> None:
    try:
        if request.ctx.span:
            request.ctx.span.end()  # close the span, send to collector
    except Exception as e:
        print(e)                    # TODO: convert to logger, catch specific exceptions


@app.exception(exceptions.SanicException)
async def otel_exc_middleware(request: request.Request,
            exception: exceptions.SanicException) -> response.HTTPResponse:
    request.ctx.span.set_status(Status(StatusCode.ERROR))  # make sure to flag span as containing an error
    request.ctx.span.add_event("exception",
        {"data": traceback.extract_stack().format()})      # Add an error event
    return response.json({"error": str(exception), "status_code": exception.status_code},
        status=exception.status_code)                      # return our error

What's Happening Here?

The request middleware executes before the main portion of the request, so in the

@app.on_request

function, I initialize the span and add an event that tells me either that it's a route or an error. The route will always be present unless a NotFound exception has bubbled up from the router because the route hasn't been defined.

In the

@app.on_response

function, I close the span and in this situation, the span is exported to the collector.

The exception function sets the span to contain an error, and add an event to contain the trace for the exception.

Wrapping It All Up

At this point, spans are being exported to honeycomb for both successful requests and errors, so the Otel POC is largely complete. The big question is do I carry this further and add automatic instrumentation to Sanic, or do I complete the larger POC for the service first?

For Sanic

The answer is that we're trying to help Sanic grow up, and it would be really helpful if we natively supported Open Telemetry either as an add-on package or as part of Sanic Extensions. Either way, it's on my to-do list to get this done, but for now I have a pattern to work with Honeycomb in a way that will start to get me the observability metrics I need.

Outstanding Issues

Update Open Telemetry documentation, possibly through PR
Update GRPC exporter, possibly through PR, to return better errors
Open issue with Honeycomb for them to update their python documentation

Takeaways

Like so many efforts, this implementation was born out of necessity. I was somewhat aware of the issues with Open Telemetry and Python because of my long time friend Michael Stella had already fought some of the same battles.

I am a bit disappointed with the issues I had to uncover on my own with Honeycomb, but I'm not deterred by them. I'm excited to make it to the next steps for my own efforts, and I hope that this effort will help others, even outside the Sanic community, integrate observability and distributed tracing with their python apps and services.

With all of that being said - this was a great challenge and I learned lots. I'm very excited about the future of Open Telemetry and establishing a pattern for observability for my own services. I think that if my experience can help anyone shave down time it takes to get up and running with Otel on python, it was totally worth it.

Blog