"The best way to learn jazz is to listen to it.”— Oscar Peterson

Sometimes I marvel at beautiful music, wondering how to reproduce it. This process is called “Playing by Ear” or “Transcription”. Just as a child learns to speak by listening first, music can be learnt the same way.

But there’s a problem…

Transcribing music is time-consuming. It takes a professional musician roughly 4–60 minutes to transcribe 1 minute of music. But beginning musicians who benefit most from transcriptions may not have the skills to transcribe.

We need a computational transcription method

To illustrate the transcription process, I will use Bach’s Passacaglia C Minor (BWV 582) as an example:

We could identify the energy of each frequency through time with a spectrogram. But the actual notes played are in green. Somehow we need to identify the base frequency of the notes played.

AI to the rescue!

Luckily some employees from ByteDance (TikTok’s parent company) were working on a piano transcription algorithm. The result is a Python module for inferencing that converts audio files to midi: https://github.com/qiuqiangkong/piano_transcription_inference

How does this module work?

Preprocess:
– Downsamples the original audio.
– Separate audio into segments.
Inference:
– Generate a spectrogram from the audio segment
– Perform CNN inferencing on the spectrogram.
Postprocess:
– Stitch rough midi output together.
– Perform regression to find the most likely midi events.

Core backend architecture

An ideal piano transcription service would be blazingly fast. So this would be the main focus of architecture exploration below.

Local server on M1 Pro — v0

A few observations when running locally:
– Enabling torch multicore processing speeds up inferencing.
– Up to 3 cores are used, beyond that there is no improvement and
utilisation.
– Transcription rate of 0.5 seconds per 1 second of audio.

Running locally gives some idea of how performance could be improved before applying to the cloud. I find that iterating on the cloud takes much longer than local development.

Naive server — v1

Although this works, there are many limitations:

API Gateway limits payloads to 6MB — equivalent to a 6-minute mp3 file.
Lambda’s computation speed is slow — 3 seconds per 1 second of audio
API Gateway enforces a 30-second timeout — can transcribe up to 10 seconds of audio.

Monolithic server — v2

Presigned POST urls can upload up to 5GB files to S3. Lambda can download (to /tmp folder only) and upload to S3 using AWS SDK.

Lambda has 900-second timeout — can transcribe up to 5 minutes of audio.

Step Function server — v3

Step Function orchestrator can run inference lambdas in parallel — this prevents lambda timeouts, so you can transcribe over 1 hour of audio:
– Inference takes 12.5 seconds for 5-second segment
– Can achieve a transcription rate of 0.25 seconds per 1 second of audio.

Credits:

Melbourne AWS User Group — inspired me to use Step Functions.
Melbourne Serverless Meetup — exposed me to serverless architecture.
Figma — image e ditor + AWS icons.

Blog

Fast piano transcription on AWS -Part 1

Meng Lin