Creating Visuals for Music Using Speech Recognition, Javascript and ffmpeg: Version 0
VBAK
Posted on July 15, 2019
Hello! This is my first blog post on dev.to
I make music and I code.
The Problem
Putting out music and garnering attention to it requires me to wear multiple hats for a variety of tasks: branding, social media marketing, beat production, songwriting, mastering audio, shooting and editing videos, designing graphics, the list goes on...
In order to create social media audiovisual content for my music, I generally follow this process:
- 1) Make a beat in Garageband
- 2) Write lyrics
- 3) Practice the song
- 4) Setup my the DSLR camera
- 5) Setup my microphone
- 6) Video myself recording the song
- 7) Import the video into Adobe Premiere
- 8) Import the song audio into Adobe Premiere
- 9) Align the audio with the video
- 10) Add and align lyrics (text graphics) with the audio
- 11) Add some effects to the video I like this 80s look
- 12) Render the video (45 minutes to an hour)
- 13) Export to
.mp4
(another 30-40 minutes) - 14) Upload to YouTube (another 30-40 minutes)
- 15) Upload to IGTV (another 30-40 minutes)
I want to increase the time I spend on steps 1 through 3 and decrease the time I spend on steps 4 through 15.
Inspiration
Last Sunday (07/07/2019) I was refactoring some of my code on a project from jQuery to Web APIs. One thing led to the next, as they do the longer I am on MDN, and I came across the WebRTC (Web Real-Time Communication) standard and the YouTube LiveStream API documentation. This led me to Googling info about audio and video codecs. This finally led me to ffmpeg
, an open source software used for audio and video processing. Sweet--I could start something from there.
I had used this software sparingly in the past, so I spent a few days experimenting with a few different image-to-video conversions in order to learn the basics. Here I've used ffmpeg
to convert a sort-of timelapse of the BART (Bay Area Rapid Transit) train that passes nearby using 338 images taken throughout the day:
This inspired and led me to the project I'm working on now.
The Project
I've called this project animatemusic
at this GitHub repository. My goal is to create a toolchain in order to expedite the creation of visuals for my songs.
The Tech
- Node.js
- DOM Web API
- JSZip
- FileSaver
- ffmpeg
How it Works Thus Far
The process is a bit choppy right now since I'm running the various responsibilities in series in a semi-manual fashion:
- 1) Export my vocals from Garageband to a single
.wav
file - 2) Type the song lyrics into a
.txt
file - 3) Feed the song vocals and lyrics to a locally run CLI of gentle and receive a
JSON
file with the forced-alignment results - 4) Install and run my
animatemusic
repo locally - 5) upload the
JSON
file (along with some other parameters) and receive a.zip
folder with individual video frame.png
files - 6) Use
ffmpeg
to stitch the images into a (lyric) video file - 7) Use
ffmpeg
to combine the song audio and the lyric video
Setting Up gentle
gentle is a forced-alignment tool that relies on kaldi which is a speech recognition toolkit. Forced-alignment involves matching a text transcript with the corresponding speech audio file.
The installation process for gentle was rocky, so the following tips and resources may be useful to you, should you choose to install it:
- "Error finding kaldi files"
- I added
branch: "master"
to the gentle.gitmodules
file in order to capture some of the latest updates in kaldi which resolved some installation issues - Install gentle in a python virtual environment since they expect you to use
python@2.7.x
and the correspondingpip
version - In gentle's
install_deps.sh
bash script, comment out any of thebrew install
software names that you already have installed since anybrew
warnings will prevent the bash script from proceeding to the next step, which is the criticalsetup.py
process
Generating the Forced-Alignment Results
Once you have gentle running, give yourself a pat on the back and then run the following in your terminal, now outside of the virtual environment which used python@2.7.x
:
python3 align.py path/to/audio path/to/transcript -o path/to/output
The resulting file is in JSON
format with the following structure:
{
"transcript": string,
"words": [
{
"alignedWord": string,
"case": string,
"end": number,
"endOffset": number,
"phones": [
{
"duration": number,
"phone": string
}
],
"start": number,
"startOffset": number,
"word": string
}
]
}
-
transcript
- holds the full text of your transcript in a single string
-
words
- holds word Objects in an array
-
alignedWord
- is the word string that gentle recognized from the audio
-
case
- is a success string with either "success" or "not-in-audio" values
-
end
- is the time in seconds of when the word ends in the audio
-
endOffset
- I'm not sure...TBD (comment if you know)
-
start
- is the time in seconds of when the word starts in the audio
-
startOffset
- I'm not sure...TBD (comment if you know)
-
word
- is the word in the transcript to which it forced-aligned the word in the audio file
Converting Forced-Alignment Results to Video Frames
If I can create an image for each video frame, I can render all of those image frames into a video using ffmpeg
.
Right now, I have a single script block in my index.html
which performs all of the logic around this process. Here's the minimal interface I've created thus far:
Here are the inputs to my script:
- "video frame rate" and "full song length"
- determine the total number of frames in the (eventual) video. Default values: 30 fps (frames per second) and 60 seconds, resulting in 1800 frames.
- "words per frame" determine how many words will be displayed together on the
canvas
at any given time- right now my script is not optimal--if your cadence is fast, the time between words is short and this causes rounding errors and the script fails. This motivated the addition of this input.
- "video width" and "video height"
- set the size for the
canvas
element
- set the size for the
- "lyrics"
- is the
JSON
output from gentle
- is the
The following scripts must be loaded first:
-
jszip.min.js
- The wonderful JSZip client-side library which generates a zip file
-
FileSaver.js
- The wonderful FileSaver client-side library which, among other functionality, exposes the
saveAs
variable to trigger a browser download of a file
- The wonderful FileSaver client-side library which, among other functionality, exposes the
The script I've written right now, can be seen in the repo's index.html. It's still a work in progress so please provide feedback. Here's how it works:
- Upon uploading the transcript, the event handler
handleFiles
is called.handleFiles
:- Parses the file into a regular JS object
- Renders either a blank image (no lyrics being sung for that frame) or an image with the lyrics text (for frames where lyrics are being sung) onto the
canvas
element - Saves the
canvas
element first as adataURL
and then as a.png
file object to the folder object which will eventually be zipped - Initiates the download of the zipped folder upon completion of all image renders
A few helper functions to break up the responsibilities:
-
prepareWordData
- takes the
words
Array
from the transcript - extracts
wordsPerFrame
words at a time (default of 3 words) - creates an
Array
of new reduced versions of the original word Objects using the first and last word'sstart
andend
values, respectively for every set of words:
- takes the
{
alignedWord: string,
case: "success",
end: number, // the last word's `end` property
start: number // the first word`s `start` property
}
```
- `getWordDuration`
- takes a word object and returns the difference (in seconds) between the `start` and `end` values.
- this "duration" is used to determine how many frames need to be rendered for each set of words
- `renderWordFrames`
- takes the word (empty string if no lyrics are spoken during those frames) and duration of the word
- creates a new 2D `context` object
- fills it with the words' text
- gets the `dataURL` using the `.toDataURL()` property on the `canvas` element
- saves it to the folder-object-to-be-zipped with filenames starting with `0.png`
- This filename convention was chosen since it's the default filename sequence that `ffmpeg` expects
#### Generating the Video From Rendered Frames
Now that I have an image file for each frame of the video, I can use `ffmpeg` to stich them together. I have found the following parameters to be successful:
`ffmpeg -framerate 30 -i "%d.png" -s:v 640x480 -c:v libx264 -profile:v high -crf 20 -pix_fmt yuv420p path/to/output.mp4`
- `-framerate 30` sets the video frame rate to 30 frames per second
- `-i "%d.png"` matches the sequential filenames
- `-s:v` sets the size of the video frame (corresponding to the `canvas` element size, in this exampel, 640x480)
- `-c:v` specifies the video codec (I've used `libx264` which is recommended by YouTube and Instagram)
- `-profile:v` sets the quality of the video to `high` (haven't fully understood how it works yet)
- `crf` is the "Constant Rate Factor" which I haven't fully understood, but it ranges from 0 (lossless) to 51 (lowest quality)
- `-pix_fmt` sets the pixel format used, in this case, `yuv420` which sets the ratio of pixels for luminance Y (or brightness), chrominance blue U and chrominance red V. I'm pretty rough on these concepts so please correct or enlighten if you are more experienced.
This command generates a video at the output path, stiching the images together at a given framerate.
#### Adding the Song Audio
Now that I have the video for the lyrics, I can add the song audio (full song not just the vocals) using:
`ffmpeg -i path/to/video -i path/to/audio -vcodec libx264 -acodec libmp3lame path/to/output.mp4`
The first two input flags identify the video and audio files which will be streamed together using the video codec and audio codec specified.
#### The Result
Here's what I end up with!
It's pretty rough but the adrenaline rush was real when I saw it the first time.
## Next Steps
I consider this a successful Proof-Of-Concept. Here are my next steps:
- Over time, the lyrics fall out of sync with the audio, and this is most likely due to the fact that I rely on rounding the number of frames at 3 different places in the script
- The manner in which the three words align with the vocals is suboptimal. I may consider increasing the number of words shown per set of frames
- It's dull! The project is called `animatemusic` and this video is lacking interesting animations. If you recall, the word objects contain an array of phonemes used to pronounce the word. Mixing this with [anime.js, particularly their morphing animation](https://animejs.com/documentation/#morphing) will lead to some interesting lip sync animation attempts down the road
- The process is fragmented. Generating the forced-alignment output, generating the video frame images and generating the final output video currently takes places in three separate manual steps. I would like to eventually integrate these different services
- Integrations. The eventual goal is to connect this process with my YouTube and Instagram accounts so that I can upload to them upon completion using their APIs
- Refactoring. There's a lot of improvements needed in my script and I now feel confident enough to dive in and build this project out properly with tests
## Feedback
If you can help me improve my code, blog post, or my understanding of the context and concepts around anything you read above, please leave a comment below.
## Follow Me
[YouTube](https://www.youtube.com/channel/UCqvA0CVB3QR3TLpGVzkD35g)
[Instagram](https://www.instagram.com/vbaknation)
Thanks for reading!
Posted on July 15, 2019
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
July 15, 2019