Diary of youtube-dl internals, part 1

zenulabidin

Ali Sherief

Posted on November 2, 2020

Diary of youtube-dl internals, part 1

Once upon a time, there was a project called youtube-dl on Github. It could download videos from YouTube and dozens of other sites. Many people used it and programs were built around it.

The Youtube downloader in particular occasionally stopped working as Google changed the site to inhibit downloading from it, and the maintainers would commit fixes for these very quickly. It was a cat-and-mouse game.

As you probably heard, youtube-dl’s repository was DMCAed by RIAA and taken offline. It was because copyrighted music videos were being downloaded in the unit tests. However the code still existed in other places and has since been mirrored dozens of times.

Among the more prominent of these repositories are blackjack4494/yt-dlc (endorsed by the r/youtubedl subreddit) and l1ving/youtube-dl (maintenance repo for bugfixes). I have volunteered to continue the Python side of things (maybe it will be my Hacktoberfest too 😉) and have also elected to keep a diary of things I studied from the codebase. It will be useful because the issues and PRs were removed along with the original repo, so finding old bug reports has become more difficult.

This series will be a work in progress and other unrelated articles will be published in between. My hope in writing this series is that it empowers other developers to contribute as well, because you can only contribute to a project after you understand it.

A few other notes, I will mainly cover the Youtube downloader because that is in the most danger of breaking by an update from Google. Secondly, the source code lives in the youtube_dl folder. Files without absolute paths should be considered as relative to the youtube_dl folder.

So, let’s begin.

The Youtube downloader: extractor/youtube.py

This is the core of the youtube downloader. It provides the classes for extracting the video from the page, to create a Youtube session for logging in programmatically, and defines all the video formats.

youtube-dl handles a number of different kinds of youtube video URLs, so it needs to identify from its input which ones are valid youtube URLs and which ones aren’t. That is handled in the YoutubeIE class with a regular expression. Now as you might know, Python has a rich regular expression syntax, and a regex is indicated in Python with r-strings: r"". The regex can be written across multiple lines of code using triple quotes: r"""""".

Here is the YoutubeIE member that has the video matching regular expression, called _VALID_URL. It is too long to paste here, so let’s go over some of the key matching groups in it. In case you didn’t know, a matching group is a part of a regex whose value can be retrieved later.

Lines 377-378

Our first expression is not a matching group, it is (?x), what this does is that it enables the regular expression to be written across several lines. It is not enough to simply use triple quote strings because the newline characters will choke the regex parser and make the regex invalid. So with (?x), the newlines are ignored.

Next up we have ^ followed by a ( on a second line, the ^ makes the whole regex match only the beginning of the input, while the left parentheses begins one giant matching group.

Regular expressions that don’t start with $ and end with $ are more difficult to debug because the regex is applied to all substrings of the input as opposed to only the full input.

Line 379

(?:https?://|//): Is responsible for matching http://, https://, and // at the beginning of the video. The ? after the s matches it 0 or 1 times. The (?:) construct is a non-matching group. It only looks for patterns matching the regex without making a capturing group.

The (?:) non-matching groups are also look-ahead operators. Before any regexes are matched to patterns, this operator looks ahead from the beginning of the input (or if (?:) comes after some other regex, after the place at the input where a pattern matched). The question mark and every other part of the Youtube video URL that isn’t in a non-matching group (?:) is matched at line 440, along with the entire input (that regex is also non-matching so it swallows the whole input after getting the desirable parts of the URL into matching groups).

Let’s show a few examples:

r"^(?:abc)([0-9]{3}).*$"
# Gets a three-digit number following “abc”
r"^(?:the (?:quick|slick)).*$"
# Searches for “the ”...”quick” or “the “...”slick” which may
# have other characters between the two words. The match 
# might not be at the beginning. Does not make a match group.
r"^(?:the )(?:quick ).*$"
# Searches for “the “ OR “quick “ in the input. It might
# not be at the beginning. Does not make a match group.
Enter fullscreen mode Exit fullscreen mode

Line 380

(?:(?:(?:(?:\w+\.)?[yY][oO][uU][tT][uU][bB][eE](?:-nocookie|kids)?\.com/|, first of all let's dissect this from the innermost non-batching group.

  • (?:\w+\.)? matches all letters, digits and underscores (but not dashes), followed by a single dot .. It's purpose is to match the subdomain part before youtube.com. So it matches things like abc.youtube.com, 9734fde_.youtube.com, and even youtube.com with no subdomain, because of the ? at the very end, the subdomain is not required to appear at all.

  • (?:...[yY][oO][uU][tT][uU][bB][eE], matches any combination of upper and lowercase characters forming the word "youtube"...

  • (?:-nocookie|kids)?\.com/, matches either -nocookie.com/ or kids.com/ after the "youtube" regex above...

  • which makes the full regex match an optional subdomain, was hen youtube (any variation in case) -nocookie or kids .com/.

Lines 381-414

Regexes for the various wrappers around youtube. The | character at the end of each of them stands for “or”, as in match only one of these hostnames.

Line 415

(?:.*?\#/)?: If there's an anchor tag /#, match it and everything before it. Usually there is none, but just in case.

At this point, for this input:

https://www.youtube.com/watch?v=PA4gbtKWNAI&feature=youtu.be
Enter fullscreen mode Exit fullscreen mode

We matched https://www.youtube.com/.

Line 416

(?:(?:v|embed|e)/(?!videoseries)): Matches v/, embed/ or e/, but not if videoseries comes after it, i.e. things like v/videoseries doesn't match. This form of youtube video is not common.

Line 418-423

Matches everything between watch? and v=. More specifically, line 419:

(?:(?:watch|movie)(?:_popup)?(?:\.php)?/?)?
Enter fullscreen mode Exit fullscreen mode

matches the following combination of characters: watch watch_popup watch.php watch_popup.php movie movie_popup movie.php watch_popup.php and all of these are optional so it might match nothing like in the case of youtube.com/?v=12345123451.

  • Line 420 (?:\?|\#!?) then matches ? or # or #!.

  • Line 421 (?:.*?[&;])?? matches all query parameters before the v=, so if for example you had watch?abc=123&v=12345123451, it would match abc=123&. As you can see in the regex, it also matches a trailing semicolon in case youtube-dl is fed a url with watch?abc=123&v=12345123451.

The extra question mark after *? and ?? indicates lazy matching; only matching as many characters as needed to fulfill the regex requirements. This regex needs it because it has .* which matches all characters so it would otherwise swallow the v= parameter and video ID with it (greedy matching, the default).

  • Finally, line 422 matches v= itself.

At this point, for this input:

https://www.youtube.com/watch?v=PA4gbtKWNAI&feature=youtu.be
Enter fullscreen mode Exit fullscreen mode

We matched https://www.youtube.com/watch?v=.

Line 426

youtu.be/ is matched here. Nothing else interesting above or below these lines.

Line 433

([0-9A-Za-z_-]{11}): Finally, this is where the video ID is captured. It is 11 characters long, and contains alphanumeric characters, underscores and dashes. For the first time, we also encounter a capturing group: We want to use this ID outside of the regex.

Note that everything up to this line is optional; you can pass the ID by itself as input to the YoutubeIE class and it instantly recognizes it as a Youtube ID. This might not necessarily work from the youtube-dl command-line though, because it has no way of telling which site the ID is from.

Line 440

(?(1).+)? Finally, everything after the video ID is matched and discarded. The (?...) construct is a conditional, like in programming languages. It only matches the regex after the condition i parentheses if the condition is true. The (1) indicates a truthy value so this matches everything after the ID if there is anything. Usually there's nothing like this at the end of URLs but in the above example, we have &feature=youtu.be, which needs to be matched with some sort of catch-all (due to the $ at the last line denoting "end of input") because the entire _VALID_URL regex won't match anything if that is not captured.

There is another member of YoutubeIE called _NEXT_URL_RE, its job is to catch the encoded youtube video URL that is put at the end of a second URL which handles things like age verification. It's regex is [\?&]next_url=([^&]+). It captures a ? (in case next_url is the first query parameter) or & (in case next_url is not the first), next_url= itself followed by a matching group of everything following it up to and not including the next &, possibly reading the end of the input.

It's used on the youtube.com domain to redirect you to various other pages, it isn't just for videos.

The member _PLAYER_INFO_RE is used to fetch the ID and extension from embedded youtube videos, such as https://www.youtube.com/s/player/5e4e8d5d/player_ias.vflset/en_US/base.js. I couldn't exactly figure out what this is but it looks like the script for embedding a particular video (I could not get its video ID) in a website. It's two regexes are:

r'/(?P<id>[a-zA-Z0-9_-]{8,})/player_ias\.vflset(?:/[a-zA-Z]{2,3}_[a-zA-Z]{2,3})?/base\.(?P<ext>[a-z]+)$'
# and
r'\b(?P<id>vfl[a-zA-Z0-9_-]+)\b.*?\.(?P<ext>[a-z]+)$'
# vfl<8 character ID><some other text>.<extension>
Enter fullscreen mode Exit fullscreen mode

As for the regular expressions themselves, it's mostly syntax we've seen above but I want to highlight two operators used here:

  • The \b operator matches the regular expression only if the adjacent character at one (and only one) end after it is alphanumeric or an underscore. In this case, it means alphanumerics and underscores must not appear at the left of "vfl" (or after the player ID, for the second \b) for the regex to match.

  • (?P<NAME>) is syntax for a special matching group that only works in Python. It saves the pattern appearing after the ?P<NAME> to a capture group called NAME (a variable inside Python's internal regex engine, not as a Python variable).

Closing words

As I study the youtube-dl codebase further I will continue posting here about my discoveries.

If you see any errors in this post, please let me know so I can correct them.

💖 💪 🙅 🚩
zenulabidin
Ali Sherief

Posted on November 2, 2020

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related