String Manipulation of URLs is an Anti-Pattern.
Mike Stemle
Posted on August 31, 2021
Quick note before we get started: this piece is Node-centric in its examples, but this anti-pattern is polyglottal. As with most anti-patterns, this isn't about the syntax, it is about the approach.
What's a URL, really?
A URL is a useful thing. It tells both humans and users where to find resources on the internet. There's a lot of information packed into a URL, from protocol designations to document anchors, and when we treat it like a string we're steering into danger.
A URL is a packed value. It contains an awful lot of data:
- Protocol scheme
- Host name
- Port number
- Path
- File name
- Search parameters (a.k.a. query string parameters)
- Anchor (which can also be used for parameters)
The Problem with String Manipulation
Based on your specific needs, a URL may contain several reserved characters. Some of these characters include ?
, #
, =
, &
, %
, :
, ,
and /
. This is not an exhaustive list. Having these characters in the wrong place within your URL can cause misunderstanding.
A good implementation should be flexible enough to deal with any reasonable inputs, and capable of failing predictably when inputs are not reasonable. Packed values, like a URL, need to be treated like packed values, and not handled using string manipulation.
You can see here how the q
is seen as part of the URL, but restrict_sr
is interpreted as another URL parameter parallel to url
. While it may be tempting to simply use a function to URL-encode this, I would like to encourage you to reconsider. These URL encoding methods aren't great for all of the possible characters that you'd want to put in there, and they're likely to make a bunch of assumptions that aren't going to be true.
A Better Approach
Here you can see that encoding the URL didn't solve the problem. Let's try a different approach: let's use the URL API.
By using the URL API here, you can see that the URL which is being used as a parameter is safely tucked away, and you don't have to worry about it being confused.
Why does this matter?
The two primary problems caused by the anti-pattern of string manipulation of URLs are those of bugs, and of URL injection vulnerabilities.
Poorly-encoded URLs make it difficult for web servers and applications to understand the parameters coming to them. If they cannot reliably understand their inputs, there may be unexpected or unwanted behavior.
URLs which are constructed using predictable string manipulation also pose a very real risk of URL injection. URL injection can lead to SQL injection, NoSQL injection, cross-site scripting (XSS), and a whole host of other security holes.
Conclusion
A URL isn't a string. Much like the packed bit fields of yore, it is a packed value. Don't treat it like a string, treat it like a first-class object or structure. And never write your own URL parsers, every language has a good URL library that you can use.
Posted on August 31, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.