Software Engineer Interviews - #2 Webhook Log Parser
Lucas Queiroz
Posted on November 15, 2024
Disclaimer
I initially posted this article on Substack, but as it doesn't have a proper code formatting tool yet, I decided to move it here.
Introduction
This is the second post of the Software Engineer Interview series. In this first post, I have shared how I plan to do the series, some general tips and study guide for interviews.
In this second post, I will start with a challenge that I did a few years ago, and I now consider it simple. I do not remember exactly how I completed the challenge at the time, but being a very junior developer, I wanted to “show knowledge” and did it in multiple languages - something I wouldn’t do now, I would focus on the quality of the code.
The Challenge
The challenge consists of parsing an HTTP log file, and retrieving the most requested URLs and statuses. It will be a CLI application, and the output is a list with the top 10 URLs and the top 10 response statuses, followed by the number of responses. Example:
Top 10 URLs:
- https://google.com - 100
Top 10 statuses:
- 200 - 23
The log structure is the following:
- One log per line
- Each line contains multiple attributes, in the format
key=value
, each pair separated by a space - Keys:
level
response_body
request_to
response_headers
response_status
You can download the log file here.
Solving the Challenge
The first step here is to parse the file, either by reading every line or going through a stream. In Python, which is what I will be using here to solve this challenge, you can iterate through the file lines, without the need to save all the lines in memory, not only saving resources but increasing the performance of the project.
Let's assume the log is saved in the same folder as your Python code, and the log’s file name is log.txt. To parse it in a stream, without saving all the lines in memory, we can do the following:
# solution.py
with open("log.txt", "r") as log:
for line in log:
print(line)
You can run this in your terminal with:
python3 solution.py
If you run this code, you will see that each line is printed with a newline character at the end. We can fix this by calling .strip()
.
# solution.py
with open("log.txt", "r") as log:
for line in log:
print(line.strip())
Now, to the actual parsing of the line data and counting instances, there are multiple ways of doing this. The one I find easier and with a better performance is to use regex to parse the line, and Python’s Counter to count the elements. Other ways to do this would be:
- Splitting the line
- You can split the line by spaces, and then have if checks to see if the key is what you need (
request_to
,response_status
), then replace the key name and equal sign to an empty string and get the value.
- You can split the line by spaces, and then have if checks to see if the key is what you need (
- Counting with a dictionary
- You can have a dictionary where they key is the URL or the status code, and the value being the number of times it appeared.
Let’s start by parsing the line with regex, to obtain the data we need from the line (the URL and the response status).
You need to import the re
package from Python, and create a regex pattern. The pattern will be grouping the values we need, so we can refer back to them later. Since the lines are formatted in a way where the value will start and end with quotes (“), we can have a simple regex to group them:
.*request_to="(.*)".*response_status="(.*)".*
Explanation:
-
.*
will match any character, any number of times -
request_to=”
will match itself -
(.*)
will match any character after the previous token, any number of times, and group it -
“
will match a quote, ending the value -
.*
again to ignore anything between the tokens -
response_status=”
will match itself -
(.*)
will again match anything inside the quotes for the value -
“
will match a quote, ending the value -
.*
will match the end of the string, even if it has nothing after the response status
Let's update the code to match the line and get the groups:
# solution.py
import re
regex = re.compile(r".*request_to=\"(.*)\".*response_status=\"(.*)\".*")
def process_line(line: str) -> None:
match = regex.search(line)
if match:
request_to = match.group(1)
response_status = match.group(2)
with open("log.txt", "r") as log:
for line in log:
process_line(line)
As you can see, there were a lot of changes to the code. Let’s break them down:
-
import re
- This imports the re package from Python so we can use the regex functionality
-
regex = re.compile(r".request_to=\"(.)\".response_status=\"(.)\".*")
- This builds a variable called
regex
, which uses the regex we built before
- This builds a variable called
-
def process_line
- This is a new method I added so we can separate the concerns of processing the line from the line reading
-
match = regex.search(line)
- This will call the search function on the regex we compiled, and tell it to search on the line we received in the method’s parameter
-
if match:
- This is checking if the match was successful. If the line doesn’t match with our regex, it will not enter the condition
- The next two lines will get the values from the regex groups we built before and set them to variables with appropriate names
Now, we need to count the request_to
and response_status
, and for this we can use the Counter
.
Since strings can be parsed as lists in Python, and Counter
expects a list, if we just pass the string directly, it will count the characters in the string, and not the entire string, so we need to wrap the value in a list before passing it to Counter
. This is the updated code with the Counter part:
# solution.py
import re
from collections import Counter
regex = re.compile(r".*request_to=\"(.*)\".*response_status=\"(.*)\".*")
request_to_counter = Counter()
response_status_counter = Counter()
def process_line(line: str) -> None:
match = regex.search(line)
if match:
request_to = match.group(1)
response_status = match.group(2)
request_to_counter.update([request_to])
response_status_counter.update([response_status])
with open("log.txt", "r") as log:
for line in log:
process_line(line)
Again, here are the main changes:
-
from collections import Counter
- This is importing the Counter class into our code
-
request_to_counter = Counter()
- This is creating a new instance of the Counter and assigning it to the request_to_counter variable
-
response_status_counter = Counter()
- Same as above
-
request_to_counter.update([request_to])
- This is telling the request_to_counter to add one more item, and wrapping the value in a list, as explained above
-
response_status_counter.update([response_status])
- Same as above
If we print out the counters at the end of the file, this is what we will receive as output:
Counter({'https://eagerhaystack.com': 750, 'https://surrealostrich.com.br': 734, 'https://grimpottery.net.br': 732, 'https://abandonedpluto.com': 731, 'https://easterncobra.com.br': 730, 'https://solidstreet.net': 725, 'https://notoriouslonesome.com': 724, 'https://solidwindshield.net.br': 713, 'https://intensecloud.us': 712, 'https://grotesquemoon.de': 706, 'https://severeleather.com': 693, 'https://endlessiron.com.br': 688, 'https://woodenoyster.com.br': 685, 'https://steepBoomerang.me': 677})
Counter({'404': 1474, '503': 1451, '400': 1440, '500': 1428, '200': 1417, '201': 1402, '204': 1388})
This means we already have the counting part done! We just need to get the top 10 from the counter, and format the output.
To get the top 10 from the counter, we can use the function most_common(int)
, which will output an ordered list of tuples, each tuple containing the key and the value. We can then use this to format the output:
# solution.py
import re
from collections import Counter
regex = re.compile(r".*request_to=\"(.*)\".*response_status=\"(.*)\".*")
request_to_counter = Counter()
response_status_counter = Counter()
def process_line(line: str) -> None:
match = regex.search(line)
if match:
request_to = match.group(1)
response_status = match.group(2)
request_to_counter.update([request_to])
response_status_counter.update([response_status])
with open("log.txt", "r") as log:
for line in log:
process_line(line)
top_ten_urls = request_to_counter.most_common(10)
top_ten_status_codes = response_status_counter.most_common(10)
print("Top 10 URLs:")
for url, count in top_ten_urls:
print(f"{url} - {count}")
print("")
print("Top 10 status codes:")
for status_code, count in top_ten_status_codes:
print(f"{status_code} - {count}")
Once again, there is a lot of changes in the code, here they are:
-
top_ten_urls = request_to_counter.most_common(10)
- This is using the function I mentioned previously, to retrieve the top 10 urls
-
top_ten_stauts_codes = response_status_counter.most_common(10)
- Same as above
-
for url, count in top_ten_urls:
- This is looping through each element in the top 10 urls. Since each element is a tuple with two elements, we can use the syntax
first_var, second_var
to retrieve both data at the same time with a nice variable name
- This is looping through each element in the top 10 urls. Since each element is a tuple with two elements, we can use the syntax
-
for status_code, count in top_ten_status_codes:
- Same as above
The other lines are printing out strings
Basically, our logic is done. This is the output from running this file:
Top 10 URLs:
https://eagerhaystack.com - 750
https://surrealostrich.com.br - 734
https://grimpottery.net.br - 732
https://abandonedpluto.com - 731
https://easterncobra.com.br - 730
https://solidstreet.net - 725
https://notoriouslonesome.com - 724
https://solidwindshield.net.br - 713
https://intensecloud.us - 712
https://grotesquemoon.de - 706
Top 10 status codes:
404 - 1474
503 - 1451
400 - 1440
500 - 1428
200 - 1417
201 - 1402
204 - 1388
We can still improve the code though, by separating the logic to retrieve top 10 in different methods, and add a main method to the file.
# solution.py
import re
from collections import Counter
regex = re.compile(r".*request_to=\"(.*)\".*response_status=\"(.*)\".*")
request_to_counter = Counter()
response_status_counter = Counter()
def process_line(line: str) -> None:
match = regex.search(line)
if match:
request_to = match.group(1)
response_status = match.group(2)
request_to_counter.update([request_to])
response_status_counter.update([response_status])
def process_file() -> None:
with open("log.txt", "r") as log:
for line in log:
process_line(line)
def print_top_ten_urls() -> None:
top_ten_urls = request_to_counter.most_common(10)
print("Top 10 URLs:")
for url, count in top_ten_urls:
print(f"{url} - {count}")
def print_top_ten_status_codes() -> None:
top_ten_status_codes = response_status_counter.most_common(10)
print("Top 10 status codes:")
for status_code, count in top_ten_status_codes:
print(f"{status_code} - {count}")
if __name__ == "__main__":
process_file()
print_top_ten_urls()
print("")
print_top_ten_status_codes()
This is basically just separating the code in methods, to improve readability and separating concerns. The last part, if __name__ == “__main__”
tells Python this is the main method of the file, so if it is imported somewhere with import solution, it will run the main method.
What to Study
A good study guide for this challenge would be:
- Regex
- Python Counter
Other items you may want to study, which can help on a similar challenge, and could be faster to code in a live coding environment would be:
- Python dicts
- Sorting dicts with Python
I hope this helps you prepare for your next challenge!
Posted on November 15, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.
Related
November 27, 2024