Disclaimer

I initially posted this article on Substack, but as it doesn't have a proper code formatting tool yet, I decided to move it here.

Introduction

This is the second post of the Software Engineer Interview series. In this first post, I have shared how I plan to do the series, some general tips and study guide for interviews.

In this second post, I will start with a challenge that I did a few years ago, and I now consider it simple. I do not remember exactly how I completed the challenge at the time, but being a very junior developer, I wanted to “show knowledge” and did it in multiple languages - something I wouldn’t do now, I would focus on the quality of the code.

The Challenge

The challenge consists of parsing an HTTP log file, and retrieving the most requested URLs and statuses. It will be a CLI application, and the output is a list with the top 10 URLs and the top 10 response statuses, followed by the number of responses. Example:

Top 10 URLs:
- https://google.com - 100

Top 10 statuses:
- 200 - 23

The log structure is the following:

One log per line
Each line contains multiple attributes, in the format key=value, each pair separated by a space
Keys:
- level
- response_body
- request_to
- response_headers
- response_status

You can download the log file here.

Solving the Challenge

The first step here is to parse the file, either by reading every line or going through a stream. In Python, which is what I will be using here to solve this challenge, you can iterate through the file lines, without the need to save all the lines in memory, not only saving resources but increasing the performance of the project.

Let's assume the log is saved in the same folder as your Python code, and the log’s file name is log.txt. To parse it in a stream, without saving all the lines in memory, we can do the following:

# solution.py

with open("log.txt", "r") as log:
    for line in log:
        print(line)

You can run this in your terminal with:

python3 solution.py

If you run this code, you will see that each line is printed with a newline character at the end. We can fix this by calling .strip().

# solution.py

with open("log.txt", "r") as log:
    for line in log:
        print(line.strip())

Now, to the actual parsing of the line data and counting instances, there are multiple ways of doing this. The one I find easier and with a better performance is to use regex to parse the line, and Python’s Counter to count the elements. Other ways to do this would be:

Splitting the line
- You can split the line by spaces, and then have if checks to see if the key is what you need (request_to, response_status), then replace the key name and equal sign to an empty string and get the value.
Counting with a dictionary
- You can have a dictionary where they key is the URL or the status code, and the value being the number of times it appeared.

Let’s start by parsing the line with regex, to obtain the data we need from the line (the URL and the response status).

You need to import the re package from Python, and create a regex pattern. The pattern will be grouping the values we need, so we can refer back to them later. Since the lines are formatted in a way where the value will start and end with quotes (“), we can have a simple regex to group them:

.*request_to="(.*)".*response_status="(.*)".*

Explanation:

.* will match any character, any number of times
request_to=” will match itself
(.*) will match any character after the previous token, any number of times, and group it
“ will match a quote, ending the value
.* again to ignore anything between the tokens
response_status=” will match itself
(.*) will again match anything inside the quotes for the value
“ will match a quote, ending the value
.* will match the end of the string, even if it has nothing after the response status

Let's update the code to match the line and get the groups:

# solution.py

import re


regex = re.compile(r".*request_to=\"(.*)\".*response_status=\"(.*)\".*")


def process_line(line: str) -> None:
    match = regex.search(line)
    if match:
        request_to = match.group(1)
        response_status = match.group(2)


with open("log.txt", "r") as log:
    for line in log:
        process_line(line)

As you can see, there were a lot of changes to the code. Let’s break them down:

import re
- This imports the re package from Python so we can use the regex functionality
regex = re.compile(r".request_to=\"(.)\".response_status=\"(.)\".*")
- This builds a variable called regex, which uses the regex we built before
def process_line
- This is a new method I added so we can separate the concerns of processing the line from the line reading
match = regex.search(line)
- This will call the search function on the regex we compiled, and tell it to search on the line we received in the method’s parameter
if match:
- This is checking if the match was successful. If the line doesn’t match with our regex, it will not enter the condition
The next two lines will get the values from the regex groups we built before and set them to variables with appropriate names

Now, we need to count the request_to and response_status, and for this we can use the Counter.

Since strings can be parsed as lists in Python, and Counter expects a list, if we just pass the string directly, it will count the characters in the string, and not the entire string, so we need to wrap the value in a list before passing it to Counter. This is the updated code with the Counter part:

# solution.py

import re
from collections import Counter


regex = re.compile(r".*request_to=\"(.*)\".*response_status=\"(.*)\".*")
request_to_counter = Counter()
response_status_counter = Counter()


def process_line(line: str) -> None:
    match = regex.search(line)
    if match:
        request_to = match.group(1)
        response_status = match.group(2)
        request_to_counter.update([request_to])
        response_status_counter.update([response_status])


with open("log.txt", "r") as log:
    for line in log:
        process_line(line)

Again, here are the main changes:

from collections import Counter
- This is importing the Counter class into our code
request_to_counter = Counter()
- This is creating a new instance of the Counter and assigning it to the request_to_counter variable
response_status_counter = Counter()
- Same as above
request_to_counter.update([request_to])
- This is telling the request_to_counter to add one more item, and wrapping the value in a list, as explained above
response_status_counter.update([response_status])
- Same as above

If we print out the counters at the end of the file, this is what we will receive as output:

Counter({'https://eagerhaystack.com': 750, 'https://surrealostrich.com.br': 734, 'https://grimpottery.net.br': 732, 'https://abandonedpluto.com': 731, 'https://easterncobra.com.br': 730, 'https://solidstreet.net': 725, 'https://notoriouslonesome.com': 724, 'https://solidwindshield.net.br': 713, 'https://intensecloud.us': 712, 'https://grotesquemoon.de': 706, 'https://severeleather.com': 693, 'https://endlessiron.com.br': 688, 'https://woodenoyster.com.br': 685, 'https://steepBoomerang.me': 677})

Counter({'404': 1474, '503': 1451, '400': 1440, '500': 1428, '200': 1417, '201': 1402, '204': 1388})

This means we already have the counting part done! We just need to get the top 10 from the counter, and format the output.

To get the top 10 from the counter, we can use the function most_common(int), which will output an ordered list of tuples, each tuple containing the key and the value. We can then use this to format the output:

# solution.py

import re
from collections import Counter


regex = re.compile(r".*request_to=\"(.*)\".*response_status=\"(.*)\".*")
request_to_counter = Counter()
response_status_counter = Counter()


def process_line(line: str) -> None:
    match = regex.search(line)
    if match:
        request_to = match.group(1)
        response_status = match.group(2)
        request_to_counter.update([request_to])
        response_status_counter.update([response_status])


with open("log.txt", "r") as log:
    for line in log:
        process_line(line)

top_ten_urls = request_to_counter.most_common(10)
top_ten_status_codes = response_status_counter.most_common(10)

print("Top 10 URLs:")
for url, count in top_ten_urls:
    print(f"{url} - {count}")


print("")
print("Top 10 status codes:")
for status_code, count in top_ten_status_codes:
    print(f"{status_code} - {count}")

Once again, there is a lot of changes in the code, here they are:

top_ten_urls = request_to_counter.most_common(10)
- This is using the function I mentioned previously, to retrieve the top 10 urls
top_ten_stauts_codes = response_status_counter.most_common(10)
- Same as above
for url, count in top_ten_urls:
- This is looping through each element in the top 10 urls. Since each element is a tuple with two elements, we can use the syntax first_var, second_var to retrieve both data at the same time with a nice variable name
for status_code, count in top_ten_status_codes:
- Same as above

The other lines are printing out strings

Basically, our logic is done. This is the output from running this file:

Top 10 URLs:
https://eagerhaystack.com - 750
https://surrealostrich.com.br - 734
https://grimpottery.net.br - 732
https://abandonedpluto.com - 731
https://easterncobra.com.br - 730
https://solidstreet.net - 725
https://notoriouslonesome.com - 724
https://solidwindshield.net.br - 713
https://intensecloud.us - 712
https://grotesquemoon.de - 706

Top 10 status codes:
404 - 1474
503 - 1451
400 - 1440
500 - 1428
200 - 1417
201 - 1402
204 - 1388

We can still improve the code though, by separating the logic to retrieve top 10 in different methods, and add a main method to the file.

# solution.py

import re
from collections import Counter


regex = re.compile(r".*request_to=\"(.*)\".*response_status=\"(.*)\".*")
request_to_counter = Counter()
response_status_counter = Counter()


def process_line(line: str) -> None:
    match = regex.search(line)
    if match:
        request_to = match.group(1)
        response_status = match.group(2)
        request_to_counter.update([request_to])
        response_status_counter.update([response_status])


def process_file() -> None:
    with open("log.txt", "r") as log:
        for line in log:
            process_line(line)


def print_top_ten_urls() -> None:
    top_ten_urls = request_to_counter.most_common(10)

    print("Top 10 URLs:")
    for url, count in top_ten_urls:
        print(f"{url} - {count}")


def print_top_ten_status_codes() -> None:
    top_ten_status_codes = response_status_counter.most_common(10)

    print("Top 10 status codes:")
    for status_code, count in top_ten_status_codes:
        print(f"{status_code} - {count}")


if __name__ == "__main__":
    process_file()
    print_top_ten_urls()
    print("")
    print_top_ten_status_codes()

This is basically just separating the code in methods, to improve readability and separating concerns. The last part, if __name__ == “__main__” tells Python this is the main method of the file, so if it is imported somewhere with import solution, it will run the main method.

What to Study

A good study guide for this challenge would be:

Regex
Python Counter

Other items you may want to study, which can help on a similar challenge, and could be faster to code in a live coding environment would be:

Python dicts
Sorting dicts with Python

I hope this helps you prepare for your next challenge!

Blog

Software Engineer Interviews - #2 Webhook Log Parser

Lucas Queiroz

Disclaimer

Introduction

The Challenge

Solving the Challenge

What to Study

Join Our Newsletter. No Spam, Only the good stuff.

Related