Jordan Scrapes SteamDB

aarmora

Jordan Hansen

Posted on October 17, 2019

Jordan Scrapes SteamDB

Demo code here

A request

except I did agree to it

This request comes straight at you from u/Jimmyxavi. Looks like he’s working on a project for university and wanted to get the steam file size for all early access games.

So, here we goooooooooo….

Puppeteer was my weapon of choice for this scrape. I’ve wrote a few times about it and it’s still one of my favorite weapons. I probably could have done the scrape a bit quicker with Axios but any time I’m going to hit a website thousands of times, I kind of like the imposed speed slow down that puppeteer gives me. It also allows me to easily do some of the interactions that were helpful here, like change a dropdown.

The gatekeeper called Algolia

I dug around on steamdb.info to look to see if I could navigate directly to any pages. At first I tried the instant search beta which is a really cool tool but kind of crappy for web scraping. It uses something called Algolia which is like Elastic search and just makes for very powerful, fast searching.

I just so happened to have just discussed Algolia with my good friend Matt (see his cool packaging company Citadel Packaging) two weeks ago. I was looking for some tools in order to improve the search over at Cobalt Intelligence (great business leads there!) and Aloglia was one of the things that he suggested.

Algolia is built for quick searching but limits total results to 1,000. It depends on you passing a query and it will limit those results to 1,000. If I don’t pass a query, I can’t get more than 1,000 even if the total amount is closer to 5,000. I tinkered around with it a little bit but just decided to go with their other search option.

Enter their old search

Here is the first helpful link –

https://steamdb.info/search/?a=app&q=&type=1&category=666 . Type 1 I’m guessing is “Game” and category 666 is “Early Access”. As you can see, this page offers 4,249 games. By default it only shows 25 results. This is where puppeteer shines. With a command as simple as await page.select('#table-sortable_length select', '-1'); I can set the dropdown to whatever value I want. In this case, -1 is ‘All’.

From here, I just loop through each row and get the url for each app and the name. I then push them into an array that I will later loop through and open each page stored.

    const appsInfo: any[] = [];
    for (let app of apps) {

        const url = await getPropertyBySelector(app, 'a', 'href');
        const name = await getPropertyBySelector(app, 'td:nth-of-type(3)', 'innerHTML');

        appsInfo.push({
            url: url,
            name: name
        });
    }
Enter fullscreen mode Exit fullscreen mode

The next helpful link is the actual location of the depots which displays the size information. https://steamdb.info/app/570/depots/ – this is the depot for one of the best games ever invited, Dota 2. As you can see, it lists a bunch of builds and the size of each.


export async function handleDepots(app: any, page: Page) {
    await page.goto(`${app.url}depots/`);

    const table = await page.$('#depots table:first-of-type tbody');

    if (!table) {
        return Promise.resolve();
    }
    const depots = await table.$$('tr');

    console.log('depots length', depots.length);

    for (let i = 0; i < depots.length; i++) {

        const depotSize = await depots[i].$eval('[data-sort]', elem => elem.textContent);
        const actualDepotSize = await depots[i].$eval('[data-sort]', elem => elem.getAttribute('data-sort'));
        const depotName = await getPropertyBySelector(depots[i], 'td:nth-of-type(2)', 'innerHTML');

        const macRow = await depots[i].$('.icon-macos');

        if (!macRow) {
            app[`depot${i + 1}Size`] = depotSize;
            app[`depot${i + 1}ActualSize`] = actualDepotSize;
            app[`depot${i + 1}Name`] = depotName;
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

This function is to handle the depot page. It navigates to that page and then finds the depots table with const table = await page.$('#depots table:first-of-type tbody');. Then it loops through the rows and gets the size of the specific depot and the depot name.

I had a bit of a tricky part with this because the actual depot size is stored in a data-sort attribute, which is actually slightly different than the displayed value. I would guess the data-sort attribute is the correct one because that is what it sorts the column by. It was also kind of tricky to pull from the attribute and I ended up having to use const actualDepotSize = await depots[i].$eval('[data-sort]', elem => elem.getAttribute('data-sort'));instead of my normal helper function.

The end

And there we have it. After it all completes (and it took close to 70 minutes!) it outputs to a csv file.


    const csv = json2csv.parse(appsInfo);

    fs.writeFile('steamApps.csv', csv, async (err) => {
        if (err) {
            console.log('err while saving file', err);
        }
    });

Enter fullscreen mode Exit fullscreen mode

Demo code here

Looking for business leads?

Using the techniques talked about here at javascriptwebscrapingguy.com, we’ve been able to launch a way to access awesome business leads. Learn more at Cobalt Intelligence!

The post Jordan Scrapes SteamDB appeared first on JavaScript Web Scraping Guy.

💖 💪 🙅 🚩
aarmora
Jordan Hansen

Posted on October 17, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related

Jordan Scrapes Real Foreclose
webscraping Jordan Scrapes Real Foreclose

October 17, 2019

Jordan Scrapes SteamDB
webscraping Jordan Scrapes SteamDB

October 17, 2019