Write your first Web Scraper with Dart

creativ_bracket

Jermaine

Posted on December 17, 2018

Write your first Web Scraper with Dart

In this tutorial, we will write a web scraper to query the Hacker News homepage for a list of the latest articles with their URLs. We will produce a JSON string containing our results based on the scraped data. We will also cover this with some unit tests!

This will be great to follow along, especially if you've been meaning to sink your teeth into the Dart language and tooling.

β†’ Get the source code


What is a Web Scraper?

Web Scrapers refer to scripts that perform the task of extracting data from websites. This usually happens by performing a GET request to the web page and then parsing the HTML response to retrieve the desired content.


1. Generate a console project

Create a directory for your project:

$ mkdir hacker_news_scraper && cd hacker_news_scraper
Enter fullscreen mode Exit fullscreen mode

Use the stagehand package to generate a console application:

$ pub global activate stagehand # If you don't have it installed
$ stagehand console-full
Enter fullscreen mode Exit fullscreen mode

Add the http and html dependency in the pubspec.yaml file:

dependencies:
  html: ^0.13.3+3
  http: ^0.12.0
Enter fullscreen mode Exit fullscreen mode

The http package provides a Future-based API for making requests. The html package contains helpers to parse HTML5 strings using a DOM-inspired API. It's a port of html5lib from Python.

And install the added dependencies:

$ pub get
Enter fullscreen mode Exit fullscreen mode

Following these instructions correctly should give you the file/folder structure below:

Generated with Stagehand

2. Implement the script

Empty the contents of lib/hacker_news_scraper.dart, for we shall start from scratch☝️

Import our installed dependencies:

import 'dart:convert'; // Contains the JSON encoder

import 'package:http/http.dart'; // Contains a client for making API calls
import 'package:html/parser.dart'; // Contains HTML parsers to generate a Document object
import 'package:html/dom.dart'; // Contains DOM related classes for extracting data from elements
Enter fullscreen mode Exit fullscreen mode

Create a function after our imports to contain our logic:

initiate() async {}
Enter fullscreen mode Exit fullscreen mode

The http package contains a Client class for making HTTP calls. Create an instance and perform a GET request to the Hacker News homepage:

Future initiate() async {
  var client = Client();
  Response response = await client.get('https://news.ycombinator.com');

  print(response.body);
}
Enter fullscreen mode Exit fullscreen mode

To test this out, go to bin/main.dart and invoke the initiate method:

import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;

void main(List<String> arguments) async {
  print(await hacker_news_scraper.initiate());
}
Enter fullscreen mode Exit fullscreen mode

Run this file:

$ dart bin/main.dart
Enter fullscreen mode Exit fullscreen mode

Below is an extract of the response:

<html op="news"><head><meta name="referrer" content="origin"><meta name="viewport" content="width=device-width, initial-scale=1.0"> 
...
...
<table border="0" cellpadding="0" cellspacing="0" class="itemlist">
              <tr class='athing' id='18678314'>
      <td align="right" valign="top" class="title"><span class="rank">1.</span></td>      <td valign="top" class="votelinks"><center><a id='up_18678314' href='vote?id=18678314&amp;how=up&amp;goto=news'><div class='votearrow' title='upvote'></div></a></center></td><td class="title"><a href="http://vmls-book.stanford.edu/" class="storylink">Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares</a><span class="sitebit comhead"> (<a href="from?site=stanford.edu"><span class="sitestr">stanford.edu</span></a>)</span></td></tr><tr><td colspan="2"></td><td class="subtext">
        <span class="score" id="score_18678314">381 points</span> by <a href="user?id=yarapavan" class="hnuser">yarapavan</a> <span class="age"><a href="item?id=18678314">8 hours ago</a></span> <span id="unv_18678314"></span> | <a href="hide?id=18678314&amp;goto=news">hide</a> | <a href="item?id=18678314">37&nbsp;comments</a>              </td></tr>
...
...
Enter fullscreen mode Exit fullscreen mode

In order to know what to look for, we need to know how to select the links on the page:

Firefox inspect Hacker News markup

It appears that each link is in a table cell and has the class "storylink". This means that we can use this CSS selector to traverse those: td.title > a.storylink

In lib/hacker_news_scraper.dart, rather than printing the response body in the initiate function, let's parse the body and select our elements using the helpers from the html package.

Future initiate() async {
  var client = Client();
  Response response = await client.get('https://news.ycombinator.com');

  // Use html parser and query selector
  var document = parse(response.body);
  List<Element> links = document.querySelectorAll('td.title > a.storylink');
}
Enter fullscreen mode Exit fullscreen mode

At this point we now have a list of Elements where each element is an a.storylink item. The Element type provides an API similar to the DOM.

With a for in loop we can traverse the collection:

List<Map<String, dynamic>> linkMap = [];

for (var link in links) {
  linkMap.add({
    'title': link.text,
    'href': link.attributes['href'],
  });
}
Enter fullscreen mode Exit fullscreen mode

And return a JSON-encoded output:

import 'dart:convert'; // Do this at the top of the file

Future initiate() {
  ...
  ...
  return json.encode(linkMap);
}
Enter fullscreen mode Exit fullscreen mode

Here's the full script so far:

// lib/hacker_news_scraper.dart
import 'dart:convert';

import 'package:http/http.dart';
import 'package:html/parser.dart';
import 'package:html/dom.dart';

Future initiate() async {
  // Make API call to Hackernews homepage
  var client = Client();
  Response response = await client.get('https://news.ycombinator.com');

  // Use html parser
  var document = parse(response.body);
  List<Element> links = document.querySelectorAll('td.title > a.storylink');
  List<Map<String, dynamic>> linkMap = [];

  for (var link in links) {
    linkMap.add({
      'title': link.text,
      'href': link.attributes['href'],
    });
  }

  return json.encode(linkMap);
}
Enter fullscreen mode Exit fullscreen mode

Running this should return a JSON output similar to below:

[
  {
    "title":"Introduction to Applied Linear Algebra: Vectors, Matrices, and Least Squares",
    "href":"http://vmls-book.stanford.edu/"
  },
  {
    "title":"Write Your Own Virtual Machine",
    "href":"https://justinmeiners.github.io/lc3-vm/"
  },
  {
    "title":"Verizon signals its Yahooand AOL divisions are almost worthless",
    "href":"https://www.nbcnews.com/tech/tech-news/verizon-signals-its-yahoo-aol-divisions-are-almost-worthless-n946846"
  },
  ...
  ...
]
Enter fullscreen mode Exit fullscreen mode

3. Write the unit tests

Our tests will go in test/hacker_news_scraper_test.dart. Replace its contents with the below:

import 'dart:convert';

import 'package:test/test.dart';
import 'package:http/http.dart';
import 'package:http/testing.dart';
import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;

void main() {
  // Our tests will go here
}
Enter fullscreen mode Exit fullscreen mode

This is what our first test looks like:

void main() {
  test('calling initiate() returns a list of storylinks', () async {
    var response = await hacker_news_scraper.initiate();
    expect(response, equals('/* JSON string to match against */'));
  });
}
Enter fullscreen mode Exit fullscreen mode

We need to refactor our solution slightly for our tests. This is because writing tests will be flakey since we will be making actual calls to the Hacker News website.

In the scenario where Hacker News isn't available or we do not have an internet connection or the story listings change(and they will), our tests will fail.

Let's refactor our initiate() method call to expect a client parameter and remove the var client = Client(); declaration:

// lib/hacker_news_scraper.dart
initiate(BaseClient client) {
  // var client = Client(); // <- Remove this line
  ...
}
Enter fullscreen mode Exit fullscreen mode

The http package extends a BaseClient type for its HTTP client. This is also useful because the same package provides another subclass called MockClient for mocking HTTP calls, useful for our unit tests!

Return to bin/main.dart and ensure the Client is passed in:

import 'package:http/http.dart'; // Import the package first!
import 'package:hacker_news_scraper/hacker_news_scraper.dart' as hacker_news_scraper;

void main(List<String> arguments) async {
  print(await hacker_news_scraper.initiate(Client()));
}
Enter fullscreen mode Exit fullscreen mode

Ok, back to our unit tests.

This is the first test that uses our MockClient:

void main() {
  MockClient client = null;

  test('calling initiate(client) returns a list of storylinks', () async {
    // Arrange
    client = MockClient((req) => Future(() => Response('''
      <body>
        <table><tbody><tr>
        <td class="title">
          <a class="storylink" href="https://dartlang.org">Get started with Dart</a>
        </td>
        </tr></tbody></table>
      </body>
    ''', 200)));

    // Act
    var response = await hacker_news_scraper.initiate(client);

    // Assert
    expect(
        response,
        equals(json.encode([
          {
            'title': 'Get started with Dart',
            'href': 'https://dartlang.org',
          }
        ])));
  });
}
Enter fullscreen mode Exit fullscreen mode

The MockClient instance takes a closure as the first parameter. This closure provides a request object which we can manipulate if needed. A Future object is expected to be returned from this closure, which is what were doing here. We are returning an HTML string when the call is made in our await client.get(...) method.

The MockClient instance also takes in a second parameter, an integer representing the response status code. In this case its a 200 OK.

We then proceed to make our initiate() call passing in our MockClient. This means that our test is now predictable and can confidently perform assertions on the response.

The expect and equals top-level functions come as part of the test package by the Dart team. We installed this earlier on and it is listed under dev_dependencies: in our pubspec.yaml file.

We are using the json.encode() method as its an encoded JSON string we expect from the operation.

We can run this test by doing:

$ pub run test
Enter fullscreen mode Exit fullscreen mode

Here's the second test to address a failure scenario:

void main() {
  ...
  ...
  test('calling initiate(client) should silently fail', () async {
    // Arrange
    client = MockClient((req) => Future(() => Response('Failed', 400)));

    // Act
    var response = await hacker_news_scraper.initiate(client);

    // Assert
    expect(response, equals('Failed'));
  });
}
Enter fullscreen mode Exit fullscreen mode

Run pub run test again. This will fail.

Let's make this pass. In our initiate() method, let's add this condition below our GET call:

if (response.statusCode != 200) return response.body;
Enter fullscreen mode Exit fullscreen mode

Run the test again. All should pass!

All tests pass

Conclusion

To sum things up, we have built a scraping tool to pull in the latest feed from the Hacker News website using the http and html packages provided by the Dart team. We then covered our backs by writing some unit tests.

In reality though it may serve you better to use the Hacker News API for this πŸ˜„. That being said, you will still need this approach for websites that do not have an official API for traversing their content.

I hope this has been insightful, especially in the area of writing tests in Dart.

β†’ Get the source code

I also run a YouTube channel teaching subscribers to develop fullstack applications with Dart. Become a subscriber to receive updates when new videos are released.

And lastly, I am almost finished with producing a free Dart course on Egghead.io. This is due for release in the New Year πŸŽ‰, so keep an eye out for that πŸ‘οΈ

Like, share and follow me 😍 for more content on Dart.

Further reading

  1. http: A composable, Future-based library for making HTTP requests
  2. html: HTML5 parser in Dart
  3. Free Dart screencasts on Egghead.io
πŸ’– πŸ’ͺ πŸ™… 🚩
creativ_bracket
Jermaine

Posted on December 17, 2018

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related