Parsing HTML in Dart with Unit tests

Today I wanted to build a simple web crawler in Dart with unit tests. The road turned out to be bumpier than I had expected. I'll write down what I learned.

Making a get request

A Google search leads to this Dart cookbook, and shows that it's as simple as adding the http package, then:

import 'package:http/http.dart' as http;
...
http.get(Uri.parse('https://...'))

And indeed, it worked fine.

Parsing the HTML

Now it gets slightly trickier. It's actually really simple, but for some reason, I started in the wrong direction: https://api.dart.dev/stable/2.15.1/dart-html/DomParser/parseFromString.html. The doc is surprisingly sparse. They don't even document what type should be. On MDN, I found some more documentation (see here), so I tried:

import 'dart:html';
...
DomParser().parseFromString(response.body, 'text/html');

But it failed with this error:

: Error: Not found: 'dart:html'
lib/my_crawler.dart:1
import 'dart:html';
       ^
: Error: Method not found: 'DomParser'.
lib/my_crawler.dart:20
    DomParser().parseFromString(doc.body, 'text/html');
    ^^^^^^^^^

The reason is because dart:html is in fact only available in the Browser, ie only if you're making Dart Web apps.

To parse HTML on other platforms, one should use the html package.

import 'package:html/parser.dart' show parse;
import 'package:http/http.dart' as http;
...
final response = await http.get(uri);
final document = parse(response.body);
final links = document.querySelectorAll('a');
final uris = links.map((link) => Uri.parse(link.attributes['href'] as String));

Perfect! Now let's write unit tests so we can start iterating faster.

Unit tests

Preparing the test data

Unit tests will help me iterate faster, because I won't have to download the same HTML page every time I run the crawler. Instead, I'll download the page once, and the test will read that file locally. Here's how to download the page:

% curl https://www.videocardbenchmark.net/common_gpus.html > common_gpus.html

Note: when the page isn't found, the > filename trick does not work. It simply creates an empty file instead of the 404 error page. To capture the 404 error page, I had to use the --output filename parameter. Finally, for URL's that are slightly more complicated (something?a=123&b=456), you have to pass the URI within quotes :

% curl --output GeForce_RTX_3080_Ti.html "https://www.videocardbenchmark.net/gpu.php?gpu=GeForce+RTX+3080+Ti&id=4409"

Mocking http

Change the function signature to pass mocks

Next we need to use a mock http Client. So instead of the static http.get, we should pass a specific client to the crawler. So I changed the crawl function's signature from:

static Future<Map<String, Benchmark>> crawlPage(Uri uri)

to:

static Future<Map<String, Benchmark>> crawlPage(Uri uri, Client client)

In my application, I have to pass an actual client:

MyCrawler.crawlPage(Uri.parse('https://...'), Client());

So that in my test, I can pass a mock:

import 'dart:io';

import 'package:gpu_benchmarks/videocardbenchmark_crawler.dart';
import 'package:http/http.dart';
import 'package:mockito/mockito.dart';
import 'package:test/test.dart';

main() {
  test('Crawls a page', () async {
    const url = 'https://www.videocardbenchmark.net/common_gpus.html';
    final client = MockClient();
    final uri = Uri.parse(url);
    ...
    final r = await VideoCardBenchmarkCrawler.crawlPage(uri, client);
    expect(r.length, equals(100));
  });
}

class MockClient extends Mock implements Client {}

Mocking an http response

Finally, I want to make my MockClient return the file I have previously downloaded. According to the Mockito doc, it's as simpe as:

final mockResponse = Response(File('./test/common_gpus.html').readAsStringSync(), 200);
when(client.get(any)).thenReturn(Future.value(mockResponse));

But I get this error:

It turns out this is a new type of error that appeared with the introduction of null safety in Dart. Since any matches null, and Client.get expects a non-null Uri, it fails to compile.

I tried to get around the error by not using any, and matching with the exact value instead:

when(client.get(uri)).thenReturn(...);

It compiles but at runtime, it throws:

type 'Null' is not a subtype of type 'Future<Response>' at MockClient.get

It seems like the mock client.get returns null, which the runtime won't accept since it expects a Future<Response>. So at this point it's better to follow the doc than try the old ways randomly.

Mockito now has a full doc about how to handle null safety. See here. Interestingly, it uses an example about mocking a client response as well:

Matching with `any` won't be as convenient to use as before.

The doc explains that the new way to mock my Client is to add an annotation that tells Mockito to generate the Mock using the build_runner package.

import 'dart:io';

import 'package:http/http.dart';
import 'package:mockito/annotations.dart';
import 'package:mockito/mockito.dart';
import 'package:test/test.dart';

@GenerateMocks([Client])
main() {
  test('Crawls a page', () async {
    ...

The doc shows the command to run build_runner once everything is set up, but it shows the outdated command pub run build_runner build. The new one is:

% dart run build_runner build

Running build_runner generates the MockClient in crawler_test.mocks.dart and I can now import that class in my test. The generated function actually accepts a nullable (Uri?) instead, which means you can use any without a problem.

Finally, I had one last runtime error:

Invalid argument(s): `thenReturn` should not be used to return a Future. Instead, use `thenAnswer((_) => future)`.

That's because for futures, we're supposed to use thenAnswer.  Then our code becomes:

final mockResponse = Response(File('./test/common_gpus.html').readAsStringSync(), 200);
when(client.get(any)).thenAnswer((_) => Future.value(mockResponse));

And now the test finally runs successfully.