'WhatWG - HTML5 tokenizer - official test cases?

Short Version

Are there test vectors/test cases for a conforming HTML Tokenizer? (https://html.spec.whatwg.org/multipage/parsing.html#tokenization)

An example would be a sample of HTML:

<!doctype html>\r\n<html>\r\n<head></head>\r\n<body></body>\r\n</html>

And you are given the expected tokens:

  • doctype("html")
  • character(LF)
  • startTag("html")
  • character(LF)
  • startTag("head")
  • endTag("head")
  • character(LF)
  • startTag("body")
  • endTag("body")
  • character(LF)
  • endTag("html")

Long Version

There is a web-site (WebPlatformTests.org) dedicated to creating tests so that implementations can test their conformance:

The web-platform-tests project is a cross-browser test suite for the Web-platform stack. Writing tests in a way that allows them to be run in all browsers gives browser projects confidence that they are shipping software which is compatible with other implementations, and that later implementations will be compatible with their implementations.

HTML5 Tokenizer test class

In their GitHub repository (https://github.com/web-platform-tests/wpt/tree/master/html), they have an HTML Tokenizer test unit:

https://github.com/web-platform-tests/wpt/blob/7b0ebaccc62b566a1965396e5be7bb2bc06f841f/tools/third_party/html5lib/html5lib/tests/tokenizer.py

class TokenizerTestParser(object):
    def __init__(self, initialState, lastStartTag=None):
        self.tokenizer = HTMLTokenizer
        self._state = initialState
        self._lastStartTag = lastStartTag

    def parse(self, stream, encoding=None, innerHTML=False):
        # pylint:disable=unused-argument
        tokenizer = self.tokenizer(stream, encoding)
        self.outputTokens = []

And i see how it tokenizes some HTML, and tests the returned list of tokens against some reference. But i can't find where it gets the test vectors from.

HTML Parsing Test Folder

The Web Platform Tests home also documents how to navigate the repository to find the tests you want:

HTML

This directory contains tests for HTML.

Sub-directory names should be based on the URL of the corresponding part of the multipage-version specification. For example, the URL of "8.3 Base64 utility methods" is https://html.spec.whatwg.org/multipage/webappapis.html#atob. So the directory in WPT is webappapis/atob/.

In my case i am looking at the spec:

Which should mean i should need a directory in WPT named "parsing/parsing". Except there is no parsing folder:

enter image description here

WhatWG specification

The HTML 5 specification has a link to "Tests":

enter image description here

But that goes to what i already mentioned above - Web Platform Tests.

Non-standard test cases

In the absense of any formal test vectors, i did find a guy who wrote an (intentionally) non-conforming HTML tokenizer (https://github.com/tildeio/simple-html-tokenizer/blob/master/tests/tokenizer-tests.ts).

Some of the test cases are wrong (they simply violate the HTML5 spec). But at least he does have a nice collection of about 40 test cases. But about 1/3rd of them violate HTML5.

Given that the Web Platform Tests specifically has Tokenizer tests: it seems to me that it must have Tokenizer tests somewhere.

But where are they?



Sources

This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.

Source: Stack Overflow

Solution Source