'WhatWG - HTML5 tokenizer - official test cases?
Short Version
Are there test vectors/test cases for a conforming HTML Tokenizer? (https://html.spec.whatwg.org/multipage/parsing.html#tokenization)
An example would be a sample of HTML:
<!doctype html>\r\n<html>\r\n<head></head>\r\n<body></body>\r\n</html>
And you are given the expected tokens:
- doctype("html")
- character(
LF) - startTag("html")
- character(
LF) - startTag("head")
- endTag("head")
- character(
LF) - startTag("body")
- endTag("body")
- character(
LF) - endTag("html")
Long Version
There is a web-site (WebPlatformTests.org) dedicated to creating tests so that implementations can test their conformance:
The web-platform-tests project is a cross-browser test suite for the Web-platform stack. Writing tests in a way that allows them to be run in all browsers gives browser projects confidence that they are shipping software which is compatible with other implementations, and that later implementations will be compatible with their implementations.
HTML5 Tokenizer test class
In their GitHub repository (https://github.com/web-platform-tests/wpt/tree/master/html), they have an HTML Tokenizer test unit:
class TokenizerTestParser(object):
def __init__(self, initialState, lastStartTag=None):
self.tokenizer = HTMLTokenizer
self._state = initialState
self._lastStartTag = lastStartTag
def parse(self, stream, encoding=None, innerHTML=False):
# pylint:disable=unused-argument
tokenizer = self.tokenizer(stream, encoding)
self.outputTokens = []
And i see how it tokenizes some HTML, and tests the returned list of tokens against some reference. But i can't find where it gets the test vectors from.
HTML Parsing Test Folder
The Web Platform Tests home also documents how to navigate the repository to find the tests you want:
HTML
This directory contains tests for HTML.
Sub-directory names should be based on the URL of the corresponding part of the multipage-version specification. For example, the URL of "8.3 Base64 utility methods" is https://html.spec.whatwg.org/multipage/webappapis.html#atob. So the directory in WPT is webappapis/atob/.
In my case i am looking at the spec:
Which should mean i should need a directory in WPT named "parsing/parsing". Except there is no parsing folder:
WhatWG specification
The HTML 5 specification has a link to "Tests":
But that goes to what i already mentioned above - Web Platform Tests.
Non-standard test cases
In the absense of any formal test vectors, i did find a guy who wrote an (intentionally) non-conforming HTML tokenizer (https://github.com/tildeio/simple-html-tokenizer/blob/master/tests/tokenizer-tests.ts).
Some of the test cases are wrong (they simply violate the HTML5 spec). But at least he does have a nice collection of about 40 test cases. But about 1/3rd of them violate HTML5.
Given that the Web Platform Tests specifically has Tokenizer tests: it seems to me that it must have Tokenizer tests somewhere.
But where are they?
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|


