Warning: Work in progress! Leave feedback on Zulip or Github if you'd like this doc to be updated.

HTM8 - An Easy Subset of HTML5, With Some Errors

HTM8 is a data language, which is part of J8 Notation:

Currently, all of Oils docs are parsed and processed with it.

We would like to "lift it up" into an API for YSH users.

Table of Contents
Structure of an HTM8 Doc
Tags - Open, Close, Self-Closing
Attributes - Quotes optional
Text - Regular or CDATA
Escaped Chars - named, decimal, hex
Comments - HTML or XML
Declarations - HTML or XML
Special Rules For Specific HTML Tags
<script> and <style> are Leaf Tags with Special Lexing
16 VOID Tags Don't Need Close Tags (Special Parsing)
Errors
Notes on Leniency
What are some examples of syntax errors?
Under the Hood - Regular Languages, Algebraic Data Types
2 Layers of Lexing
4 Regular Expressions
XML Parsing Mode
Algorithms
What Do You Use This for?
List of Algorithms
Emitting HTM8 as HTML5
Converting to XML
Related
Brainstorming / TODO
Foreign XML with <svg> and <math> ?

Structure of an HTM8 Doc

Tags - Open, Close, Self-Closing

  1. Open <a>
  2. Close </a>
  3. StartEnd <img/>

HTML5 doesn't have the notion of self-closing tags. Instead, it silently ignores the trailing /.

We are bringing it back for human, because we think it's too hard for people to remember the 16 void elements.

And lack of balanced bugs causes visual bugs that are hard to debug. It would be better to get an error earlier.

Attributes - Quotes optional

5 closely related Syntaxes

  1. Missing <a missing>
  2. Empty <a empty=>
  3. Unquoted <a href=foo>
  4. Double Quoted <a href="foo">
  5. Single Quoted <a href='foo'>

Note: <a href=/> is disallowed because it's ambiguous. Use <a href="/"> or <a href=/ > or <a href= />.

Text - Regular or CDATA

Regular Text

But we are lenient and allow raw > between tags:

<p> foo > bar </p>

and raw < inside tags:

<span foo="<" > foo </span>

CDATA

Like HTML5, we support explicit <![CDATA[, even though it's implicit in the tags.

Escaped Chars - named, decimal, hex

  1. &amp; - named
  2. &#999; - decimal
  3. &#xff; - hex

Comments - HTML or XML

  1. <!-- -->
  2. <? ?> (XML processing instruction)

Declarations - HTML or XML

Special Rules For Specific HTML Tags

<script> and <style> are Leaf Tags with Special Lexing

Note: we still have CDATA for compatibility.

16 VOID Tags Don't Need Close Tags (Special Parsing)

Errors

Notes on Leniency

Angle brackets:

This makes lexing the top-level structure easier.

What are some examples of syntax errors?

HTML notes:

There are 5 kinds of tags:

and we have

Under the Hood - Regular Languages, Algebraic Data Types

That is, we use exhaustive reasoning

It's meant to be easy to implement.

2 Layers of Lexing

  1. TagLexer
  2. AttrLexer

4 Regular Expressions

Using re2c as the "choice" primitive.

  1. Lexer
  2. NAME lexer
  3. Begin VALUE lexer
  4. Quoted value lexer - for decoding <a href="&amp;">

XML Parsing Mode

Conflicts between HTML5 and XML:

Algorithms

What Do You Use This for?

TODO:

List of Algorithms

Emitting HTM8 as HTML5

Just emit it! This always works, by design.

Converting to XML

Related

Brainstorming / TODO

Foreign XML with <svg> and <math> ?

<svg> and <math> are foreign XML content.

We might want to support this.

This is one way:

<object data="math.xml" type="application/mathml+xml"></object>
<object data="drawing.xml" type="image/svg+xml"></object>

Then we don't need special parsing?

Generated on Mon, 10 Feb 2025 03:32:50 +0000