1 | ---
2 | in_progress: yes
3 | default_highlighter: oils-sh
4 | ---
5 |
6 | HTM8 - An Easy Subset of HTML5, With Some Errors
7 | =================================
8 |
9 | HTM8 is a data language, which is part of J8 Notation:
10 |
11 | - It's a subset of HTML5, so there are **Syntax Errors**
12 | - It's "for humans"
13 | - `<li><li>` example
14 | - It's Easy
15 | - Easy to Implement - ~700 lines of regular languages and Python
16 | - And thus Easy to Remember, for users
17 | - Runs Efficiently - you don't have to materialize a big DOM tree, which
18 | causes many allocations
19 | - Convertible to XML?
20 | - without allocations, with a `sed`-like transformation!
21 | - low level lexing and matching
22 | - Ambitious
23 | - zero-alloc whitelist-based HTML filter for user content
24 | - zero-alloc browser and CSS-style content queries
25 |
26 | Currently, all of Oils docs are parsed and processed with it.
27 |
28 | We would like to "lift it up" into an API for YSH users.
29 |
30 | <!--
31 |
32 | TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
33 | and then validated by an XML parser
34 |
35 | - lxml - this is supposed to be high quality
36 |
37 | - Python stdlib uses expat - https://libexpat.github.io/
38 |
39 | - Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
40 | - do they have the billion laughs bug?
41 |
42 | -->
43 |
44 | <div id="toc">
45 | </div>
46 |
47 | ## Structure of an HTM8 Doc
48 |
49 | ### Tags - Open, Close, Self-Closing
50 |
51 | 1. Open `<a>`
52 | 1. Close `</a>`
53 | 1. StartEnd `<img/>`
54 |
55 | HTML5 doesn't have the notion of self-closing tags. Instead, it silently ignores
56 | the trailing `/`.
57 |
58 | We are bringing it back for human, because we think it's too hard for people to
59 | remember the 16 void elements.
60 |
61 | And lack of balanced bugs causes visual bugs that are hard to debug. It would
62 | be better to get an error **earlier**.
63 |
64 | ### Attributes - Quotes optional
65 |
66 | 5 closely related Syntaxes
67 |
68 | 1. Missing `<a missing>`
69 | 1. Empty `<a empty=>`
70 | 1. Unquoted `<a href=foo>`
71 | 1. Double Quoted `<a href="foo">`
72 | 1. Single Quoted `<a href='foo'>`
73 |
74 | Note: `<a href=/>` is disallowed because it's ambiguous. Use `<a href="/">` or
75 | `<a href=/ >` or `<a href= />`.
76 |
77 | ### Text - Regular or CDATA
78 |
79 | #### Regular Text
80 |
81 | - Any UTF-8 text.
82 | - Generally, `& < > " '` should be escaped as `& < > " &apos`.
83 |
84 | But we are lenient and allow raw `>` between tags:
85 |
86 | <p> foo > bar </p>
87 |
88 | and raw `<` inside tags:
89 |
90 | <span foo="<" > foo </span>
91 |
92 | #### CDATA
93 |
94 | Like HTML5, we support explicit `<
279 | - [table-object-doc.html](table-object-doc.html)
280 |
281 |
282 | ## Brainstorming / TODO
283 |
284 | ### Foreign XML with `<svg>` and `<math>` ?
285 |
286 | `<svg>` and `<math>` are foreign XML content.
287 |
288 | We might want to support this.
289 |
290 | - So I can just switch to XML mode in that case
291 | - TODO: we need a test corpus for this!
292 | - maybe look for wikipedia content
293 | - can we also just disallow these? Can you make these into external XML files?
294 |
295 | This is one way:
296 |
297 | <object data="math.xml" type="application/mathml+xml"></object>
298 | <object data="drawing.xml" type="image/svg+xml"></object>
299 |
300 | Then we don't need special parsing?