1 | ---
|
2 | in_progress: yes
|
3 | default_highlighter: oils-sh
|
4 | ---
|
5 |
|
6 | HTM8 - An Easy Subset of HTML5, With Some Errors
|
7 | =================================
|
8 |
|
9 | HTM8 is a data language, which is part of J8 Notation:
|
10 |
|
11 | - It's a subset of HTML5, so there are **Syntax Errors**
|
12 | - It's "for humans"
|
13 | - `<li><li>` example
|
14 | - It's Easy
|
15 | - Easy to Implement - ~700 lines of regular languages and Python
|
16 | - And thus Easy to Remember, for users
|
17 | - Runs Efficiently - you don't have to materialize a big DOM tree, which
|
18 | causes many allocations
|
19 | - Convertible to XML?
|
20 | - without allocations, with a `sed`-like transformation!
|
21 | - low level lexing and matching
|
22 | - Ambitious
|
23 | - zero-alloc whitelist-based HTML filter for user content
|
24 | - zero-alloc browser and CSS-style content queries
|
25 |
|
26 | Currently, all of Oils docs are parsed and processed with it.
|
27 |
|
28 | We would like to "lift it up" into an API for YSH users.
|
29 |
|
30 | <!--
|
31 |
|
32 | TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
|
33 | and then validated by an XML parser
|
34 |
|
35 | - lxml - this is supposed to be high quality
|
36 |
|
37 | - Python stdlib uses expat - https://libexpat.github.io/
|
38 |
|
39 | - Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
|
40 | - do they have the billion laughs bug?
|
41 |
|
42 | -->
|
43 |
|
44 | <div id="toc">
|
45 | </div>
|
46 |
|
47 | ## Structure of an HTM8 Doc
|
48 |
|
49 | ### Tags - Open, Close, Self-Closing
|
50 |
|
51 | 1. Open `<a>`
|
52 | 1. Close `</a>`
|
53 | 1. StartEnd `<img/>`
|
54 |
|
55 | HTML5 doesn't have the notion of self-closing tags. Instead, it silently ignores
|
56 | the trailing `/`.
|
57 |
|
58 | We are bringing it back for human, because we think it's too hard for people to
|
59 | remember the 16 void elements.
|
60 |
|
61 | And lack of balanced bugs causes visual bugs that are hard to debug. It would
|
62 | be better to get an error **earlier**.
|
63 |
|
64 | ### Attributes - Quotes optional
|
65 |
|
66 | 5 closely related Syntaxes
|
67 |
|
68 | 1. Missing `<a missing>`
|
69 | 1. Empty `<a empty=>`
|
70 | 1. Unquoted `<a href=foo>`
|
71 | 1. Double Quoted `<a href="foo">`
|
72 | 1. Single Quoted `<a href='foo'>`
|
73 |
|
74 | Note: `<a href=/>` is disallowed because it's ambiguous. Use `<a href="/">` or
|
75 | `<a href=/ >` or `<a href= />`.
|
76 |
|
77 | ### Text - Regular or CDATA
|
78 |
|
79 | #### Regular Text
|
80 |
|
81 | - Any UTF-8 text.
|
82 | - Generally, `& < > " '` should be escaped as `& < > " &apos`.
|
83 |
|
84 | But we are lenient and allow raw `>` between tags:
|
85 |
|
86 | <p> foo > bar </p>
|
87 |
|
88 | and raw `<` inside tags:
|
89 |
|
90 | <span foo="<" > foo </span>
|
91 |
|
92 | #### CDATA
|
93 |
|
94 | Like HTML5, we support explicit `<
|
279 | - [table-object-doc.html](table-object-doc.html)
|
280 |
|
281 |
|
282 | ## Brainstorming / TODO
|
283 |
|
284 | ### Foreign XML with `<svg>` and `<math>` ?
|
285 |
|
286 | `<svg>` and `<math>` are foreign XML content.
|
287 |
|
288 | We might want to support this.
|
289 |
|
290 | - So I can just switch to XML mode in that case
|
291 | - TODO: we need a test corpus for this!
|
292 | - maybe look for wikipedia content
|
293 | - can we also just disallow these? Can you make these into external XML files?
|
294 |
|
295 | This is one way:
|
296 |
|
297 | <object data="math.xml" type="application/mathml+xml"></object>
|
298 | <object data="drawing.xml" type="image/svg+xml"></object>
|
299 |
|
300 | Then we don't need special parsing?
|