1 | ---
2 | in_progress: yes
3 | default_highlighter: oils-sh
4 | ---
5 |
6 | HTM8 - An Easy Subset of HTML5, With Errors
7 | =================================
8 |
9 | - Syntax Errors: It's a Subset
10 | - Easy
11 | - Easy to Remember
12 | - Easy to Implement
13 | - Runs Efficiently - you don't have to materialize a big DOM tree, which
14 | causes many allocations
15 | - Convertable to XML?
16 | - without allocations, with a `sed`-like transformation!
17 | - low level lexing and matching
18 |
19 | <!--
20 |
21 | TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
22 | and then validated by an XML parser
23 |
24 | - lxml - this is supposed to be high quality
25 |
26 | - Python stdlib uses expat - https://libexpat.github.io/
27 |
28 | - Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
29 | - do they have the billion laughs bug?
30 |
31 | -->
32 |
33 | <div id="toc">
34 | </div>
35 |
36 | ## Basic Structure
37 |
38 | ### Text Content
39 |
40 | Anything except `&` and `<`.
41 |
42 | These must be `&` and `<`.
43 |
44 | `>` is allowed, or you can escape it with `>`.
45 |
46 | ### 3 Kinds of Character Code
47 |
48 | 1. `&` - named
49 | 1. `ϧ` - decimal
50 | 1. `ÿ` - hex
51 |
52 | ### 3 Kinds of Tag
53 |
54 | 1. Start
55 | 1. End
56 | 1. StartEnd
57 |
58 | ### 2 Kinds of Attribute
59 |
60 | 1. Unquoted
61 | 1. Quoted
62 |
63 | ### 2 Kinds of Comment
64 |
65 | 1. `<!-- -->`
66 | 1. `<? ?>` (XML processing instruction)
67 |
68 |
69 | ## Special Rules, From HTML
70 |
71 | ### 2 Tags Cause Special Lexing
72 |
73 | - `<script> <style>`
74 |
75 | Note: we still have CDATA for compatibility.
76 |
77 |
78 | ### 16 VOID Tags Change Parsing
79 |
80 | - `<source> ...`
81 |
82 | ### Bonus: XML Mode
83 |
84 | - Get rid of the 2 special lexing tags, and 16 VOID tags
85 |
86 | Then you can query HTML
87 |
88 |
89 | ## Under the Hood
90 |
91 | ### 3 Layers of Lexing
92 |
93 | 1. Tag
94 | 1. Attributes within a Tag
95 | 1. Quoted Value for Attributes
96 |
97 | ## What Do You Use This for?
98 |
99 | - Stripping comments
100 | - Adding TOC
101 | - Syntax highlighting code
102 | - Adding links shortcuts
103 | - ul-table
104 |
105 | TODO:
106 |
107 | - DOM API on top of it
108 | - node.elementsByTag('p')
109 | - node.elementsByClassName('left')
110 | - node.elementByID('foo')
111 | - innerHTML() outerHTML()
112 | - tag attrs
113 | - low level:
114 | - outerLeft, outerRight, innerLeft, innerRight
115 | - CSS Selectors - `querySelectorAll()`
116 | - sed-like model
117 |
118 | ## Algorithms
119 |
120 | ### Emitting HTM8 as HTML5
121 |
122 | Just emit it! This always works, by design.
123 |
124 | ### Parsing XML
125 |
126 | - Set `NO_SPECIAL_TAGS`
127 |
128 |
129 | Conflicts between HTML5 and XML:
130 |
131 | - In XML, `<source>` is like any tag, and must be closed,
132 | - In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,
133 |
134 | - In XML, `<script>` and `<style>` don't have special treatment
135 | - In HTML, they do
136 |
137 | - The header is different - `<!DOCTYPE html>` vs. `<?xml version= ... ?>`
138 |
139 | - HTML: `<a empty= missing>` is two attributes
140 | - right now we don't handle `<a empty = "missing">` as a single attribute
141 | - that is valid XML, so should we handle it?
142 |
143 | ### Converting to XML?
144 |
145 | - Add quotes to unquoted attributes
146 | - single and double quotes stay the same?
147 | - Quote special chars
148 | - & BadAmpersand -> `&`
149 | - < BadLessThan -> `<`
150 | - > BadGreaterTnan -> `>`
151 | - `<script>` and `<style>`
152 | - either add `<
178 | - [table-object-doc.html](table-object-doc.html)
179 |
180 | ## FAQ
181 |
182 | ### What Doesn't This Cover?
183 |
184 | - HTM8 tags must be balanced to convert them to XML
185 |
186 | - NUL bytes aren't allowed - currently due to re2c sentinel
187 | - Although I think we could have the preprocessing pass to convert it to the
188 | Unicode replacement char? I think that HTML might mandate that
189 | - Encodings other than UTF-8. HTM8 is always UTF-8.
190 | - Unicode Tag names and attribute names.
191 | - This is allowed in HTML5 and XML.
192 | - We leave those out for simpler lexing. Text and attribute values may be unicode.
193 |
194 | - `<a href=">">` - no literal `>` inside quotes
195 | - HTML5 handles it, but we want to easily scan the "top level" structure of the doc
196 | - And it doesn't appear to be common in our testdata
197 | - TODO: we will handle `<a href="&">`
198 |
199 | There are 5 kinds of tags:
200 |
201 | - Normal HTML tags
202 | - RCDATA for `<title> <textarea>`
203 | - RAWTEXT `<style> <xmp> <iframe>` ?
204 |
205 | and we have
206 |
207 | - CDATA `<script>`
208 | - TODO: we need a test case for `</script>` in a string literal?
209 | - Foreign `<math> <svg>` - XML rules
210 |
211 | ## TODO
212 |
213 | - `<svg>` and `<math>` are foreign XML content? Doh
214 | - So I can just switch to XML mode in that case
215 | - TODO: we need a test corpus for this!
216 | - maybe look for wikipedia content
217 | - can we also just disallow these? Can you make these into external XML files?
218 |
219 | This is one way:
220 |
221 | <object data="math.xml" type="application/mathml+xml"></object>
222 | <object data="drawing.xml" type="image/svg+xml"></object>
223 |
224 | Then we don't need special parsing?
225 |