1 | ---
|
2 | in_progress: yes
|
3 | default_highlighter: oils-sh
|
4 | ---
|
5 |
|
6 | HTM8 - An Easy Subset of HTML5, With Errors
|
7 | =================================
|
8 |
|
9 | - Syntax Errors: It's a Subset
|
10 | - Easy
|
11 | - Easy to Remember
|
12 | - Easy to Implement
|
13 | - Runs Efficiently - you don't have to materialize a big DOM tree, which
|
14 | causes many allocations
|
15 | - Convertable to XML?
|
16 | - without allocations, with a `sed`-like transformation!
|
17 | - low level lexing and matching
|
18 |
|
19 | <!--
|
20 |
|
21 | TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
|
22 | and then validated by an XML parser
|
23 |
|
24 | - lxml - this is supposed to be high quality
|
25 |
|
26 | - Python stdlib uses expat - https://libexpat.github.io/
|
27 |
|
28 | - Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
|
29 | - do they have the billion laughs bug?
|
30 |
|
31 | -->
|
32 |
|
33 | <div id="toc">
|
34 | </div>
|
35 |
|
36 | ## Basic Structure
|
37 |
|
38 | ### Text Content
|
39 |
|
40 | Anything except `&` and `<`.
|
41 |
|
42 | These must be `&` and `<`.
|
43 |
|
44 | `>` is allowed, or you can escape it with `>`.
|
45 |
|
46 | ### 3 Kinds of Character Code
|
47 |
|
48 | 1. `&` - named
|
49 | 1. `ϧ` - decimal
|
50 | 1. `ÿ` - hex
|
51 |
|
52 | ### 3 Kinds of Tag
|
53 |
|
54 | 1. Start
|
55 | 1. End
|
56 | 1. StartEnd
|
57 |
|
58 | ### 2 Kinds of Attribute
|
59 |
|
60 | 1. Unquoted
|
61 | 1. Quoted
|
62 |
|
63 | ### 2 Kinds of Comment
|
64 |
|
65 | 1. `<!-- -->`
|
66 | 1. `<? ?>` (XML processing instruction)
|
67 |
|
68 |
|
69 | ## Special Rules, From HTML
|
70 |
|
71 | ### 2 Tags Cause Special Lexing
|
72 |
|
73 | - `<script> <style>`
|
74 |
|
75 | Note: we still have CDATA for compatibility.
|
76 |
|
77 |
|
78 | ### 16 VOID Tags Change Parsing
|
79 |
|
80 | - `<source> ...`
|
81 |
|
82 | ### Bonus: XML Mode
|
83 |
|
84 | - Get rid of the 2 special lexing tags, and 16 VOID tags
|
85 |
|
86 | Then you can query HTML
|
87 |
|
88 |
|
89 | ## Under the Hood
|
90 |
|
91 | ### 3 Layers of Lexing
|
92 |
|
93 | 1. Tag
|
94 | 1. Attributes within a Tag
|
95 | 1. Quoted Value for Attributes
|
96 |
|
97 | ## What Do You Use This for?
|
98 |
|
99 | - Stripping comments
|
100 | - Adding TOC
|
101 | - Syntax highlighting code
|
102 | - Adding links shortcuts
|
103 | - ul-table
|
104 |
|
105 | TODO:
|
106 |
|
107 | - DOM API on top of it
|
108 | - node.elementsByTag('p')
|
109 | - node.elementsByClassName('left')
|
110 | - node.elementByID('foo')
|
111 | - innerHTML() outerHTML()
|
112 | - tag attrs
|
113 | - low level:
|
114 | - outerLeft, outerRight, innerLeft, innerRight
|
115 | - CSS Selectors - `querySelectorAll()`
|
116 | - sed-like model
|
117 |
|
118 | ## Algorithms
|
119 |
|
120 | ### Emitting HTM8 as HTML5
|
121 |
|
122 | Just emit it! This always works, by design.
|
123 |
|
124 | ### Parsing XML
|
125 |
|
126 | - Set `NO_SPECIAL_TAGS`
|
127 |
|
128 |
|
129 | Conflicts between HTML5 and XML:
|
130 |
|
131 | - In XML, `<source>` is like any tag, and must be closed,
|
132 | - In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,
|
133 |
|
134 | - In XML, `<script>` and `<style>` don't have special treatment
|
135 | - In HTML, they do
|
136 |
|
137 | - The header is different - `<!DOCTYPE html>` vs. `<?xml version= ... ?>`
|
138 |
|
139 | - HTML: `<a empty= missing>` is two attributes
|
140 | - right now we don't handle `<a empty = "missing">` as a single attribute
|
141 | - that is valid XML, so should we handle it?
|
142 |
|
143 | ### Converting to XML?
|
144 |
|
145 | - Add quotes to unquoted attributes
|
146 | - single and double quotes stay the same?
|
147 | - Quote special chars
|
148 | - & BadAmpersand -> `&`
|
149 | - < BadLessThan -> `<`
|
150 | - > BadGreaterTnan -> `>`
|
151 | - `<script>` and `<style>`
|
152 | - either add `<![CDATA[`
|
153 | - or simply escape their values with `& <`
|
154 | - what to do about case-insensitive tags?
|
155 | - maybe you can just normalize them
|
156 | - because we do strict matching
|
157 | - Maybe validate any other declarations, like `<!DOCTYPE foo>`
|
158 | - Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
|
159 |
|
160 | ## Leniency
|
161 |
|
162 | Angle brackets:
|
163 |
|
164 | - `<a foo="<">` is allowed, but `<a foo=">">` is disallowed
|
165 | - `<p> 4>3 </p>` is allowed, but `<p> 4<3 </p>` is disallowed
|
166 |
|
167 | This makes lexing the top-level structure easier.
|
168 |
|
169 | - unescaped `&` is allowed, unlike XML
|
170 | - it's very common in `<a href="?foo=42&bar=99">`
|
171 | - It's lexed as BadAmpersand, in case you want to fix it for XML. Although
|
172 | we don't do that for < and > consistently.
|
173 |
|
174 |
|
175 | ## Related
|
176 |
|
177 | - [ysh-doc-processing.html](ysh-doc-processing.html)
|
178 | - [table-object-doc.html](table-object-doc.html)
|
179 |
|
180 | ## FAQ
|
181 |
|
182 | ### What Doesn't This Cover?
|
183 |
|
184 | - HTM8 tags must be balanced to convert them to XML
|
185 |
|
186 | - NUL bytes aren't allowed - currently due to re2c sentinel
|
187 | - Although I think we could have the preprocessing pass to convert it to the
|
188 | Unicode replacement char? I think that HTML might mandate that
|
189 | - Encodings other than UTF-8. HTM8 is always UTF-8.
|
190 | - Unicode Tag names and attribute names.
|
191 | - This is allowed in HTML5 and XML.
|
192 | - We leave those out for simpler lexing. Text and attribute values may be unicode.
|
193 |
|
194 | - `<a href=">">` - no literal `>` inside quotes
|
195 | - HTML5 handles it, but we want to easily scan the "top level" structure of the doc
|
196 | - And it doesn't appear to be common in our testdata
|
197 | - TODO: we will handle `<a href="&">`
|
198 |
|
199 | There are 5 kinds of tags:
|
200 |
|
201 | - Normal HTML tags
|
202 | - RCDATA for `<title> <textarea>`
|
203 | - RAWTEXT `<style> <xmp> <iframe>` ?
|
204 |
|
205 | and we have
|
206 |
|
207 | - CDATA `<script>`
|
208 | - TODO: we need a test case for `</script>` in a string literal?
|
209 | - Foreign `<math> <svg>` - XML rules
|
210 |
|
211 | ## TODO
|
212 |
|
213 | - `<svg>` and `<math>` are foreign XML content? Doh
|
214 | - So I can just switch to XML mode in that case
|
215 | - TODO: we need a test corpus for this!
|
216 | - maybe look for wikipedia content
|
217 | - can we also just disallow these? Can you make these into external XML files?
|
218 |
|
219 | This is one way:
|
220 |
|
221 | <object data="math.xml" type="application/mathml+xml"></object>
|
222 | <object data="drawing.xml" type="image/svg+xml"></object>
|
223 |
|
224 | Then we don't need special parsing?
|
225 |
|