OILS / doc / htm8.md View on Github | oils.pub

225 lines, 148 significant
1---
2in_progress: yes
3default_highlighter: oils-sh
4---
5
6HTM8 - An Easy Subset of HTML5, With Errors
7=================================
8
9- Syntax Errors: It's a Subset
10- Easy
11 - Easy to Remember
12 - Easy to Implement
13 - Runs Efficiently - you don't have to materialize a big DOM tree, which
14 causes many allocations
15- Convertable to XML?
16 - without allocations, with a `sed`-like transformation!
17 - low level lexing and matching
18
19<!--
20
21TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
22and then validated by an XML parser
23
24- lxml - this is supposed to be high quality
25
26- Python stdlib uses expat - https://libexpat.github.io/
27
28- Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
29 - do they have the billion laughs bug?
30
31-->
32
33<div id="toc">
34</div>
35
36## Basic Structure
37
38### Text Content
39
40Anything except `&` and `<`.
41
42These must be `&amp;` and `&lt;`.
43
44`>` is allowed, or you can escape it with `&gt;`.
45
46### 3 Kinds of Character Code
47
481. `&amp;` - named
491. `&#999;` - decimal
501. `&#xff;` - hex
51
52### 3 Kinds of Tag
53
541. Start
551. End
561. StartEnd
57
58### 2 Kinds of Attribute
59
601. Unquoted
611. Quoted
62
63### 2 Kinds of Comment
64
651. `<!-- -->`
661. `<? ?>` (XML processing instruction)
67
68
69## Special Rules, From HTML
70
71### 2 Tags Cause Special Lexing
72
73- `<script> <style>`
74
75Note: we still have CDATA for compatibility.
76
77
78### 16 VOID Tags Change Parsing
79
80- `<source> ...`
81
82### Bonus: XML Mode
83
84- Get rid of the 2 special lexing tags, and 16 VOID tags
85
86Then you can query HTML
87
88
89## Under the Hood
90
91### 3 Layers of Lexing
92
931. Tag
941. Attributes within a Tag
951. Quoted Value for Attributes
96
97## What Do You Use This for?
98
99- Stripping comments
100- Adding TOC
101- Syntax highlighting code
102- Adding links shortcuts
103- ul-table
104
105TODO:
106
107- DOM API on top of it
108 - node.elementsByTag('p')
109 - node.elementsByClassName('left')
110 - node.elementByID('foo')
111 - innerHTML() outerHTML()
112 - tag attrs
113 - low level:
114 - outerLeft, outerRight, innerLeft, innerRight
115- CSS Selectors - `querySelectorAll()`
116- sed-like model
117
118## Algorithms
119
120### Emitting HTM8 as HTML5
121
122Just emit it! This always works, by design.
123
124### Parsing XML
125
126- Set `NO_SPECIAL_TAGS`
127
128
129Conflicts between HTML5 and XML:
130
131- In XML, `<source>` is like any tag, and must be closed,
132- In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,
133
134- In XML, `<script>` and `<style>` don't have special treatment
135- In HTML, they do
136
137- The header is different - `<!DOCTYPE html>` vs. `<?xml version= ... ?>`
138
139- HTML: `<a empty= missing>` is two attributes
140- right now we don't handle `<a empty = "missing">` as a single attribute
141 - that is valid XML, so should we handle it?
142
143### Converting to XML?
144
145- Add quotes to unquoted attributes
146 - single and double quotes stay the same?
147- Quote special chars
148 - & BadAmpersand -> `&amp;`
149 - < BadLessThan -> `&lt;`
150 - > BadGreaterTnan -> `&gt;`
151- `<script>` and `<style>`
152 - either add `<![CDATA[`
153 - or simply escape their values with `&amp; &lt;`
154- what to do about case-insensitive tags?
155 - maybe you can just normalize them
156 - because we do strict matching
157- Maybe validate any other declarations, like `<!DOCTYPE foo>`
158- Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
159
160## Leniency
161
162Angle brackets:
163
164- `<a foo="<">` is allowed, but `<a foo=">">` is disallowed
165- `<p> 4>3 </p>` is allowed, but `<p> 4<3 </p>` is disallowed
166
167This makes lexing the top-level structure easier.
168
169- unescaped `&` is allowed, unlike XML
170 - it's very common in `<a href="?foo=42&bar=99">`
171 - It's lexed as BadAmpersand, in case you want to fix it for XML. Although
172 we don't do that for < and > consistently.
173
174
175## Related
176
177- [ysh-doc-processing.html](ysh-doc-processing.html)
178- [table-object-doc.html](table-object-doc.html)
179
180## FAQ
181
182### What Doesn't This Cover?
183
184- HTM8 tags must be balanced to convert them to XML
185
186- NUL bytes aren't allowed - currently due to re2c sentinel
187 - Although I think we could have the preprocessing pass to convert it to the
188 Unicode replacement char? I think that HTML might mandate that
189- Encodings other than UTF-8. HTM8 is always UTF-8.
190- Unicode Tag names and attribute names.
191 - This is allowed in HTML5 and XML.
192 - We leave those out for simpler lexing. Text and attribute values may be unicode.
193
194- `<a href=">">` - no literal `>` inside quotes
195 - HTML5 handles it, but we want to easily scan the "top level" structure of the doc
196 - And it doesn't appear to be common in our testdata
197 - TODO: we will handle `<a href="&">`
198
199There are 5 kinds of tags:
200
201- Normal HTML tags
202- RCDATA for `<title> <textarea>`
203- RAWTEXT `<style> <xmp> <iframe>` ?
204
205and we have
206
207- CDATA `<script>`
208 - TODO: we need a test case for `</script>` in a string literal?
209- Foreign `<math> <svg>` - XML rules
210
211## TODO
212
213- `<svg>` and `<math>` are foreign XML content? Doh
214 - So I can just switch to XML mode in that case
215 - TODO: we need a test corpus for this!
216 - maybe look for wikipedia content
217- can we also just disallow these? Can you make these into external XML files?
218
219This is one way:
220
221 <object data="math.xml" type="application/mathml+xml"></object>
222 <object data="drawing.xml" type="image/svg+xml"></object>
223
224Then we don't need special parsing?
225