doc/htm8.md

OILS / doc / htm8.md View on Github | oils.pub

225 lines, 148 significant

1	---
2	in_progress: yes
3	default_highlighter: oils-sh
4	---
5
6	HTM8 - An Easy Subset of HTML5, With Errors
7	=================================
8
9	- Syntax Errors: It's a Subset
10	- Easy
11	- Easy to Remember
12	- Easy to Implement
13	- Runs Efficiently - you don't have to materialize a big DOM tree, which
14	causes many allocations
15	- Convertable to XML?
16	- without allocations, with a `sed`-like transformation!
17	- low level lexing and matching
18
19	<!--
20
21	TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
22	and then validated by an XML parser
23
24	- lxml - this is supposed to be high quality
25
26	- Python stdlib uses expat - https://libexpat.github.io/
27
28	- Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
29	- do they have the billion laughs bug?
30
31	-->
32
33	<div id="toc">
34	</div>
35
36	## Basic Structure
37
38	### Text Content
39
40	Anything except `&` and `<`.
41
42	These must be `&` and `<`.
43
44	`>` is allowed, or you can escape it with `>`.
45
46	### 3 Kinds of Character Code
47
48	1. `&` - named
49	1. `ϧ` - decimal
50	1. `ÿ` - hex
51
52	### 3 Kinds of Tag
53
54	1. Start
55	1. End
56	1. StartEnd
57
58	### 2 Kinds of Attribute
59
60	1. Unquoted
61	1. Quoted
62
63	### 2 Kinds of Comment
64
65	1. `<!-- -->`
66	1. `<? ?>` (XML processing instruction)
67
68
69	## Special Rules, From HTML
70
71	### 2 Tags Cause Special Lexing
72
73	- `<script> <style>`
74
75	Note: we still have CDATA for compatibility.
76
77
78	### 16 VOID Tags Change Parsing
79
80	- `<source> ...`
81
82	### Bonus: XML Mode
83
84	- Get rid of the 2 special lexing tags, and 16 VOID tags
85
86	Then you can query HTML
87
88
89	## Under the Hood
90
91	### 3 Layers of Lexing
92
93	1. Tag
94	1. Attributes within a Tag
95	1. Quoted Value for Attributes
96
97	## What Do You Use This for?
98
99	- Stripping comments
100	- Adding TOC
101	- Syntax highlighting code
102	- Adding links shortcuts
103	- ul-table
104
105	TODO:
106
107	- DOM API on top of it
108	- node.elementsByTag('p')
109	- node.elementsByClassName('left')
110	- node.elementByID('foo')
111	- innerHTML() outerHTML()
112	- tag attrs
113	- low level:
114	- outerLeft, outerRight, innerLeft, innerRight
115	- CSS Selectors - `querySelectorAll()`
116	- sed-like model
117
118	## Algorithms
119
120	### Emitting HTM8 as HTML5
121
122	Just emit it! This always works, by design.
123
124	### Parsing XML
125
126	- Set `NO_SPECIAL_TAGS`
127
128
129	Conflicts between HTML5 and XML:
130
131	- In XML, `<source>` is like any tag, and must be closed,
132	- In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,
133
134	- In XML, `<script>` and `<style>` don't have special treatment
135	- In HTML, they do
136
137	- The header is different - `<!DOCTYPE html>` vs. `<?xml version= ... ?>`
138
139	- HTML: `<a empty= missing>` is two attributes
140	- right now we don't handle `<a empty = "missing">` as a single attribute
141	- that is valid XML, so should we handle it?
142
143	### Converting to XML?
144
145	- Add quotes to unquoted attributes
146	- single and double quotes stay the same?
147	- Quote special chars
148	- & BadAmpersand -> `&`
149	- < BadLessThan -> `<`
150	- > BadGreaterTnan -> `>`
151	- `<script>` and `<style>`
152	- either add `<![CDATA[`
153	- or simply escape their values with `& <`
154	- what to do about case-insensitive tags?
155	- maybe you can just normalize them
156	- because we do strict matching
157	- Maybe validate any other declarations, like `<!DOCTYPE foo>`
158	- Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
159
160	## Leniency
161
162	Angle brackets:
163
164	- `<a foo="<">` is allowed, but `<a foo=">">` is disallowed
165	- `<p> 4>3 </p>` is allowed, but `<p> 4<3 </p>` is disallowed
166
167	This makes lexing the top-level structure easier.
168
169	- unescaped `&` is allowed, unlike XML
170	- it's very common in `<a href="?foo=42&bar=99">`
171	- It's lexed as BadAmpersand, in case you want to fix it for XML. Although
172	we don't do that for < and > consistently.
173
174
175	## Related
176
177	- [ysh-doc-processing.html](ysh-doc-processing.html)
178	- [table-object-doc.html](table-object-doc.html)
179
180	## FAQ
181
182	### What Doesn't This Cover?
183
184	- HTM8 tags must be balanced to convert them to XML
185
186	- NUL bytes aren't allowed - currently due to re2c sentinel
187	- Although I think we could have the preprocessing pass to convert it to the
188	Unicode replacement char? I think that HTML might mandate that
189	- Encodings other than UTF-8. HTM8 is always UTF-8.
190	- Unicode Tag names and attribute names.
191	- This is allowed in HTML5 and XML.
192	- We leave those out for simpler lexing. Text and attribute values may be unicode.
193
194	- `<a href=">">` - no literal `>` inside quotes
195	- HTML5 handles it, but we want to easily scan the "top level" structure of the doc
196	- And it doesn't appear to be common in our testdata
197	- TODO: we will handle `<a href="&">`
198
199	There are 5 kinds of tags:
200
201	- Normal HTML tags
202	- RCDATA for `<title> <textarea>`
203	- RAWTEXT `<style> <xmp> <iframe>` ?
204
205	and we have
206
207	- CDATA `<script>`
208	- TODO: we need a test case for `</script>` in a string literal?
209	- Foreign `<math> <svg>` - XML rules
210
211	## TODO
212
213	- `<svg>` and `<math>` are foreign XML content? Doh
214	- So I can just switch to XML mode in that case
215	- TODO: we need a test corpus for this!
216	- maybe look for wikipedia content
217	- can we also just disallow these? Can you make these into external XML files?
218
219	This is one way:
220
221	<object data="math.xml" type="application/mathml+xml"></object>
222	<object data="drawing.xml" type="image/svg+xml"></object>
223
224	Then we don't need special parsing?
225