doc/htm8.md

OILS / doc / htm8.md View on Github | oils.pub

304 lines, 205 significant

1	---
2	in_progress: yes
3	default_highlighter: oils-sh
4	---
5
6	HTM8 - An Easy Subset of HTML5, With Some Errors
7	=================================
8
9	HTM8 is a data language, which is part of J8 Notation:
10
11	- It's a subset of HTML5, so there are Syntax Errors
12	- It's "for humans"
13	- `<li><li>` example
14	- It's Easy
15	- Easy to Implement - ~700 lines of regular languages and Python
16	- And thus Easy to Remember, for users
17	- Runs Efficiently - you don't have to materialize a big DOM tree, which
18	causes many allocations
19	- Convertible to XML?
20	- without allocations, with a `sed`-like transformation!
21	- low level lexing and matching
22	- Ambitious
23	- zero-alloc whitelist-based HTML filter for user content
24	- zero-alloc browser and CSS-style content queries
25
26	Currently, all of Oils docs are parsed and processed with it.
27
28	We would like to "lift it up" into an API for YSH users.
29
30	<!--
31
32	TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
33	and then validated by an XML parser
34
35	- lxml - this is supposed to be high quality
36
37	- Python stdlib uses expat - https://libexpat.github.io/
38
39	- Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
40	- do they have the billion laughs bug?
41
42	-->
43
44	<div id="toc">
45	</div>
46
47	## Structure of an HTM8 Doc
48
49	### Tags - Open, Close, Self-Closing
50
51	1. Open `<a>`
52	1. Close `</a>`
53	1. StartEnd `<img/>`
54
55	HTML5 doesn't have the notion of self-closing tags. Instead, it silently ignores
56	the trailing `/`.
57
58	We are bringing it back for human, because we think it's too hard for people to
59	remember the 16 void elements.
60
61	And lack of balanced bugs causes visual bugs that are hard to debug. It would
62	be better to get an error earlier.
63
64	### Attributes - Quotes optional
65
66	5 closely related Syntaxes
67
68	1. Missing `<a missing>`
69	1. Empty `<a empty=>`
70	1. Unquoted `<a href=foo>`
71	1. Double Quoted `<a href="foo">`
72	1. Single Quoted `<a href='foo'>`
73
74	Note: `<a href=/>` is disallowed because it's ambiguous. Use `<a href="/">` or
75	`<a href=/ >` or `<a href= />`.
76
77	### Text - Regular or CDATA
78
79	#### Regular Text
80
81	- Any UTF-8 text.
82	- Generally, `& < > " '` should be escaped as `& < > " &apos`.
83
84	But we are lenient and allow raw `>` between tags:
85
86	<p> foo > bar </p>
87
88	and raw `<` inside tags:
89
90	<span foo="<" > foo </span>
91
92	#### CDATA
93
94	Like HTML5, we support explicit `<![CDATA[`, even though it's implicit in the
95	tags.
96
97	### Escaped Chars - named, decimal, hex
98
99	1. `&` - named
100	1. `ϧ` - decimal
101	1. `ÿ` - hex
102
103
104	### Comments - HTML or XML
105
106	1. `<!-- -->`
107	1. `<? ?>` (XML processing instruction)
108
109	### Declarations - HTML or XML
110
111	- `<!DOCTYPE html>` from HTML5
112	- `<?xml version= ... ?>` from XML - this is a comment / processing instruction
113
114	## Special Rules For Specific HTML Tags
115
116	### `<script>` and `<style>` are Leaf Tags with Special Lexing
117
118	- `<script> <style>`
119
120	Note: we still have CDATA for compatibility.
121
122	### 16 VOID Tags Don't Need Close Tags (Special Parsing)
123
124	- `<source> ...`
125
126
127	## Errors
128
129	### Notes on Leniency
130
131	Angle brackets:
132
133	- `<a foo="<">` is allowed, but `<a foo=">">` is disallowed
134	- `<p> 4>3 </p>` is allowed, but `<p> 4<3 </p>` is disallowed
135
136	This makes lexing the top-level structure easier.
137
138	- unescaped `&` is allowed, unlike XML
139	- it's very common in `<a href="?foo=42&bar=99">`
140	- It's lexed as BadAmpersand, in case you want to fix it for XML. Although
141	we don't do that for < and > consistently.
142
143	### What are some examples of syntax errors?
144
145	- HTM8 tags must be balanced to convert them to XML
146
147	- `<script></SCRIPT>` isn't matched
148	- the begin and end tags must match exactly, like `<SCRipt></SCRipt>`
149	- likewise for `<style>`
150
151	- NUL bytes aren't allowed - currently due to re2c sentinel. Two options:
152	1. Make it a syntax error - like JSON8
153	1. we could have a reprocessing pass to convert it to the Unicode replacement
154	char? I think that HTML might mandate that
155	- Encodings other than UTF-8. HTM8 is always UTF-8.
156	- Unicode Tag names and attribute names.
157	- This is allowed in HTML5 and XML.
158	- We leave those out for simpler lexing. Text and attribute values may be unicode.
159
160	- `<a href=">">` - no literal `>` inside quotes
161	- HTML5 handles it, but we want to easily scan the "top level" structure of the doc
162	- And it doesn't appear to be common in our testdata
163	- TODO: we will handle `<a href="&">`
164
165	HTML notes:
166
167	There are 5 kinds of tags:
168
169	- Normal HTML tags
170	- RCDATA for `<title> <textarea>`
171	- RAWTEXT `<style> <xmp> <iframe>` ?
172
173	and we have
174
175	- CDATA `<script>`
176	- TODO: we need a test case for `</script>` in a string literal?
177	- Foreign `<math> <svg>` - XML rules
178
179	## Under the Hood - Regular Languages, Algebraic Data Types
180
181	That is, we use exhaustive reasoning
182
183	It's meant to be easy to implement.
184
185	### 2 Layers of Lexing
186
187	1. TagLexer
188	1. AttrLexer
189
190	### 4 Regular Expressions
191
192	Using re2c as the "choice" primitive.
193
194	1. Lexer
195	1. NAME lexer
196	1. Begin VALUE lexer
197	1. Quoted value lexer - for decoding `<a href="&">`
198
199	## XML Parsing Mode
200
201	- Set `NO_SPECIAL_TAGS` - get rid of special cases fo `<script>` and `<style>`
202
203	Conflicts between HTML5 and XML:
204
205	- In XML, `<source>` is like any tag, and must be closed,
206	- In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,
207
208	- In XML, `<script>` and `<style>` don't have special treatment
209	- In HTML, they do
210
211	- The header is different - `<!DOCTYPE html>` vs. `<?xml version= ... ?>`
212
213	- HTML: `<a empty= missing>` is two attributes
214	- right now we don't handle `<a empty = "missing">` as a single attribute
215	- that is valid XML, so should we handle it?
216
217	## Algorithms
218
219	### What Do You Use This for?
220
221	- Stripping comments
222	- Adding TOC
223	- Syntax highlighting code
224	- Adding links shortcuts
225	- ul-table
226
227	TODO:
228
229	- DOM API on top of it
230	- node.elementsByTag('p')
231	- node.elementsByClassName('left')
232	- node.elementByID('foo')
233	- innerHTML() outerHTML()
234	- tag attrs
235	- low level:
236	- outerLeft, outerRight, innerLeft, innerRight
237	- CSS Selectors - `querySelectorAll()`
238	- sed-like model
239
240
241	### List of Algorithms
242
243	- Lexing/Parsing
244	- lex just the top level
245	- lex both levels
246	- match tags - this is the level for value.Htm8Frag?
247	- sed-like
248	- convert to XML!
249	- sed-like replacement of DOM Tree or element - e.g. Oils TOC
250	- Structured
251	- convert to DOMTree
252	- lazy selection by tag, or attr (id= and class=)
253	- lazy selection by CSS selector expression
254	- untrusted HTML filter, e.g. like StackOverflow / Reddit
255	- this is Safe HTM8
256	- should have a zero alloc way to support this, with good errors?
257	- I think most of them silently strip data
258
259	### Emitting HTM8 as HTML5
260
261	Just emit it! This always works, by design.
262
263	### Converting to XML
264
265	- Add quotes to unquoted attributes
266	- single and double quotes stay the same?
267	- Quote special chars - in text, and inside single- and double-quoted attr values
268	- & BadAmpersand -> `&`
269	- < BadLessThan -> `<`
270	- > BadGreaterTnan -> `>`
271	- `<script>` and `<style>`
272	- either add `<![CDATA[`
273	- or simply escape their values with `& <`
274	- what to do about case-insensitive tags?
275	- maybe you can just normalize them
276	- because we do strict matching
277	- Maybe validate any other declarations, like `<!DOCTYPE foo>`
278	- Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
279
280	## Related
281
282	- [ysh-doc-processing.html](ysh-doc-processing.html)
283	- [table-object-doc.html](table-object-doc.html)
284
285
286	## Brainstorming / TODO
287
288	### Foreign XML with `<svg>` and `<math>` ?
289
290	`<svg>` and `<math>` are foreign XML content.
291
292	We might want to support this.
293
294	- So I can just switch to XML mode in that case
295	- TODO: we need a test corpus for this!
296	- maybe look for wikipedia content
297	- can we also just disallow these? Can you make these into external XML files?
298
299	This is one way:
300
301	<object data="math.xml" type="application/mathml+xml"></object>
302	<object data="drawing.xml" type="image/svg+xml"></object>
303
304	Then we don't need special parsing?