doc/htm8.md

OILS / doc / htm8.md View on Github | oils.pub

300 lines, 202 significant

1	---
2	in_progress: yes
3	default_highlighter: oils-sh
4	---
5
6	HTM8 - An Easy Subset of HTML5, With Some Errors
7	=================================
8
9	HTM8 is a data language, which is part of J8 Notation:
10
11	- It's a subset of HTML5, so there are Syntax Errors
12	- It's "for humans"
13	- `<li><li>` example
14	- It's Easy
15	- Easy to Implement - ~700 lines of regular languages and Python
16	- And thus Easy to Remember, for users
17	- Runs Efficiently - you don't have to materialize a big DOM tree, which
18	causes many allocations
19	- Convertible to XML?
20	- without allocations, with a `sed`-like transformation!
21	- low level lexing and matching
22	- Ambitious
23	- zero-alloc whitelist-based HTML filter for user content
24	- zero-alloc browser and CSS-style content queries
25
26	Currently, all of Oils docs are parsed and processed with it.
27
28	We would like to "lift it up" into an API for YSH users.
29
30	<!--
31
32	TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
33	and then validated by an XML parser
34
35	- lxml - this is supposed to be high quality
36
37	- Python stdlib uses expat - https://libexpat.github.io/
38
39	- Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
40	- do they have the billion laughs bug?
41
42	-->
43
44	<div id="toc">
45	</div>
46
47	## Structure of an HTM8 Doc
48
49	### Tags - Open, Close, Self-Closing
50
51	1. Open `<a>`
52	1. Close `</a>`
53	1. StartEnd `<img/>`
54
55	HTML5 doesn't have the notion of self-closing tags. Instead, it silently ignores
56	the trailing `/`.
57
58	We are bringing it back for human, because we think it's too hard for people to
59	remember the 16 void elements.
60
61	And lack of balanced bugs causes visual bugs that are hard to debug. It would
62	be better to get an error earlier.
63
64	### Attributes - Quotes optional
65
66	5 closely related Syntaxes
67
68	1. Missing `<a missing>`
69	1. Empty `<a empty=>`
70	1. Unquoted `<a href=foo>`
71	1. Double Quoted `<a href="foo">`
72	1. Single Quoted `<a href='foo'>`
73
74	Note: `<a href=/>` is disallowed because it's ambiguous. Use `<a href="/">` or
75	`<a href=/ >` or `<a href= />`.
76
77	### Text - Regular or CDATA
78
79	#### Regular Text
80
81	- Any UTF-8 text.
82	- Generally, `& < > " '` should be escaped as `& < > " &apos`.
83
84	But we are lenient and allow raw `>` between tags:
85
86	<p> foo > bar </p>
87
88	and raw `<` inside tags:
89
90	<span foo="<" > foo </span>
91
92	#### CDATA
93
94	Like HTML5, we support explicit `<![CDATA[`, even though it's implicit in the
95	tags.
96
97	### Escaped Chars - named, decimal, hex
98
99	1. `&` - named
100	1. `ϧ` - decimal
101	1. `ÿ` - hex
102
103
104	### Comments - HTML or XML
105
106	1. `<!-- -->`
107	1. `<? ?>` (XML processing instruction)
108
109	### Declarations - HTML or XML
110
111	- `<!DOCTYPE html>` from HTML5
112	- `<?xml version= ... ?>` from XML - this is a comment / processing instruction
113
114	## Special Rules For Specific HTML Tags
115
116	### `<script>` and `<style>` are Leaf Tags with Special Lexing
117
118	- `<script> <style>`
119
120	Note: we still have CDATA for compatibility.
121
122	### 16 VOID Tags Don't Need Close Tags (Special Parsing)
123
124	- `<source> ...`
125
126
127	## Errors
128
129	### Notes on Leniency
130
131	Angle brackets:
132
133	- `<a foo="<">` is allowed, but `<a foo=">">` is disallowed
134	- `<p> 4>3 </p>` is allowed, but `<p> 4<3 </p>` is disallowed
135
136	This makes lexing the top-level structure easier.
137
138	- unescaped `&` is allowed, unlike XML
139	- it's very common in `<a href="?foo=42&bar=99">`
140	- It's lexed as BadAmpersand, in case you want to fix it for XML. Although
141	we don't do that for < and > consistently.
142
143	### What are some examples of syntax errors?
144
145	- HTM8 tags must be balanced to convert them to XML
146
147	- NUL bytes aren't allowed - currently due to re2c sentinel. Two options:
148	1. Make it a syntax error - like JSON8
149	1. we could have a reprocessing pass to convert it to the Unicode replacement
150	char? I think that HTML might mandate that
151	- Encodings other than UTF-8. HTM8 is always UTF-8.
152	- Unicode Tag names and attribute names.
153	- This is allowed in HTML5 and XML.
154	- We leave those out for simpler lexing. Text and attribute values may be unicode.
155
156	- `<a href=">">` - no literal `>` inside quotes
157	- HTML5 handles it, but we want to easily scan the "top level" structure of the doc
158	- And it doesn't appear to be common in our testdata
159	- TODO: we will handle `<a href="&">`
160
161	HTML notes:
162
163	There are 5 kinds of tags:
164
165	- Normal HTML tags
166	- RCDATA for `<title> <textarea>`
167	- RAWTEXT `<style> <xmp> <iframe>` ?
168
169	and we have
170
171	- CDATA `<script>`
172	- TODO: we need a test case for `</script>` in a string literal?
173	- Foreign `<math> <svg>` - XML rules
174
175	## Under the Hood - Regular Languages, Algebraic Data Types
176
177	That is, we use exhaustive reasoning
178
179	It's meant to be easy to implement.
180
181	### 2 Layers of Lexing
182
183	1. TagLexer
184	1. AttrLexer
185
186	### 4 Regular Expressions
187
188	Using re2c as the "choice" primitive.
189
190	1. Lexer
191	1. NAME lexer
192	1. Begin VALUE lexer
193	1. Quoted value lexer - for decoding `<a href="&">`
194
195	## XML Parsing Mode
196
197	- Set `NO_SPECIAL_TAGS` - get rid of special cases fo `<script>` and `<style>`
198
199	Conflicts between HTML5 and XML:
200
201	- In XML, `<source>` is like any tag, and must be closed,
202	- In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,
203
204	- In XML, `<script>` and `<style>` don't have special treatment
205	- In HTML, they do
206
207	- The header is different - `<!DOCTYPE html>` vs. `<?xml version= ... ?>`
208
209	- HTML: `<a empty= missing>` is two attributes
210	- right now we don't handle `<a empty = "missing">` as a single attribute
211	- that is valid XML, so should we handle it?
212
213	## Algorithms
214
215	### What Do You Use This for?
216
217	- Stripping comments
218	- Adding TOC
219	- Syntax highlighting code
220	- Adding links shortcuts
221	- ul-table
222
223	TODO:
224
225	- DOM API on top of it
226	- node.elementsByTag('p')
227	- node.elementsByClassName('left')
228	- node.elementByID('foo')
229	- innerHTML() outerHTML()
230	- tag attrs
231	- low level:
232	- outerLeft, outerRight, innerLeft, innerRight
233	- CSS Selectors - `querySelectorAll()`
234	- sed-like model
235
236
237	### List of Algorithms
238
239	- Lexing/Parsing
240	- lex just the top level
241	- lex both levels
242	- match tags - this is the level for value.Htm8Frag?
243	- sed-like
244	- convert to XML!
245	- sed-like replacement of DOM Tree or element - e.g. Oils TOC
246	- Structured
247	- convert to DOMTree
248	- lazy selection by tag, or attr (id= and class=)
249	- lazy selection by CSS selector expression
250	- untrusted HTML filter, e.g. like StackOverflow / Reddit
251	- this is Safe HTM8
252	- should have a zero alloc way to support this, with good errors?
253	- I think most of them silently strip data
254
255	### Emitting HTM8 as HTML5
256
257	Just emit it! This always works, by design.
258
259	### Converting to XML
260
261	- Add quotes to unquoted attributes
262	- single and double quotes stay the same?
263	- Quote special chars - in text, and inside single- and double-quoted attr values
264	- & BadAmpersand -> `&`
265	- < BadLessThan -> `<`
266	- > BadGreaterTnan -> `>`
267	- `<script>` and `<style>`
268	- either add `<![CDATA[`
269	- or simply escape their values with `& <`
270	- what to do about case-insensitive tags?
271	- maybe you can just normalize them
272	- because we do strict matching
273	- Maybe validate any other declarations, like `<!DOCTYPE foo>`
274	- Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
275
276	## Related
277
278	- [ysh-doc-processing.html](ysh-doc-processing.html)
279	- [table-object-doc.html](table-object-doc.html)
280
281
282	## Brainstorming / TODO
283
284	### Foreign XML with `<svg>` and `<math>` ?
285
286	`<svg>` and `<math>` are foreign XML content.
287
288	We might want to support this.
289
290	- So I can just switch to XML mode in that case
291	- TODO: we need a test corpus for this!
292	- maybe look for wikipedia content
293	- can we also just disallow these? Can you make these into external XML files?
294
295	This is one way:
296
297	<object data="math.xml" type="application/mathml+xml"></object>
298	<object data="drawing.xml" type="image/svg+xml"></object>
299
300	Then we don't need special parsing?