OILS / doc / htm8.md View on Github | oils.pub

304 lines, 205 significant
1---
2in_progress: yes
3default_highlighter: oils-sh
4---
5
6HTM8 - An Easy Subset of HTML5, With Some Errors
7=================================
8
9HTM8 is a data language, which is part of J8 Notation:
10
11- It's a subset of HTML5, so there are **Syntax Errors**
12 - It's "for humans"
13 - `<li><li>` example
14- It's Easy
15 - Easy to Implement - ~700 lines of regular languages and Python
16 - And thus Easy to Remember, for users
17 - Runs Efficiently - you don't have to materialize a big DOM tree, which
18 causes many allocations
19- Convertible to XML?
20 - without allocations, with a `sed`-like transformation!
21 - low level lexing and matching
22- Ambitious
23 - zero-alloc whitelist-based HTML filter for user content
24 - zero-alloc browser and CSS-style content queries
25
26Currently, all of Oils docs are parsed and processed with it.
27
28We would like to "lift it up" into an API for YSH users.
29
30<!--
31
32TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
33and then validated by an XML parser
34
35- lxml - this is supposed to be high quality
36
37- Python stdlib uses expat - https://libexpat.github.io/
38
39- Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
40 - do they have the billion laughs bug?
41
42-->
43
44<div id="toc">
45</div>
46
47## Structure of an HTM8 Doc
48
49### Tags - Open, Close, Self-Closing
50
511. Open `<a>`
521. Close `</a>`
531. StartEnd `<img/>`
54
55HTML5 doesn't have the notion of self-closing tags. Instead, it silently ignores
56the trailing `/`.
57
58We are bringing it back for human, because we think it's too hard for people to
59remember the 16 void elements.
60
61And lack of balanced bugs causes visual bugs that are hard to debug. It would
62be better to get an error **earlier**.
63
64### Attributes - Quotes optional
65
665 closely related Syntaxes
67
681. Missing `<a missing>`
691. Empty `<a empty=>`
701. Unquoted `<a href=foo>`
711. Double Quoted `<a href="foo">`
721. Single Quoted `<a href='foo'>`
73
74Note: `<a href=/>` is disallowed because it's ambiguous. Use `<a href="/">` or
75`<a href=/ >` or `<a href= />`.
76
77### Text - Regular or CDATA
78
79#### Regular Text
80
81- Any UTF-8 text.
82- Generally, `& < > " '` should be escaped as `&amp; &lt; &gt; &quot; &apos`.
83
84But we are lenient and allow raw `>` between tags:
85
86 <p> foo > bar </p>
87
88and raw `<` inside tags:
89
90 <span foo="<" > foo </span>
91
92#### CDATA
93
94Like HTML5, we support explicit `<![CDATA[`, even though it's implicit in the
95tags.
96
97### Escaped Chars - named, decimal, hex
98
991. `&amp;` - named
1001. `&#999;` - decimal
1011. `&#xff;` - hex
102
103
104### Comments - HTML or XML
105
1061. `<!-- -->`
1071. `<? ?>` (XML processing instruction)
108
109### Declarations - HTML or XML
110
111- `<!DOCTYPE html>` from HTML5
112- `<?xml version= ... ?>` from XML - this is a comment / processing instruction
113
114## Special Rules For Specific HTML Tags
115
116### `<script>` and `<style>` are Leaf Tags with Special Lexing
117
118- `<script> <style>`
119
120Note: we still have CDATA for compatibility.
121
122### 16 VOID Tags Don't Need Close Tags (Special Parsing)
123
124- `<source> ...`
125
126
127## Errors
128
129### Notes on Leniency
130
131Angle brackets:
132
133- `<a foo="<">` is allowed, but `<a foo=">">` is disallowed
134- `<p> 4>3 </p>` is allowed, but `<p> 4<3 </p>` is disallowed
135
136This makes lexing the top-level structure easier.
137
138- unescaped `&` is allowed, unlike XML
139 - it's very common in `<a href="?foo=42&bar=99">`
140 - It's lexed as BadAmpersand, in case you want to fix it for XML. Although
141 we don't do that for < and > consistently.
142
143### What are some examples of syntax errors?
144
145- HTM8 tags must be balanced to convert them to XML
146
147- `<script></SCRIPT>` isn't matched
148 - the begin and end tags must match exactly, like `<SCRipt></SCRipt>`
149 - likewise for `<style>`
150
151- NUL bytes aren't allowed - currently due to re2c sentinel. Two options:
152 1. Make it a syntax error - like JSON8
153 1. we could have a reprocessing pass to convert it to the Unicode replacement
154 char? I think that HTML might mandate that
155- Encodings other than UTF-8. HTM8 is always UTF-8.
156- Unicode Tag names and attribute names.
157 - This is allowed in HTML5 and XML.
158 - We leave those out for simpler lexing. Text and attribute values may be unicode.
159
160- `<a href=">">` - no literal `>` inside quotes
161 - HTML5 handles it, but we want to easily scan the "top level" structure of the doc
162 - And it doesn't appear to be common in our testdata
163 - TODO: we will handle `<a href="&">`
164
165HTML notes:
166
167There are 5 kinds of tags:
168
169- Normal HTML tags
170- RCDATA for `<title> <textarea>`
171- RAWTEXT `<style> <xmp> <iframe>` ?
172
173and we have
174
175- CDATA `<script>`
176 - TODO: we need a test case for `</script>` in a string literal?
177- Foreign `<math> <svg>` - XML rules
178
179## Under the Hood - Regular Languages, Algebraic Data Types
180
181That is, we use exhaustive reasoning
182
183It's meant to be easy to implement.
184
185### 2 Layers of Lexing
186
1871. TagLexer
1881. AttrLexer
189
190### 4 Regular Expressions
191
192Using re2c as the "choice" primitive.
193
1941. Lexer
1951. NAME lexer
1961. Begin VALUE lexer
1971. Quoted value lexer - for decoding `<a href="&amp;">`
198
199## XML Parsing Mode
200
201- Set `NO_SPECIAL_TAGS` - get rid of special cases fo `<script>` and `<style>`
202
203Conflicts between HTML5 and XML:
204
205- In XML, `<source>` is like any tag, and must be closed,
206- In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,
207
208- In XML, `<script>` and `<style>` don't have special treatment
209- In HTML, they do
210
211- The header is different - `<!DOCTYPE html>` vs. `<?xml version= ... ?>`
212
213- HTML: `<a empty= missing>` is two attributes
214- right now we don't handle `<a empty = "missing">` as a single attribute
215 - that is valid XML, so should we handle it?
216
217## Algorithms
218
219### What Do You Use This for?
220
221- Stripping comments
222- Adding TOC
223- Syntax highlighting code
224- Adding links shortcuts
225- ul-table
226
227TODO:
228
229- DOM API on top of it
230 - node.elementsByTag('p')
231 - node.elementsByClassName('left')
232 - node.elementByID('foo')
233 - innerHTML() outerHTML()
234 - tag attrs
235 - low level:
236 - outerLeft, outerRight, innerLeft, innerRight
237- CSS Selectors - `querySelectorAll()`
238- sed-like model
239
240
241### List of Algorithms
242
243- Lexing/Parsing
244 - lex just the top level
245 - lex both levels
246 - match tags - this is the level for value.Htm8Frag?
247- sed-like
248 - convert to XML!
249 - sed-like replacement of DOM Tree or element - e.g. Oils TOC
250- Structured
251 - convert to DOMTree
252 - lazy selection by tag, or attr (id= and class=)
253 - lazy selection by CSS selector expression
254 - untrusted HTML filter, e.g. like StackOverflow / Reddit
255 - this is Safe HTM8
256 - should have a zero alloc way to support this, with good errors?
257 - I think most of them silently strip data
258
259### Emitting HTM8 as HTML5
260
261Just emit it! This always works, by design.
262
263### Converting to XML
264
265- Add quotes to unquoted attributes
266 - single and double quotes stay the same?
267- Quote special chars - in text, and inside single- and double-quoted attr values
268 - & BadAmpersand -> `&amp;`
269 - < BadLessThan -> `&lt;`
270 - > BadGreaterTnan -> `&gt;`
271- `<script>` and `<style>`
272 - either add `<![CDATA[`
273 - or simply escape their values with `&amp; &lt;`
274- what to do about case-insensitive tags?
275 - maybe you can just normalize them
276 - because we do strict matching
277- Maybe validate any other declarations, like `<!DOCTYPE foo>`
278- Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
279
280## Related
281
282- [ysh-doc-processing.html](ysh-doc-processing.html)
283- [table-object-doc.html](table-object-doc.html)
284
285
286## Brainstorming / TODO
287
288### Foreign XML with `<svg>` and `<math>` ?
289
290`<svg>` and `<math>` are foreign XML content.
291
292We might want to support this.
293
294- So I can just switch to XML mode in that case
295- TODO: we need a test corpus for this!
296- maybe look for wikipedia content
297- can we also just disallow these? Can you make these into external XML files?
298
299This is one way:
300
301 <object data="math.xml" type="application/mathml+xml"></object>
302 <object data="drawing.xml" type="image/svg+xml"></object>
303
304Then we don't need special parsing?