OILS / doc / htm8.md View on Github | oils.pub

166 lines, 108 significant
1---
2in_progress: yes
3default_highlighter: oils-sh
4---
5
6HTM8 - Efficient HTML with Errors
7=================================
8
9- Syntax Errors: It's a Subset
10- Efficient
11 - Easy to Remember
12 - Easy to Implement
13 - Runs Efficiently - you don't have to materialize a big DOM tree, which
14 causes many allocations
15
16<div id="toc">
17</div>
18
19## Basic Structure
20
21### Text Content
22
23Anything except `&` and `<`.
24
25These must be `&amp;` and `&lt;`.
26
27`>` is allowed, or you can escape it with `&gt;`.
28
29### 3 Kinds of Character Code
30
311. `&amp;` - named
321. `&#999;` - decimal
331. `&#xff;` - hex
34
35### 3 Kinds of Tag
36
371. Start
381. End
391. StartEnd
40
41### 2 Kinds of Attribute
42
431. Unquoted
441. Quoted
45
46### 2 Kinds of Comment
47
481. `<!-- -->`
491. `<? ?>` (XML processing instruction)
50
51
52## Special Rules, From HTML
53
54### 2 Tags Cause Special Lexing
55
56- `<script> <style>`
57
58Note: we still have CDATA for compatibility.
59
60
61### 16 VOID Tags Change Parsing
62
63- `<source> ...`
64
65### Bonus: XML Mode
66
67- Get rid of the 2 special lexing tags, and 16 VOID tags
68
69Then you can query HTML
70
71
72## Under the Hood
73
74### 3 Layers of Lexing
75
761. Tag
771. Attributes within a Tag
781. Quoted Value for Attributes
79
80## What Do You Use This for?
81
82- Stripping comments
83- Adding TOC
84- Syntax highlighting code
85- Adding links shortcuts
86- ul-table
87
88TODO:
89
90- DOM API on top of it
91 - node.elementsByTag('p')
92 - node.elementsByClassName('left')
93 - node.elementByID('foo')
94 - innerHTML() outerHTML()
95 - tag attrs
96 - low level:
97 - outerLeft, outerRight, innerLeft, innerRight
98- CSS Selectors - `querySelectorAll()`
99- sed-like model
100
101## Algorithms
102
103### Emitting HTM8 as HTML5
104
105Just emit it! This always works, by design.
106
107### Parsing XML
108
109- Set `NO_SPECIAL_TAGS`
110
111### Converting to XML?
112
113- Always quote all attributes
114- Always quote `>` - are we alloxing this in HX8?
115- Do something with `<script>` and `<style>`
116 - I guess turn them into normal tags, with escaping?
117 - Or maybe just disallow them?
118- Maybe validate any other declarations, like `<!DOCTYPE foo>`
119- Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
120
121## Related
122
123- [ysh-doc-processing.html](ysh-doc-processing.html)
124- [table-object-doc.html](table-object-doc.html)
125
126## FAQ
127
128### What Doesn't This Cover?
129
130- Encodings other than UTF-8. HTM8 is always UTF-8.
131- Unicode Tag names and attribute names.
132 - This is allowed in HTML5 and XML.
133 - We leave those out for simpler lexing. Text and attribute values may be unicode.
134
135- `<a href=">">` - no literal `>` inside quotes
136 - HTML5 handles it, but we want to easily scan the "top level" structure of the doc
137 - And it doesn't appear to be common in our testdata
138 - TODO: we will handle `<a href="&">`
139
140There are 5 kinds of tags:
141
142- Normal HTML tags
143- RCDATA for `<title> <textarea>`
144- RAWTEXT `<style> <xmp> <iframe>` ?
145
146and we have
147
148- CDATA `<script>`
149 - TODO: we need a test case for `</script>` in a string literal?
150- Foreign `<math> <svg>` - XML rules
151
152## TODO
153
154- `<svg>` and `<math>` are foreign XML content? Doh
155 - So I can just switch to XML mode in that case
156 - TODO: we need a test corpus for this!
157 - maybe look for wikipedia content
158- can we also just disallow these? Can you make these into external XML files?
159
160This is one way:
161
162 <object data="math.xml" type="application/mathml+xml"></object>
163 <object data="drawing.xml" type="image/svg+xml"></object>
164
165Then we don't need special parsing?
166