OILS / doc / htm8.md View on Github | oils.pub

164 lines, 106 significant
1---
2in_progress: yes
3default_highlighter: oils-sh
4---
5
6HTM8 - Efficient HTML with Errors
7=================================
8
9- Syntax Errors: It's a Subset
10- Efficient
11 - Easy to Remember
12 - Easy to Implement
13 - Runs Efficiently - you don't have to materialize a big DOM tree, which
14 causes many allocations
15
16<div id="toc">
17</div>
18
19## Basic Structure
20
21### Text Content
22
23Anything except `&` and `<`.
24
25These must be `&amp;` and `&lt;`.
26
27`>` is allowed, or you can escape it with `&gt;`.
28
29### 3 Kinds of Character Code
30
311. `&amp;` - named
321. `&#999;` - decimal
331. `&#xff;` - hex
34
35### 3 Kinds of Tag
36
371. Start
381. End
391. StartEnd
40
41### 2 Kinds of Attribute
42
431. Unquoted
441. Quoted
45
46### 2 Kinds of Comment
47
481. `<!-- -->`
491. `<? ?>` (XML processing instruction)
50
51
52## Special Rules, From HTML
53
54### 2 Tags Cause Special Lexing
55
56- `<script> <style>`
57
58Note: we still have CDATA for compatibility.
59
60
61### 16 VOID Tags Change Parsing
62
63- `<source> ...`
64
65### Bonus: XML Mode
66
67- Get rid of the 2 special lexing tags, and 16 VOID tags
68
69Then you can query HTML
70
71
72## Under the Hood
73
74### 3 Layers of Lexing
75
761. Tag
771. Attributes within a Tag
781. Quoted Value for Attributes
79
80## What Do You Use This for?
81
82- Stripping comments
83- Adding TOC
84- Syntax highlighting code
85- Adding links shortcuts
86- ul-table
87
88TODO:
89
90- DOM API on top of it
91 - node.elementsByTag('p')
92 - node.elementsByClassName('left')
93 - node.elementByID('foo')
94 - innerHTML() outerHTML()
95 - tag attrs
96 - low level:
97 - outerLeft, outerRight, innerLeft, innerRight
98- CSS Selectors - `querySelectorAll()`
99- sed-like model
100
101## Algorithms
102
103### Emitting HTM8 as HTML5
104
105Just emit it! This always works, by design.
106
107### Parsing XML
108
109- Set `NO_SPECIAL_TAGS`
110
111### Converting to XML?
112
113- Always quote all attributes
114- Always quote `>` - are we alloxing this in HX8?
115- Do something with `<script>` and `<style>`
116 - I guess turn them into normal tags, with escaping?
117 - Or maybe just disallow them?
118- Maybe validate any other declarations, like `<!DOCTYPE foo>`
119- Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
120
121## Related
122
123- [ysh-doc-processing.html](ysh-doc-processing.html)
124- [table-object-doc.html](table-object-doc.html)
125
126## FAQ
127
128### What Doesn't This Cover?
129
130- single-quoted attributes?
131 - We should probably add those, it shouldn't be hard?
132
133- Encodings other than UTF-8. HTM8 is always UTF-8.
134- Unicode Tag names and attribute names.
135 - This is allowed in HTML5 and XML.
136 - We leave those out for simpler lexing. Text and attribute values may be unicode.
137
138There are 5 kinds of tags:
139
140- Normal HTML tags
141- RCDATA for `<title> <textarea>`
142- RAWTEXT `<style> <xmp> <iframe>` ?
143
144and we have
145
146- CDATA `<script>`
147 - TODO: we need a test case for `</script>` in a string literal?
148- Foreign `<math> <svg>` - XML rules
149
150## TODO
151
152- `<svg>` and `<math>` are foreign XML content? Doh
153 - So I can just switch to XML mode in that case
154 - TODO: we need a test corpus for this!
155 - maybe look for wikipedia content
156- can we also just disallow these? Can you make these into external XML files?
157
158This is one way:
159
160 <object data="math.xml" type="application/mathml+xml"></object>
161 <object data="drawing.xml" type="image/svg+xml"></object>
162
163Then we don't need special parsing?
164