OILS / doc / htm8.md View on Github | oils.pub

181 lines, 117 significant
1---
2in_progress: yes
3default_highlighter: oils-sh
4---
5
6HTM8 - Efficient HTML with Errors
7=================================
8
9- Syntax Errors: It's a Subset
10- Efficient
11 - Easy to Remember
12 - Easy to Implement
13 - Runs Efficiently - you don't have to materialize a big DOM tree, which
14 causes many allocations
15
16<div id="toc">
17</div>
18
19## Basic Structure
20
21### Text Content
22
23Anything except `&` and `<`.
24
25These must be `&amp;` and `&lt;`.
26
27`>` is allowed, or you can escape it with `&gt;`.
28
29### 3 Kinds of Character Code
30
311. `&amp;` - named
321. `&#999;` - decimal
331. `&#xff;` - hex
34
35### 3 Kinds of Tag
36
371. Start
381. End
391. StartEnd
40
41### 2 Kinds of Attribute
42
431. Unquoted
441. Quoted
45
46### 2 Kinds of Comment
47
481. `<!-- -->`
491. `<? ?>` (XML processing instruction)
50
51
52## Special Rules, From HTML
53
54### 2 Tags Cause Special Lexing
55
56- `<script> <style>`
57
58Note: we still have CDATA for compatibility.
59
60
61### 16 VOID Tags Change Parsing
62
63- `<source> ...`
64
65### Bonus: XML Mode
66
67- Get rid of the 2 special lexing tags, and 16 VOID tags
68
69Then you can query HTML
70
71
72## Under the Hood
73
74### 3 Layers of Lexing
75
761. Tag
771. Attributes within a Tag
781. Quoted Value for Attributes
79
80## What Do You Use This for?
81
82- Stripping comments
83- Adding TOC
84- Syntax highlighting code
85- Adding links shortcuts
86- ul-table
87
88TODO:
89
90- DOM API on top of it
91 - node.elementsByTag('p')
92 - node.elementsByClassName('left')
93 - node.elementByID('foo')
94 - innerHTML() outerHTML()
95 - tag attrs
96 - low level:
97 - outerLeft, outerRight, innerLeft, innerRight
98- CSS Selectors - `querySelectorAll()`
99- sed-like model
100
101## Algorithms
102
103### Emitting HTM8 as HTML5
104
105Just emit it! This always works, by design.
106
107### Parsing XML
108
109- Set `NO_SPECIAL_TAGS`
110
111
112Conflicts between HTML5 and XML:
113
114- In XML, `<source>` is like any tag, and must be closed,
115- In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,
116
117- In XML, `<script>` and `<style>` don't have special treatment
118- In HTML, they do
119
120- The header is different - `<!DOCTYPE html>` vs. `<?xml version= ... ?>`
121
122- HTML: `<a empty= missing>` is two attributes
123- right now we don't handle `<a empty = "missing">` as a single attribute
124 - that is valid XML, so should we handle it?
125
126### Converting to XML?
127
128- Always quote all attributes
129- Always quote `>` - are we alloxing this in HX8?
130- Do something with `<script>` and `<style>`
131 - I guess turn them into normal tags, with escaping?
132 - Or maybe just disallow them?
133- Maybe validate any other declarations, like `<!DOCTYPE foo>`
134- Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
135
136## Related
137
138- [ysh-doc-processing.html](ysh-doc-processing.html)
139- [table-object-doc.html](table-object-doc.html)
140
141## FAQ
142
143### What Doesn't This Cover?
144
145- Encodings other than UTF-8. HTM8 is always UTF-8.
146- Unicode Tag names and attribute names.
147 - This is allowed in HTML5 and XML.
148 - We leave those out for simpler lexing. Text and attribute values may be unicode.
149
150- `<a href=">">` - no literal `>` inside quotes
151 - HTML5 handles it, but we want to easily scan the "top level" structure of the doc
152 - And it doesn't appear to be common in our testdata
153 - TODO: we will handle `<a href="&">`
154
155There are 5 kinds of tags:
156
157- Normal HTML tags
158- RCDATA for `<title> <textarea>`
159- RAWTEXT `<style> <xmp> <iframe>` ?
160
161and we have
162
163- CDATA `<script>`
164 - TODO: we need a test case for `</script>` in a string literal?
165- Foreign `<math> <svg>` - XML rules
166
167## TODO
168
169- `<svg>` and `<math>` are foreign XML content? Doh
170 - So I can just switch to XML mode in that case
171 - TODO: we need a test corpus for this!
172 - maybe look for wikipedia content
173- can we also just disallow these? Can you make these into external XML files?
174
175This is one way:
176
177 <object data="math.xml" type="application/mathml+xml"></object>
178 <object data="drawing.xml" type="image/svg+xml"></object>
179
180Then we don't need special parsing?
181