doc/htm8.md

OILS / doc / htm8.md View on Github | oils.pub

166 lines, 108 significant

1	---
2	in_progress: yes
3	default_highlighter: oils-sh
4	---
5
6	HTM8 - Efficient HTML with Errors
7	=================================
8
9	- Syntax Errors: It's a Subset
10	- Efficient
11	- Easy to Remember
12	- Easy to Implement
13	- Runs Efficiently - you don't have to materialize a big DOM tree, which
14	causes many allocations
15
16	<div id="toc">
17	</div>
18
19	## Basic Structure
20
21	### Text Content
22
23	Anything except `&` and `<`.
24
25	These must be `&` and `<`.
26
27	`>` is allowed, or you can escape it with `>`.
28
29	### 3 Kinds of Character Code
30
31	1. `&` - named
32	1. `ϧ` - decimal
33	1. `ÿ` - hex
34
35	### 3 Kinds of Tag
36
37	1. Start
38	1. End
39	1. StartEnd
40
41	### 2 Kinds of Attribute
42
43	1. Unquoted
44	1. Quoted
45
46	### 2 Kinds of Comment
47
48	1. `<!-- -->`
49	1. `<? ?>` (XML processing instruction)
50
51
52	## Special Rules, From HTML
53
54	### 2 Tags Cause Special Lexing
55
56	- `<script> <style>`
57
58	Note: we still have CDATA for compatibility.
59
60
61	### 16 VOID Tags Change Parsing
62
63	- `<source> ...`
64
65	### Bonus: XML Mode
66
67	- Get rid of the 2 special lexing tags, and 16 VOID tags
68
69	Then you can query HTML
70
71
72	## Under the Hood
73
74	### 3 Layers of Lexing
75
76	1. Tag
77	1. Attributes within a Tag
78	1. Quoted Value for Attributes
79
80	## What Do You Use This for?
81
82	- Stripping comments
83	- Adding TOC
84	- Syntax highlighting code
85	- Adding links shortcuts
86	- ul-table
87
88	TODO:
89
90	- DOM API on top of it
91	- node.elementsByTag('p')
92	- node.elementsByClassName('left')
93	- node.elementByID('foo')
94	- innerHTML() outerHTML()
95	- tag attrs
96	- low level:
97	- outerLeft, outerRight, innerLeft, innerRight
98	- CSS Selectors - `querySelectorAll()`
99	- sed-like model
100
101	## Algorithms
102
103	### Emitting HTM8 as HTML5
104
105	Just emit it! This always works, by design.
106
107	### Parsing XML
108
109	- Set `NO_SPECIAL_TAGS`
110
111	### Converting to XML?
112
113	- Always quote all attributes
114	- Always quote `>` - are we alloxing this in HX8?
115	- Do something with `<script>` and `<style>`
116	- I guess turn them into normal tags, with escaping?
117	- Or maybe just disallow them?
118	- Maybe validate any other declarations, like `<!DOCTYPE foo>`
119	- Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
120
121	## Related
122
123	- [ysh-doc-processing.html](ysh-doc-processing.html)
124	- [table-object-doc.html](table-object-doc.html)
125
126	## FAQ
127
128	### What Doesn't This Cover?
129
130	- Encodings other than UTF-8. HTM8 is always UTF-8.
131	- Unicode Tag names and attribute names.
132	- This is allowed in HTML5 and XML.
133	- We leave those out for simpler lexing. Text and attribute values may be unicode.
134
135	- `<a href=">">` - no literal `>` inside quotes
136	- HTML5 handles it, but we want to easily scan the "top level" structure of the doc
137	- And it doesn't appear to be common in our testdata
138	- TODO: we will handle `<a href="&">`
139
140	There are 5 kinds of tags:
141
142	- Normal HTML tags
143	- RCDATA for `<title> <textarea>`
144	- RAWTEXT `<style> <xmp> <iframe>` ?
145
146	and we have
147
148	- CDATA `<script>`
149	- TODO: we need a test case for `</script>` in a string literal?
150	- Foreign `<math> <svg>` - XML rules
151
152	## TODO
153
154	- `<svg>` and `<math>` are foreign XML content? Doh
155	- So I can just switch to XML mode in that case
156	- TODO: we need a test corpus for this!
157	- maybe look for wikipedia content
158	- can we also just disallow these? Can you make these into external XML files?
159
160	This is one way:
161
162	<object data="math.xml" type="application/mathml+xml"></object>
163	<object data="drawing.xml" type="image/svg+xml"></object>
164
165	Then we don't need special parsing?
166