1 | ---
|
2 | in_progress: yes
|
3 | default_highlighter: oils-sh
|
4 | ---
|
5 |
|
6 | HTM8 - Efficient HTML with Errors
|
7 | =================================
|
8 |
|
9 | - Syntax Errors: It's a Subset
|
10 | - Efficient
|
11 | - Easy to Remember
|
12 | - Easy to Implement
|
13 | - Runs Efficiently - you don't have to materialize a big DOM tree, which
|
14 | causes many allocations
|
15 |
|
16 | <div id="toc">
|
17 | </div>
|
18 |
|
19 | ## Basic Structure
|
20 |
|
21 | ### Text Content
|
22 |
|
23 | Anything except `&` and `<`.
|
24 |
|
25 | These must be `&` and `<`.
|
26 |
|
27 | `>` is allowed, or you can escape it with `>`.
|
28 |
|
29 | ### 3 Kinds of Character Code
|
30 |
|
31 | 1. `&` - named
|
32 | 1. `ϧ` - decimal
|
33 | 1. `ÿ` - hex
|
34 |
|
35 | ### 3 Kinds of Tag
|
36 |
|
37 | 1. Start
|
38 | 1. End
|
39 | 1. StartEnd
|
40 |
|
41 | ### 2 Kinds of Attribute
|
42 |
|
43 | 1. Unquoted
|
44 | 1. Quoted
|
45 |
|
46 | ### 2 Kinds of Comment
|
47 |
|
48 | 1. `<!-- -->`
|
49 | 1. `<? ?>` (XML processing instruction)
|
50 |
|
51 |
|
52 | ## Special Rules, From HTML
|
53 |
|
54 | ### 2 Tags Cause Special Lexing
|
55 |
|
56 | - `<script> <style>`
|
57 |
|
58 | Note: we still have CDATA for compatibility.
|
59 |
|
60 |
|
61 | ### 16 VOID Tags Change Parsing
|
62 |
|
63 | - `<source> ...`
|
64 |
|
65 | ### Bonus: XML Mode
|
66 |
|
67 | - Get rid of the 2 special lexing tags, and 16 VOID tags
|
68 |
|
69 | Then you can query HTML
|
70 |
|
71 |
|
72 | ## Under the Hood
|
73 |
|
74 | ### 3 Layers of Lexing
|
75 |
|
76 | 1. Tag
|
77 | 1. Attributes within a Tag
|
78 | 1. Quoted Value for Attributes
|
79 |
|
80 | ## What Do You Use This for?
|
81 |
|
82 | - Stripping comments
|
83 | - Adding TOC
|
84 | - Syntax highlighting code
|
85 | - Adding links shortcuts
|
86 | - ul-table
|
87 |
|
88 | TODO:
|
89 |
|
90 | - DOM API on top of it
|
91 | - node.elementsByTag('p')
|
92 | - node.elementsByClassName('left')
|
93 | - node.elementByID('foo')
|
94 | - innerHTML() outerHTML()
|
95 | - tag attrs
|
96 | - low level:
|
97 | - outerLeft, outerRight, innerLeft, innerRight
|
98 | - CSS Selectors - `querySelectorAll()`
|
99 | - sed-like model
|
100 |
|
101 | ## Algorithms
|
102 |
|
103 | ### Emitting HTM8 as HTML5
|
104 |
|
105 | Just emit it! This always works, by design.
|
106 |
|
107 | ### Parsing XML
|
108 |
|
109 | - Set `NO_SPECIAL_TAGS`
|
110 |
|
111 | ### Converting to XML?
|
112 |
|
113 | - Always quote all attributes
|
114 | - Always quote `>` - are we alloxing this in HX8?
|
115 | - Do something with `<script>` and `<style>`
|
116 | - I guess turn them into normal tags, with escaping?
|
117 | - Or maybe just disallow them?
|
118 | - Maybe validate any other declarations, like `<!DOCTYPE foo>`
|
119 | - Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
|
120 |
|
121 | ## Related
|
122 |
|
123 | - [ysh-doc-processing.html](ysh-doc-processing.html)
|
124 | - [table-object-doc.html](table-object-doc.html)
|
125 |
|
126 | ## FAQ
|
127 |
|
128 | ### What Doesn't This Cover?
|
129 |
|
130 | - single-quoted attributes?
|
131 | - We should probably add those, it shouldn't be hard?
|
132 |
|
133 | - Encodings other than UTF-8. HTM8 is always UTF-8.
|
134 | - Unicode Tag names and attribute names.
|
135 | - This is allowed in HTML5 and XML.
|
136 | - We leave those out for simpler lexing. Text and attribute values may be unicode.
|
137 |
|
138 | There are 5 kinds of tags:
|
139 |
|
140 | - Normal HTML tags
|
141 | - RCDATA for `<title> <textarea>`
|
142 | - RAWTEXT `<style> <xmp> <iframe>` ?
|
143 |
|
144 | and we have
|
145 |
|
146 | - CDATA `<script>`
|
147 | - TODO: we need a test case for `</script>` in a string literal?
|
148 | - Foreign `<math> <svg>` - XML rules
|
149 |
|
150 | ## TODO
|
151 |
|
152 | - `<svg>` and `<math>` are foreign XML content? Doh
|
153 | - So I can just switch to XML mode in that case
|
154 | - TODO: we need a test corpus for this!
|
155 | - maybe look for wikipedia content
|
156 | - can we also just disallow these? Can you make these into external XML files?
|
157 |
|
158 | This is one way:
|
159 |
|
160 | <object data="math.xml" type="application/mathml+xml"></object>
|
161 | <object data="drawing.xml" type="image/svg+xml"></object>
|
162 |
|
163 | Then we don't need special parsing?
|
164 |
|