1 | ---
|
2 | in_progress: yes
|
3 | default_highlighter: oils-sh
|
4 | ---
|
5 |
|
6 | HTM8 - An Easy Subset of HTML5, With Some Errors
|
7 | =================================
|
8 |
|
9 | HTM8 is a data language, which is part of J8 Notation:
|
10 |
|
11 | - It's a subset of HTML5, so there are **Syntax Errors**
|
12 | - It's "for humans"
|
13 | - `<li><li>` example
|
14 | - It's Easy
|
15 | - Easy to Implement - ~700 lines of regular languages and Python
|
16 | - And thus Easy to Remember, for users
|
17 | - Runs Efficiently - you don't have to materialize a big DOM tree, which
|
18 | causes many allocations
|
19 | - Convertible to XML?
|
20 | - without allocations, with a `sed`-like transformation!
|
21 | - low level lexing and matching
|
22 | - Ambitious
|
23 | - zero-alloc whitelist-based HTML filter for user content
|
24 | - zero-alloc browser and CSS-style content queries
|
25 |
|
26 | Currently, all of Oils docs are parsed and processed with it.
|
27 |
|
28 | We would like to "lift it up" into an API for YSH users.
|
29 |
|
30 | <!--
|
31 |
|
32 | TODO: 99.9% of HTML documents from CommonCrawl should be convertible to XML,
|
33 | and then validated by an XML parser
|
34 |
|
35 | - lxml - this is supposed to be high quality
|
36 |
|
37 | - Python stdlib uses expat - https://libexpat.github.io/
|
38 |
|
39 | - Gah it's this huge thing, 8K lines: https://github.com/libexpat/libexpat/blob/master/expat/lib/xmlparse.c
|
40 | - do they have the billion laughs bug?
|
41 |
|
42 | -->
|
43 |
|
44 | <div id="toc">
|
45 | </div>
|
46 |
|
47 | ## Structure of an HTM8 Doc
|
48 |
|
49 | ### Tags - Open, Close, Self-Closing
|
50 |
|
51 | 1. Open `<a>`
|
52 | 1. Close `</a>`
|
53 | 1. StartEnd `<img/>`
|
54 |
|
55 | HTML5 doesn't have the notion of self-closing tags. Instead, it silently ignores
|
56 | the trailing `/`.
|
57 |
|
58 | We are bringing it back for human, because we think it's too hard for people to
|
59 | remember the 16 void elements.
|
60 |
|
61 | And lack of balanced bugs causes visual bugs that are hard to debug. It would
|
62 | be better to get an error **earlier**.
|
63 |
|
64 | ### Attributes - Quotes optional
|
65 |
|
66 | 5 closely related Syntaxes
|
67 |
|
68 | 1. Missing `<a missing>`
|
69 | 1. Empty `<a empty=>`
|
70 | 1. Unquoted `<a href=foo>`
|
71 | 1. Double Quoted `<a href="foo">`
|
72 | 1. Single Quoted `<a href='foo'>`
|
73 |
|
74 | Note: `<a href=/>` is disallowed because it's ambiguous. Use `<a href="/">` or
|
75 | `<a href=/ >` or `<a href= />`.
|
76 |
|
77 | ### Text - Regular or CDATA
|
78 |
|
79 | #### Regular Text
|
80 |
|
81 | - Any UTF-8 text.
|
82 | - Generally, `& < > " '` should be escaped as `& < > " &apos`.
|
83 |
|
84 | But we are lenient and allow raw `>` between tags:
|
85 |
|
86 | <p> foo > bar </p>
|
87 |
|
88 | and raw `<` inside tags:
|
89 |
|
90 | <span foo="<" > foo </span>
|
91 |
|
92 | #### CDATA
|
93 |
|
94 | Like HTML5, we support explicit `<data:image/s3,"s3://crabby-images/2986c/2986c5a2be8e58fc32fa292ba5ef8832e78c13d9" alt="CDATA[`, even though it's implicit in the
|
95 | tags.
|
96 |
|
97 | ### Escaped Chars - named, decimal, hex
|
98 |
|
99 | 1. `&` - named
|
100 | 1. `ϧ` - decimal
|
101 | 1. `ÿ` - hex
|
102 |
|
103 |
|
104 | ### Comments - HTML or XML
|
105 |
|
106 | 1. `<!-- -->`
|
107 | 1. `<? ?>` (XML processing instruction)
|
108 |
|
109 | ### Declarations - HTML or XML
|
110 |
|
111 | - `<!DOCTYPE html>` from HTML5
|
112 | - `<?xml version= ... ?>` from XML - this is a comment / processing instruction
|
113 |
|
114 | ## Special Rules For Specific HTML Tags
|
115 |
|
116 | ### `<script>` and `<style>` are Leaf Tags with Special Lexing
|
117 |
|
118 | - `<script> <style>`
|
119 |
|
120 | Note: we still have CDATA for compatibility.
|
121 |
|
122 | ### 16 VOID Tags Don't Need Close Tags (Special Parsing)
|
123 |
|
124 | - `<source> ...`
|
125 |
|
126 |
|
127 | ## Errors
|
128 |
|
129 | ### Notes on Leniency
|
130 |
|
131 | Angle brackets:
|
132 |
|
133 | - `<a foo="<">` is allowed, but `<a foo=">">` is disallowed
|
134 | - `<p> 4>3 </p>` is allowed, but `<p> 4<3 </p>` is disallowed
|
135 |
|
136 | This makes lexing the top-level structure easier.
|
137 |
|
138 | - unescaped `&` is allowed, unlike XML
|
139 | - it's very common in `<a href="?foo=42&bar=99">`
|
140 | - It's lexed as BadAmpersand, in case you want to fix it for XML. Although
|
141 | we don't do that for < and > consistently.
|
142 |
|
143 | ### What are some examples of syntax errors?
|
144 |
|
145 | - HTM8 tags must be balanced to convert them to XML
|
146 |
|
147 | - `<script></SCRIPT>` isn't matched
|
148 | - the begin and end tags must match exactly, like `<SCRipt></SCRipt>`
|
149 | - likewise for `<style>`
|
150 |
|
151 | - NUL bytes aren't allowed - currently due to re2c sentinel. Two options:
|
152 | 1. Make it a syntax error - like JSON8
|
153 | 1. we could have a reprocessing pass to convert it to the Unicode replacement
|
154 | char? I think that HTML might mandate that
|
155 | - Encodings other than UTF-8. HTM8 is always UTF-8.
|
156 | - Unicode Tag names and attribute names.
|
157 | - This is allowed in HTML5 and XML.
|
158 | - We leave those out for simpler lexing. Text and attribute values may be unicode.
|
159 |
|
160 | - `<a href=">">` - no literal `>` inside quotes
|
161 | - HTML5 handles it, but we want to easily scan the "top level" structure of the doc
|
162 | - And it doesn't appear to be common in our testdata
|
163 | - TODO: we will handle `<a href="&">`
|
164 |
|
165 | HTML notes:
|
166 |
|
167 | There are 5 kinds of tags:
|
168 |
|
169 | - Normal HTML tags
|
170 | - RCDATA for `<title> <textarea>`
|
171 | - RAWTEXT `<style> <xmp> <iframe>` ?
|
172 |
|
173 | and we have
|
174 |
|
175 | - CDATA `<script>`
|
176 | - TODO: we need a test case for `</script>` in a string literal?
|
177 | - Foreign `<math> <svg>` - XML rules
|
178 |
|
179 | ## Under the Hood - Regular Languages, Algebraic Data Types
|
180 |
|
181 | That is, we use exhaustive reasoning
|
182 |
|
183 | It's meant to be easy to implement.
|
184 |
|
185 | ### 2 Layers of Lexing
|
186 |
|
187 | 1. TagLexer
|
188 | 1. AttrLexer
|
189 |
|
190 | ### 4 Regular Expressions
|
191 |
|
192 | Using re2c as the "choice" primitive.
|
193 |
|
194 | 1. Lexer
|
195 | 1. NAME lexer
|
196 | 1. Begin VALUE lexer
|
197 | 1. Quoted value lexer - for decoding `<a href="&">`
|
198 |
|
199 | ## XML Parsing Mode
|
200 |
|
201 | - Set `NO_SPECIAL_TAGS` - get rid of special cases fo `<script>` and `<style>`
|
202 |
|
203 | Conflicts between HTML5 and XML:
|
204 |
|
205 | - In XML, `<source>` is like any tag, and must be closed,
|
206 | - In HTML, `<source>` is a VOID tag, and must NOT be closedlike any tag, and must be closed,
|
207 |
|
208 | - In XML, `<script>` and `<style>` don't have special treatment
|
209 | - In HTML, they do
|
210 |
|
211 | - The header is different - `<!DOCTYPE html>` vs. `<?xml version= ... ?>`
|
212 |
|
213 | - HTML: `<a empty= missing>` is two attributes
|
214 | - right now we don't handle `<a empty = "missing">` as a single attribute
|
215 | - that is valid XML, so should we handle it?
|
216 |
|
217 | ## Algorithms
|
218 |
|
219 | ### What Do You Use This for?
|
220 |
|
221 | - Stripping comments
|
222 | - Adding TOC
|
223 | - Syntax highlighting code
|
224 | - Adding links shortcuts
|
225 | - ul-table
|
226 |
|
227 | TODO:
|
228 |
|
229 | - DOM API on top of it
|
230 | - node.elementsByTag('p')
|
231 | - node.elementsByClassName('left')
|
232 | - node.elementByID('foo')
|
233 | - innerHTML() outerHTML()
|
234 | - tag attrs
|
235 | - low level:
|
236 | - outerLeft, outerRight, innerLeft, innerRight
|
237 | - CSS Selectors - `querySelectorAll()`
|
238 | - sed-like model
|
239 |
|
240 |
|
241 | ### List of Algorithms
|
242 |
|
243 | - Lexing/Parsing
|
244 | - lex just the top level
|
245 | - lex both levels
|
246 | - match tags - this is the level for value.Htm8Frag?
|
247 | - sed-like
|
248 | - convert to XML!
|
249 | - sed-like replacement of DOM Tree or element - e.g. Oils TOC
|
250 | - Structured
|
251 | - convert to DOMTree
|
252 | - lazy selection by tag, or attr (id= and class=)
|
253 | - lazy selection by CSS selector expression
|
254 | - untrusted HTML filter, e.g. like StackOverflow / Reddit
|
255 | - this is Safe HTM8
|
256 | - should have a zero alloc way to support this, with good errors?
|
257 | - I think most of them silently strip data
|
258 |
|
259 | ### Emitting HTM8 as HTML5
|
260 |
|
261 | Just emit it! This always works, by design.
|
262 |
|
263 | ### Converting to XML
|
264 |
|
265 | - Add quotes to unquoted attributes
|
266 | - single and double quotes stay the same?
|
267 | - Quote special chars - in text, and inside single- and double-quoted attr values
|
268 | - & BadAmpersand -> `&`
|
269 | - < BadLessThan -> `<`
|
270 | - > BadGreaterTnan -> `>`
|
271 | - `<script>` and `<style>`
|
272 | - either add `<![CDATA[`
|
273 | - or simply escape their values with `& <`
|
274 | - what to do about case-insensitive tags?
|
275 | - maybe you can just normalize them
|
276 | - because we do strict matching
|
277 | - Maybe validate any other declarations, like `<!DOCTYPE foo>`
|
278 | - Add XML header `<?xml version=>`, remove `<!DOCTYPE html>`
|
279 |
|
280 | ## Related
|
281 |
|
282 | - [ysh-doc-processing.html"
|
283 | - [table-object-doc.html](table-object-doc.html)
|
284 |
|
285 |
|
286 | ## Brainstorming / TODO
|
287 |
|
288 | ### Foreign XML with `<svg>` and `<math>` ?
|
289 |
|
290 | `<svg>` and `<math>` are foreign XML content.
|
291 |
|
292 | We might want to support this.
|
293 |
|
294 | - So I can just switch to XML mode in that case
|
295 | - TODO: we need a test corpus for this!
|
296 | - maybe look for wikipedia content
|
297 | - can we also just disallow these? Can you make these into external XML files?
|
298 |
|
299 | This is one way:
|
300 |
|
301 | <object data="math.xml" type="application/mathml+xml"></object>
|
302 | <object data="drawing.xml" type="image/svg+xml"></object>
|
303 |
|
304 | Then we don't need special parsing?
|