1 | Lazy, Lossless Lexing Libraries
|
2 | ===============================
|
3 |
|
4 | Right now we're using this Python code to process HTML in Oil docs. Logically,
|
5 | the doc pipeline looks like:
|
6 |
|
7 | (Hand-written Markdown with embedded HTML) <br/>
|
8 | → CommonMark renderer → <br/>
|
9 | (HTML) <br/>
|
10 | → Filter with lazy lexing → <br/>
|
11 | (HTML) <br/>
|
12 | → Filter with lazy lexing → <br/>
|
13 | (HTML) <br/>
|
14 | <br/>
|
15 | *... repeat N times ...* <br/>
|
16 | <br/>
|
17 | (Final HTML) <br/>
|
18 |
|
19 | Eventually it would be nice to expand this API design to more formats and make
|
20 | it available to YSH users.
|
21 |
|
22 | <div id="toc">
|
23 | </div>
|
24 |
|
25 | ## Why?
|
26 |
|
27 | - To report good parse errors with location info
|
28 | - To use like `sed` on HTML
|
29 | - To minimize allocations, i.e. don't construct a DOM, and don't even construct
|
30 | substrings
|
31 | - So we don't "lose the stack", unlike callback-based parsing
|
32 | - We get an iterator of events/spans
|
33 | - A layer to build a "lossless syntax tree" on top of.
|
34 |
|
35 | ## Formats
|
36 |
|
37 | ### HTML
|
38 |
|
39 | HTML has two levels:
|
40 |
|
41 | 1. The `<>` structure, i.e. tags, the DOCTYPE declaration, comments, and processing
|
42 | instructions
|
43 | 2. The `name="value"` attributes inside start tags (and self-closing tags)
|
44 |
|
45 | Note there could be third layer:
|
46 |
|
47 | 3. `"foo=42&bar=99`
|
48 |
|
49 | ### TSV8
|
50 |
|
51 | - This format is **designed** to be read line-by-line (unlike CSV).
|
52 | - You can skip to any column, and optionally decode the field into an Bool,
|
53 | Int, Float, or Str.
|
54 |
|
55 | ### JSON8
|
56 |
|
57 | - We could reuse the lexer from `data_lang/j8.py`, without parsing
|
58 |
|
59 | ## Links
|
60 |
|
61 | - [pulldown-cmark][]. This is called a "pull parser" in reference to the *XML
|
62 | Pull Parsing* API at <http://xmlpull.org>. However I would call this a
|
63 | *lexer* and not a *parser*.
|
64 | - Hm I think this indicates a need for **lossless** lexers as well?
|
65 | https://github.com/Byron/pulldown-cmark-to-cmark/blob/master/src/fmt.rs
|
66 | - It appears to be used in mdbook
|
67 | - [simdjson][]: This pasrer is designed for extreme speed, and the first stage
|
68 | of it "lazy" and uses integer indices. (We're only trying to avoid
|
69 | allocation; we're not avoiding branches or using SIMD! We also aim to
|
70 | transform the underlying data, not just parse it.)
|
71 |
|
72 | [simdjson]: https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/
|
73 |
|
74 | [pulldown-cmark]: https://github.com/raphlinus/pulldown-cmark
|
75 |
|
76 | ## Design Notes
|
77 |
|
78 | ### Lessons/Claims
|
79 |
|
80 | - You can parse HTML quickly and correctly with regexes! It has a simple
|
81 | syntax that's almost designed for this.
|
82 | - Key point: We're not parsing them **only** with regexes.
|
83 | - The parser is correct in the sense that its behavior on every input is
|
84 | fully-defined. There are no buffer overflows on edge cases -- that's the
|
85 | magic of regexes and the corresponding state machines. However it doesn't
|
86 | recognize every weird document on the web. It recognizes something close
|
87 | to "well-formed XML" (but it's not XHTML.)
|
88 | - Parsing with spans / integer positions is efficient, **composable**, and
|
89 | leads to better **syntax errors**.
|
90 | - Example: spec test format parse errors aren't good. Info is lost.
|
91 | Or ASDL parser? I guess it has some line info.
|
92 | - The API is easier to use than SAX because you're not losing the stack.
|
93 | (Previous transformations used Python's HTMLParser, which is a SAX API.)
|
94 | - It's more efficient than a DOM API. DOM allocates a lot and loses
|
95 | whitespace. (Speed matters when rebuilding the whole site. Each page has
|
96 | multiple stages. It makes the code cleaner to do multiple quick passes, like
|
97 | a compiler.)
|
98 | - In many instances, you can MODIFY the HTML doc correctly without
|
99 | deserializing something. For example, adding `<b>` tags around a line
|
100 | doesn't require unquoting and quoting `>` within them.
|
101 | - Preserving whitespace is useful because I want to use 'diff' to test
|
102 | correctness against the old pipeline.
|
103 | - Python's `pat.match(s, start_pos, end_pos)` is very useful for efficient
|
104 | lexing.
|
105 | - TODO: Convert to re2c to see how it does. Need YYLIMIT.
|
106 | - TODO: Issue of non-greedy matches.
|
107 | - TODO: Issue of unquoting and quoting (escaping).
|
108 | - The triple backtick extension to Markdown (part of CommonMark) is useful.
|
109 | - Although it should generate arbitrary `<div name="val">` instead. This
|
110 | allow natural plugins. You can write `<div>` in Markdown, but it's
|
111 | annoying to manually escape `<` to `>`, e.g. in graphviz or TeX.
|
112 | HTML is analogous to shell. A web site is a directory tree of text files!
|
113 | - It's better than the Snip `-->` syntax, which didn't play well with syntax
|
114 | highlighting.
|
115 | - Composable grammars: Is this the basis for Pulp?
|
116 |
|
117 | ### Open Problems
|
118 |
|
119 | - Python generators (`yield`) make the code more natural, but that's not
|
120 | possible in C or C++. (C++20 has coroutines though.)
|
121 | - Could write a compiler? Would be an excuse to clean up the OPy or mycpp
|
122 | ASTs.
|
123 | - Input handling in C/shell:
|
124 | - Unlike Python's regex engine, libc `regexec()` doesnt have an `end_pos`,
|
125 | requires `NUL` termination.
|
126 | - We're also using `re2c` this way. Can se use `YYLIMIT`?
|
127 | - Simple solution: copy the subrange, or temporarily mutate the buffer (would
|
128 | cause copy-on-write)
|
129 | - Is there a zero-copytehre a
|
130 |
|
131 | ### Comparisons
|
132 |
|
133 | - mdbook (in Rust, using [pulldown-cmark][]). Has Rust plugins.
|
134 | - pandoc. Supports many formats, not just HTML. Has plugins that take an AST
|
135 | in JSON. The AST represents pandoc-flavored Markdown. pandoc differs from
|
136 | Markdown in that it discourages HTML extensions.
|
137 | - ReStructuredText. Sphinx has plugins.
|
138 | - https://www.sphinx-doc.org/en/master/usage/extensions/index.html
|
139 | - XML Starlet (dormant, understands XPath and CSS Starlet)
|
140 | - R's Sweave?
|
141 | - bookdown?
|
142 | - Jupyter notebooks have code and data. Do they have plugins?
|
143 | - Are there tools in node.js?
|
144 | - https://stackoverflow.com/questions/6657216/why-doesnt-node-js-have-a-native-dom
|
145 | - jsdom?
|
146 |
|
147 | ## To Build On Top
|
148 |
|
149 | - CSS selectors
|
150 | - DOM
|