OILS / doc / lazylex.md View on Github | oils.pub

150 lines, 121 significant
1Lazy, Lossless Lexing Libraries
2===============================
3
4Right now we're using this Python code to process HTML in Oil docs. Logically,
5the doc pipeline looks like:
6
7(Hand-written Markdown with embedded HTML) <br/>
8&rarr; CommonMark renderer &rarr; <br/>
9(HTML) <br/>
10&rarr; Filter with lazy lexing &rarr; <br/>
11(HTML) <br/>
12&rarr; Filter with lazy lexing &rarr; <br/>
13(HTML) <br/>
14<br/>
15*... repeat N times ...* <br/>
16<br/>
17(Final HTML) <br/>
18
19Eventually it would be nice to expand this API design to more formats and make
20it available to YSH users.
21
22<div id="toc">
23</div>
24
25## Why?
26
27- To report good parse errors with location info
28- To use like `sed` on HTML
29- To minimize allocations, i.e. don't construct a DOM, and don't even construct
30 substrings
31- So we don't "lose the stack", unlike callback-based parsing
32 - We get an iterator of events/spans
33- A layer to build a "lossless syntax tree" on top of.
34
35## Formats
36
37### HTML
38
39HTML has two levels:
40
411. The `<>` structure, i.e. tags, the DOCTYPE declaration, comments, and processing
42 instructions
432. The `name="value"` attributes inside start tags (and self-closing tags)
44
45Note there could be third layer:
46
473. `"foo=42&amp;bar=99`
48
49### TSV8
50
51- This format is **designed** to be read line-by-line (unlike CSV).
52- You can skip to any column, and optionally decode the field into an Bool,
53 Int, Float, or Str.
54
55### JSON8
56
57- We could reuse the lexer from `data_lang/j8.py`, without parsing
58
59## Links
60
61- [pulldown-cmark][]. This is called a "pull parser" in reference to the *XML
62 Pull Parsing* API at <http://xmlpull.org>. However I would call this a
63 *lexer* and not a *parser*.
64 - Hm I think this indicates a need for **lossless** lexers as well?
65 https://github.com/Byron/pulldown-cmark-to-cmark/blob/master/src/fmt.rs
66 - It appears to be used in mdbook
67- [simdjson][]: This pasrer is designed for extreme speed, and the first stage
68 of it "lazy" and uses integer indices. (We're only trying to avoid
69 allocation; we're not avoiding branches or using SIMD! We also aim to
70 transform the underlying data, not just parse it.)
71
72[simdjson]: https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/
73
74[pulldown-cmark]: https://github.com/raphlinus/pulldown-cmark
75
76## Design Notes
77
78### Lessons/Claims
79
80- You can parse HTML quickly and correctly with regexes! It has a simple
81 syntax that's almost designed for this.
82 - Key point: We're not parsing them **only** with regexes.
83 - The parser is correct in the sense that its behavior on every input is
84 fully-defined. There are no buffer overflows on edge cases -- that's the
85 magic of regexes and the corresponding state machines. However it doesn't
86 recognize every weird document on the web. It recognizes something close
87 to "well-formed XML" (but it's not XHTML.)
88- Parsing with spans / integer positions is efficient, **composable**, and
89 leads to better **syntax errors**.
90 - Example: spec test format parse errors aren't good. Info is lost.
91 Or ASDL parser? I guess it has some line info.
92- The API is easier to use than SAX because you're not losing the stack.
93 (Previous transformations used Python's HTMLParser, which is a SAX API.)
94- It's more efficient than a DOM API. DOM allocates a lot and loses
95 whitespace. (Speed matters when rebuilding the whole site. Each page has
96 multiple stages. It makes the code cleaner to do multiple quick passes, like
97 a compiler.)
98 - In many instances, you can MODIFY the HTML doc correctly without
99 deserializing something. For example, adding `<b>` tags around a line
100 doesn't require unquoting and quoting `&gt;` within them.
101- Preserving whitespace is useful because I want to use 'diff' to test
102 correctness against the old pipeline.
103- Python's `pat.match(s, start_pos, end_pos)` is very useful for efficient
104 lexing.
105 - TODO: Convert to re2c to see how it does. Need YYLIMIT.
106- TODO: Issue of non-greedy matches.
107- TODO: Issue of unquoting and quoting (escaping).
108- The triple backtick extension to Markdown (part of CommonMark) is useful.
109 - Although it should generate arbitrary `<div name="val">` instead. This
110 allow natural plugins. You can write `<div>` in Markdown, but it's
111 annoying to manually escape `<` to `&gt;`, e.g. in graphviz or TeX.
112 HTML is analogous to shell. A web site is a directory tree of text files!
113 - It's better than the Snip `-->` syntax, which didn't play well with syntax
114 highlighting.
115- Composable grammars: Is this the basis for Pulp?
116
117### Open Problems
118
119- Python generators (`yield`) make the code more natural, but that's not
120 possible in C or C++. (C++20 has coroutines though.)
121 - Could write a compiler? Would be an excuse to clean up the OPy or mycpp
122 ASTs.
123- Input handling in C/shell:
124 - Unlike Python's regex engine, libc `regexec()` doesnt have an `end_pos`,
125 requires `NUL` termination.
126 - We're also using `re2c` this way. Can se use `YYLIMIT`?
127 - Simple solution: copy the subrange, or temporarily mutate the buffer (would
128 cause copy-on-write)
129 - Is there a zero-copytehre a
130
131### Comparisons
132
133- mdbook (in Rust, using [pulldown-cmark][]). Has Rust plugins.
134- pandoc. Supports many formats, not just HTML. Has plugins that take an AST
135 in JSON. The AST represents pandoc-flavored Markdown. pandoc differs from
136 Markdown in that it discourages HTML extensions.
137- ReStructuredText. Sphinx has plugins.
138 - https://www.sphinx-doc.org/en/master/usage/extensions/index.html
139- XML Starlet (dormant, understands XPath and CSS Starlet)
140- R's Sweave?
141 - bookdown?
142- Jupyter notebooks have code and data. Do they have plugins?
143- Are there tools in node.js?
144 - https://stackoverflow.com/questions/6657216/why-doesnt-node-js-have-a-native-dom
145 - jsdom?
146
147## To Build On Top
148
149- CSS selectors
150- DOM