lazylex/README.md

OILS / lazylex / README.md View on Github | oils.pub

150 lines, 121 significant

1	Lazy, Lossless Lexing Libraries
2	===============================
3
4	Right now we're using this Python code to process HTML in Oil docs. Logically,
5	the doc pipeline looks like:
6
7	(Hand-written Markdown with embedded HTML) <br/>
8	→ CommonMark renderer → <br/>
9	(HTML) <br/>
10	→ Filter with lazy lexing → <br/>
11	(HTML) <br/>
12	→ Filter with lazy lexing → <br/>
13	(HTML) <br/>
14	<br/>
15	... repeat N times ... <br/>
16	<br/>
17	(Final HTML) <br/>
18
19	Eventually it would be nice to expand this API design to more formats and make
20	it available to YSH users.
21
22	<div id="toc">
23	</div>
24
25	## Why?
26
27	- To report good parse errors with location info
28	- To use like `sed` on HTML
29	- To minimize allocations, i.e. don't construct a DOM, and don't even construct
30	substrings
31	- So we don't "lose the stack", unlike callback-based parsing
32	- We get an iterator of events/spans
33	- A layer to build a "lossless syntax tree" on top of.
34
35	## Formats
36
37	### HTML
38
39	HTML has two levels:
40
41	1. The `<>` structure, i.e. tags, the DOCTYPE declaration, comments, and processing
42	instructions
43	2. The `name="value"` attributes inside start tags (and self-closing tags)
44
45	Note there could be third layer:
46
47	3. `"foo=42&bar=99`
48
49	### TSV8
50
51	- This format is designed to be read line-by-line (unlike CSV).
52	- You can skip to any column, and optionally decode the field into an Bool,
53	Int, Float, or Str.
54
55	### JSON8
56
57	- We could reuse the lexer from `data_lang/j8.py`, without parsing
58
59	## Links
60
61	- [pulldown-cmark][]. This is called a "pull parser" in reference to the *XML
62	Pull Parsing* API at <http://xmlpull.org>. However I would call this a
63	lexer and not a parser.
64	- Hm I think this indicates a need for lossless lexers as well?
65	https://github.com/Byron/pulldown-cmark-to-cmark/blob/master/src/fmt.rs
66	- It appears to be used in mdbook
67	- [simdjson][]: This pasrer is designed for extreme speed, and the first stage
68	of it "lazy" and uses integer indices. (We're only trying to avoid
69	allocation; we're not avoiding branches or using SIMD! We also aim to
70	transform the underlying data, not just parse it.)
71
72	[simdjson]: https://branchfree.org/2019/02/25/paper-parsing-gigabytes-of-json-per-second/
73
74	[pulldown-cmark]: https://github.com/raphlinus/pulldown-cmark
75
76	## Design Notes
77
78	### Lessons/Claims
79
80	- You can parse HTML quickly and correctly with regexes! It has a simple
81	syntax that's almost designed for this.
82	- Key point: We're not parsing them only with regexes.
83	- The parser is correct in the sense that its behavior on every input is
84	fully-defined. There are no buffer overflows on edge cases -- that's the
85	magic of regexes and the corresponding state machines. However it doesn't
86	recognize every weird document on the web. It recognizes something close
87	to "well-formed XML" (but it's not XHTML.)
88	- Parsing with spans / integer positions is efficient, composable, and
89	leads to better syntax errors.
90	- Example: spec test format parse errors aren't good. Info is lost.
91	Or ASDL parser? I guess it has some line info.
92	- The API is easier to use than SAX because you're not losing the stack.
93	(Previous transformations used Python's HTMLParser, which is a SAX API.)
94	- It's more efficient than a DOM API. DOM allocates a lot and loses
95	whitespace. (Speed matters when rebuilding the whole site. Each page has
96	multiple stages. It makes the code cleaner to do multiple quick passes, like
97	a compiler.)
98	- In many instances, you can MODIFY the HTML doc correctly without
99	deserializing something. For example, adding `<b>` tags around a line
100	doesn't require unquoting and quoting `>` within them.
101	- Preserving whitespace is useful because I want to use 'diff' to test
102	correctness against the old pipeline.
103	- Python's `pat.match(s, start_pos, end_pos)` is very useful for efficient
104	lexing.
105	- TODO: Convert to re2c to see how it does. Need YYLIMIT.
106	- TODO: Issue of non-greedy matches.
107	- TODO: Issue of unquoting and quoting (escaping).
108	- The triple backtick extension to Markdown (part of CommonMark) is useful.
109	- Although it should generate arbitrary `<div name="val">` instead. This
110	allow natural plugins. You can write `<div>` in Markdown, but it's
111	annoying to manually escape `<` to `>`, e.g. in graphviz or TeX.
112	HTML is analogous to shell. A web site is a directory tree of text files!
113	- It's better than the Snip `-->` syntax, which didn't play well with syntax
114	highlighting.
115	- Composable grammars: Is this the basis for Pulp?
116
117	### Open Problems
118
119	- Python generators (`yield`) make the code more natural, but that's not
120	possible in C or C++. (C++20 has coroutines though.)
121	- Could write a compiler? Would be an excuse to clean up the OPy or mycpp
122	ASTs.
123	- Input handling in C/shell:
124	- Unlike Python's regex engine, libc `regexec()` doesnt have an `end_pos`,
125	requires `NUL` termination.
126	- We're also using `re2c` this way. Can se use `YYLIMIT`?
127	- Simple solution: copy the subrange, or temporarily mutate the buffer (would
128	cause copy-on-write)
129	- Is there a zero-copytehre a
130
131	### Comparisons
132
133	- mdbook (in Rust, using [pulldown-cmark][]). Has Rust plugins.
134	- pandoc. Supports many formats, not just HTML. Has plugins that take an AST
135	in JSON. The AST represents pandoc-flavored Markdown. pandoc differs from
136	Markdown in that it discourages HTML extensions.
137	- ReStructuredText. Sphinx has plugins.
138	- https://www.sphinx-doc.org/en/master/usage/extensions/index.html
139	- XML Starlet (dormant, understands XPath and CSS Starlet)
140	- R's Sweave?
141	- bookdown?
142	- Jupyter notebooks have code and data. Do they have plugins?
143	- Are there tools in node.js?
144	- https://stackoverflow.com/questions/6657216/why-doesnt-node-js-have-a-native-dom
145	- jsdom?
146
147	## To Build On Top
148
149	- CSS selectors
150	- DOM