doc/pretty-printing.md

OILS / doc / pretty-printing.md View on Github | oilshell.org

607 lines, 412 significant

1	---
2	default_highlighter: oils-sh
3	---
4
5	Up to 4 Pretty Printers?
6	========================
7
8	(March 2024)
9
10	Oils parses a lot of text, and it's becoming apparent than we need a
11	print a lot too! The text should be nicely formatted because a shell is a
12	user interface.
13
14	This doc describes 4 possible pretty printers in Oils. Traditional shells
15	don't have appear to have any pretty printing.
16
17	[OSH]: $xref
18	[YSH]: $xref
19
20	<div id="toc">
21	</div>
22
23	## Screenshots
24
25	Let's be concrete first, because there's a lot of brainstorming below.
26
27	[YSH]($xref) prints its JSON-like data structures with the `=` keyword, which
28	takes a Python-like expression on the right:
29
30	![ysh issues.json](https://app.oilshell.org/picdir/resize?name=14qke97__ysh-issues.png&max-width=600)
31
32	Right now, it looks bad on big data structures. It should look something like
33	`nodejs` or `jq`:
34
35	![node.js issues.json](https://app.oilshell.org/picdir/resize?name=13b35jj__nodejs-issues.png&max-width=600)
36
37	![jq issues.json](https://app.oilshell.org/picdir/resize?name=11wsgpm__jq-issues.png&max-width=600)
38
39	We may want to omit the quotes, like `nodejs`. (This syntax isn't meant to be
40	parsed. [JSON8]($xref) may have unquoted dict keys, although it's not
41	essential).
42
43	## Background - `go fmt` style versus PPL style
44
45	To back up a bit, I'm writing this doc organize my thoughts, and to explain
46	problems and requirements to contributors.
47
48	There are at least two schools of thought on pretty printers, which
49	this lobste.rs thread has a good discussion of:
50
51	- Why is Prettier Rock Solid? <https://lobste.rs/s/aevptj/why_is_prettier_rock_solid>
52	- HN comments: <https://news.ycombinator.com/item?id=39437424>. Aside: a top
53	comment shows why we don't want to take responsibility for all formatting
54	decisions in our OSH-YSH printer! Users are opinionated. The JavaScript
55	formatter Prettier is criticized for a bug, and for being slow.
56
57	More comments on a blog post by Justin Pombrio (which I circulated):
58
59	- A Twist on Wadler's Printer
60	<https://lobste.rs/s/1r0aak/twist_on_wadler_s_printer>
61
62	Let's call the two styles the "`go fmt` style" and the "PPL style" (functional
63	pretty printing language).
64
65	I was probably "biased" toward `go fmt`, because the two formatters we actually
66	use in Oils are influenced by it:
67
68	1. `clang-format` for our C++ code.
69	- This is the best formatter I've used. It's fast, and e.g. does a good job
70	on C macros.
71	1. `yapf` for our Python code.
72	- It's intentionally a "clone" for `clang-format`. (It's slow, mostly due
73	to being written in Python, and creating lots of tiny objects!)
74
75	(Details: they have line wrapping algorithms, while `go fmt` doesn't. I'm not
76	calling them "graph search" printers, because I think of line wrapping as a
77	subproblem of pretty printing.)
78
79	<!--
80
81	Related: a bunch of Zulip threads where I was learning about pretty printing:
82
83	- TODO: Link to threads
84
85	(I've written many parsers before, but only one ad hoc pretty printer, which I
86	want to get rid of. Discussed below.)
87
88	-->
89
90	### Why PPL style?
91
92	However, Justin's post helped me understand Wadler's printer, a popular example
93	of the PPL style. This style might have some advantages for Oils:
94
95	- There's no user-provided layout for data structures — either
96	[YSH]($xref) data or [Zephyr ASDL][ASDL] data. So we need to synthesize a
97	layout from scratch.
98	- We have multiple languages to format, and the PPL style separates policy
99	(language rules) and mechanism (line wrapping). So we should try a more
100	principled architecture, hopefully without sacrificing quality.
101	- The two styles may not be as distinct as they seem at first. They may be
102	complementary.
103	- We can probably use a PPL for the expression subproblem of the OSH-YSH
104	shell formatter. The rest of the formatter will have rules that don't
105	have to do with line wrapping (aligning EOL comments like `go fmt` does,
106	etc.)
107	- I think a non-wrapping pretty printer — an "indenter" — can
108	use something similar to the PPL IRs. Notes below.
109	- PPLs could be slower (asymptotically and in practice) than the custom
110	algorithms, but I think that can be solved with a simple rule in practice:
111	1. Compute the total size of the data structure/doc up front.
112	1. If it's small, spend extra time to make it look pretty, by using an
113	expressive and slow PPL. We can be quadratic or worse, perhaps. We
114	might want `node.js`-like columnar layout.
115	1. If it's big, use a simple and fast PPL subset.
116
117	(Does that last idea work?)
118
119	[ASDL]: $xref:zephyr-asdl
120
121
122	### Warning
123
124	Sometimes I pile on too many requirements, which I mentioned in the latest
125	release announcement:
126
127	- [Oils 0.20.0 > Why I'm Working on JSON](https://www.oilshell.org/blog/2024/02/release-0.20.0.html#zulip-why-am-i-working-on-json)
128
129	We should do the simplest things first, and I think the PPL approach will allow
130	that.
131
132	BTW there are many threads on `#data-languages` on [Zulip]($xref) where I'm
133	brainstorming and learning about pretty printing.
134
135	## Four Printers - What and Why?
136
137	Here's a sketch of what I think we need. It goes from concrete →
138	experimental and research-y.
139
140	### Print YSH data types in a JSON-like format
141
142	What is it? This is the `=` keyword shown in the screenshots above. (BTW,
143	Lua uses the same syntax `=` to evaluate expressions.)
144
145	Motivation: We should look as good as `node.js` or `jq`.
146
147	---
148
149	Currently we use our JSON printer with the options `SHOW_NON_DATA \|
150	SHOW_CYCLES`.
151
152	- `SHOW_NOW_DATA` prints non-data objects, like `<Func 0x123>`. This syntax
153	can't be parsed.
154	- `SHOW_CYCLES` prints cycles with `-->`, instead of raising an error, like
155	JSON does.
156
157	Example:
158
159	ysh$ var d = {}
160	ysh$ setvar d.eggex = /dot+/ # Eggex object
161
162	ysh$ = d
163	(Dict 0x7feb87cb4050) {"eggex":<Eggex 0x7feb87dbfd00>}
164
165	ysh$ setvar d.cycle = d
166
167	ysh$ = d
168	(Dict 0x7feb87cb4050) {"eggex":<Eggex 0x7feb87dbfd00>,"cycle":{ --> 0x7feb87cb4050 }}
169
170	We should replace this with a new pretty printer that wraps lines, and has
171	<span style="color: darkcyan; font-weight: bold">COLOR</span>.
172
173	### Replace our ad hoc Zephyr ASDL pretty printer
174
175	What is it? [Zephyr ASDL][ASDL] is the statically typed schema language we
176	use to implement Oils. It's "one level down" from the shell.
177
178	We used it to define the syntax of shell with algebraic data types. We
179	create a [lossless syntax tree]($xref:LST), which is also an IR for shell.
180
181	Motivation: We already wrote an ad hoc pretty printer, and it should be
182	replaced! It tries to fit records on a single line, and if that fails, it uses
183	multiple lines. I think it's slow.
184
185	If we already have a YSH data printer, this printer should "obviously" be
186	unified with it. We should have a nice separation of policy and mechanism.
187
188	---
189
190	Use `osh -n myscript.sh` to see what it does:
191
192	![osh -n](https://app.oilshell.org/picdir/resize?name=1lwb0bf__osh-n.png&max-width=600)
193
194	Notes:
195
196	- The algorithm is in [asdl/format.py]($oils-src), and the "homogeneous IR" is
197	in [asdl/hnode.asdl]($oils-src).
198	- Each generated class definition has a `PrettyTree()` method, which converts
199	the typed `self` or `this` to the homogeneous `hnode` tree.
200	- `AbbreviatedTree()` is a bit like the modular printers discussed in the
201	`lobste.rs` thread. It makes certain common data structures more
202	readable, with type-specific logic. It's in Python only — can that
203	logic also be in C++?
204	- The syntax tree is actually a graph, and I recently added logic to
205	omit duplicate nodes. This is unlike the JSON printer, which prints
206	duplicate nodes, as Python and node.js do.
207	- The slowness hasn't really mattered, because this format isn't exposed to
208	users. It's only for debugging Oils itself. But it's useful, and we may
209	want to expose it.
210	- Also used in `pp asdl (obj)`, another debugging feature.
211	- TODO: Add a simple benchmark? The new printer will probably be faster than
212	the old one.
213	- `osh -n benchmarks/testdata/configure-coreutils` tests a huge shell file
214
215	<!--
216	- current state of `osh -n`
217	- timing of it -- I think this may take awhile. It was never optimized. It
218	produces MANY allocations.
219	- to be honest -- allocations and GC will probably dominate in Oils.
220	-->
221
222	### OSH-YSH Code Formatter
223
224	What is it? A formatter for shell code. I think it's more essential to
225	have a [YSH]($xref) formatter, but an [OSH]($xref) formatter is also possible.
226	They both use the same [lossless syntax tree]($xref:LST).
227
228	Motivation - Formatters make a new language easier to use, and there are
229	many users who don't know shell well.
230
231	For example, I don't know TypeScript well, and I had a good experience with
232	`deno fmt`. It reduced the mental load of adopting a new tool.
233
234	---
235
236	Justin had a nice idea on on `lobste.rs` - we should create coarse tree
237	with these elements:
238
239	- `{ }` affect indentation in [YSH]($xref)
240	- In [OSH]($xref), we should also understand `then elif else fi`, `do done`,
241	etc.
242	- `( )` in [YSH]($xref) changes the lexer from [command mode to expression
243	mode](command-vs-expression-mode.html)
244	- Newlines can't appear in atomic `text()` / chunks
245	- Comments need to be preserved at the end of lines
246	- They may also be aligned on consecutive lines (with heuristics)
247	- Keywords like `while for if` begin blocks of code
248
249	Why? We don't don't want to take responsibility for every formatting decision!
250
251	I actually think the command mode / statement formatter should be
252	non-wrapping, while expressions can wrap. Commands will likely be more
253	common than expressions in most YSH code.
254
255	---
256
257	The formatter will be invoked with `ysh --tool format myfile.ysh`.
258
259	It can also be used with the output of `osh --tool ysh-ify`, which roughly
260	translates [OSH]($xref) to [YSH]($xref) (doesn't preserve semantics). This may
261	help generate test data, since there's plenty of shell code in the wild,
262	but not much [YSH][] code.
263
264	### Experimental: Export the Oils "Syntax Graph" to Users with "NIL8"
265
266	What is it? A more human AND machine-readable format for the syntax tree,
267	which is actually a graph.
268
269	Motivation: The pretty-printed AST could be a parseable format. Allow
270	users to reuse all the hard work we did [parsing
271	shell]($blog-tag:parsing-shell)!
272
273	---
274
275	Printing raw [ASDL][] data is useful, but it could be more readable with custom
276	logic to print the natural layers of the graph. There are 4 logical layers
277	in [frontend/syntax.asdl]($oils-src):
278
279	1. `source_t` describes whether shell code comes from `foo.sh` or `ysh -c 'echo
280	mycode'`
281	2. `SourceLine` represents physical lines of code
282	3. `Token` refers to portions of lines
283	4. The syntax tree of `command_t word_t word_part_t expr_t`. The leaves are
284	tokens.
285
286	And instead of a pretty format meant for humans, we may want to print an
287	s-expression-like format I'm calling "NIL8".
288
289	NIL8 can be parsed. You may want to write tree-shaking code to deploy
290	[YSH][] into containers.
291
292	More on NIL8 below. Again, it's experimental.
293
294	## Implementation Notes
295
296	### Do the printers depend on each other?
297
298	- The [YSH][] printer (1) naturally comes before the [ASDL][] printer (2).
299	- The code formatter (3) is concrete and useful to end users.
300	- The NIL8 printer (4) comes after the [ASDL][] printer (2), but it's experimental.
301	- It depends on a bunch of other work in Oils/YSH.
302
303	Risks:
304
305	- No real risks for (1) and (2)
306	- They're "engineering" — Justin's blog post is very close! It could
307	be ported almost literally to typed Python. It will translate
308	automatically to C++. (And it would be interesting to compare our
309	garbage-collected C++ with Rust's `Rc<T>`)
310	- [ASDL][] involves code generation in both Python and C++. We have a custom
311	build system for this (using Ninja for C++).
312	- The OSH/YSH formatter has non-trivial decisions
313	- End-of-line comments. (Shell doesn't have block comments, which simplifies
314	things.)
315	- Multi-line strings in YSH have a special rule -- the indentation of the
316	closing `'''` is significant
317	- Shell here docs may be tricky
318	- This formatter probably requires the most "elbow grease". This is why I
319	said that the statement formatter should initially be non-wrapping —
320	it reduces the scope of the problem.
321
322	### Code Skeleton
323
324	I added some stubs in the code:
325
326	- [data_lang/pretty.asdl]($oils-src) - How we would express the IR
327	- [data_lang/pretty.py]($oils-src) - YSH conversion.
328	- [data_lang/pretty-benchmark.sh]($oils-src) - Our naive ASDL pretty printer is
329	slow. It can take more than 3 seconds on a big file, vs. ~100ms to parse it.
330	(It does print over 100 MB of text though.)
331
332	To generate Python code from the ASDL schema, run `build/py.sh all`.
333	Otherwise, Oils is a plain Python 2 program, with a few C extensions.
334
335	C++ translation is a separate step, and it's now pretty polished.
336
337	For new contributors:
338
339	- [Contributing]($wiki) on the wiki
340	- [Where Contributors Have Problems]($wiki)
341
342	There is also a stub for the formatter:
343
344	- [tools/fmt.py]($oils-src) - Stub file for the formatter.
345	- Code copied from [tools/ysh_ify.py]($oils-src).
346
347	## Design Questions
348
349	This section has some less concrete thoughts.
350
351	### Non-Wrapping Printers aka "Indenters" - same PPL IR?
352
353	I think the PPL IRs are also useful if you're not line wrapping? You can just
354	fix indentation.
355
356	### Columnar Layouts (spending more time)
357
358	`nodejs` does this in its `console.log()`.
359
360	![python node.js](https://app.oilshell.org/picdir/resize?name=uscdu6__python-nodejs.png&max-width=600)
361
362	Future work? We should get the basic pretty printer working first.
363
364	### Two Levels of Coarse Parsing for YSH? (not at first)
365
366	Making the coarse tree has some similarity to syntax highlighting. I wrote a
367	simple syntax highlighter for 5+ languages called `micro-syntax`, and it should
368	support [YSH]($xref) too.
369
370	Sketch:
371
372	1. First make the really coarse tree, something like: `Comment \| Code \|
373	StringLiteral`
374	2. Then make a less coarse tree for pretty printing:
375	- Lex code into `{} ()`
376	- Categorize comments into `EndLineComment` \| `BeginLineComment`
377
378	Then do a trivial linear pass to fix up indentation. The `{ }` or `do done`
379	tokens determine indentation.
380
381	---
382
383	Though honestly it's probably better to just reuse our elaborate parser at
384	first. I like to "compress" different algorithms together, but it may not be
385	worth it here.
386
387	### NIL8 - Uses cases for both Code and Data
388
389	What is "NIL8"? We don't know if it's a good idea yet, but it may be part of
390	[J8 Notation](j8-notation.html).
391
392	Think:
393
394	- A mash-up of [JSON][] and S-expressions
395	- NIL8 Isn't Lisp
396	- Narrow Intermediate Language
397	- WebAssembly text format
398	- An IR for an imperative language, with a Lisp-y syntax.
399	- An exterior S-expression format
400	- Blog: [Oils is Exterior-First](https://www.oilshell.org/blog/2023/06/ysh-design.html)
401	- I posted POSE (portable s-expressions) on lobste.rs for this reason:
402	<https://lobste.rs/s/lwf4jv/pose_portable_s_expressions_pose_spec> (no
403	comments)
404
405	[JSON]: $xref
406
407	If NIL8 can represent both the [lossless syntax tree]($xref:LST) and a new IR
408	for a [mycpp]($xref) rewrite ("yaks"), that's a good test of the design.
409
410	Note that the AST is a statically typed data structure, which means we may
411	also want to export the [ASDL][] schema as NIL8!
412
413	Links:
414
415	- [#data-languages > NIL8 Infix Rule](https://oilshell.zulipchat.com/#narrow/stream/403584-data-languages/topic/NIL8.20Infix.20Rule)
416	- [Commit describing infix rule](https://github.com/oilshell/oil/commit/b9cecf88838d4c89ce1dbd8f4bcdd8e92e10d442)
417
418	At a high level, we're trying to nudge users toward a small set of syntaxes
419	for shell-like programming, rather than inventing ad hoc syntax every time.
420	String literals are a pain point: people often implement them badly, or not at
421	all.
422
423	## Conclusion
424
425	I think we should have shared infrastructure for 3 printers:
426
427	1. [YSH]($xref) data structures
428	1. [ASDL][] data structures
429	1. [OSH]($xref) and [YSH]($xref) code
430
431	And then there's this idea of "replacing" or rationalizing the [ASDL][] syntax
432	tree with "NIL8":
433
434	- It can be parsed, not just printed.
435	- To parse, you can reuse an existing [JSON]($xref) string lexer. IMO, this
436	is the fiddliest part of parsing.
437	- It can export a graph shape, in natural "layers".
438
439	## Related
440
441	### Docs
442
443	- [Parser Architecture](parser-architecture.html) - describes issues like the
444	"lossless invariant", which is affected by re-parsing.
445	- I recently updated it, and tested the invariant with
446	[test/lossless.sh]($oils-src).
447	- The repo [README]($oils-doc:oils-repo/README.html) has an overview of the
448	code.
449
450	### Fun Computer Science Problems in Oils
451
452	Pretty printing is adjacent to other fun problems in Oils, like GC performance,
453	"boxless" optimization, etc.
454
455	- [#help-wanted > Fun Computer Science Problems](https://oilshell.zulipchat.com/#narrow/stream/417617-help-wanted/topic/Fun.20Computer.20Science.20Problems)
456
457	Things to think about:
458
459	(1) Unified Code Representation for Oils
460
461	We want to pack all these tools into Oils:
462
463	- The interpreter, for execution
464	- prints precise errors, ignores comment tokens
465	- `ysh-ify` - a VERY rough translation of [OSH]($xref) to [YSH]($xref)
466	- doesn't respect semantics, because you really need static types for that
467	- uses "span ID"
468	- Pretty Printing (this doc)
469	- will also use "span ID" to detect comment positions, I think
470	- Syntax highlighting
471	- Interactive highlighting will help users learn the language
472	- It's recursive, because sublanguages are mutually recursive: string
473	literals, commands, expressions
474
475	Related: Retrospective on the Go `ast` package.
476
477	(2) Yaks - mycpp from the bottom up
478
479	NIL8 leads into "Yaks", which is an IR for garbage collected C++.
480
481	- Yaks is of course a big yak shave, but it's concrete because we recently
482	completed the translation with [mycpp]($xref). (mycpp is a crappy
483	program which produces good results!)
484
485	(3) Pretty printing will cause many small allocations.
486
487	I think that naive implementations should be fast enough. If not, any slowness
488	may be due to allocation, not necessarily the pretty printing algorithm itself.
489
490	Some solutions:
491
492	- We want to move towards a tagged pointer runtime, an ambitious
493	transformation of the entire interpreter.
494	- But we'll do it in steps. First step: Small String Optimization.
495	- Yaks goes along with a tagged pointer runtime. Yaks also has a principled
496	IR, which may be represented with many small objects.
497	- GC rooting optimization should give a speed boost
498
499	## Appendix: Older Notes
500
501	This is about ref counting for printing graph-shaped data.
502
503	I no longer think this is as important. I think we should manually
504	construct 4 layers of the graph, as described in the section on formatter (4).
505
506	### Dynamically Typed YSH Data (`value_t`)
507
508	Similar to JSON / JSON8 printing, except we
509
510	1. count references, and then print `...` instead of repeating
511	1. line wrap
512	1. assign colors
513	- for atoms, and possibly for balanced parens, to make it more readable
514
515	#### Step 1: Count References
516
517	This is a global pass that computes a Dict[int, int]
518
519	object ID -> number of times referenced in the graph
520
521	The graph is specified by single root node, e.g. the argument to
522
523	pp value (obj)
524
525	Pass this dict into the second step.
526
527	#### Step 2: Convert To Homogeneous Representation
528
529	value.List -> hnode.Compound with []
530	value.Dict -> hnode.Compound with {}
531
532	null, true, false -> Atom
533	Cycle detected -> Atom, with { --> 4beef2 }
534	[ --> 4beef2 ]
535
536	Repetition:
537
538	{ key: { ... 4beef2 }, key2: { ... 4beef2 }
539
540	Or maybe omit the type, since strings don't have that:
541
542	{ key: ... 4beef2, key2: ... 4beef2 }
543
544	#### hnode Schema
545
546	The schema looks like this now?
547
548	hnode =
549	Atom(str s, color color) - # External objects can use this?
550	\| Compound(hnode* items)
551
552	The length of 'str s' is the input to line wrapping.
553
554	#### Step 3: Figure out what's on each line
555
556	TODO: what's the heuristic here? Is it global?
557
558	Dynamic programming?
559
560	do we insert hnode.Newline() or something?
561
562	### Statically Typed ASDL Data
563
564	Reduce it to the case above.
565
566	#### Step 1 - Ref Counting / Cycle Detection?
567
568	We do this all at once?
569
570	Because to convert to value.Record, you have to do cycle detection anyway.
571
572	And that's similar to ref counting.
573
574	#### Step 2 - ASDL records -> value.Record
575
576	value =
577	...
578	\| Record(str type_name, Dict[str, value_t] fields)
579
580	The special "-" key can be used for JSON:
581
582	{"-": "command.Simple, "name": "hi"}
583
584	Though this loses some information, and it doesn't solve the problem with
585	shared references. We would need Packle for that.
586
587	#### Step 2a: Optional Abbreviation?
588
589	Is this separate? Or part of step 2.
590
591	We need something between value.Record and hnode.Compound
592	to do ABBREVIATION:
593
594	- Abbreviate type name, or omit it
595	- Omit some field names (requires schema to record it)
596	- Change () to <>
597
598	Also need nodes for
599
600	- `...` means already printed
601	- `---` means CANNOT print, because it's a cycle
602	- `@1f23` - ID if already printed, or in a cycle
603
604	#### Steps 3 and 4 - Homogeneous Representation, Line Wrapping
605
606	Identical to the dynamically typed case above.
607