OILS / doc / stream-table-process.md View on Github | oilshell.org

481 lines, 326 significant
1---
2in_progress: yes
3default_highlighter: oils-sh
4---
5
6Streams, Tables and Processes - awk, R, xargs
7=============================================
8
9*(July 2024)*
10
11This is a long, "unified/orthogonal" design for:
12
13- Streams: [awk]($xref) delimited lines, regexes
14- Tables: like data frames with R's dplyr or Pandas, but with the "exterior"
15 TSV8 format
16- Processes: xargs -P in parallel
17
18There's also a relation to:
19
20- Trees: `jq`, which will be covered elsewhere.
21
22It's a layered design. That means we need some underlying mechanisms:
23
24- `eval` and positional args `$1 $2 $3`
25- `ctx` builtin
26- Data langauges: TSV8
27- Process pool / event loop primitive
28
29It will link to:
30
31- Oils blog posts (old)
32- Zulip threads (recent)
33- Other related projects (many of them)
34
35<div id="toc">
36</div>
37
38## Intro With Code Snippets
39
40Let's introduce this with a text file
41
42 $ seq 4 | xargs -n 2 | tee test.txt
43 1 2
44 3 4
45
46xargs does splitting:
47
48 $ echo 'alice bob' | xargs -n 1 -- echo hi | tee test2.txt
49 hi alice
50 hi bob
51
52Oils:
53
54 # should we use $_ for _word _line _row? $[_.age] instead of $[_row.age]
55 $ echo 'alice bob' | each-word { echo "hi $_" } | tee test2.txt
56 hi alice
57 hi bob
58
59Normally this should be balanced
60
61### Streams - awk
62
63Now let's use awk:
64
65 $ cat test.txt | awk '{ print $2 " " $1 }'
66 2 1
67 4 3
68
69In YSH:
70
71 $ cat test.txt | chop '$2 $1'
72 2 1
73 4 3
74
75It's shorter! `chop` is an alias for `split-by (space=true, template='$2 $1')`
76
77With a template, for static parsing:
78
79 $ cat test.txt | chop (^"$2 $1")
80 2 1
81 4 3
82
83It's shorter! With a block:
84
85 $ cat test.txt | chop { mkdir -v -p $2/$1 }
86 mkdir: created directory '2/1'
87 mkdir: created directory '4/3'
88
89With no argument, it prints a table:
90
91 $ cat test.txt | chop
92 #.tsv8 $1 $2
93 2 1
94 4 3
95
96 $ cat test.txt | chop (names = :|a b|)
97 #.tsv8 a b
98 2 1
99 4 3
100
101Longer examples with split-by:
102
103 $ cat test.txt | split-by (space=true, template='$2 $1')
104 $ cat test.txt | split-by (space=true, template=^"$2 $1")
105 $ cat test.txt | split-by (space=true) { mkdir -v -p $2/$1 }
106 $ cat test.txt | split-by (space=true)
107 $ cat test.txt | split-by (space=true, names= :|a b|)
108 $ cat test.txt | split-by (space=true, names= :|a b|) {
109 mkdir -v -p $a/$b
110 }
111
112With must-match:
113
114 $ var p = /<capture d+> s+ </capture d+>/
115 $ cat test.txt | must-match (p, template='$2 $1')
116 $ cat test.txt | must-match (p, template=^"$2 $1")
117 $ cat test.txt | must-match (p) { mkdir -v -p $2/$1 }
118 $ cat test.txt | must-match (p)
119
120With names:
121
122 $ var p = /<capture d+ as a> s+ </capture d+ as b>/
123 $ cat test.txt | must-match (p, template='$b $a')
124 $ cat test.txt | must-match (p)
125 #.tsv8 a b
126 2 1
127 4 3
128
129 $ cat test.txt | must-match (p) {
130 mkdir -v -p $a/$b
131 }
132
133Doing it in parallel:
134
135 $ cat test.txt | must-match --max-jobs 4 (p) {
136 mkdir -v -p $a/$b
137 }
138
139### Tables - Data frames with dplyr (R)
140
141 $ cat table.txt
142 size path
143 3 foo.txt
144 20 bar.jpg
145
146 $ R
147 > t=read.table('table.txt', header=T)
148 > t
149 size path
150 1 3 foo.txt
151 2 20 bar.jpg
152
153### Processes - xargs
154
155We already saw this! Because we "compressed" awk and xargs together
156
157What's not in the streams / awk example above:
158
159- `BEGIN END` - that can be separate
160- `when [$1 ~ /d+/] { }`
161
162## Background / References
163
164- Shell, Awk, and Make Should be Combined (2016)
165 - this is the Awk part!
166
167- What is a Data Frame? (2018)
168
169- Sketches of YSH Features (June 2023) - can we express things in YSH?
170 - Zulip: Oils Layering / Self-hosting
171
172- Language Compositionality Test: J8 Lines
173 - This whole thing is a compositionality test
174
175- read --split
176 - more feedback from Aidan and Samuel
177
178- What is a Data Frame?
179
180- jq in jq thread
181
182Old wiki pages:
183
184- <https://github.com/oilshell/oil/wiki/Structured-Data-in-Oil>
185 - uxy - closest I think - <https://github.com/sustrik/uxy>
186 - relies on to-json and jq for querying
187 - miller - I don't like their language - https://github.com/johnkerl/miller -
188 - jc - <https://github.com/kellyjonbrazil/jc>
189- nushell
190- extremely old thing -
191
192We're doing **all of these**.
193
194## Concrete Use Cases
195
196- benchmarks/* with dplyr
197- wedge report
198- oilshell.org analytics job uses dplyr and ggplot2
199
200## Intro
201
202### How much code is it?
203
204- I think this is ~1000 lines of Python and ~1000 lines of YSH (not including tests)
205 - It should be small
206
207### Thanks
208
209- Samuel - two big hints
210 - do it in YSH
211 - `table` with the `ctx` builtin
212- Aidan
213 - `read --split` feedback
214
215
216## Tools
217
218- awk
219 - streams of records - row-wise
220- R
221 - column-wise operations on tables
222- `find . -printf '%s %P\n'` - size and path
223 - generate text that looks like a table
224- xargs
225 - operate on tabular text -- it has a bespoke splitting algorithm
226 - Opinionated guide to xargs
227 - table in, table out
228- jq - "awk for JSON"
229
230
231## Concepts
232
233- TSV8
234 - aligned format SSV8
235 - columns have types, and attributes
236- Lines
237 - raw lines like shell
238 - J8 lines (which can represent any filename, any unicode or byte string)
239- Tables - can be thought of as:
240 - Streams of Rows - shape `[{bytes: 123, path: "foo"}, {}, ...]`
241 - this is actually <https://jsonlines.org> , and it fits well with `jq`
242 - Columns - shape `{bytes: [], path: []}
243
244## Underlying Mechanisms in Oils / Primitives
245
246- blocks `value.Block` - `^()` and `{ }`
247- expressions `value.Expr` - `^[]` and 'compute [] where []'
248
249- eval (b, vars={}, positional=[])
250
251- Buffered for loop
252 - YSH is now roughly as fast as Awk!
253 - `for x in (io.stdin)`
254
255- "magic awk loop"
256
257 with chop {
258 for <README.md *.py> {
259 echo _line_num _line _filename $1 $2
260 }
261 }
262
263- positional args $1 $2 $3
264 - currently mean "argv stack"
265 - or "the captures"
266 - this can probably be generalized
267
268- `ctx` builtin
269- `value.Place`
270
271TODO:
272
273- split() like Python, not like shell IFS algorithm
274
275- string formatting ${bytes %.2f}
276 - ${bytes %.2f M} Megabytes
277 - ${bytes %.2f Mi} Mebibytes
278
279 - ${timestamp +'%Y-m-%d'} and strfitime
280
281 - this is for
282
283 - floating point %e %f %g and printf and strftime
284
285### Process Pool or Event Loop Primitive?
286
287- if you want to display progress, then you might need an event loop
288- test framework might display progress
289
290## Matrices - Orthogonal design in these dimensions
291
292- input: lines vs. rows
293- output: string (Str, Template) vs. row vs. block execution (also a row)
294- execution: serial vs. parallel
295- representation: interior vs. exterior !!!
296 - Dicts and Lists are interior, but TSV8 is exterior
297 - and we have row-wise format, and column-wise format -- this always bugged me
298- exterior: human vs. machine readable
299 - TSV8 is both human and machine-readable
300 - "aligned" #.ssv8 format is also
301 - they are one format named TSV8, with different file extensions. This is
302 because it doesn't make too much sense to implement SSV8 without TSV8. The
303 latter becomes trivial. So we call the whole thing TSV8.
304
305This means we consider all these conversions
306
307- Line -> Line
308- Line -> Row
309- Row -> Line
310- Row -> Row
311
312## Concrete Decisions - Matrix cut off
313
314Design might seem very general, but we did make some hard choices.
315
316- push vs. pull
317 - everything is "push" style I think
318- buffered vs. unbuffered, everything
319
320- List vs iterators
321 - everything is either iterable pipelines, or a List
322
323
324[OSH]: $xref
325[YSH]: $xref
326
327
328## String World
329
330**THESE ARE ALL THE SAME ALGORITHM**. They just have different names.
331
332- each-line
333- each-row
334- split-by (/d+/, cols=:|a b c|)
335 - chop
336- if-match
337- must-match
338 - todo
339
340should we also have: if-split-by ? In case there aren't enough columns?
341
342They all take:
343
344- string arg ' '
345- template arg (^"") - `value.Expr`
346- block arg
347
348for the block arg, this applies:
349
350 -j 4
351 --max-jobs 4
352
353 --max-jobs $(cached-nproc)
354 --max-jobs $[_nproc - 1]
355
356### Awk Issues
357
358So we have this
359
360 echo begin
361 var d = {}
362 cat -- @files | split-by (ifs=IFS) {
363 echo $2 $1
364 call d->accum($1, $2)
365 }
366 echo end
367
368But then how do we have conditionals:
369
370 Filter foo { # does this define a proc? Or a data structure
371
372 split-by (ifs=IFS) # is this possible? We register the proc itself?
373
374 config split-by (ifs=IFS) # register it
375
376 BEGIN {
377 var d = {}
378 }
379 END {
380 echo d.sum
381 }
382
383 when [$1 ~ /d+/] {
384 setvar d.sum += $1
385 }
386
387 }
388
389## Table World
390
391### `table` to construct
392
393Actions:
394
395 table cat
396 table align / table tabify
397 table header (cols)
398 table slice (1, -1) or (-1, -2) etc.
399
400Subcommands
401
402 cols
403 types
404 attr units
405
406Partial Parsing / Lazy Parsing - TSV8 is designed for this
407
408 # we only decode the columns that are necessary
409 cat myfile.tsv8 | table --by-col (&out, cols = :|bytes path|)
410
411## Will writing it in YSH be slow?
412
413- We concentrate on semantics first
414- We can rewrite in Python
415- Better: users can use **exterior** tools with the same interface
416 - in some cases
417 - they can write an efficient `sort-tsv8` or `join-tsv8` with novel algorithms
418- Most data will be small at first
419
420
421## Applications
422
423- Shell is shared nothing
424- Scaling to infinity on the biggest clouds
425
426
427## Extra: Tree World?
428
429This is sort of "expanding the scope" of the project, when we want to reduce scope.
430
431But YSH has both tree-shaped JSON, and table-shaped TSV8, and jq is a nice **bridge** between them.
432
433Streams of Trees (jq)
434
435 empty
436 this
437 this[]
438 =>
439 select()
440 a & b # more than one
441
442
443## Pie in the Sky
444
445Four types of Data Languages:
446
447- flat strings
448- JSON8 - tree
449- TSV8 - table
450- NIL8 - Lisp Tree
451- HTML/XML - doc tree -- attributed text (similar to Emacs data model)
452 - 8ml
453
454Four types of query languaegs:
455
456- regex
457- jq / jshape
458- tsv8
459
460
461## Appendix
462
463### Notes on Naming
464
465Considering columns and then rows:
466
467- SQL is "select ... where"
468- dplyr is "select ... filter"
469- YSH is "pick ... where"
470 - select is a legacy shell keyword, and pick is shorter
471 - or it could be elect in OSH, elect/select in YSH
472 - OSH wouldn't support mutate [average = bytes/total] anyway
473
474dplyr:
475
476- summarise vs. summarize vs. summary
477
478
479
480
481