OILS / doc / framing.md View on Github | oilshell.org

173 lines, 112 significant
1---
2in_progress: yes
3---
4
5Solutions to the Framing Problem
6================================
7
8How do you write multiple **records** to a pipe, and how do you read them?
9
10You need a way of delimiting them. Let's call this the "framing problem"
11— a term borrowed from network engineering.
12
13This doc categorizes different formats, and shows how you handle them in YSH.
14
15YSH is meant for writing correct shell programs.
16
17<div id="toc">
18</div>
19
20## A Length Prefix
21
22[Netstrings][netstring] are a simple format defined by Daniel J Bernstein.
23
24 3:foo, # ASCII length, colon, byte string, comma
25
26[netstring]: https://en.wikipedia.org/wiki/Netstring
27
28This format is easy to implement, and efficient to read and write.
29
30But the encoded output may contain binary data, which isn't readable by a human
31using a terminal (or GUI). This is significant!
32
33---
34
35TODO: Implement `read --netstr` and `write --netstr`
36
37<!--
38Like [J8 Notation][], this format is "8-bit clean", but:
39
40- A netstring encoder is easier to write than a QSN encoder. This may be
41 useful if you don't have a library handy.
42- It's more efficient to decode, in theory.
43-->
44
45## Solutions Using a Delimiter
46
47Now let's look at traditional Unix solutions, and their **problems**.
48
49### Fixed Delimiter: Newline or `NUL` byte
50
51In traditional Unix, newlines delimit "records". Here's how you read them in
52shell:
53
54 while IFS='' read -r; do # confusing idiom!
55 echo line=$REPLY
56 break # remaining bytes are still in the pipe
57 done
58
59YSH has a simpler idiom:
60
61 while read --raw-line { # unbuffered
62 echo line=$_reply
63 break # remaining bytes are still in the pipe
64 }
65
66Or you can read all lines:
67
68 for line in (io.stdin) { # buffered
69 echo line=$line
70 break # remaining bytes may be lost in a buffer
71 }
72
73**However**, in Unix, all of these strings may have newlines:
74
75- filenames
76- items in `argv`
77- values in `environ`
78
79---
80
81But these C-style strings can't contain the `NUL` byte, aka `\0`. So GNU tools
82have evolved support for another format:
83
84 find . -print0 # write data
85 xargs -0 # read data; also --null
86 grep -z # read data; also --null-data
87 sort -z # read data; also --zero-terminated
88 # (Why are all the names different?)
89
90In Oils, we added a `-0` flag to `read` to understands this:
91
92 $ find . -print0 | { read -0 x; echo $x; read -0 x; echo $x; }
93 foo # could contain newlines!
94 bar
95
96### Chosen Delimiter: Here docs and multipart MIME
97
98Shell has has here docs that look like this:
99
100 cat <<EOF
101 the string EOF
102 can't start a line
103 EOF
104
105So you **choose** the delimiter, with the "word" you write after `<<`.
106
107---
108
109Similarly, when your browser POSTs a form, it uses [MIME multipart message
110format](https://en.wikipedia.org/wiki/MIME#Multipart_messages):
111
112 MIME-Version: 1.0
113 Content-Type: multipart/mixed; boundary=frontier
114
115 This is a message with multiple parts in MIME format.
116 --frontier
117 Content-Type: text/plain
118
119 This is the body of the message.
120 --frontier
121
122So again, you **choose** a delimiter with `boundary=frontier`, and then you
123must recognize it later in the message.
124
125## C-Style `\` escaping allows arbitrary bytes
126
127[JSON][] can express strings with newlines:
128
129 "line 1 \n line 2"
130
131It can also express the zero code point, which isn't the same as NUL byte:
132
133 "zero code point \u0000"
134
135[J8 Notation][] is an extension of JSON that fixes this:
136
137 "NUL byte \y00"
138
139(We use `\y00` rather than `\x00`, because Python and JavaScript both confuse
140`\x00` with `U+0000`. The zero code point may be encoded as 2 or 4 `NUL`
141bytes.)
142
143[J8 Strings]: j8-notation.html
144[JSON]: $xref
145
146### Escaping-Based Records
147
148TSV files are based on delimiters, but they aren't very readable in a terminal.
149
150TODO
151
152So TSV8 offers and "aligned" format:
153
154 #.ssv8 flag desc type
155 type Str Str Str
156 --verbose "do it \t verbosely" bool
157 --count "count only" int
158
159So this format combines two strategies:
160
161- Delimiter-based for the **rows** / lines
162- Escaping-based for the **cells**
163
164## Conclusion
165
166Traditional shells mostly support newline-based records. YSH supports:
167
1681. Length-prefixed records
1691. Delimiter-based records
170 - fixed delimiter: newline or `NUL`
171 - chosen delimiter: TODO? with regex capture?
1721. Escaping-based records with [JSON][] and the [J8 Notation][] extension.
173 - But we avoid formats that are purely based on escaping.