OILS / doc / unicode.md View on Github | oils.pub

218 lines, 143 significant
1---
2default_highlighter: oils-sh
3in_progress: yes
4---
5
6Notes on Unicode in Shell
7=========================
8
9<div id="toc">
10</div>
11
12## Philosophy
13
14Oils is UTF-8 centric, unlike `bash` and other shells.
15
16That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like
17Python or JavaScript. The former languages internally represent strings as
18UTF-8, while the latter use arrays of code points or UTF-16 code units.
19
20## A Mental Model
21
22### Program Encoding - OSH vs. YSH
23
24- The source files of OSH programs may have arbitrary bytes, for backward
25 compatibility.
26- The source files of YSH programs should be should be encoded in UTF-8 (or its
27 ASCII subset). TODO: Enforce this with `shopt --set utf8_source`
28
29Unicode characters can be encoded directly in the source:
30
31<pre>
32echo '&#x03bc;'
33</pre>
34
35or denoted in ASCII with C-escaped strings:
36
37 echo $'\u03bc' # bash style
38
39 echo u'\u{3bc}' # YSH style
40
41(Such strings are preferred over `echo -e` because they're statically parsed.)
42
43### Data Encoding
44
45Strings in OSH are arbitrary sequences of **bytes**, which may or may not be
46valid UTF-8. Details:
47
48- When passed to external programs, strings are truncated at the first `NUL`
49 (`'\0'`) byte. This is a consequence of how Unix and C work.
50- Some operations like length `${#s}` and slicing `${s:1:3}` require the string
51 to be **valid UTF-8**. Decoding errors are fatal if `shopt -s
52 strict_word_eval` is on.
53
54## List of Features That Respect Unicode
55
56### OSH / bash
57
58These operations are implemented in Python.
59
60In `osh/string_ops.py`:
61
62- `${#s}` -- length in code points (buggy in bash)
63 - Note: YSH `len(s)` returns a number of bytes, not code points.
64- `${s:1:2}` -- index and length are a number of code points
65- `${x#glob?}` and `${x##glob?}` (see below)
66
67In `builtin/`:
68
69- `printf '%d' \'c` where `c` is an arbitrary character. This is an obscure
70 syntax for `ord()`, i.e. getting an integer from an encoded character.
71
72More:
73
74- `$IFS` word splitting. Affects `shSplit()` builtin
75 - Doesn't respect unicode in dash, ash, mksh. But it does in bash, yash, and
76 zsh with `setopt SH_WORD_SPLIT`.
77 - TODO: Oils should probably respect it
78- `${foo,}` and `${foo^}` for lowercase / uppercase
79 - TODO: doesn't respect unicode
80- `[[ a < b ]]` and `[ a '<' b ]` for sorting
81 - these can use libc `strcoll()`?
82
83#### Globs
84
85Globs have character classes `[^a]` and `?`.
86
87This pattern results in a `glob()` call:
88
89 echo my?glob
90
91These patterns result in `fnmatch()` calls:
92
93 case $x in ?) echo 'one char' ;; esac
94
95 [[ $x == ? ]]
96
97 ${s#?} # remove one character suffix, quadratic loop for globs
98
99This uses our glob to ERE translator for *position* info:
100
101 echo ${s/?/x}
102
103#### Regexes (ERE)
104
105Regexes have character classes `[^a]` and `.`:
106
107 pat='.' # single "character"
108 [[ $x =~ $pat ]]
109
110#### Locale-aware operations
111
112- Prompt string has time, which is locale-specific.
113- In bash, `printf` also has time.
114
115Other:
116
117- The prompt width is calculated with `wcswidth()`, which doesn't just count
118 code points. It calculates the **display width** of characters, which is
119 different in general.
120
121### YSH
122
123- Eggex matching depends on ERE semantics.
124 - `mystr ~ / [ \xff ] /`
125 - `case (x) { / dot / }`
126- `Str.{trim,trimLeft,trimRight}` respect unicode space, like JavaScript does
127- TODO: `Str.{upper,lower}` also need unicode case folding
128- TODO: `s.split()` doesn't have a default "split by space", which should
129 probably respect unicode space, like `trim()` does
130- TODO: `for offset, rune in (runes(mystr))` decodes UTF-8, like Go
131
132Not unicode aware:
133
134- `strcmp()` does byte-wise and UTF-8 wise comparisons?
135
136### Data Languages
137
138- Decoding JSON/J8 validates UTF-8
139- Encoding JSON/J8 decodes and validates UTF-8
140 - So we can distinguish valid UTF-8 and invalid bytes like `\yff`
141
142## Implementation Notes
143
144Unlike bash and CPython, Oils doesn't call `setlocale()`. (Although GNU
145readline may call it.)
146
147It's expected that your locale will respect UTF-8. This is true on most
148distros. If not, then some string operations will support UTF-8 and some
149won't.
150
151For example:
152
153- String length like `${#s}` is implemented in Oils code, not libc, so it will
154 always respect UTF-8.
155- `[[ s =~ $pat ]]` is implemented with libc, so it is affected by the locale
156 settings. Same with Oils `(x ~ pat)`.
157
158TODO: Oils should support `LANG=C` for some operations, but not `LANG=X` for
159other `X`.
160
161### List of Low-Level UTF-8 Operations
162
163libc:
164
165- `glob()` and `fnmatch()`
166- `regexec()`
167- `strcoll()` respects `LC_COLLATE`, which bash probably does
168
169Our own:
170
171- Decode next rune from a position, or previous rune
172 - `trimLeft()` and `${s#prefix}` need this
173- Decode UTF-8
174 - J8 encoding and decoding need this
175 - `for r in (runes(x))` needs this
176 - respecting surrogate half
177 - JSON needs this
178- Encode integer rune to UTF-8 sequence
179 - J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
180
181Not sure:
182
183- Case folding
184 - both OSH and YSH have uppercase and lowercase
185
186## Tips
187
188- The GNU `iconv` program converts text from one encoding to another.
189
190<!--
191## Spec Tests
192
193June 2024 notes:
194
195- `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
196 - ${s//?/a}
197- glob() and fnmatch() seem to be OK? As long as locale is UTF-8.
198
199-->
200
201<!--
202
203What libraries are we using?
204
205TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
206
207Or maybe we punt on that, and say Oils is only valid in UTF-8 mode? Need to
208investigate the API more.
209
210- fnmatch()
211- glob()
212- regcomp/regexec()
213
214- Are we using any re2c unicode? For JSON?
215- upper() and lower()? isupper() is lower()
216 - Need to sort these out
217
218-->