1 | ---
|
2 | default_highlighter: oils-sh
|
3 | in_progress: yes
|
4 | ---
|
5 |
|
6 | Notes on Unicode in Shell
|
7 | =========================
|
8 |
|
9 | <div id="toc">
|
10 | </div>
|
11 |
|
12 | ## Philosophy
|
13 |
|
14 | Oils is UTF-8 centric, unlike `bash` and other shells.
|
15 |
|
16 | That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like
|
17 | Python or JavaScript. The former languages internally represent strings as
|
18 | UTF-8, while the latter use arrays of code points or UTF-16 code units.
|
19 |
|
20 | ## A Mental Model
|
21 |
|
22 | ### Program Encoding - OSH vs. YSH
|
23 |
|
24 | - The source files of OSH programs may have arbitrary bytes, for backward
|
25 | compatibility.
|
26 | - The source files of YSH programs should be should be encoded in UTF-8 (or its
|
27 | ASCII subset). TODO: Enforce this with `shopt --set utf8_source`
|
28 |
|
29 | Unicode characters can be encoded directly in the source:
|
30 |
|
31 | <pre>
|
32 | echo 'μ'
|
33 | </pre>
|
34 |
|
35 | or denoted in ASCII with C-escaped strings:
|
36 |
|
37 | echo $'\u03bc' # bash style
|
38 |
|
39 | echo u'\u{3bc}' # YSH style
|
40 |
|
41 | (Such strings are preferred over `echo -e` because they're statically parsed.)
|
42 |
|
43 | ### Data Encoding
|
44 |
|
45 | Strings in OSH are arbitrary sequences of **bytes**, which may or may not be
|
46 | valid UTF-8. Details:
|
47 |
|
48 | - When passed to external programs, strings are truncated at the first `NUL`
|
49 | (`'\0'`) byte. This is a consequence of how Unix and C work.
|
50 | - Some operations like length `${#s}` and slicing `${s:1:3}` require the string
|
51 | to be **valid UTF-8**. Decoding errors are fatal if `shopt -s
|
52 | strict_word_eval` is on.
|
53 |
|
54 | ## List of Features That Respect Unicode
|
55 |
|
56 | ### OSH / bash
|
57 |
|
58 | These operations are implemented in Python.
|
59 |
|
60 | In `osh/string_ops.py`:
|
61 |
|
62 | - `${#s}` -- length in code points (buggy in bash)
|
63 | - Note: YSH `len(s)` returns a number of bytes, not code points.
|
64 | - `${s:1:2}` -- index and length are a number of code points
|
65 | - `${x#glob?}` and `${x##glob?}` (see below)
|
66 |
|
67 | In `builtin/`:
|
68 |
|
69 | - `printf '%d' \'c` where `c` is an arbitrary character. This is an obscure
|
70 | syntax for `ord()`, i.e. getting an integer from an encoded character.
|
71 |
|
72 | More:
|
73 |
|
74 | - `$IFS` word splitting. Affects `shSplit()` builtin
|
75 | - Doesn't respect unicode in dash, ash, mksh. But it does in bash, yash, and
|
76 | zsh with `setopt SH_WORD_SPLIT`.
|
77 | - TODO: Oils should probably respect it
|
78 | - `${foo,}` and `${foo^}` for lowercase / uppercase
|
79 | - TODO: doesn't respect unicode
|
80 | - `[[ a < b ]]` and `[ a '<' b ]` for sorting
|
81 | - these can use libc `strcoll()`?
|
82 |
|
83 | #### Globs
|
84 |
|
85 | Globs have character classes `[^a]` and `?`.
|
86 |
|
87 | This pattern results in a `glob()` call:
|
88 |
|
89 | echo my?glob
|
90 |
|
91 | These patterns result in `fnmatch()` calls:
|
92 |
|
93 | case $x in ?) echo 'one char' ;; esac
|
94 |
|
95 | [[ $x == ? ]]
|
96 |
|
97 | ${s#?} # remove one character suffix, quadratic loop for globs
|
98 |
|
99 | This uses our glob to ERE translator for *position* info:
|
100 |
|
101 | echo ${s/?/x}
|
102 |
|
103 | #### Regexes (ERE)
|
104 |
|
105 | Regexes have character classes `[^a]` and `.`:
|
106 |
|
107 | pat='.' # single "character"
|
108 | [[ $x =~ $pat ]]
|
109 |
|
110 | #### Locale-aware operations
|
111 |
|
112 | - Prompt string has time, which is locale-specific.
|
113 | - In bash, `printf` also has time.
|
114 |
|
115 | Other:
|
116 |
|
117 | - The prompt width is calculated with `wcswidth()`, which doesn't just count
|
118 | code points. It calculates the **display width** of characters, which is
|
119 | different in general.
|
120 |
|
121 | ### YSH
|
122 |
|
123 | - Eggex matching depends on ERE semantics.
|
124 | - `mystr ~ / [ \xff ] /`
|
125 | - `case (x) { / dot / }`
|
126 | - `Str.{trim,trimLeft,trimRight}` respect unicode space, like JavaScript does
|
127 | - TODO: `Str.{upper,lower}` also need unicode case folding
|
128 | - TODO: `s.split()` doesn't have a default "split by space", which should
|
129 | probably respect unicode space, like `trim()` does
|
130 | - TODO: `for offset, rune in (runes(mystr))` decodes UTF-8, like Go
|
131 |
|
132 | Not unicode aware:
|
133 |
|
134 | - `strcmp()` does byte-wise and UTF-8 wise comparisons?
|
135 |
|
136 | ### Data Languages
|
137 |
|
138 | - Decoding JSON/J8 validates UTF-8
|
139 | - Encoding JSON/J8 decodes and validates UTF-8
|
140 | - So we can distinguish valid UTF-8 and invalid bytes like `\yff`
|
141 |
|
142 | ## Implementation Notes
|
143 |
|
144 | Unlike bash and CPython, Oils doesn't call `setlocale()`. (Although GNU
|
145 | readline may call it.)
|
146 |
|
147 | It's expected that your locale will respect UTF-8. This is true on most
|
148 | distros. If not, then some string operations will support UTF-8 and some
|
149 | won't.
|
150 |
|
151 | For example:
|
152 |
|
153 | - String length like `${#s}` is implemented in Oils code, not libc, so it will
|
154 | always respect UTF-8.
|
155 | - `[[ s =~ $pat ]]` is implemented with libc, so it is affected by the locale
|
156 | settings. Same with Oils `(x ~ pat)`.
|
157 |
|
158 | TODO: Oils should support `LANG=C` for some operations, but not `LANG=X` for
|
159 | other `X`.
|
160 |
|
161 | ### List of Low-Level UTF-8 Operations
|
162 |
|
163 | libc:
|
164 |
|
165 | - `glob()` and `fnmatch()`
|
166 | - `regexec()`
|
167 | - `strcoll()` respects `LC_COLLATE`, which bash probably does
|
168 |
|
169 | Our own:
|
170 |
|
171 | - Decode next rune from a position, or previous rune
|
172 | - `trimLeft()` and `${s#prefix}` need this
|
173 | - Decode UTF-8
|
174 | - J8 encoding and decoding need this
|
175 | - `for r in (runes(x))` needs this
|
176 | - respecting surrogate half
|
177 | - JSON needs this
|
178 | - Encode integer rune to UTF-8 sequence
|
179 | - J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
|
180 |
|
181 | Not sure:
|
182 |
|
183 | - Case folding
|
184 | - both OSH and YSH have uppercase and lowercase
|
185 |
|
186 | ## Tips
|
187 |
|
188 | - The GNU `iconv` program converts text from one encoding to another.
|
189 |
|
190 | <!--
|
191 | ## Spec Tests
|
192 |
|
193 | June 2024 notes:
|
194 |
|
195 | - `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
|
196 | - ${s//?/a}
|
197 | - glob() and fnmatch() seem to be OK? As long as locale is UTF-8.
|
198 |
|
199 | -->
|
200 |
|
201 | <!--
|
202 |
|
203 | What libraries are we using?
|
204 |
|
205 | TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
|
206 |
|
207 | Or maybe we punt on that, and say Oils is only valid in UTF-8 mode? Need to
|
208 | investigate the API more.
|
209 |
|
210 | - fnmatch()
|
211 | - glob()
|
212 | - regcomp/regexec()
|
213 |
|
214 | - Are we using any re2c unicode? For JSON?
|
215 | - upper() and lower()? isupper() is lower()
|
216 | - Need to sort these out
|
217 |
|
218 | -->
|