OILS / doc / unicode.md View on Github | oils.pub

344 lines, 232 significant
1---
2default_highlighter: oils-sh
3---
4
5Unicode in Oils
6===============
7
8Roughly speaking, you can divide programming languages into 3 categories with
9respect to Unicode strings:
10
111. **UTF-8** - Go, Rust, Julia, ..., Oils
121. **UTF-16** - Java, JavaScript, ...
131. **UTF-32** aka Unicode code points - Python 2 and 3, C and C++, ...
14
15So Oils is in the **first** category: it's UTF-8 centric.
16
17Let's see what this means — in terms your mental model when writing OSH
18and YSH, and in terms of the Oils implementation.
19
20<div id="toc">
21</div>
22
23## Example: The Length of a String
24
25The Oils runtime has a single `Str` [data type](types.html), which is used by
26both OSH and YSH.
27
28A `Str` is an array of bytes, which **may or may not be** UTF-8 encoded. For
29example:
30
31 s=$'\u03bc' # 1 code point, which is UTF-8 encoded as 2 bytes
32
33 echo ${#s} # => 1 code point (regardless of locale, right now)
34
35 echo $[len(s)] # => 2 bytes
36
37That is, the YSH feature `len(mystr)` returns the length in **bytes**. But the
38shell feature `${#s}` *decodes* the string as UTF-8, and returns the length in
39**code points**.
40
41Again, this string storage model is like Go and Julia, but different than
42JavaScript (UTF-16) and Python (code points).
43
44### Note on bash
45
46`bash` does support multiple lengths, but in a way that depends on global
47variables:
48
49 s=$'\u03bc' # one code point
50
51 echo ${#s} # => 1, when say LANG=C.UTF-8
52
53 LC_ALL=C # libc setlocale() called under the hood
54 echo ${#s} # => 2 bytes, now that LC_ALL=C
55
56So bash doesn't seem to fall cleanly in one of the 3 categories above.
57
58It would be interesting to test bash with non-UTF-8 libc locales like Shift JIS
59(Japanese), but they are rare. In practice, the locale almost always C or
60UTF-8, so bash and Oils are similar.
61
62But Oils is more strict about UTF-8, and YSH discourages global variables like
63`LC_ALL`.
64
65(TODO: For compatibility, OSH should call `setlocale()` when assigning
66`LC_ALL=C`.)
67
68<!--
69- Python: like bash, strings are logically an array of code points.
70- JavaScript: a string is an array of 16-bit code units (UTF-16).
71
72So, unlike those 3 languages, Oils is UTF-8 centric.
73-->
74
75## Code Strings and Data Strings
76
77### OSH vs. YSH
78
79For backward compatibility, OSH source files may have **arbitrary bytes**. For
80example, `echo [the literal byte 0xFF]` is a valid source file.
81
82In contrast, YSH source files must be encoded in UTF-8, including its ASCII
83subset. (TODO: Enforce this with `shopt --set utf8_source`)
84
85If you write C-escaped strings, then your source file can be ASCII:
86
87 echo $'\u03bc' # bash style
88
89 echo u'\u{3bc}' # YSH style
90
91If you write UTF-8 characters, then your source is UTF-8:
92
93<pre>
94echo '&#x03bc;'
95</pre>
96
97### Data Encoding
98
99As mentioned, strings in OSH and YSH are arbitrary sequences of **bytes**,
100which may or may not be valid UTF-8.
101
102Some operations like length `${#s}` and slicing `${s:1:3}` require the string
103to be **valid UTF-8**. Decoding errors are fatal if `shopt -s
104strict_word_eval` is on.
105
106### Passing Data to libc / the Kernel
107
108When passed to external programs, strings are truncated at the first `NUL`
109(`'\0'`) byte. This is a consequence of how Unix and C work.
110
111## Your System Locale Should Be UTF-8
112
113At startup, Oils calls the `libc` function `setlocale()`, which initializes the
114global variables from environment variables like `LC_CTYPE` and `LC_COLLATE`.
115(For details, see [osh-locale][] and [ysh-locale][].)
116
117[osh-locale]: ref/chap-special-var.html#osh-locale
118[ysh-locale]: ref/chap-special-var.html#ysh-locale
119
120These global variables determine how `libc` string operations like `tolower()`
121`glob()`, and `regexec()` behave.
122
123For example:
124
125- In `glob()` syntax, does `?` match a byte or a code point?
126- In `regcomp()` syntax, does `.` match a byte or a code point?
127
128Oils only supports UTF-8 locales. If the locale is not UTF-8, Oils prints a
129warning to `stderr` at startup. You can silence it with `OILS_LOCALE_OK=1`.
130
131(Note: GNU readline also calls `setlocale()`, but Oils may or may not link
132against GNU readline.)
133
134### Note: Some string operations use libc, and some don't
135
136For example:
137
138- String length like `${#s}` is implemented in Oils code, not `libc`. It
139 currently assumes UTF-8.
140 - The YSH `trim()` method is also implemented in Oils, not `libc`. It
141 decodes UTF-8 to detect Unicode spaces.
142- On the other hand, `[[ s =~ $pat ]]` is implemented with `libc`, so it's
143 affected by the locale settings.
144 - This is also true of `(s ~ pat)` in YSH.
145
146## Tips
147
148- The GNU `iconv` program converts text from one encoding to another.
149
150## Summary
151
152Oils is more UTF-8 centric than bash:
153
154- Your system locale should be UTF-8
155- Some OSH string operations **assume** UTF-8, because they are implemented
156 inside Oils. They don't use `libc` string functions that potentially support
157 multiple locales.
158
159<!--
160(TODO: Oils should support `LANG=C LC_ALL=C` in more cases, like for string
161length.)
162-->
163
164## Appendix: Languages Operations That Involve Unicode
165
166Here are some details.
167
168### OSH / bash
169
170These operations are implemented in Python.
171
172In `osh/string_ops.py`:
173
174- `${#s}` - length in code points
175 - OSH gives proper decoding errors; bash returns nonsense
176- `${s:1:2}` - index and length are in code points
177 - Again, OSH may give decoding errors
178- `${x#glob?}` and `${x##glob?}` - see section on glob below
179
180In `builtin/`:
181
182- `printf '%d' \'c` where `c` is an arbitrary character. This is an obscure
183 syntax for `ord()`, i.e. getting an integer from an encoded character.
184
185#### Operations That Use Glob Syntax
186
187The libc functions `glob()` and `fnmatch()` accept a pattern, which may have
188the `?` wildcard. It stands for a single **code point** (in UTF-8 locales),
189not a byte.
190
191Word evaluation uses a `glob()` call:
192
193 echo ?.c # which files match?
194
195These language constructs result in `fnmatch()` calls:
196
197 ${s#?} # remove one character suffix, quadratic loop for globs
198
199 case $x in ?) echo 'one char' ;; esac
200
201 [[ $x == ? ]]
202
203#### Operations That Involve Regexes (ERE)
204
205Regexes have the wildcard `.`. Like `?` in globs, it stands for a **code
206point**. They also have `[^a]`, which stands for a code point.
207
208 pat='.' # single code point
209 [[ $x =~ $pat ]]
210
211This construct our **glob to ERE translator** for position info:
212
213 echo ${s/?/x}
214
215#### More Locale-aware operations
216
217- `$IFS` word splitting, which also affects the `shSplit()` builtin
218 - Doesn't respect unicode in dash, ash, mksh. But it does in bash, yash, and
219 zsh with `setopt SH_WORD_SPLIT`. (TODO: Oils could support Unicode in
220 `$IFS`.)
221- `${foo,}` and `${foo^}` for lowercase / uppercase
222 - TODO: For bash compatibility, use `libc` functions?
223- `[[ a < b ]]` and `[ a '<' b ]` for sorting
224 - TODO: For bash compatibility, use libc `strcoll()`?
225- The `$PS1` prompt language has various time `%` codes, which are
226 locale-specific.
227- In bash, `printf` also has a libc time calls with `%()T`.
228
229Other:
230
231- The prompt width is calculated with `wcswidth()`, which doesn't just count
232 code points. It calculates the **display width** of characters, which is
233 different in general.
234
235### YSH
236
237- Eggex matching depends on ERE semantics.
238 - `mystr ~ / [ \y01 ] /`
239 - `case (x) { / dot / }`
240- [String methods](ref/chap-type-method.html)
241 - `Str.{trim,trimStart,trimEnd}` respect unicode space, like JavaScript does
242 - TODO: `Str.{upper,lower}` also need unicode case folding
243 - are they different than the bash operations?
244 - TODO: `s.split()` doesn't have a default "split by space", which should
245 probably respect unicode space, like `trim()` does
246- [Builtin functions](ref/chap-builtin-func.html)
247 - TODO: `for offset, rune in (runes(mystr))` should decode UTF-8, like Go
248 - `strcmp()` should do byte-wise and UTF-8 wise comparisons?
249
250### Data Languages
251
252- Decoding JSON/J8 validates UTF-8
253- Encoding JSON/J8 decodes and validates UTF-8
254 - So we can distinguish valid UTF-8 and invalid bytes like `\yff`
255
256## More Notes
257
258### List of Low-Level UTF-8 Operations
259
260libc:
261
262- `glob()` and `fnmatch()`
263- `regexec()`
264- `strcoll()` respects `LC_COLLATE`, which bash probably does
265- `tolower() toupper()` - will we use these?
266
267In Python:
268
269- Decode next rune from a position, or previous rune
270 - `trimLeft()` and `${s#prefix}` need this
271- Decode UTF-8
272 - J8 encoding and decoding need this
273 - `for r in (runes(x))` needs this
274 - respecting surrogate half
275 - JSON needs this
276- Encode integer rune to UTF-8 sequence
277 - J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
278
279Not sure:
280
281- Case folding
282 - both OSH and YSH have uppercase and lowercase
283
284### setlocale() calls made by bash, Python, ...
285
286bash:
287
288 $ ltrace -e setlocale bash -c 'echo'
289 bash->setlocale(LC_ALL, "") = "en_US.UTF-8"
290 ...
291 bash->setlocale(LC_CTYPE, "") = "en_US.UTF-8"
292 bash->setlocale(LC_COLLATE, "") = "en_US.UTF-8"
293 bash->setlocale(LC_MESSAGES, "") = "en_US.UTF-8"
294 bash->setlocale(LC_NUMERIC, "") = "en_US.UTF-8"
295 bash->setlocale(LC_TIME, "") = "en_US.UTF-8"
296 ...
297
298Notes:
299
300- both bash and GNU readline call `setlocale()`.
301- I think `LC_ALL` is sufficient?
302- I think `LC_COLLATE` affects `glob()` order, which makes bash scripts
303 non-deterministic.
304 - We ran into this with `spec/task-runner.sh gen-task-file`, which does a
305 glob of `*/*.test.sh`. James Chen-Smith ran it with the equivalent of
306 LANG=C, which scrambled the order.
307
308Python 2 and 3 mostly agree:
309
310 $ ltrace -e setlocale python3 -c 'print()'
311 python3->setlocale(LC_CTYPE, nil) = "C"
312 python3->setlocale(LC_CTYPE, "") = "en_US.UTF-8"
313
314It only calls it for `LC_CTYPE`, not `LC_ALL`.
315
316<!--
317## Spec Tests
318
319June 2024 notes:
320
321- `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
322 - ${s//?/a}
323- glob() and fnmatch() seem to be OK? As long as locale is UTF-8.
324
325-->
326
327<!--
328
329What libraries are we using?
330
331TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
332
333Or maybe we punt on that, and say Oils is only valid in UTF-8 mode? Need to
334investigate the API more.
335
336- fnmatch()
337- glob()
338- regcomp/regexec()
339
340- Are we using any re2c unicode? For JSON?
341- upper() and lower()? isupper() is lower()
342 - Need to sort these out
343
344-->