doc/unicode.md

OILS / doc / unicode.md View on Github | oils.pub

344 lines, 232 significant

1	---
2	default_highlighter: oils-sh
3	---
4
5	Unicode in Oils
6	===============
7
8	Roughly speaking, you can divide programming languages into 3 categories with
9	respect to Unicode strings:
10
11	1. UTF-8 - Go, Rust, Julia, ..., Oils
12	1. UTF-16 - Java, JavaScript, ...
13	1. UTF-32 aka Unicode code points - Python 2 and 3, C and C++, ...
14
15	So Oils is in the first category: it's UTF-8 centric.
16
17	Let's see what this means — in terms your mental model when writing OSH
18	and YSH, and in terms of the Oils implementation.
19
20	<div id="toc">
21	</div>
22
23	## Example: The Length of a String
24
25	The Oils runtime has a single `Str` [data type](types.html), which is used by
26	both OSH and YSH.
27
28	A `Str` is an array of bytes, which may or may not be UTF-8 encoded. For
29	example:
30
31	s=$'\u03bc' # 1 code point, which is UTF-8 encoded as 2 bytes
32
33	echo ${#s} # => 1 code point (regardless of locale, right now)
34
35	echo $[len(s)] # => 2 bytes
36
37	That is, the YSH feature `len(mystr)` returns the length in bytes. But the
38	shell feature `${#s}` decodes the string as UTF-8, and returns the length in
39	code points.
40
41	Again, this string storage model is like Go and Julia, but different than
42	JavaScript (UTF-16) and Python (code points).
43
44	### Note on bash
45
46	`bash` does support multiple lengths, but in a way that depends on global
47	variables:
48
49	s=$'\u03bc' # one code point
50
51	echo ${#s} # => 1, when say LANG=C.UTF-8
52
53	LC_ALL=C # libc setlocale() called under the hood
54	echo ${#s} # => 2 bytes, now that LC_ALL=C
55
56	So bash doesn't seem to fall cleanly in one of the 3 categories above.
57
58	It would be interesting to test bash with non-UTF-8 libc locales like Shift JIS
59	(Japanese), but they are rare. In practice, the locale almost always C or
60	UTF-8, so bash and Oils are similar.
61
62	But Oils is more strict about UTF-8, and YSH discourages global variables like
63	`LC_ALL`.
64
65	(TODO: For compatibility, OSH should call `setlocale()` when assigning
66	`LC_ALL=C`.)
67
68	<!--
69	- Python: like bash, strings are logically an array of code points.
70	- JavaScript: a string is an array of 16-bit code units (UTF-16).
71
72	So, unlike those 3 languages, Oils is UTF-8 centric.
73	-->
74
75	## Code Strings and Data Strings
76
77	### OSH vs. YSH
78
79	For backward compatibility, OSH source files may have arbitrary bytes. For
80	example, `echo [the literal byte 0xFF]` is a valid source file.
81
82	In contrast, YSH source files must be encoded in UTF-8, including its ASCII
83	subset. (TODO: Enforce this with `shopt --set utf8_source`)
84
85	If you write C-escaped strings, then your source file can be ASCII:
86
87	echo $'\u03bc' # bash style
88
89	echo u'\u{3bc}' # YSH style
90
91	If you write UTF-8 characters, then your source is UTF-8:
92
93	<pre>
94	echo 'μ'
95	</pre>
96
97	### Data Encoding
98
99	As mentioned, strings in OSH and YSH are arbitrary sequences of bytes,
100	which may or may not be valid UTF-8.
101
102	Some operations like length `${#s}` and slicing `${s:1:3}` require the string
103	to be valid UTF-8. Decoding errors are fatal if `shopt -s
104	strict_word_eval` is on.
105
106	### Passing Data to libc / the Kernel
107
108	When passed to external programs, strings are truncated at the first `NUL`
109	(`'\0'`) byte. This is a consequence of how Unix and C work.
110
111	## Your System Locale Should Be UTF-8
112
113	At startup, Oils calls the `libc` function `setlocale()`, which initializes the
114	global variables from environment variables like `LC_CTYPE` and `LC_COLLATE`.
115	(For details, see [osh-locale][] and [ysh-locale][].)
116
117	[osh-locale]: ref/chap-special-var.html#osh-locale
118	[ysh-locale]: ref/chap-special-var.html#ysh-locale
119
120	These global variables determine how `libc` string operations like `tolower()`
121	`glob()`, and `regexec()` behave.
122
123	For example:
124
125	- In `glob()` syntax, does `?` match a byte or a code point?
126	- In `regcomp()` syntax, does `.` match a byte or a code point?
127
128	Oils only supports UTF-8 locales. If the locale is not UTF-8, Oils prints a
129	warning to `stderr` at startup. You can silence it with `OILS_LOCALE_OK=1`.
130
131	(Note: GNU readline also calls `setlocale()`, but Oils may or may not link
132	against GNU readline.)
133
134	### Note: Some string operations use libc, and some don't
135
136	For example:
137
138	- String length like `${#s}` is implemented in Oils code, not `libc`. It
139	currently assumes UTF-8.
140	- The YSH `trim()` method is also implemented in Oils, not `libc`. It
141	decodes UTF-8 to detect Unicode spaces.
142	- On the other hand, `[[ s =~ $pat ]]` is implemented with `libc`, so it's
143	affected by the locale settings.
144	- This is also true of `(s ~ pat)` in YSH.
145
146	## Tips
147
148	- The GNU `iconv` program converts text from one encoding to another.
149
150	## Summary
151
152	Oils is more UTF-8 centric than bash:
153
154	- Your system locale should be UTF-8
155	- Some OSH string operations assume UTF-8, because they are implemented
156	inside Oils. They don't use `libc` string functions that potentially support
157	multiple locales.
158
159	<!--
160	(TODO: Oils should support `LANG=C LC_ALL=C` in more cases, like for string
161	length.)
162	-->
163
164	## Appendix: Languages Operations That Involve Unicode
165
166	Here are some details.
167
168	### OSH / bash
169
170	These operations are implemented in Python.
171
172	In `osh/string_ops.py`:
173
174	- `${#s}` - length in code points
175	- OSH gives proper decoding errors; bash returns nonsense
176	- `${s:1:2}` - index and length are in code points
177	- Again, OSH may give decoding errors
178	- `${x#glob?}` and `${x##glob?}` - see section on glob below
179
180	In `builtin/`:
181
182	- `printf '%d' \'c` where `c` is an arbitrary character. This is an obscure
183	syntax for `ord()`, i.e. getting an integer from an encoded character.
184
185	#### Operations That Use Glob Syntax
186
187	The libc functions `glob()` and `fnmatch()` accept a pattern, which may have
188	the `?` wildcard. It stands for a single code point (in UTF-8 locales),
189	not a byte.
190
191	Word evaluation uses a `glob()` call:
192
193	echo ?.c # which files match?
194
195	These language constructs result in `fnmatch()` calls:
196
197	${s#?} # remove one character suffix, quadratic loop for globs
198
199	case $x in ?) echo 'one char' ;; esac
200
201	[[ $x == ? ]]
202
203	#### Operations That Involve Regexes (ERE)
204
205	Regexes have the wildcard `.`. Like `?` in globs, it stands for a **code
206	point**. They also have `[^a]`, which stands for a code point.
207
208	pat='.' # single code point
209	[[ $x =~ $pat ]]
210
211	This construct our glob to ERE translator for position info:
212
213	echo ${s/?/x}
214
215	#### More Locale-aware operations
216
217	- `$IFS` word splitting, which also affects the `shSplit()` builtin
218	- Doesn't respect unicode in dash, ash, mksh. But it does in bash, yash, and
219	zsh with `setopt SH_WORD_SPLIT`. (TODO: Oils could support Unicode in
220	`$IFS`.)
221	- `${foo,}` and `${foo^}` for lowercase / uppercase
222	- TODO: For bash compatibility, use `libc` functions?
223	- `[[ a < b ]]` and `[ a '<' b ]` for sorting
224	- TODO: For bash compatibility, use libc `strcoll()`?
225	- The `$PS1` prompt language has various time `%` codes, which are
226	locale-specific.
227	- In bash, `printf` also has a libc time calls with `%()T`.
228
229	Other:
230
231	- The prompt width is calculated with `wcswidth()`, which doesn't just count
232	code points. It calculates the display width of characters, which is
233	different in general.
234
235	### YSH
236
237	- Eggex matching depends on ERE semantics.
238	- `mystr ~ / [ \y01 ] /`
239	- `case (x) { / dot / }`
240	- [String methods](ref/chap-type-method.html)
241	- `Str.{trim,trimStart,trimEnd}` respect unicode space, like JavaScript does
242	- TODO: `Str.{upper,lower}` also need unicode case folding
243	- are they different than the bash operations?
244	- TODO: `s.split()` doesn't have a default "split by space", which should
245	probably respect unicode space, like `trim()` does
246	- [Builtin functions](ref/chap-builtin-func.html)
247	- TODO: `for offset, rune in (runes(mystr))` should decode UTF-8, like Go
248	- `strcmp()` should do byte-wise and UTF-8 wise comparisons?
249
250	### Data Languages
251
252	- Decoding JSON/J8 validates UTF-8
253	- Encoding JSON/J8 decodes and validates UTF-8
254	- So we can distinguish valid UTF-8 and invalid bytes like `\yff`
255
256	## More Notes
257
258	### List of Low-Level UTF-8 Operations
259
260	libc:
261
262	- `glob()` and `fnmatch()`
263	- `regexec()`
264	- `strcoll()` respects `LC_COLLATE`, which bash probably does
265	- `tolower() toupper()` - will we use these?
266
267	In Python:
268
269	- Decode next rune from a position, or previous rune
270	- `trimLeft()` and `${s#prefix}` need this
271	- Decode UTF-8
272	- J8 encoding and decoding need this
273	- `for r in (runes(x))` needs this
274	- respecting surrogate half
275	- JSON needs this
276	- Encode integer rune to UTF-8 sequence
277	- J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
278
279	Not sure:
280
281	- Case folding
282	- both OSH and YSH have uppercase and lowercase
283
284	### setlocale() calls made by bash, Python, ...
285
286	bash:
287
288	$ ltrace -e setlocale bash -c 'echo'
289	bash->setlocale(LC_ALL, "") = "en_US.UTF-8"
290	...
291	bash->setlocale(LC_CTYPE, "") = "en_US.UTF-8"
292	bash->setlocale(LC_COLLATE, "") = "en_US.UTF-8"
293	bash->setlocale(LC_MESSAGES, "") = "en_US.UTF-8"
294	bash->setlocale(LC_NUMERIC, "") = "en_US.UTF-8"
295	bash->setlocale(LC_TIME, "") = "en_US.UTF-8"
296	...
297
298	Notes:
299
300	- both bash and GNU readline call `setlocale()`.
301	- I think `LC_ALL` is sufficient?
302	- I think `LC_COLLATE` affects `glob()` order, which makes bash scripts
303	non-deterministic.
304	- We ran into this with `spec/task-runner.sh gen-task-file`, which does a
305	glob of `/.test.sh`. James Chen-Smith ran it with the equivalent of
306	LANG=C, which scrambled the order.
307
308	Python 2 and 3 mostly agree:
309
310	$ ltrace -e setlocale python3 -c 'print()'
311	python3->setlocale(LC_CTYPE, nil) = "C"
312	python3->setlocale(LC_CTYPE, "") = "en_US.UTF-8"
313
314	It only calls it for `LC_CTYPE`, not `LC_ALL`.
315
316	<!--
317	## Spec Tests
318
319	June 2024 notes:
320
321	- `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
322	- ${s//?/a}
323	- glob() and fnmatch() seem to be OK? As long as locale is UTF-8.
324
325	-->
326
327	<!--
328
329	What libraries are we using?
330
331	TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
332
333	Or maybe we punt on that, and say Oils is only valid in UTF-8 mode? Need to
334	investigate the API more.
335
336	- fnmatch()
337	- glob()
338	- regcomp/regexec()
339
340	- Are we using any re2c unicode? For JSON?
341	- upper() and lower()? isupper() is lower()
342	- Need to sort these out
343
344	-->