doc/unicode.md

OILS / doc / unicode.md View on Github | oils.pub

218 lines, 143 significant

1	---
2	default_highlighter: oils-sh
3	in_progress: yes
4	---
5
6	Notes on Unicode in Shell
7	=========================
8
9	<div id="toc">
10	</div>
11
12	## Philosophy
13
14	Oils is UTF-8 centric, unlike `bash` and other shells.
15
16	That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like
17	Python or JavaScript. The former languages internally represent strings as
18	UTF-8, while the latter use arrays of code points or UTF-16 code units.
19
20	## A Mental Model
21
22	### Program Encoding - OSH vs. YSH
23
24	- The source files of OSH programs may have arbitrary bytes, for backward
25	compatibility.
26	- The source files of YSH programs should be should be encoded in UTF-8 (or its
27	ASCII subset). TODO: Enforce this with `shopt --set utf8_source`
28
29	Unicode characters can be encoded directly in the source:
30
31	<pre>
32	echo 'μ'
33	</pre>
34
35	or denoted in ASCII with C-escaped strings:
36
37	echo $'\u03bc' # bash style
38
39	echo u'\u{3bc}' # YSH style
40
41	(Such strings are preferred over `echo -e` because they're statically parsed.)
42
43	### Data Encoding
44
45	Strings in OSH are arbitrary sequences of bytes, which may or may not be
46	valid UTF-8. Details:
47
48	- When passed to external programs, strings are truncated at the first `NUL`
49	(`'\0'`) byte. This is a consequence of how Unix and C work.
50	- Some operations like length `${#s}` and slicing `${s:1:3}` require the string
51	to be valid UTF-8. Decoding errors are fatal if `shopt -s
52	strict_word_eval` is on.
53
54	## List of Features That Respect Unicode
55
56	### OSH / bash
57
58	These operations are implemented in Python.
59
60	In `osh/string_ops.py`:
61
62	- `${#s}` -- length in code points (buggy in bash)
63	- Note: YSH `len(s)` returns a number of bytes, not code points.
64	- `${s:1:2}` -- index and length are a number of code points
65	- `${x#glob?}` and `${x##glob?}` (see below)
66
67	In `builtin/`:
68
69	- `printf '%d' \'c` where `c` is an arbitrary character. This is an obscure
70	syntax for `ord()`, i.e. getting an integer from an encoded character.
71
72	More:
73
74	- `$IFS` word splitting. Affects `shSplit()` builtin
75	- Doesn't respect unicode in dash, ash, mksh. But it does in bash, yash, and
76	zsh with `setopt SH_WORD_SPLIT`.
77	- TODO: Oils should probably respect it
78	- `${foo,}` and `${foo^}` for lowercase / uppercase
79	- TODO: doesn't respect unicode
80	- `[[ a < b ]]` and `[ a '<' b ]` for sorting
81	- these can use libc `strcoll()`?
82
83	#### Globs
84
85	Globs have character classes `[^a]` and `?`.
86
87	This pattern results in a `glob()` call:
88
89	echo my?glob
90
91	These patterns result in `fnmatch()` calls:
92
93	case $x in ?) echo 'one char' ;; esac
94
95	[[ $x == ? ]]
96
97	${s#?} # remove one character suffix, quadratic loop for globs
98
99	This uses our glob to ERE translator for position info:
100
101	echo ${s/?/x}
102
103	#### Regexes (ERE)
104
105	Regexes have character classes `[^a]` and `.`:
106
107	pat='.' # single "character"
108	[[ $x =~ $pat ]]
109
110	#### Locale-aware operations
111
112	- Prompt string has time, which is locale-specific.
113	- In bash, `printf` also has time.
114
115	Other:
116
117	- The prompt width is calculated with `wcswidth()`, which doesn't just count
118	code points. It calculates the display width of characters, which is
119	different in general.
120
121	### YSH
122
123	- Eggex matching depends on ERE semantics.
124	- `mystr ~ / [ \xff ] /`
125	- `case (x) { / dot / }`
126	- `Str.{trim,trimLeft,trimRight}` respect unicode space, like JavaScript does
127	- TODO: `Str.{upper,lower}` also need unicode case folding
128	- TODO: `s.split()` doesn't have a default "split by space", which should
129	probably respect unicode space, like `trim()` does
130	- TODO: `for offset, rune in (runes(mystr))` decodes UTF-8, like Go
131
132	Not unicode aware:
133
134	- `strcmp()` does byte-wise and UTF-8 wise comparisons?
135
136	### Data Languages
137
138	- Decoding JSON/J8 validates UTF-8
139	- Encoding JSON/J8 decodes and validates UTF-8
140	- So we can distinguish valid UTF-8 and invalid bytes like `\yff`
141
142	## Implementation Notes
143
144	Unlike bash and CPython, Oils doesn't call `setlocale()`. (Although GNU
145	readline may call it.)
146
147	It's expected that your locale will respect UTF-8. This is true on most
148	distros. If not, then some string operations will support UTF-8 and some
149	won't.
150
151	For example:
152
153	- String length like `${#s}` is implemented in Oils code, not libc, so it will
154	always respect UTF-8.
155	- `[[ s =~ $pat ]]` is implemented with libc, so it is affected by the locale
156	settings. Same with Oils `(x ~ pat)`.
157
158	TODO: Oils should support `LANG=C` for some operations, but not `LANG=X` for
159	other `X`.
160
161	### List of Low-Level UTF-8 Operations
162
163	libc:
164
165	- `glob()` and `fnmatch()`
166	- `regexec()`
167	- `strcoll()` respects `LC_COLLATE`, which bash probably does
168
169	Our own:
170
171	- Decode next rune from a position, or previous rune
172	- `trimLeft()` and `${s#prefix}` need this
173	- Decode UTF-8
174	- J8 encoding and decoding need this
175	- `for r in (runes(x))` needs this
176	- respecting surrogate half
177	- JSON needs this
178	- Encode integer rune to UTF-8 sequence
179	- J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
180
181	Not sure:
182
183	- Case folding
184	- both OSH and YSH have uppercase and lowercase
185
186	## Tips
187
188	- The GNU `iconv` program converts text from one encoding to another.
189
190	<!--
191	## Spec Tests
192
193	June 2024 notes:
194
195	- `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
196	- ${s//?/a}
197	- glob() and fnmatch() seem to be OK? As long as locale is UTF-8.
198
199	-->
200
201	<!--
202
203	What libraries are we using?
204
205	TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
206
207	Or maybe we punt on that, and say Oils is only valid in UTF-8 mode? Need to
208	investigate the API more.
209
210	- fnmatch()
211	- glob()
212	- regcomp/regexec()
213
214	- Are we using any re2c unicode? For JSON?
215	- upper() and lower()? isupper() is lower()
216	- Need to sort these out
217
218	-->