| 1 | ---
|
| 2 | default_highlighter: oils-sh
|
| 3 | ---
|
| 4 |
|
| 5 | Unicode in Oils
|
| 6 | ===============
|
| 7 |
|
| 8 | Roughly speaking, you can divide programming languages into 3 categories with
|
| 9 | respect to Unicode strings:
|
| 10 |
|
| 11 | 1. **UTF-8** - Go, Rust, Julia, ..., Oils
|
| 12 | 1. **UTF-16** - Java, JavaScript, ...
|
| 13 | 1. **UTF-32** aka Unicode code points - Python 2 and 3, C and C++, ...
|
| 14 |
|
| 15 | So Oils is in the **first** category: it's UTF-8 centric.
|
| 16 |
|
| 17 | Let's see what this means — in terms your mental model when writing OSH
|
| 18 | and YSH, and in terms of the Oils implementation.
|
| 19 |
|
| 20 | <div id="toc">
|
| 21 | </div>
|
| 22 |
|
| 23 | ## Example: The Length of a String
|
| 24 |
|
| 25 | The Oils runtime has a single `Str` [data type](types.html), which is used by
|
| 26 | both OSH and YSH.
|
| 27 |
|
| 28 | A `Str` is an array of bytes, which **may or may not be** UTF-8 encoded. For
|
| 29 | example:
|
| 30 |
|
| 31 | s=$'\u03bc' # 1 code point, which is UTF-8 encoded as 2 bytes
|
| 32 |
|
| 33 | echo ${#s} # => 1 code point (regardless of locale, right now)
|
| 34 |
|
| 35 | echo $[len(s)] # => 2 bytes
|
| 36 |
|
| 37 | That is, the YSH feature `len(mystr)` returns the length in **bytes**. But the
|
| 38 | shell feature `${#s}` *decodes* the string as UTF-8, and returns the length in
|
| 39 | **code points**.
|
| 40 |
|
| 41 | Again, this string storage model is like Go and Julia, but different than
|
| 42 | JavaScript (UTF-16) and Python (code points).
|
| 43 |
|
| 44 | ### Note on bash
|
| 45 |
|
| 46 | `bash` does support multiple lengths, but in a way that depends on global
|
| 47 | variables:
|
| 48 |
|
| 49 | s=$'\u03bc' # one code point
|
| 50 |
|
| 51 | echo ${#s} # => 1, when say LANG=C.UTF-8
|
| 52 |
|
| 53 | LC_ALL=C # libc setlocale() called under the hood
|
| 54 | echo ${#s} # => 2 bytes, now that LC_ALL=C
|
| 55 |
|
| 56 | So bash doesn't seem to fall cleanly in one of the 3 categories above.
|
| 57 |
|
| 58 | It would be interesting to test bash with non-UTF-8 libc locales like Shift JIS
|
| 59 | (Japanese), but they are rare. In practice, the locale almost always C or
|
| 60 | UTF-8, so bash and Oils are similar.
|
| 61 |
|
| 62 | But Oils is more strict about UTF-8, and YSH discourages global variables like
|
| 63 | `LC_ALL`.
|
| 64 |
|
| 65 | (TODO: For compatibility, OSH should call `setlocale()` when assigning
|
| 66 | `LC_ALL=C`.)
|
| 67 |
|
| 68 | <!--
|
| 69 | - Python: like bash, strings are logically an array of code points.
|
| 70 | - JavaScript: a string is an array of 16-bit code units (UTF-16).
|
| 71 |
|
| 72 | So, unlike those 3 languages, Oils is UTF-8 centric.
|
| 73 | -->
|
| 74 |
|
| 75 | ## Code Strings and Data Strings
|
| 76 |
|
| 77 | ### OSH vs. YSH
|
| 78 |
|
| 79 | For backward compatibility, OSH source files may have **arbitrary bytes**. For
|
| 80 | example, `echo [the literal byte 0xFF]` is a valid source file.
|
| 81 |
|
| 82 | In contrast, YSH source files must be encoded in UTF-8, including its ASCII
|
| 83 | subset. (TODO: Enforce this with `shopt --set utf8_source`)
|
| 84 |
|
| 85 | If you write C-escaped strings, then your source file can be ASCII:
|
| 86 |
|
| 87 | echo $'\u03bc' # bash style
|
| 88 |
|
| 89 | echo u'\u{3bc}' # YSH style
|
| 90 |
|
| 91 | If you write UTF-8 characters, then your source is UTF-8:
|
| 92 |
|
| 93 | <pre>
|
| 94 | echo 'μ'
|
| 95 | </pre>
|
| 96 |
|
| 97 | ### Data Encoding
|
| 98 |
|
| 99 | As mentioned, strings in OSH and YSH are arbitrary sequences of **bytes**,
|
| 100 | which may or may not be valid UTF-8.
|
| 101 |
|
| 102 | Some operations like length `${#s}` and slicing `${s:1:3}` require the string
|
| 103 | to be **valid UTF-8**. Decoding errors are fatal if `shopt -s
|
| 104 | strict_word_eval` is on.
|
| 105 |
|
| 106 | ### Passing Data to libc / the Kernel
|
| 107 |
|
| 108 | When passed to external programs, strings are truncated at the first `NUL`
|
| 109 | (`'\0'`) byte. This is a consequence of how Unix and C work.
|
| 110 |
|
| 111 | ## Your System Locale Should Be UTF-8
|
| 112 |
|
| 113 | At startup, Oils calls the `libc` function `setlocale()`, which initializes the
|
| 114 | global variables from environment variables like `LC_CTYPE` and `LC_COLLATE`.
|
| 115 | (For details, see [osh-locale][] and [ysh-locale][].)
|
| 116 |
|
| 117 | [osh-locale]: ref/chap-special-var.html#osh-locale
|
| 118 | [ysh-locale]: ref/chap-special-var.html#ysh-locale
|
| 119 |
|
| 120 | These global variables determine how `libc` string operations like `tolower()`
|
| 121 | `glob()`, and `regexec()` behave.
|
| 122 |
|
| 123 | For example:
|
| 124 |
|
| 125 | - In `glob()` syntax, does `?` match a byte or a code point?
|
| 126 | - In `regcomp()` syntax, does `.` match a byte or a code point?
|
| 127 |
|
| 128 | Oils only supports UTF-8 locales. If the locale is not UTF-8, Oils prints a
|
| 129 | warning to `stderr` at startup. You can silence it with `OILS_LOCALE_OK=1`.
|
| 130 |
|
| 131 | (Note: GNU readline also calls `setlocale()`, but Oils may or may not link
|
| 132 | against GNU readline.)
|
| 133 |
|
| 134 | ### Note: Some string operations use libc, and some don't
|
| 135 |
|
| 136 | For example:
|
| 137 |
|
| 138 | - String length like `${#s}` is implemented in Oils code, not `libc`. It
|
| 139 | currently assumes UTF-8.
|
| 140 | - The YSH `trim()` method is also implemented in Oils, not `libc`. It
|
| 141 | decodes UTF-8 to detect Unicode spaces.
|
| 142 | - On the other hand, `[[ s =~ $pat ]]` is implemented with `libc`, so it's
|
| 143 | affected by the locale settings.
|
| 144 | - This is also true of `(s ~ pat)` in YSH.
|
| 145 |
|
| 146 | ## Tips
|
| 147 |
|
| 148 | - The GNU `iconv` program converts text from one encoding to another.
|
| 149 |
|
| 150 | ## Summary
|
| 151 |
|
| 152 | Oils is more UTF-8 centric than bash:
|
| 153 |
|
| 154 | - Your system locale should be UTF-8
|
| 155 | - Some OSH string operations **assume** UTF-8, because they are implemented
|
| 156 | inside Oils. They don't use `libc` string functions that potentially support
|
| 157 | multiple locales.
|
| 158 |
|
| 159 | <!--
|
| 160 | (TODO: Oils should support `LANG=C LC_ALL=C` in more cases, like for string
|
| 161 | length.)
|
| 162 | -->
|
| 163 |
|
| 164 | ## Appendix: Languages Operations That Involve Unicode
|
| 165 |
|
| 166 | Here are some details.
|
| 167 |
|
| 168 | ### OSH / bash
|
| 169 |
|
| 170 | These operations are implemented in Python.
|
| 171 |
|
| 172 | In `osh/string_ops.py`:
|
| 173 |
|
| 174 | - `${#s}` - length in code points
|
| 175 | - OSH gives proper decoding errors; bash returns nonsense
|
| 176 | - `${s:1:2}` - index and length are in code points
|
| 177 | - Again, OSH may give decoding errors
|
| 178 | - `${x#glob?}` and `${x##glob?}` - see section on glob below
|
| 179 |
|
| 180 | In `builtin/`:
|
| 181 |
|
| 182 | - `printf '%d' \'c` where `c` is an arbitrary character. This is an obscure
|
| 183 | syntax for `ord()`, i.e. getting an integer from an encoded character.
|
| 184 |
|
| 185 | #### Operations That Use Glob Syntax
|
| 186 |
|
| 187 | The libc functions `glob()` and `fnmatch()` accept a pattern, which may have
|
| 188 | the `?` wildcard. It stands for a single **code point** (in UTF-8 locales),
|
| 189 | not a byte.
|
| 190 |
|
| 191 | Word evaluation uses a `glob()` call:
|
| 192 |
|
| 193 | echo ?.c # which files match?
|
| 194 |
|
| 195 | These language constructs result in `fnmatch()` calls:
|
| 196 |
|
| 197 | ${s#?} # remove one character suffix, quadratic loop for globs
|
| 198 |
|
| 199 | case $x in ?) echo 'one char' ;; esac
|
| 200 |
|
| 201 | [[ $x == ? ]]
|
| 202 |
|
| 203 | #### Operations That Involve Regexes (ERE)
|
| 204 |
|
| 205 | Regexes have the wildcard `.`. Like `?` in globs, it stands for a **code
|
| 206 | point**. They also have `[^a]`, which stands for a code point.
|
| 207 |
|
| 208 | pat='.' # single code point
|
| 209 | [[ $x =~ $pat ]]
|
| 210 |
|
| 211 | This construct our **glob to ERE translator** for position info:
|
| 212 |
|
| 213 | echo ${s/?/x}
|
| 214 |
|
| 215 | #### More Locale-aware operations
|
| 216 |
|
| 217 | - `$IFS` word splitting, which also affects the `shSplit()` builtin
|
| 218 | - Doesn't respect unicode in dash, ash, mksh. But it does in bash, yash, and
|
| 219 | zsh with `setopt SH_WORD_SPLIT`. (TODO: Oils could support Unicode in
|
| 220 | `$IFS`.)
|
| 221 | - `${foo,}` and `${foo^}` for lowercase / uppercase
|
| 222 | - TODO: For bash compatibility, use `libc` functions?
|
| 223 | - `[[ a < b ]]` and `[ a '<' b ]` for sorting
|
| 224 | - TODO: For bash compatibility, use libc `strcoll()`?
|
| 225 | - The `$PS1` prompt language has various time `%` codes, which are
|
| 226 | locale-specific.
|
| 227 | - In bash, `printf` also has a libc time calls with `%()T`.
|
| 228 |
|
| 229 | Other:
|
| 230 |
|
| 231 | - The prompt width is calculated with `wcswidth()`, which doesn't just count
|
| 232 | code points. It calculates the **display width** of characters, which is
|
| 233 | different in general.
|
| 234 |
|
| 235 | ### YSH
|
| 236 |
|
| 237 | - Eggex matching depends on ERE semantics.
|
| 238 | - `mystr ~ / [ \y01 ] /`
|
| 239 | - `case (x) { / dot / }`
|
| 240 | - [String methods](ref/chap-type-method.html)
|
| 241 | - `Str.{trim,trimStart,trimEnd}` respect unicode space, like JavaScript does
|
| 242 | - TODO: `Str.{upper,lower}` also need unicode case folding
|
| 243 | - are they different than the bash operations?
|
| 244 | - TODO: `s.split()` doesn't have a default "split by space", which should
|
| 245 | probably respect unicode space, like `trim()` does
|
| 246 | - [Builtin functions](ref/chap-builtin-func.html)
|
| 247 | - TODO: `for offset, rune in (runes(mystr))` should decode UTF-8, like Go
|
| 248 | - `strcmp()` should do byte-wise and UTF-8 wise comparisons?
|
| 249 |
|
| 250 | ### Data Languages
|
| 251 |
|
| 252 | - Decoding JSON/J8 validates UTF-8
|
| 253 | - Encoding JSON/J8 decodes and validates UTF-8
|
| 254 | - So we can distinguish valid UTF-8 and invalid bytes like `\yff`
|
| 255 |
|
| 256 | ## More Notes
|
| 257 |
|
| 258 | ### List of Low-Level UTF-8 Operations
|
| 259 |
|
| 260 | libc:
|
| 261 |
|
| 262 | - `glob()` and `fnmatch()`
|
| 263 | - `regexec()`
|
| 264 | - `strcoll()` respects `LC_COLLATE`, which bash probably does
|
| 265 | - `tolower() toupper()` - will we use these?
|
| 266 |
|
| 267 | In Python:
|
| 268 |
|
| 269 | - Decode next rune from a position, or previous rune
|
| 270 | - `trimLeft()` and `${s#prefix}` need this
|
| 271 | - Decode UTF-8
|
| 272 | - J8 encoding and decoding need this
|
| 273 | - `for r in (runes(x))` needs this
|
| 274 | - respecting surrogate half
|
| 275 | - JSON needs this
|
| 276 | - Encode integer rune to UTF-8 sequence
|
| 277 | - J8 needs this, for `\u{3bc}` (currently in `data_lang/j8.py Utf8Encode()`)
|
| 278 |
|
| 279 | Not sure:
|
| 280 |
|
| 281 | - Case folding
|
| 282 | - both OSH and YSH have uppercase and lowercase
|
| 283 |
|
| 284 | ### setlocale() calls made by bash, Python, ...
|
| 285 |
|
| 286 | bash:
|
| 287 |
|
| 288 | $ ltrace -e setlocale bash -c 'echo'
|
| 289 | bash->setlocale(LC_ALL, "") = "en_US.UTF-8"
|
| 290 | ...
|
| 291 | bash->setlocale(LC_CTYPE, "") = "en_US.UTF-8"
|
| 292 | bash->setlocale(LC_COLLATE, "") = "en_US.UTF-8"
|
| 293 | bash->setlocale(LC_MESSAGES, "") = "en_US.UTF-8"
|
| 294 | bash->setlocale(LC_NUMERIC, "") = "en_US.UTF-8"
|
| 295 | bash->setlocale(LC_TIME, "") = "en_US.UTF-8"
|
| 296 | ...
|
| 297 |
|
| 298 | Notes:
|
| 299 |
|
| 300 | - both bash and GNU readline call `setlocale()`.
|
| 301 | - I think `LC_ALL` is sufficient?
|
| 302 | - I think `LC_COLLATE` affects `glob()` order, which makes bash scripts
|
| 303 | non-deterministic.
|
| 304 | - We ran into this with `spec/task-runner.sh gen-task-file`, which does a
|
| 305 | glob of `*/*.test.sh`. James Chen-Smith ran it with the equivalent of
|
| 306 | LANG=C, which scrambled the order.
|
| 307 |
|
| 308 | Python 2 and 3 mostly agree:
|
| 309 |
|
| 310 | $ ltrace -e setlocale python3 -c 'print()'
|
| 311 | python3->setlocale(LC_CTYPE, nil) = "C"
|
| 312 | python3->setlocale(LC_CTYPE, "") = "en_US.UTF-8"
|
| 313 |
|
| 314 | It only calls it for `LC_CTYPE`, not `LC_ALL`.
|
| 315 |
|
| 316 | <!--
|
| 317 | ## Spec Tests
|
| 318 |
|
| 319 | June 2024 notes:
|
| 320 |
|
| 321 | - `spec/var-op-patsub` has failing cases, e.g. `LC_ALL=C`
|
| 322 | - ${s//?/a}
|
| 323 | - glob() and fnmatch() seem to be OK? As long as locale is UTF-8.
|
| 324 |
|
| 325 | -->
|
| 326 |
|
| 327 | <!--
|
| 328 |
|
| 329 | What libraries are we using?
|
| 330 |
|
| 331 | TODO: Make sure these are UTF-8 mode, regardless of LANG global variables?
|
| 332 |
|
| 333 | Or maybe we punt on that, and say Oils is only valid in UTF-8 mode? Need to
|
| 334 | investigate the API more.
|
| 335 |
|
| 336 | - fnmatch()
|
| 337 | - glob()
|
| 338 | - regcomp/regexec()
|
| 339 |
|
| 340 | - Are we using any re2c unicode? For JSON?
|
| 341 | - upper() and lower()? isupper() is lower()
|
| 342 | - Need to sort these out
|
| 343 |
|
| 344 | -->
|