Why Sponsor Oils? | source | all docs for version 0.37.0 | all versions | oils.pub
Roughly speaking, you can divide programming languages into 3 categories with respect to Unicode strings:
So Oils is in the first category: it's UTF-8 centric.
Let's see what this means — in terms your mental model when writing OSH and YSH, and in terms of the Oils implementation.
The Oils runtime has a single Str data type, which is used by
both OSH and YSH.
A Str is an array of bytes, which may or may not be UTF-8 encoded. For
example:
s=$'\u03bc' # 1 code point, which is UTF-8 encoded as 2 bytes
echo ${#s} # => 1 code point (regardless of locale, right now)
echo $[len(s)] # => 2 bytes
That is, the YSH feature len(mystr) returns the length in bytes. But the
shell feature ${#s} decodes the string as UTF-8, and returns the length in
code points.
Again, this string storage model is like Go and Julia, but different than JavaScript (UTF-16) and Python (code points).
bash does support multiple lengths, but in a way that depends on global
variables:
s=$'\u03bc' # one code point
echo ${#s} # => 1, when say LANG=C.UTF-8
LC_ALL=C # libc setlocale() called under the hood
echo ${#s} # => 2 bytes, now that LC_ALL=C
So bash doesn't seem to fall cleanly in one of the 3 categories above.
It would be interesting to test bash with non-UTF-8 libc locales like Shift JIS (Japanese), but they are rare. In practice, the locale almost always C or UTF-8, so bash and Oils are similar.
But Oils is more strict about UTF-8, and YSH discourages global variables like
LC_ALL.
(TODO: For compatibility, OSH should call setlocale() when assigning
LC_ALL=C.)
For backward compatibility, OSH source files may have arbitrary bytes. For
example, echo [the literal byte 0xFF] is a valid source file.
In contrast, YSH source files must be encoded in UTF-8, including its ASCII
subset. (TODO: Enforce this with shopt --set utf8_source)
If you write C-escaped strings, then your source file can be ASCII:
echo $'\u03bc' # bash style
echo u'\u{3bc}' # YSH style
If you write UTF-8 characters, then your source is UTF-8:
echo 'μ'
As mentioned, strings in OSH and YSH are arbitrary sequences of bytes, which may or may not be valid UTF-8.
Some operations like length ${#s} and slicing ${s:1:3} require the string
to be valid UTF-8. Decoding errors are fatal if shopt -s strict_word_eval is on.
When passed to external programs, strings are truncated at the first NUL
('\0') byte. This is a consequence of how Unix and C work.
At startup, Oils calls the libc function setlocale(), which initializes the
global variables from environment variables like LC_CTYPE and LC_COLLATE.
(For details, see osh-locale and ysh-locale.)
These global variables determine how libc string operations like tolower()
glob(), and regexec() behave.
For example:
glob() syntax, does ? match a byte or a code point?regcomp() syntax, does . match a byte or a code point?Oils only supports UTF-8 locales. If the locale is not UTF-8, Oils prints a
warning to stderr at startup. You can silence it with OILS_LOCALE_OK=1.
(Note: GNU readline also calls setlocale(), but Oils may or may not link
against GNU readline.)
For example:
${#s} is implemented in Oils code, not libc. It
currently assumes UTF-8.
trim() method is also implemented in Oils, not libc. It
decodes UTF-8 to detect Unicode spaces.[[ s =~ $pat ]] is implemented with libc, so it's
affected by the locale settings.
(s ~ pat) in YSH.iconv program converts text from one encoding to another.Oils is more UTF-8 centric than bash:
libc string functions that potentially support
multiple locales.Here are some details.
These operations are implemented in Python.
In osh/string_ops.py:
${#s} - length in code points
${s:1:2} - index and length are in code points
${x#glob?} and ${x##glob?} - see section on glob belowIn builtin/:
printf '%d' \'c where c is an arbitrary character. This is an obscure
syntax for ord(), i.e. getting an integer from an encoded character.The libc functions glob() and fnmatch() accept a pattern, which may have
the ? wildcard. It stands for a single code point (in UTF-8 locales),
not a byte.
Word evaluation uses a glob() call:
echo ?.c # which files match?
These language constructs result in fnmatch() calls:
${s#?} # remove one character suffix, quadratic loop for globs
case $x in ?) echo 'one char' ;; esac
[[ $x == ? ]]
Regexes have the wildcard .. Like ? in globs, it stands for a code
point. They also have [^a], which stands for a code point.
pat='.' # single code point
[[ $x =~ $pat ]]
This construct our glob to ERE translator for position info:
echo ${s/?/x}
$IFS word splitting, which also affects the shSplit() builtin
setopt SH_WORD_SPLIT. (TODO: Oils could support Unicode in
$IFS.)${foo,} and ${foo^} for lowercase / uppercase
libc functions?[[ a < b ]] and [ a '<' b ] for sorting
strcoll()?$PS1 prompt language has various time % codes, which are
locale-specific.printf also has a libc time calls with %()T.Other:
wcswidth(), which doesn't just count
code points. It calculates the display width of characters, which is
different in general.mystr ~ / [ \y01 ] /case (x) { / dot / }Str.{trim,trimStart,trimEnd} respect unicode space, like JavaScript doesStr.{upper,lower} also need unicode case folding
s.split() doesn't have a default "split by space", which should
probably respect unicode space, like trim() doesfor offset, rune in (runes(mystr)) should decode UTF-8, like Gostrcmp() should do byte-wise and UTF-8 wise comparisons?\yfflibc:
glob() and fnmatch()regexec()strcoll() respects LC_COLLATE, which bash probably doestolower() toupper() - will we use these?In Python:
trimLeft() and ${s#prefix} need thisfor r in (runes(x)) needs this\u{3bc} (currently in data_lang/j8.py Utf8Encode())Not sure:
bash:
$ ltrace -e setlocale bash -c 'echo'
bash->setlocale(LC_ALL, "") = "en_US.UTF-8"
...
bash->setlocale(LC_CTYPE, "") = "en_US.UTF-8"
bash->setlocale(LC_COLLATE, "") = "en_US.UTF-8"
bash->setlocale(LC_MESSAGES, "") = "en_US.UTF-8"
bash->setlocale(LC_NUMERIC, "") = "en_US.UTF-8"
bash->setlocale(LC_TIME, "") = "en_US.UTF-8"
...
Notes:
setlocale().LC_ALL is sufficient?LC_COLLATE affects glob() order, which makes bash scripts
non-deterministic.
spec/task-runner.sh gen-task-file, which does a
glob of */*.test.sh. James Chen-Smith ran it with the equivalent of
LANG=C, which scrambled the order.Python 2 and 3 mostly agree:
$ ltrace -e setlocale python3 -c 'print()'
python3->setlocale(LC_CTYPE, nil) = "C"
python3->setlocale(LC_CTYPE, "") = "en_US.UTF-8"
It only calls it for LC_CTYPE, not LC_ALL.