Why Sponsor Oils? | source | all docs for version 0.25.0 | all versions | oils.pub
Warning: Work in progress! Leave feedback on Zulip or Github if you'd like this doc to be updated.
Oils is UTF-8 centric, unlike bash
and other shells.
That is, its Unicode support is like Go, Rust, Julia, and Swift, and not like Python or JavaScript. The former languages internally represent strings as UTF-8, while the latter use arrays of code points or UTF-16 code units.
shopt --set utf8_source
Unicode characters can be encoded directly in the source:
echo 'μ'
or denoted in ASCII with C-escaped strings:
echo $'\u03bc' # bash style
echo u'\u{3bc}' # YSH style
(Such strings are preferred over echo -e
because they're statically parsed.)
Strings in OSH are arbitrary sequences of bytes, which may or may not be valid UTF-8. Details:
NUL
('\0'
) byte. This is a consequence of how Unix and C work.${#s}
and slicing ${s:1:3}
require the string
to be valid UTF-8. Decoding errors are fatal if shopt -s strict_word_eval
is on.These operations are implemented in Python.
In osh/string_ops.py
:
${#s}
-- length in code points (buggy in bash)
len(s)
returns a number of bytes, not code points.${s:1:2}
-- index and length are a number of code points${x#glob?}
and ${x##glob?}
(see below)In builtin/
:
printf '%d' \'c
where c
is an arbitrary character. This is an obscure
syntax for ord()
, i.e. getting an integer from an encoded character.More:
$IFS
word splitting. Affects shSplit()
builtin
setopt SH_WORD_SPLIT
.${foo,}
and ${foo^}
for lowercase / uppercase
[[ a < b ]]
and [ a '<' b ]
for sorting
strcoll()
?Globs have character classes [^a]
and ?
.
This pattern results in a glob()
call:
echo my?glob
These patterns result in fnmatch()
calls:
case $x in ?) echo 'one char' ;; esac
[[ $x == ? ]]
${s#?} # remove one character suffix, quadratic loop for globs
This uses our glob to ERE translator for position info:
echo ${s/?/x}
Regexes have character classes [^a]
and .
:
pat='.' # single "character"
[[ $x =~ $pat ]]
printf
also has time.Other:
wcswidth()
, which doesn't just count
code points. It calculates the display width of characters, which is
different in general.mystr ~ / [ \xff ] /
case (x) { / dot / }
Str.{trim,trimLeft,trimRight}
respect unicode space, like JavaScript doesStr.{upper,lower}
also need unicode case foldings.split()
doesn't have a default "split by space", which should
probably respect unicode space, like trim()
doesfor offset, rune in (runes(mystr))
decodes UTF-8, like GoNot unicode aware:
strcmp()
does byte-wise and UTF-8 wise comparisons?\yff
Unlike bash and CPython, Oils doesn't call setlocale()
. (Although GNU
readline may call it.)
It's expected that your locale will respect UTF-8. This is true on most distros. If not, then some string operations will support UTF-8 and some won't.
For example:
${#s}
is implemented in Oils code, not libc, so it will
always respect UTF-8.[[ s =~ $pat ]]
is implemented with libc, so it is affected by the locale
settings. Same with Oils (x ~ pat)
.TODO: Oils should support LANG=C
for some operations, but not LANG=X
for
other X
.
libc:
glob()
and fnmatch()
regexec()
strcoll()
respects LC_COLLATE
, which bash probably doesOur own:
trimLeft()
and ${s#prefix}
need thisfor r in (runes(x))
needs this\u{3bc}
(currently in data_lang/j8.py Utf8Encode()
)Not sure:
iconv
program converts text from one encoding to another.