Unicode in Oils

Roughly speaking, you can divide programming languages into 3 categories with respect to Unicode strings:

  1. UTF-8 - Go, Rust, Julia, ..., Oils
  2. UTF-16 - Java, JavaScript, ...
  3. UTF-32 aka Unicode code points - Python 2 and 3, C and C++, ...

So Oils is in the first category: it's UTF-8 centric.

Let's see what this means — in terms your mental model when writing OSH and YSH, and in terms of the Oils implementation.

Table of Contents
Example: The Length of a String
Note on bash
Code Strings and Data Strings
OSH vs. YSH
Data Encoding
Passing Data to libc / the Kernel
Your System Locale Should Be UTF-8
Note: Some string operations use libc, and some don't
Tips
Summary
Appendix: Languages Operations That Involve Unicode
OSH / bash
YSH
Data Languages
More Notes
List of Low-Level UTF-8 Operations
setlocale() calls made by bash, Python, ...

Example: The Length of a String

The Oils runtime has a single Str data type, which is used by both OSH and YSH.

A Str is an array of bytes, which may or may not be UTF-8 encoded. For example:

s=$'\u03bc'      # 1 code point, which is UTF-8 encoded as 2 bytes

echo ${#s}       # => 1 code point (regardless of locale, right now)

echo $[len(s)]   # => 2 bytes

That is, the YSH feature len(mystr) returns the length in bytes. But the shell feature ${#s} decodes the string as UTF-8, and returns the length in code points.

Again, this string storage model is like Go and Julia, but different than JavaScript (UTF-16) and Python (code points).

Note on bash

bash does support multiple lengths, but in a way that depends on global variables:

s=$'\u03bc'  # one code point

echo ${#s}   # => 1, when say LANG=C.UTF-8

LC_ALL=C     # libc setlocale() called under the hood
echo ${#s}   # => 2 bytes, now that LC_ALL=C

So bash doesn't seem to fall cleanly in one of the 3 categories above.

It would be interesting to test bash with non-UTF-8 libc locales like Shift JIS (Japanese), but they are rare. In practice, the locale almost always C or UTF-8, so bash and Oils are similar.

But Oils is more strict about UTF-8, and YSH discourages global variables like LC_ALL.

(TODO: For compatibility, OSH should call setlocale() when assigning LC_ALL=C.)

Code Strings and Data Strings

OSH vs. YSH

For backward compatibility, OSH source files may have arbitrary bytes. For example, echo [the literal byte 0xFF] is a valid source file.

In contrast, YSH source files must be encoded in UTF-8, including its ASCII subset. (TODO: Enforce this with shopt --set utf8_source)

If you write C-escaped strings, then your source file can be ASCII:

echo $'\u03bc'   # bash style

echo u'\u{3bc}'  # YSH style

If you write UTF-8 characters, then your source is UTF-8:

echo 'μ'

Data Encoding

As mentioned, strings in OSH and YSH are arbitrary sequences of bytes, which may or may not be valid UTF-8.

Some operations like length ${#s} and slicing ${s:1:3} require the string to be valid UTF-8. Decoding errors are fatal if shopt -s strict_word_eval is on.

Passing Data to libc / the Kernel

When passed to external programs, strings are truncated at the first NUL ('\0') byte. This is a consequence of how Unix and C work.

Your System Locale Should Be UTF-8

At startup, Oils calls the libc function setlocale(), which initializes the global variables from environment variables like LC_CTYPE and LC_COLLATE. (For details, see osh-locale and ysh-locale.)

These global variables determine how libc string operations like tolower() glob(), and regexec() behave.

For example:

Oils only supports UTF-8 locales. If the locale is not UTF-8, Oils prints a warning to stderr at startup. You can silence it with OILS_LOCALE_OK=1.

(Note: GNU readline also calls setlocale(), but Oils may or may not link against GNU readline.)

Note: Some string operations use libc, and some don't

For example:

Tips

Summary

Oils is more UTF-8 centric than bash:

Appendix: Languages Operations That Involve Unicode

Here are some details.

OSH / bash

These operations are implemented in Python.

In osh/string_ops.py:

In builtin/:

Operations That Use Glob Syntax

The libc functions glob() and fnmatch() accept a pattern, which may have the ? wildcard. It stands for a single code point (in UTF-8 locales), not a byte.

Word evaluation uses a glob() call:

echo ?.c  # which files match?

These language constructs result in fnmatch() calls:

${s#?}  # remove one character suffix, quadratic loop for globs

case $x in ?) echo 'one char' ;; esac

[[ $x == ? ]]

Operations That Involve Regexes (ERE)

Regexes have the wildcard .. Like ? in globs, it stands for a code point. They also have [^a], which stands for a code point.

pat='.'  # single code point
[[ $x =~ $pat ]]

This construct our glob to ERE translator for position info:

echo ${s/?/x}

More Locale-aware operations

Other:

YSH

Data Languages

More Notes

List of Low-Level UTF-8 Operations

libc:

In Python:

Not sure:

setlocale() calls made by bash, Python, ...

bash:

$ ltrace -e setlocale bash -c 'echo'
bash->setlocale(LC_ALL, "")                    = "en_US.UTF-8"
...
bash->setlocale(LC_CTYPE, "")                  = "en_US.UTF-8"
bash->setlocale(LC_COLLATE, "")                = "en_US.UTF-8"
bash->setlocale(LC_MESSAGES, "")               = "en_US.UTF-8"
bash->setlocale(LC_NUMERIC, "")                = "en_US.UTF-8"
bash->setlocale(LC_TIME, "")                   = "en_US.UTF-8"
...

Notes:

Python 2 and 3 mostly agree:

$ ltrace -e setlocale python3 -c 'print()'
python3->setlocale(LC_CTYPE, nil)              = "C"
python3->setlocale(LC_CTYPE, "")               = "en_US.UTF-8"

It only calls it for LC_CTYPE, not LC_ALL.

Generated on Sun, 30 Nov 2025 19:53:30 +0000