Why Sponsor Oils? | source | all docs for version 0.23.0 | all versions | oilshell.org
YSH has a new syntax for patterns, which appears between the / /
delimiters:
if (mystr ~ /d+ '.' d+/) {
echo 'mystr looks like a number N.M'
}
These patterns are intended to be familiar, but they differ from POSIX or Perl expressions in important ways. So we call them eggexes rather than regexes!
^
no longer means three totally
different things. See the critique at the end of this doc.egrep
(grep -E
)awk
sed --regexp-extended
!!
, so you can visually audit them for catastrophic
backtracking. (Russ Cox, author of the RE2 engine, has
written extensively on this issue.)Here's a longer example:
# Define a subpattern. 'digit' and 'd' are the same.
$ var D = / digit{1,3} /
# Use the subpattern
$ var ip_pat = / D '.' D '.' D '.' D /
# This eggex compiles to an ERE
$ echo $ip_pat
[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}
This means you can use it in a very simple way:
$ egrep $ip_pat foo.txt
TODO: You should also be able to inline patterns like this:
egrep $/d+/ foo.txt
NUL
bytes, etc.Eggexes have a consistent syntax:
dot
, space
, or s
'lit'
, $var
, etc.%start
, %word_end
, etc.HexDigit
. Important: these are spliced as syntax trees,
not strings, so you don't need to think about quoting.For example, it's easy to see that these patterns all match three characters:
/ d d d /
/ digit digit digit /
/ dot dot dot /
/ word space word /
/ 'ab' space /
/ 'abc' /
And that these patterns match two:
/ %start w w /
/ %start 'if' /
/ d d %end /
And that you have to look up the definition of HexDigit
to know how many
characters this matches:
/ %start HexDigit %end /
Constructs like . ^ $ \< \>
are deprecated because they break these rules.
.
Is Now dot
But .
is still accepted. It usually matches any character except a newline,
although this changes based on flags (e.g. dotall
, unicode
).
word
, w
, alnum
We accept both Perl and POSIX classes.
d
or digit
s
or space
w
or word
alpha
, alnum
, ...%this
%start
is ^
%end
is $
%input_start
is \A
%input_end
is \z
%last_line_end
is \Z
%word_start
is \<
%word_end
is \>
'hello *world*'
becomes a regex-escaped stringNote: instead of using double-quoted strings like "xyz $var"
, you can splice
a strings into an eggex:
/ 'xyz ' @var /
x y
matches x
and y
in sequencex | y
matches x
or y
You can also write a more Pythonic alternative: x or y
.
Repetition is just like POSIX ERE or Perl:
x?
, x+
, x*
x{3}
, x{1,3}
We've reserved syntactic space for PCRE and Python variants:
x{L +}
, x{L 3,4}
x{P +}
, x{P 3,4}
You can negate named char classes:
/ !digit /
and char class literals:
/ ![ a-z A-Z ] /
Sometimes you can do both:
/ ![ !digit ] / # translates to /[^\D]/ in PCRE
# error in ERE because it can't be expressed
You can also negate "regex modifiers" / compilation flags:
/ word ; ignorecase / # flag on
/ word ; !ignorecase / # flag off
/ word ; !i / # abbreviated
In contrast, regexes have many confusing syntaxes for negation:
[^abc] vs. [abc]
[[^:digit:]] vs. [[:digit:]]
\D vs. \d
/\w/-i vs /\w/i
@var_name
or UpperCaseVarName
This allows you to reuse patterns. Using uppercase variables:
var D = / digit{3} /
var ip_addr = / D '.' D '.' D '.' D /
Using normal variables:
var part = / digit{3} /
var ip_addr = / @part '.' @part '.' @part '.' @part /
This is similar to how lex
and re2c
work.
()
Parentheses are used for precdence:
('foo' | 'bar')+
See note below: When translating to POSIX ERE, grouping becomes a capturing group. POSIX ERE has no non-capturing groups.
<capture ...>
Here's a positional capture:
<capture d+> # Becomes _group(1)
Add a variable after as
for named capture:
<capture d+ as month> # Becomes _group('month')
You can also add type conversion functions:
<capture d+ : int> # _group(1) returns an Int, not Str
<capture d+ as month: int> # _group('month') returns an Int, not Str
[]
Example:
[ a-f 'A'-'F' \xFF \u{03bc} \n \\ \' \" \0 ]
Terms:
a-f
or 'A' - 'F'
\n
, \x01
, \u{3bc}
, etc.'abc'
Only letters, numbers, and the underscore may be unquoted:
/['a'-'f' 'A'-'F' '0'-'9']/
/[a-f A-F 0-9]/ # Equivalent to the above
/['!' - ')']/ # Correct range
/[!-)]/ # Syntax Error
Ranges must be separated by spaces:
No:
/[a-fA-F0-9]/
Yes:
/[a-f A-f 0-9]/
!!
(Discouraged)If you want to translate to PCRE, you can use these.
!!REF 1
!!REF name
!!AHEAD( d+ )
!!NOT_AHEAD( d+ )
!!BEHIND( d+ )
!!NOT_BEHIND( d+ )
!!ATOMIC( d+ )
Since they all begin with !!
, You can visually audit your code for potential
performance problems.
;
)Flags or "regex modifiers" appear after a semicolon:
/ digit+ ; i / # ignore case
A translation preference is specified after a second semi-colon:
/ digit+ ; ; ERE / # translates to [[:digit:]]+
/ digit+ ; ; python / # could translate to \d+
Flags and translation preferences together:
/ digit+ ; ignorecase ; python / # could translate to (?i)\d+
In Oils, the following flags are currently supported:
reg_icase
/ i
(Ignore Case)Use this flag to ignore case when matching. For example, /'foo'; i/
matches
'FOO', but /'foo'/
doesn't.
reg_newline
(Multiline)With this flag, %end
will match before a newline and %start
will match
after a newline.
= u'abc123\n' ~ / digit %end ; reg_newline / # true
= u'abc\n123' ~ / %start digit ; reg_newline / # true
Without the flag, %start
and %end
only match from the start or end of the
string, respectively.
= u'abc123\n' ~ / digit %end / # false
= u'abc\n123' ~ / %start digit / # false
Newlines are also ignored in dot
and ![abc]
patterns.
= u'\n' ~ / . / # true
= u'\n' ~ / !digit / # true
Without this flag, the newline \n
is treated as an ordinary character.
= u'\n' ~ / . ; reg_newline / # false
= u'\n' ~ / !digit ; reg_newline / # false
You can spread regexes over multiple lines and add comments:
var x = ///
digit{4} # year e.g. 2001
'-'
digit{2} # month e.g. 06
'-'
digit{2} # day e.g. 31
///
(Not yet implemented in YSH.)
See the YSH regex API for details.
In summary, YSH has Perl-like conveniences with an ~
operator:
var s = 'on 04-01, 10-31'
var pat = /<capture d+ as month> '-' <capture d+ as day>/
if (s ~ pat) { # search for the pattern
echo $[_group('month')] # => 04
}
It also has an explicit and powerful Python-like API with the search()
and
leftMatch()` methods on strings.
var m = s => search(pat, pos=8) # start searching at a position
if (m) {
echo $[m => group('month')] # => 10
}
No:
/ $'foo\tbar' / # Match 7 characters including a tab, but it's hard to read
/ r'foo\tbar' / # The string must contain 8 chars including '\' and 't'
Yes:
# Instead, Take advantage of char literals and implicit regex concatenation
/ 'foo' \t 'bar' /
/ 'foo' \\ 'tbar' /
Repetitions like * + ?
apply only to the last character, so literal strings
need extra grouping:
No:
'foo'+
Yes:
<capture 'foo'>+
Also OK:
('foo')+ # this is a CAPTURING group in ERE
This is necessary because ERE doesn't have non-capturing groups like Perl's
(?:...)
, and Eggex only does "dumb" translations. It doesn't silently insert
constructs that change the meaning of the pattern.
ERE can't represent this set of 1 character reliably:
/ [ \u{0100} ] / # This char is 2 bytes encoded in UTF-8
These sets are accepted:
/ [ \u{1} \u{2} ] / # set of 2 chars
/ [ \x01 \x02 ] ] / # set of 2 bytes
They happen to be identical when translated to ERE, but may not be when translated to PCRE.
This is a sequence of characters:
/ $'\xfe\xff' /
This is a set of characters that is illegal:
/ [ $'\xfe\xff' ] / # set or sequence? It's confusing
This is a better way to write it:
/ [ \xfe \xff ] / # set of 2 chars
^ - ] \
The literal characters ^ - ] \
are problematic because they can be confused
with operators.
^
means negation-
means range]
closes the character class\
is usually literal, but GNU gawk has an extension to make it an escaping
operatorThe Eggex-to-ERE translator is smart enough to handle cases like this:
var pat = / ['^' 'x'] /
# translated to [x^], not [^x] for correctness
However, cases like this are a fatal runtime error:
var pat1 = / ['a'-'^'] /
var pat2 = / ['a'-'-'] /
... because the same symbol can mean many things.
^
could mean:
[^abc]
^
like [abc^]
\
is used in:
\w
or \d
\b
\n
\+
?
could mean:
a?
a+?
(?P<named>\d+)
(?:noncapturing)
With egg expressions, each construct has a distinct syntax.
Bash:
if [[ $x =~ '[[:digit:]]+' ]]; then
echo 'x looks like a number
fi
Compare with YSH:
if (x ~ /digit+/) {
echo 'x looks like a number'
}
Perl:
$x =~ /\d+/
YSH:
x ~ /d+/
The Perl expression has three more punctuation characters:
~
, not =~
d
. If that's too short, you can
also write digit
.The eggex syntax can be incorporated into other tools and shells. It's designed to be separate from YSH -- hence the separate name.
Notes:
/ 'foo' \t 'sq' \' bar \n /
— i.e. implicit concatenation of strings and
characters, described above.Eggexes aren't backward compatible in general, but they retain some legacy
operators like ^ . $
to ease the transition. These expressions are valid
eggexes and valid POSIX EREs:
.*
^[0-9]+$
^.{1,3}|[0-9][0-9]?$
If "eggex" sounds too much like "regex" to you, simply say "egg expression". It won't be confused with "regular expression" or "regex".
All three languages support pattern composition and have quoted literals. And they have the goal of improving upon Perl 5 regex syntax, which has made its way into every major programming language (Python, Java, C++, etc.)
The main difference is that Eggexes are meant to be used with existing
regex engines. For example, you translate them to a POSIX ERE, which is
executed by egrep
or awk
. Or you translate them to a Perl-like syntax and
use them in Python, JavaScript, Java, or C++ programs.
Perl 6 and Rosie have their own engines that are more powerful than PCRE, Python, etc. That means they cannot be used this way.
The short answer is that they can be complementary: PEGs are closer to parsing, while eggex and regular languages are closer to lexing. Related:
The PEG model is more resource intensive, but it can recognize more languages, and it can recognize recursive structure (trees).
dot
, %start
, and %end
Have More Precise Names?Because the meanings of .
^
and $
are usually affected by regex engine
flags, like dotall
, multiline
, and unicode
.
As a result, the names mean nothing more than "however your regex engine
interprets .
^
and $
".
As mentioned in the "Philosophy" section above, eggex only does a superficial, one-to-one translation. It doesn't understand the details of which characters will be matched under which engine.
Eggexes are implemented in YSH, but not yet set in stone.
Please try them, as described in this post and the README, and send us feedback!
You can create a new post on /r/oilshell
or a new message on #oil-discuss
on https://oilshell.zulipchat.com/ (log in
with Github, etc.)