OILS / doc / eggex.md View on Github | oils.pub

619 lines, 398 significant
1---
2default_highlighter: oils-sh
3---
4
5Egg Expressions (YSH Regexes)
6=============================
7
8YSH has a new syntax for patterns, which appears between the `/ /` delimiters:
9
10 if (mystr ~ /d+ '.' d+/) {
11 echo 'mystr looks like a number N.M'
12 }
13
14These patterns are intended to be familiar, but they differ from POSIX or Perl
15expressions in important ways. So we call them *eggexes* rather than
16*regexes*!
17
18<!-- cmark.py expands this -->
19<div id="toc">
20</div>
21
22## Why Invent a New Language?
23
24- Eggexes let you name **subpatterns** and compose them, which makes them more
25 readable and testable.
26- Their **syntax** is vastly simpler because literal characters are **quoted**,
27 and operators are not. For example, `^` no longer means three totally
28 different things. See the critique at the end of this doc.
29- bash and awk use the limited and verbose POSIX ERE syntax, while eggexes are
30 more expressive and (in some cases) Perl-like.
31- They're designed to be **translated to any regex dialect**. Right now, the
32 YSH shell translates them to ERE so you can use them with common Unix tools:
33 - `egrep` (`grep -E`)
34 - `awk`
35 - GNU `sed --regexp-extended`
36 - PCRE syntax is the second most important target.
37- They're **statically parsed** in YSH, so:
38 - You can get **syntax errors** at parse time. In contrast, if you embed a
39 regex in a string, you don't get syntax errors until runtime.
40 - The eggex is part of the [lossless syntax tree][], which means you can do
41 linting, formatting, and refactoring on eggexes, just like any other type
42 of code.
43- Eggexes support **regular languages** in the mathematical sense, whereas
44 regexes are **confused** about the issue. All nonregular eggex extensions
45 are prefixed with `!!`, so you can visually audit them for [catastrophic
46 backtracking][backtracking]. (Russ Cox, author of the RE2 engine, [has
47 written extensively](https://swtch.com/~rsc/regexp/) on this issue.)
48- Eggexes are more fun than regexes!
49
50[backtracking]: https://blog.codinghorror.com/regex-performance/
51
52[lossless syntax tree]: http://www.oilshell.org/blog/2017/02/11.html
53
54### Example of Pattern Reuse
55
56Here's a longer example:
57
58 # Define a subpattern. 'digit' and 'D' are the same.
59 $ var D = / digit{1,3} /
60
61 # Use the subpattern
62 $ var ip_pat = / D '.' D '.' D '.' D /
63
64 # This eggex compiles to an ERE
65 $ echo $ip_pat
66 [[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}
67
68This means you can use it in a very simple way:
69
70 $ egrep $ip_pat foo.txt
71
72TODO: You should also be able to inline patterns like this:
73
74 egrep $/d+/ foo.txt
75
76### Design Philosophy
77
78- Eggexes can express a **superset** of POSIX and Perl syntax.
79- The language is designed for "dumb", one-to-one, **syntactic** translations.
80 That is, translation doesn't rely on understanding the **semantics** of
81 regexes. This is because regex implementations have many corner cases and
82 incompatibilities, with regard to Unicode, `NUL` bytes, etc.
83
84### The Expression Language Is Consistent
85
86Eggexes have a consistent syntax:
87
88- Single characters are unadorned, in lowercase: `dot`, `space`, or `s`
89- A sequence of multiple characters looks like `'lit'`, `$var`, etc.
90- Constructs that match **zero** characters look like `%start`, `%word_end`, etc.
91- Entire subpatterns (which may contain alternation, repetition, etc.) are in
92 uppercase like `HexDigit`. Important: these are **spliced** as syntax trees,
93 not strings, so you **don't** need to think about quoting.
94
95For example, it's easy to see that these patterns all match **three** characters:
96
97 / d d d /
98 / digit digit digit /
99 / dot dot dot /
100 / word space word /
101 / 'ab' space /
102 / 'abc' /
103
104And that these patterns match **two**:
105
106 / %start w w /
107 / %start 'if' /
108 / d d %end /
109
110And that you have to look up the definition of `HexDigit` to know how many
111characters this matches:
112
113 / %start HexDigit %end /
114
115Constructs like `. ^ $ \< \>` are deprecated because they break these rules.
116
117## Expression Primitives
118
119### `.` Is Now `dot`
120
121The `dot` primitive usually matches any character, although its exact meaning
122depends on the underlying regex library.
123
124- YSH uses `libc`, which accepts POSIX ERE syntax. So `dot` aka `.` matches
125 any character, unless the `reg_newline` flag is true.
126- If Eggex were compiled to Python, `dot` aka `.` matches any character
127 *except* a newline, unless the `re.DOTALL` flag is true.
128
129Note: Eggex accepts `.` as a synonym for `dot`, even though `dot` is preferred.
130
131### Classes Are Unadorned: `word`, `w`, `alnum`
132
133We accept both Perl and POSIX classes.
134
135- Perl:
136 - `d` or `digit`
137 - `s` or `space`
138 - `w` or `word`
139- POSIX
140 - `alpha`, `alnum`, ...
141
142### Zero-width Assertions Look Like `%this`
143
144- POSIX
145 - `%start` is `^`
146 - `%end` is `$`
147- PCRE:
148 - `%input_start` is `\A`
149 - `%input_end` is `\z`
150 - `%last_line_end` is `\Z`
151- GNU ERE extensions:
152 - `%word_start` is `\<`
153 - `%word_end` is `\>`
154
155### Single-Quoted Strings
156
157- `'hello *world*'` becomes a regex-escaped string
158
159Note: instead of using double-quoted strings like `"xyz $var"`, you can splice
160a strings into an eggex:
161
162 / 'xyz ' @var /
163
164## Compound Expressions
165
166### Sequence and Alternation Are Unchanged
167
168- `x y` matches `x` and `y` in sequence
169- `x | y` matches `x` or `y`
170
171You can also write a more Pythonic alternative: `x or y`.
172
173### Repetition Is Unchanged In Common Cases, and Better in Rare Cases
174
175Repetition is just like POSIX ERE or Perl:
176
177- `x?`, `x+`, `x*`
178- `x{3}`, `x{1,3}`
179
180We've reserved syntactic space for PCRE and Python variants:
181
182- lazy/non-greedy: `x{L +}`, `x{L 3,4}`
183- possessive: `x{P +}`, `x{P 3,4}`
184
185(Oils doesn't have these features, because Oils translates Eggex to POSIX ERE
186syntax.)
187
188### Negation Consistently Uses !
189
190You can negate named char classes:
191
192 / !digit /
193
194and char class literals:
195
196 / ![ a-z A-Z ] /
197
198Sometimes you can do both:
199
200 / ![ !digit ] / # translates to /[^\D]/ in PCRE
201 # error in ERE because it can't be expressed
202
203
204You can also negate "regex modifiers" / compilation flags:
205
206 / word ; ignorecase / # flag on
207 / word ; !ignorecase / # flag off
208 / word ; !i / # abbreviated
209
210In contrast, regexes have many confusing syntaxes for negation:
211
212 [^abc] vs. [abc]
213 [[^:digit:]] vs. [[:digit:]]
214
215 \D vs. \d
216
217 /\w/-i vs /\w/i
218
219### Splice Other Patterns `@var_name` or `UpperCaseVarName`
220
221This allows you to reuse patterns. Using uppercase variables:
222
223 var D = / digit{3} /
224
225 var ip_addr = / D '.' D '.' D '.' D /
226
227Using normal variables:
228
229 var part = / digit{3} /
230
231 var ip_addr = / @part '.' @part '.' @part '.' @part /
232
233This is similar to how `lex` and `re2c` work.
234
235### Group With `()`
236
237Parentheses are used for precdence:
238
239 ('foo' | 'bar')+
240
241See note below: When translating to POSIX ERE, grouping becomes a capturing
242group. POSIX ERE has no non-capturing groups.
243
244
245### Capture with `<capture ...>`
246
247Here's a positional capture:
248
249 <capture d+> # Becomes _group(1)
250
251Add a variable after `as` for named capture:
252
253 <capture d+ as month> # Becomes _group('month')
254
255You can also add type conversion functions:
256
257 <capture d+ : int> # _group(1) returns an Int, not Str
258 <capture d+ as month: int> # _group('month') returns an Int, not Str
259
260### Character Class Literals Use `[]`
261
262Example:
263
264 [ a b c ] # individual characters / code points
265 [ '?' '*' '+' ] # quoted characters
266 [ 'a' 'bc' '?*+' ] # reduce the number of quotes
267 [ a-f 'A'-'F' x y ] # ranges of code points
268 [ \n \\ \' \" ] # backslash escapes
269 [ \yFF \u{03bc} ] # a byte and a code point
270
271Only letters, numbers, and the underscore may be unquoted:
272
273 /['a'-'f' 'A'-'F' '0'-'9']/
274 /[a-f A-F 0-9]/ # Equivalent to the above
275
276 /['!' - ')']/ # Correct range
277 /[!-)]/ # Syntax Error
278
279Ranges must be separated by spaces:
280
281No:
282
283 /[a-fA-F0-9]/
284
285Yes:
286
287 /[a-f A-f 0-9]/
288
289### Backtracking Constructs Use `!!` (Discouraged)
290
291If you want to translate to PCRE, you can use these.
292
293 !!REF 1
294 !!REF name
295
296 !!AHEAD( d+ )
297 !!NOT_AHEAD( d+ )
298 !!BEHIND( d+ )
299 !!NOT_BEHIND( d+ )
300
301 !!ATOMIC( d+ )
302
303Since they all begin with `!!`, You can visually audit your code for potential
304performance problems.
305
306## Outside the Expression language
307
308### Flags and Translation Preferences (`;`)
309
310Flags or "regex modifiers" appear after a semicolon:
311
312 / digit+ ; i / # ignore case
313
314A translation preference is specified after a second semi-colon:
315
316 / digit+ ; ; ERE / # translates to [[:digit:]]+
317 / digit+ ; ; python / # could translate to \d+
318
319Flags and translation preferences together:
320
321 / digit+ ; ignorecase ; python / # could translate to (?i)\d+
322
323In Oils, the following flags are currently supported:
324
325#### `reg_icase` / `i` (Ignore Case)
326
327Use this flag to ignore case when matching. For example, `/'foo'; i/` matches
328'FOO', but `/'foo'/` doesn't.
329
330#### `reg_newline` (Multiline)
331
332With this flag, `%end` will match before a newline and `%start` will match
333after a newline.
334
335 = u'abc123\n' ~ / digit %end ; reg_newline / # true
336 = u'abc\n123' ~ / %start digit ; reg_newline / # true
337
338Without the flag, `%start` and `%end` only match from the start or end of the
339string, respectively.
340
341 = u'abc123\n' ~ / digit %end / # false
342 = u'abc\n123' ~ / %start digit / # false
343
344Newlines are also ignored in `dot` and `![abc]` patterns.
345
346 = u'\n' ~ / . ; reg_newline / # false
347 = u'\n' ~ / !digit ; reg_newline / # false
348
349Without this flag, the newline `\n` is treated as an ordinary character.
350
351 = u'\n' ~ / . / # true
352 = u'\n' ~ / !digit / # true
353
354### Multiline Syntax
355
356You can spread regexes over multiple lines and add comments:
357
358 var x = ///
359 digit{4} # year e.g. 2001
360 '-'
361 digit{2} # month e.g. 06
362 '-'
363 digit{2} # day e.g. 31
364 ///
365
366(Not yet implemented in YSH.)
367
368### The YSH API
369
370See the [YSH regex API](ysh-regex-api.html) for details.
371
372In summary, YSH has Perl-like conveniences with an `~` operator:
373
374 var s = 'on 04-01, 10-31'
375 var pat = /<capture d+ as month> '-' <capture d+ as day>/
376
377 if (s ~ pat) { # search for the pattern
378 echo $[_group('month')] # => 04
379 }
380
381It also has an explicit and powerful Python-like API with the `search()` and
382leftMatch()` methods on strings.
383
384 var m = s => search(pat, pos=8) # start searching at a position
385 if (m) {
386 echo $[m => group('month')] # => 10
387 }
388
389### Language Reference
390
391- See bottom of the [YSH Expression Grammar]($oils-src:ysh/grammar.pgen2) for
392 the concrete syntax.
393- See the bottom of [frontend/syntax.asdl]($oils-src:frontend/syntax.asdl) for
394 the abstract syntax.
395
396## Usage Notes
397
398### Use character literals rather than C-Escaped strings
399
400No:
401
402 / $'foo\tbar' / # Match 7 characters including a tab, but it's hard to read
403 / r'foo\tbar' / # The string must contain 8 chars including '\' and 't'
404
405Yes:
406
407 # Instead, Take advantage of char literals and implicit regex concatenation
408 / 'foo' \t 'bar' /
409 / 'foo' \\ 'tbar' /
410
411
412## POSIX ERE Limitations
413
414In theory, Eggex can be translated to many different synatxes, like Perl or
415Python.
416
417In practice, Oils translates Eggex to POSIX Extended Regular Expressions (ERE)
418syntax. This syntax is understood by `libc`, e.g. GNU libc or musl libc.
419
420But not all of Eggex can be translated to ERE syntax. And the translation is
421straightforward and "dumb", rather than smart.
422
423Here are some limitations.
424
425### Repetition of Strings Requires Grouping
426
427Repetitions like `* + ?` apply only to the last character, so literal strings
428need extra grouping:
429
430
431No:
432
433 'foo'+
434
435Yes:
436
437 <capture 'foo'>+
438
439Also OK:
440
441 ('foo')+ # but note this is a CAPTURING group in ERE
442
443This is necessary because ERE doesn't have non-capturing groups like Perl's
444`(?:...)`, and Eggex only does "dumb" translations. It doesn't silently insert
445constructs that change the meaning of the pattern.
446
447### Bytes and Code Points are Limited in Range
448
449See [re-chars](ref/chap-expr-lang.html#re-chars) in the Oils Reference.
450
451### Char class literals: `^ - ] \`
452
453The literal characters `^ - ] \` are problematic because they can be confused
454with operators.
455
456- `^` means negation
457- `-` means range
458- `]` closes the character class
459- `\` is usually literal, but GNU gawk has an extension to make it an escaping
460 operator
461
462The Eggex-to-ERE translator is smart enough to handle cases like this:
463
464 var pat = / ['^' 'x'] /
465 # translated to [x^], not [^x] for correctness
466
467However, cases like this are a fatal runtime error:
468
469 var pat1 = / ['a'-'^'] /
470 var pat2 = / ['a'-'-'] /
471
472## Critiques
473
474### Regexes Are Hard To Read
475
476... because the **same symbol can mean many things**.
477
478`^` could mean:
479
480- Start of the string/line
481- Negated character class like `[^abc]`
482- Literal character `^` like `[abc^]`
483
484`\` is used in:
485
486- Character classes like `\w` or `\d`
487- Zero-width assertions like `\b`
488- Escaped characters like `\n`
489- Quoted characters like `\+`
490
491`?` could mean:
492
493- optional: `a?`
494- lazy match: `a+?`
495- some other kind of grouping:
496 - `(?P<named>\d+)`
497 - `(?:noncapturing)`
498
499With egg expressions, each construct has a **distinct syntax**.
500
501### YSH is Shorter Than Bash
502
503Bash:
504
505 if [[ $x =~ '[[:digit:]]+' ]]; then
506 echo 'x looks like a number
507 fi
508
509Compare with YSH:
510
511 if (x ~ /digit+/) {
512 echo 'x looks like a number'
513 }
514
515### ... and Perl
516
517Perl:
518
519 $x =~ /\d+/
520
521YSH:
522
523 x ~ /d+/
524
525
526The Perl expression has three more punctuation characters:
527
528- YSH doesn't require sigils in expression mode
529- The match operator is `~`, not `=~`
530- Named character classes are unadorned like `d`. If that's too short, you can
531 also write `digit`.
532
533## Design Notes
534
535### Eggexes In Other Languages
536
537The eggex syntax can be incorporated into other tools and shells. It's
538designed to be separate from YSH -- hence the separate name.
539
540Notes:
541
542- Single quoted string literals should **disallow** internal backslashes, and
543 treat all other characters literally.. Instead, users can write `/ 'foo' \t
544 'sq' \' bar \n /` &mdash; i.e. implicit concatenation of strings and
545 characters, described above.
546- To make eggexes portable between languages, Don't use the host language's
547 syntax for string literals (at least for single-quoted strings).
548
549### Backward Compatibility
550
551Eggexes aren't backward compatible in general, but they retain some legacy
552operators like `^ . $` to ease the transition. These expressions are valid
553eggexes **and** valid POSIX EREs:
554
555 .*
556 ^[0-9]+$
557 ^.{1,3}|[0-9][0-9]?$
558
559## FAQ
560
561### The Name Sounds Funny.
562
563If "eggex" sounds too much like "regex" to you, simply say "egg expression".
564It won't be confused with "regular expression" or "regex".
565
566### How Do Eggexes Compare with [Raku Regexes][raku-regex] and the [Rosie Pattern Language][rosie]?
567
568All three languages support pattern composition and have quoted literals. And
569they have the goal of improving upon Perl 5 regex syntax, which has made its
570way into every major programming language (Python, Java, C++, etc.)
571
572The main difference is that Eggexes are meant to be used with **existing**
573regex engines. For example, you translate them to a POSIX ERE, which is
574executed by `egrep` or `awk`. Or you translate them to a Perl-like syntax and
575use them in Python, JavaScript, Java, or C++ programs.
576
577Perl 6 and Rosie have their **own engines** that are more powerful than PCRE,
578Python, etc. That means they **cannot** be used this way.
579
580[rosie]: https://rosie-lang.org/
581
582[raku-regex]: https://docs.raku.org/language/regexes
583
584### What About Eggex versus Parsing Expression Grammars? (PEGs)
585
586The short answer is that they can be complementary: PEGs are closer to
587**parsing**, while eggex and [regular languages]($xref:regular-language) are
588closer to **lexing**. Related:
589
590- [When Are Lexer Modes Useful?](https://www.oilshell.org/blog/2017/12/17.html)
591- [Why Lexing and Parsing Should Be
592 Separate](https://github.com/oils-for-unix/oils/wiki/Why-Lexing-and-Parsing-Should-Be-Separate) (wiki)
593
594The PEG model is more resource intensive, but it can recognize more languages,
595and it can recognize recursive structure (trees).
596
597### Why Don't `dot`, `%start`, and `%end` Have More Precise Names?
598
599Because the meanings of `.` `^` and `$` are usually affected by regex engine
600flags, like `dotall`, `multiline`, and `unicode`.
601
602As a result, the names mean nothing more than "however your regex engine
603interprets `.` `^` and `$`".
604
605As mentioned in the "Philosophy" section above, eggex only does a superficial,
606one-to-one translation. It doesn't understand the details of which characters
607will be matched under which engine.
608
609### Where Do I Send Feedback?
610
611Eggexes are implemented in YSH, but not yet set in stone.
612
613Please try them, as described in [this
614post](http://www.oilshell.org/blog/2019/08/22.html) and the
615[README]($oils-src:README.md), and send us feedback!
616
617You can create a new post on [/r/oilshell](https://www.reddit.com/r/oilshell/)
618or a new message on `#oil-discuss` on <https://oilshell.zulipchat.com/> (log in
619with Github, etc.)