doc/eggex.md

OILS / doc / eggex.md View on Github | oils.pub

619 lines, 398 significant

1	---
2	default_highlighter: oils-sh
3	---
4
5	Egg Expressions (YSH Regexes)
6	=============================
7
8	YSH has a new syntax for patterns, which appears between the `/ /` delimiters:
9
10	if (mystr ~ /d+ '.' d+/) {
11	echo 'mystr looks like a number N.M'
12	}
13
14	These patterns are intended to be familiar, but they differ from POSIX or Perl
15	expressions in important ways. So we call them eggexes rather than
16	regexes!
17
18	<!-- cmark.py expands this -->
19	<div id="toc">
20	</div>
21
22	## Why Invent a New Language?
23
24	- Eggexes let you name subpatterns and compose them, which makes them more
25	readable and testable.
26	- Their syntax is vastly simpler because literal characters are quoted,
27	and operators are not. For example, `^` no longer means three totally
28	different things. See the critique at the end of this doc.
29	- bash and awk use the limited and verbose POSIX ERE syntax, while eggexes are
30	more expressive and (in some cases) Perl-like.
31	- They're designed to be translated to any regex dialect. Right now, the
32	YSH shell translates them to ERE so you can use them with common Unix tools:
33	- `egrep` (`grep -E`)
34	- `awk`
35	- GNU `sed --regexp-extended`
36	- PCRE syntax is the second most important target.
37	- They're statically parsed in YSH, so:
38	- You can get syntax errors at parse time. In contrast, if you embed a
39	regex in a string, you don't get syntax errors until runtime.
40	- The eggex is part of the [lossless syntax tree][], which means you can do
41	linting, formatting, and refactoring on eggexes, just like any other type
42	of code.
43	- Eggexes support regular languages in the mathematical sense, whereas
44	regexes are confused about the issue. All nonregular eggex extensions
45	are prefixed with `!!`, so you can visually audit them for [catastrophic
46	backtracking][backtracking]. (Russ Cox, author of the RE2 engine, [has
47	written extensively](https://swtch.com/~rsc/regexp/) on this issue.)
48	- Eggexes are more fun than regexes!
49
50	[backtracking]: https://blog.codinghorror.com/regex-performance/
51
52	[lossless syntax tree]: http://www.oilshell.org/blog/2017/02/11.html
53
54	### Example of Pattern Reuse
55
56	Here's a longer example:
57
58	# Define a subpattern. 'digit' and 'D' are the same.
59	$ var D = / digit{1,3} /
60
61	# Use the subpattern
62	$ var ip_pat = / D '.' D '.' D '.' D /
63
64	# This eggex compiles to an ERE
65	$ echo $ip_pat
66	[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}
67
68	This means you can use it in a very simple way:
69
70	$ egrep $ip_pat foo.txt
71
72	TODO: You should also be able to inline patterns like this:
73
74	egrep $/d+/ foo.txt
75
76	### Design Philosophy
77
78	- Eggexes can express a superset of POSIX and Perl syntax.
79	- The language is designed for "dumb", one-to-one, syntactic translations.
80	That is, translation doesn't rely on understanding the semantics of
81	regexes. This is because regex implementations have many corner cases and
82	incompatibilities, with regard to Unicode, `NUL` bytes, etc.
83
84	### The Expression Language Is Consistent
85
86	Eggexes have a consistent syntax:
87
88	- Single characters are unadorned, in lowercase: `dot`, `space`, or `s`
89	- A sequence of multiple characters looks like `'lit'`, `$var`, etc.
90	- Constructs that match zero characters look like `%start`, `%word_end`, etc.
91	- Entire subpatterns (which may contain alternation, repetition, etc.) are in
92	uppercase like `HexDigit`. Important: these are spliced as syntax trees,
93	not strings, so you don't need to think about quoting.
94
95	For example, it's easy to see that these patterns all match three characters:
96
97	/ d d d /
98	/ digit digit digit /
99	/ dot dot dot /
100	/ word space word /
101	/ 'ab' space /
102	/ 'abc' /
103
104	And that these patterns match two:
105
106	/ %start w w /
107	/ %start 'if' /
108	/ d d %end /
109
110	And that you have to look up the definition of `HexDigit` to know how many
111	characters this matches:
112
113	/ %start HexDigit %end /
114
115	Constructs like `. ^ $ \< \>` are deprecated because they break these rules.
116
117	## Expression Primitives
118
119	### `.` Is Now `dot`
120
121	The `dot` primitive usually matches any character, although its exact meaning
122	depends on the underlying regex library.
123
124	- YSH uses `libc`, which accepts POSIX ERE syntax. So `dot` aka `.` matches
125	any character, unless the `reg_newline` flag is true.
126	- If Eggex were compiled to Python, `dot` aka `.` matches any character
127	except a newline, unless the `re.DOTALL` flag is true.
128
129	Note: Eggex accepts `.` as a synonym for `dot`, even though `dot` is preferred.
130
131	### Classes Are Unadorned: `word`, `w`, `alnum`
132
133	We accept both Perl and POSIX classes.
134
135	- Perl:
136	- `d` or `digit`
137	- `s` or `space`
138	- `w` or `word`
139	- POSIX
140	- `alpha`, `alnum`, ...
141
142	### Zero-width Assertions Look Like `%this`
143
144	- POSIX
145	- `%start` is `^`
146	- `%end` is `$`
147	- PCRE:
148	- `%input_start` is `\A`
149	- `%input_end` is `\z`
150	- `%last_line_end` is `\Z`
151	- GNU ERE extensions:
152	- `%word_start` is `\<`
153	- `%word_end` is `\>`
154
155	### Single-Quoted Strings
156
157	- `'hello world'` becomes a regex-escaped string
158
159	Note: instead of using double-quoted strings like `"xyz $var"`, you can splice
160	a strings into an eggex:
161
162	/ 'xyz ' @var /
163
164	## Compound Expressions
165
166	### Sequence and Alternation Are Unchanged
167
168	- `x y` matches `x` and `y` in sequence
169	- `x \| y` matches `x` or `y`
170
171	You can also write a more Pythonic alternative: `x or y`.
172
173	### Repetition Is Unchanged In Common Cases, and Better in Rare Cases
174
175	Repetition is just like POSIX ERE or Perl:
176
177	- `x?`, `x+`, `x*`
178	- `x{3}`, `x{1,3}`
179
180	We've reserved syntactic space for PCRE and Python variants:
181
182	- lazy/non-greedy: `x{L +}`, `x{L 3,4}`
183	- possessive: `x{P +}`, `x{P 3,4}`
184
185	(Oils doesn't have these features, because Oils translates Eggex to POSIX ERE
186	syntax.)
187
188	### Negation Consistently Uses !
189
190	You can negate named char classes:
191
192	/ !digit /
193
194	and char class literals:
195
196	/ ![ a-z A-Z ] /
197
198	Sometimes you can do both:
199
200	/ ![ !digit ] / # translates to /[^\D]/ in PCRE
201	# error in ERE because it can't be expressed
202
203
204	You can also negate "regex modifiers" / compilation flags:
205
206	/ word ; ignorecase / # flag on
207	/ word ; !ignorecase / # flag off
208	/ word ; !i / # abbreviated
209
210	In contrast, regexes have many confusing syntaxes for negation:
211
212	[^abc] vs. [abc]
213	[[^:digit:]] vs. [[:digit:]]
214
215	\D vs. \d
216
217	/\w/-i vs /\w/i
218
219	### Splice Other Patterns `@var_name` or `UpperCaseVarName`
220
221	This allows you to reuse patterns. Using uppercase variables:
222
223	var D = / digit{3} /
224
225	var ip_addr = / D '.' D '.' D '.' D /
226
227	Using normal variables:
228
229	var part = / digit{3} /
230
231	var ip_addr = / @part '.' @part '.' @part '.' @part /
232
233	This is similar to how `lex` and `re2c` work.
234
235	### Group With `()`
236
237	Parentheses are used for precdence:
238
239	('foo' \| 'bar')+
240
241	See note below: When translating to POSIX ERE, grouping becomes a capturing
242	group. POSIX ERE has no non-capturing groups.
243
244
245	### Capture with `<capture ...>`
246
247	Here's a positional capture:
248
249	<capture d+> # Becomes _group(1)
250
251	Add a variable after `as` for named capture:
252
253	<capture d+ as month> # Becomes _group('month')
254
255	You can also add type conversion functions:
256
257	<capture d+ : int> # _group(1) returns an Int, not Str
258	<capture d+ as month: int> # _group('month') returns an Int, not Str
259
260	### Character Class Literals Use `[]`
261
262	Example:
263
264	[ a b c ] # individual characters / code points
265	[ '?' '*' '+' ] # quoted characters
266	[ 'a' 'bc' '?*+' ] # reduce the number of quotes
267	[ a-f 'A'-'F' x y ] # ranges of code points
268	[ \n \\ \' \" ] # backslash escapes
269	[ \yFF \u{03bc} ] # a byte and a code point
270
271	Only letters, numbers, and the underscore may be unquoted:
272
273	/['a'-'f' 'A'-'F' '0'-'9']/
274	/[a-f A-F 0-9]/ # Equivalent to the above
275
276	/['!' - ')']/ # Correct range
277	/[!-)]/ # Syntax Error
278
279	Ranges must be separated by spaces:
280
281	No:
282
283	/[a-fA-F0-9]/
284
285	Yes:
286
287	/[a-f A-f 0-9]/
288
289	### Backtracking Constructs Use `!!` (Discouraged)
290
291	If you want to translate to PCRE, you can use these.
292
293	!!REF 1
294	!!REF name
295
296	!!AHEAD( d+ )
297	!!NOT_AHEAD( d+ )
298	!!BEHIND( d+ )
299	!!NOT_BEHIND( d+ )
300
301	!!ATOMIC( d+ )
302
303	Since they all begin with `!!`, You can visually audit your code for potential
304	performance problems.
305
306	## Outside the Expression language
307
308	### Flags and Translation Preferences (`;`)
309
310	Flags or "regex modifiers" appear after a semicolon:
311
312	/ digit+ ; i / # ignore case
313
314	A translation preference is specified after a second semi-colon:
315
316	/ digit+ ; ; ERE / # translates to [[:digit:]]+
317	/ digit+ ; ; python / # could translate to \d+
318
319	Flags and translation preferences together:
320
321	/ digit+ ; ignorecase ; python / # could translate to (?i)\d+
322
323	In Oils, the following flags are currently supported:
324
325	#### `reg_icase` / `i` (Ignore Case)
326
327	Use this flag to ignore case when matching. For example, `/'foo'; i/` matches
328	'FOO', but `/'foo'/` doesn't.
329
330	#### `reg_newline` (Multiline)
331
332	With this flag, `%end` will match before a newline and `%start` will match
333	after a newline.
334
335	= u'abc123\n' ~ / digit %end ; reg_newline / # true
336	= u'abc\n123' ~ / %start digit ; reg_newline / # true
337
338	Without the flag, `%start` and `%end` only match from the start or end of the
339	string, respectively.
340
341	= u'abc123\n' ~ / digit %end / # false
342	= u'abc\n123' ~ / %start digit / # false
343
344	Newlines are also ignored in `dot` and `![abc]` patterns.
345
346	= u'\n' ~ / . ; reg_newline / # false
347	= u'\n' ~ / !digit ; reg_newline / # false
348
349	Without this flag, the newline `\n` is treated as an ordinary character.
350
351	= u'\n' ~ / . / # true
352	= u'\n' ~ / !digit / # true
353
354	### Multiline Syntax
355
356	You can spread regexes over multiple lines and add comments:
357
358	var x = ///
359	digit{4} # year e.g. 2001
360	'-'
361	digit{2} # month e.g. 06
362	'-'
363	digit{2} # day e.g. 31
364	///
365
366	(Not yet implemented in YSH.)
367
368	### The YSH API
369
370	See the [YSH regex API](ysh-regex-api.html) for details.
371
372	In summary, YSH has Perl-like conveniences with an `~` operator:
373
374	var s = 'on 04-01, 10-31'
375	var pat = /<capture d+ as month> '-' <capture d+ as day>/
376
377	if (s ~ pat) { # search for the pattern
378	echo $[_group('month')] # => 04
379	}
380
381	It also has an explicit and powerful Python-like API with the `search()` and
382	leftMatch()` methods on strings.
383
384	var m = s => search(pat, pos=8) # start searching at a position
385	if (m) {
386	echo $[m => group('month')] # => 10
387	}
388
389	### Language Reference
390
391	- See bottom of the [YSH Expression Grammar]($oils-src:ysh/grammar.pgen2) for
392	the concrete syntax.
393	- See the bottom of [frontend/syntax.asdl]($oils-src:frontend/syntax.asdl) for
394	the abstract syntax.
395
396	## Usage Notes
397
398	### Use character literals rather than C-Escaped strings
399
400	No:
401
402	/ $'foo\tbar' / # Match 7 characters including a tab, but it's hard to read
403	/ r'foo\tbar' / # The string must contain 8 chars including '\' and 't'
404
405	Yes:
406
407	# Instead, Take advantage of char literals and implicit regex concatenation
408	/ 'foo' \t 'bar' /
409	/ 'foo' \\ 'tbar' /
410
411
412	## POSIX ERE Limitations
413
414	In theory, Eggex can be translated to many different synatxes, like Perl or
415	Python.
416
417	In practice, Oils translates Eggex to POSIX Extended Regular Expressions (ERE)
418	syntax. This syntax is understood by `libc`, e.g. GNU libc or musl libc.
419
420	But not all of Eggex can be translated to ERE syntax. And the translation is
421	straightforward and "dumb", rather than smart.
422
423	Here are some limitations.
424
425	### Repetition of Strings Requires Grouping
426
427	Repetitions like `* + ?` apply only to the last character, so literal strings
428	need extra grouping:
429
430
431	No:
432
433	'foo'+
434
435	Yes:
436
437	<capture 'foo'>+
438
439	Also OK:
440
441	('foo')+ # but note this is a CAPTURING group in ERE
442
443	This is necessary because ERE doesn't have non-capturing groups like Perl's
444	`(?:...)`, and Eggex only does "dumb" translations. It doesn't silently insert
445	constructs that change the meaning of the pattern.
446
447	### Bytes and Code Points are Limited in Range
448
449	See [re-chars](ref/chap-expr-lang.html#re-chars) in the Oils Reference.
450
451	### Char class literals: `^ - ] \`
452
453	The literal characters `^ - ] \` are problematic because they can be confused
454	with operators.
455
456	- `^` means negation
457	- `-` means range
458	- `]` closes the character class
459	- `\` is usually literal, but GNU gawk has an extension to make it an escaping
460	operator
461
462	The Eggex-to-ERE translator is smart enough to handle cases like this:
463
464	var pat = / ['^' 'x'] /
465	# translated to [x^], not [^x] for correctness
466
467	However, cases like this are a fatal runtime error:
468
469	var pat1 = / ['a'-'^'] /
470	var pat2 = / ['a'-'-'] /
471
472	## Critiques
473
474	### Regexes Are Hard To Read
475
476	... because the same symbol can mean many things.
477
478	`^` could mean:
479
480	- Start of the string/line
481	- Negated character class like `[^abc]`
482	- Literal character `^` like `[abc^]`
483
484	`\` is used in:
485
486	- Character classes like `\w` or `\d`
487	- Zero-width assertions like `\b`
488	- Escaped characters like `\n`
489	- Quoted characters like `\+`
490
491	`?` could mean:
492
493	- optional: `a?`
494	- lazy match: `a+?`
495	- some other kind of grouping:
496	- `(?P<named>\d+)`
497	- `(?:noncapturing)`
498
499	With egg expressions, each construct has a distinct syntax.
500
501	### YSH is Shorter Than Bash
502
503	Bash:
504
505	if [[ $x =~ '[[:digit:]]+' ]]; then
506	echo 'x looks like a number
507	fi
508
509	Compare with YSH:
510
511	if (x ~ /digit+/) {
512	echo 'x looks like a number'
513	}
514
515	### ... and Perl
516
517	Perl:
518
519	$x =~ /\d+/
520
521	YSH:
522
523	x ~ /d+/
524
525
526	The Perl expression has three more punctuation characters:
527
528	- YSH doesn't require sigils in expression mode
529	- The match operator is `~`, not `=~`
530	- Named character classes are unadorned like `d`. If that's too short, you can
531	also write `digit`.
532
533	## Design Notes
534
535	### Eggexes In Other Languages
536
537	The eggex syntax can be incorporated into other tools and shells. It's
538	designed to be separate from YSH -- hence the separate name.
539
540	Notes:
541
542	- Single quoted string literals should disallow internal backslashes, and
543	treat all other characters literally.. Instead, users can write `/ 'foo' \t
544	'sq' \' bar \n /` — i.e. implicit concatenation of strings and
545	characters, described above.
546	- To make eggexes portable between languages, Don't use the host language's
547	syntax for string literals (at least for single-quoted strings).
548
549	### Backward Compatibility
550
551	Eggexes aren't backward compatible in general, but they retain some legacy
552	operators like `^ . $` to ease the transition. These expressions are valid
553	eggexes and valid POSIX EREs:
554
555	.*
556	^[0-9]+$
557	^.{1,3}\|[0-9][0-9]?$
558
559	## FAQ
560
561	### The Name Sounds Funny.
562
563	If "eggex" sounds too much like "regex" to you, simply say "egg expression".
564	It won't be confused with "regular expression" or "regex".
565
566	### How Do Eggexes Compare with [Raku Regexes][raku-regex] and the [Rosie Pattern Language][rosie]?
567
568	All three languages support pattern composition and have quoted literals. And
569	they have the goal of improving upon Perl 5 regex syntax, which has made its
570	way into every major programming language (Python, Java, C++, etc.)
571
572	The main difference is that Eggexes are meant to be used with existing
573	regex engines. For example, you translate them to a POSIX ERE, which is
574	executed by `egrep` or `awk`. Or you translate them to a Perl-like syntax and
575	use them in Python, JavaScript, Java, or C++ programs.
576
577	Perl 6 and Rosie have their own engines that are more powerful than PCRE,
578	Python, etc. That means they cannot be used this way.
579
580	[rosie]: https://rosie-lang.org/
581
582	[raku-regex]: https://docs.raku.org/language/regexes
583
584	### What About Eggex versus Parsing Expression Grammars? (PEGs)
585
586	The short answer is that they can be complementary: PEGs are closer to
587	parsing, while eggex and [regular languages]($xref:regular-language) are
588	closer to lexing. Related:
589
590	- [When Are Lexer Modes Useful?](https://www.oilshell.org/blog/2017/12/17.html)
591	- [Why Lexing and Parsing Should Be
592	Separate](https://github.com/oils-for-unix/oils/wiki/Why-Lexing-and-Parsing-Should-Be-Separate) (wiki)
593
594	The PEG model is more resource intensive, but it can recognize more languages,
595	and it can recognize recursive structure (trees).
596
597	### Why Don't `dot`, `%start`, and `%end` Have More Precise Names?
598
599	Because the meanings of `.` `^` and `$` are usually affected by regex engine
600	flags, like `dotall`, `multiline`, and `unicode`.
601
602	As a result, the names mean nothing more than "however your regex engine
603	interprets `.` `^` and `$`".
604
605	As mentioned in the "Philosophy" section above, eggex only does a superficial,
606	one-to-one translation. It doesn't understand the details of which characters
607	will be matched under which engine.
608
609	### Where Do I Send Feedback?
610
611	Eggexes are implemented in YSH, but not yet set in stone.
612
613	Please try them, as described in [this
614	post](http://www.oilshell.org/blog/2019/08/22.html) and the
615	[README]($oils-src:README.md), and send us feedback!
616
617	You can create a new post on [/r/oilshell](https://www.reddit.com/r/oilshell/)
618	or a new message on `#oil-discuss` on <https://oilshell.zulipchat.com/> (log in
619	with Github, etc.)