| 1 | ---
|
| 2 | default_highlighter: oils-sh
|
| 3 | ---
|
| 4 |
|
| 5 | Egg Expressions (YSH Regexes)
|
| 6 | =============================
|
| 7 |
|
| 8 | YSH has a new syntax for patterns, which appears between the `/ /` delimiters:
|
| 9 |
|
| 10 | if (mystr ~ /d+ '.' d+/) {
|
| 11 | echo 'mystr looks like a number N.M'
|
| 12 | }
|
| 13 |
|
| 14 | These patterns are intended to be familiar, but they differ from POSIX or Perl
|
| 15 | expressions in important ways. So we call them *eggexes* rather than
|
| 16 | *regexes*!
|
| 17 |
|
| 18 | <!-- cmark.py expands this -->
|
| 19 | <div id="toc">
|
| 20 | </div>
|
| 21 |
|
| 22 | ## Why Invent a New Language?
|
| 23 |
|
| 24 | - Eggexes let you name **subpatterns** and compose them, which makes them more
|
| 25 | readable and testable.
|
| 26 | - Their **syntax** is vastly simpler because literal characters are **quoted**,
|
| 27 | and operators are not. For example, `^` no longer means three totally
|
| 28 | different things. See the critique at the end of this doc.
|
| 29 | - bash and awk use the limited and verbose POSIX ERE syntax, while eggexes are
|
| 30 | more expressive and (in some cases) Perl-like.
|
| 31 | - They're designed to be **translated to any regex dialect**. Right now, the
|
| 32 | YSH shell translates them to ERE so you can use them with common Unix tools:
|
| 33 | - `egrep` (`grep -E`)
|
| 34 | - `awk`
|
| 35 | - GNU `sed --regexp-extended`
|
| 36 | - PCRE syntax is the second most important target.
|
| 37 | - They're **statically parsed** in YSH, so:
|
| 38 | - You can get **syntax errors** at parse time. In contrast, if you embed a
|
| 39 | regex in a string, you don't get syntax errors until runtime.
|
| 40 | - The eggex is part of the [lossless syntax tree][], which means you can do
|
| 41 | linting, formatting, and refactoring on eggexes, just like any other type
|
| 42 | of code.
|
| 43 | - Eggexes support **regular languages** in the mathematical sense, whereas
|
| 44 | regexes are **confused** about the issue. All nonregular eggex extensions
|
| 45 | are prefixed with `!!`, so you can visually audit them for [catastrophic
|
| 46 | backtracking][backtracking]. (Russ Cox, author of the RE2 engine, [has
|
| 47 | written extensively](https://swtch.com/~rsc/regexp/) on this issue.)
|
| 48 | - Eggexes are more fun than regexes!
|
| 49 |
|
| 50 | [backtracking]: https://blog.codinghorror.com/regex-performance/
|
| 51 |
|
| 52 | [lossless syntax tree]: http://www.oilshell.org/blog/2017/02/11.html
|
| 53 |
|
| 54 | ### Example of Pattern Reuse
|
| 55 |
|
| 56 | Here's a longer example:
|
| 57 |
|
| 58 | # Define a subpattern. 'digit' and 'D' are the same.
|
| 59 | $ var D = / digit{1,3} /
|
| 60 |
|
| 61 | # Use the subpattern
|
| 62 | $ var ip_pat = / D '.' D '.' D '.' D /
|
| 63 |
|
| 64 | # This eggex compiles to an ERE
|
| 65 | $ echo $ip_pat
|
| 66 | [[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}\.[[:digit:]]{1,3}
|
| 67 |
|
| 68 | This means you can use it in a very simple way:
|
| 69 |
|
| 70 | $ egrep $ip_pat foo.txt
|
| 71 |
|
| 72 | TODO: You should also be able to inline patterns like this:
|
| 73 |
|
| 74 | egrep $/d+/ foo.txt
|
| 75 |
|
| 76 | ### Design Philosophy
|
| 77 |
|
| 78 | - Eggexes can express a **superset** of POSIX and Perl syntax.
|
| 79 | - The language is designed for "dumb", one-to-one, **syntactic** translations.
|
| 80 | That is, translation doesn't rely on understanding the **semantics** of
|
| 81 | regexes. This is because regex implementations have many corner cases and
|
| 82 | incompatibilities, with regard to Unicode, `NUL` bytes, etc.
|
| 83 |
|
| 84 | ### The Expression Language Is Consistent
|
| 85 |
|
| 86 | Eggexes have a consistent syntax:
|
| 87 |
|
| 88 | - Single characters are unadorned, in lowercase: `dot`, `space`, or `s`
|
| 89 | - A sequence of multiple characters looks like `'lit'`, `$var`, etc.
|
| 90 | - Constructs that match **zero** characters look like `%start`, `%word_end`, etc.
|
| 91 | - Entire subpatterns (which may contain alternation, repetition, etc.) are in
|
| 92 | uppercase like `HexDigit`. Important: these are **spliced** as syntax trees,
|
| 93 | not strings, so you **don't** need to think about quoting.
|
| 94 |
|
| 95 | For example, it's easy to see that these patterns all match **three** characters:
|
| 96 |
|
| 97 | / d d d /
|
| 98 | / digit digit digit /
|
| 99 | / dot dot dot /
|
| 100 | / word space word /
|
| 101 | / 'ab' space /
|
| 102 | / 'abc' /
|
| 103 |
|
| 104 | And that these patterns match **two**:
|
| 105 |
|
| 106 | / %start w w /
|
| 107 | / %start 'if' /
|
| 108 | / d d %end /
|
| 109 |
|
| 110 | And that you have to look up the definition of `HexDigit` to know how many
|
| 111 | characters this matches:
|
| 112 |
|
| 113 | / %start HexDigit %end /
|
| 114 |
|
| 115 | Constructs like `. ^ $ \< \>` are deprecated because they break these rules.
|
| 116 |
|
| 117 | ## Expression Primitives
|
| 118 |
|
| 119 | ### `.` Is Now `dot`
|
| 120 |
|
| 121 | The `dot` primitive usually matches any character, although its exact meaning
|
| 122 | depends on the underlying regex library.
|
| 123 |
|
| 124 | - YSH uses `libc`, which accepts POSIX ERE syntax. So `dot` aka `.` matches
|
| 125 | any character, unless the `reg_newline` flag is true.
|
| 126 | - If Eggex were compiled to Python, `dot` aka `.` matches any character
|
| 127 | *except* a newline, unless the `re.DOTALL` flag is true.
|
| 128 |
|
| 129 | Note: Eggex accepts `.` as a synonym for `dot`, even though `dot` is preferred.
|
| 130 |
|
| 131 | ### Classes Are Unadorned: `word`, `w`, `alnum`
|
| 132 |
|
| 133 | We accept both Perl and POSIX classes.
|
| 134 |
|
| 135 | - Perl:
|
| 136 | - `d` or `digit`
|
| 137 | - `s` or `space`
|
| 138 | - `w` or `word`
|
| 139 | - POSIX
|
| 140 | - `alpha`, `alnum`, ...
|
| 141 |
|
| 142 | ### Zero-width Assertions Look Like `%this`
|
| 143 |
|
| 144 | - POSIX
|
| 145 | - `%start` is `^`
|
| 146 | - `%end` is `$`
|
| 147 | - PCRE:
|
| 148 | - `%input_start` is `\A`
|
| 149 | - `%input_end` is `\z`
|
| 150 | - `%last_line_end` is `\Z`
|
| 151 | - GNU ERE extensions:
|
| 152 | - `%word_start` is `\<`
|
| 153 | - `%word_end` is `\>`
|
| 154 |
|
| 155 | ### Single-Quoted Strings
|
| 156 |
|
| 157 | - `'hello *world*'` becomes a regex-escaped string
|
| 158 |
|
| 159 | Note: instead of using double-quoted strings like `"xyz $var"`, you can splice
|
| 160 | a strings into an eggex:
|
| 161 |
|
| 162 | / 'xyz ' @var /
|
| 163 |
|
| 164 | ## Compound Expressions
|
| 165 |
|
| 166 | ### Sequence and Alternation Are Unchanged
|
| 167 |
|
| 168 | - `x y` matches `x` and `y` in sequence
|
| 169 | - `x | y` matches `x` or `y`
|
| 170 |
|
| 171 | You can also write a more Pythonic alternative: `x or y`.
|
| 172 |
|
| 173 | ### Repetition Is Unchanged In Common Cases, and Better in Rare Cases
|
| 174 |
|
| 175 | Repetition is just like POSIX ERE or Perl:
|
| 176 |
|
| 177 | - `x?`, `x+`, `x*`
|
| 178 | - `x{3}`, `x{1,3}`
|
| 179 |
|
| 180 | We've reserved syntactic space for PCRE and Python variants:
|
| 181 |
|
| 182 | - lazy/non-greedy: `x{L +}`, `x{L 3,4}`
|
| 183 | - possessive: `x{P +}`, `x{P 3,4}`
|
| 184 |
|
| 185 | (Oils doesn't have these features, because Oils translates Eggex to POSIX ERE
|
| 186 | syntax.)
|
| 187 |
|
| 188 | ### Negation Consistently Uses !
|
| 189 |
|
| 190 | You can negate named char classes:
|
| 191 |
|
| 192 | / !digit /
|
| 193 |
|
| 194 | and char class literals:
|
| 195 |
|
| 196 | / ![ a-z A-Z ] /
|
| 197 |
|
| 198 | Sometimes you can do both:
|
| 199 |
|
| 200 | / ![ !digit ] / # translates to /[^\D]/ in PCRE
|
| 201 | # error in ERE because it can't be expressed
|
| 202 |
|
| 203 |
|
| 204 | You can also negate "regex modifiers" / compilation flags:
|
| 205 |
|
| 206 | / word ; ignorecase / # flag on
|
| 207 | / word ; !ignorecase / # flag off
|
| 208 | / word ; !i / # abbreviated
|
| 209 |
|
| 210 | In contrast, regexes have many confusing syntaxes for negation:
|
| 211 |
|
| 212 | [^abc] vs. [abc]
|
| 213 | [[^:digit:]] vs. [[:digit:]]
|
| 214 |
|
| 215 | \D vs. \d
|
| 216 |
|
| 217 | /\w/-i vs /\w/i
|
| 218 |
|
| 219 | ### Splice Other Patterns `@var_name` or `UpperCaseVarName`
|
| 220 |
|
| 221 | This allows you to reuse patterns. Using uppercase variables:
|
| 222 |
|
| 223 | var D = / digit{3} /
|
| 224 |
|
| 225 | var ip_addr = / D '.' D '.' D '.' D /
|
| 226 |
|
| 227 | Using normal variables:
|
| 228 |
|
| 229 | var part = / digit{3} /
|
| 230 |
|
| 231 | var ip_addr = / @part '.' @part '.' @part '.' @part /
|
| 232 |
|
| 233 | This is similar to how `lex` and `re2c` work.
|
| 234 |
|
| 235 | ### Group With `()`
|
| 236 |
|
| 237 | Parentheses are used for precdence:
|
| 238 |
|
| 239 | ('foo' | 'bar')+
|
| 240 |
|
| 241 | See note below: When translating to POSIX ERE, grouping becomes a capturing
|
| 242 | group. POSIX ERE has no non-capturing groups.
|
| 243 |
|
| 244 |
|
| 245 | ### Capture with `<capture ...>`
|
| 246 |
|
| 247 | Here's a positional capture:
|
| 248 |
|
| 249 | <capture d+> # Becomes _group(1)
|
| 250 |
|
| 251 | Add a variable after `as` for named capture:
|
| 252 |
|
| 253 | <capture d+ as month> # Becomes _group('month')
|
| 254 |
|
| 255 | You can also add type conversion functions:
|
| 256 |
|
| 257 | <capture d+ : int> # _group(1) returns an Int, not Str
|
| 258 | <capture d+ as month: int> # _group('month') returns an Int, not Str
|
| 259 |
|
| 260 | ### Character Class Literals Use `[]`
|
| 261 |
|
| 262 | Example:
|
| 263 |
|
| 264 | [ a b c ] # individual characters / code points
|
| 265 | [ '?' '*' '+' ] # quoted characters
|
| 266 | [ 'a' 'bc' '?*+' ] # reduce the number of quotes
|
| 267 | [ a-f 'A'-'F' x y ] # ranges of code points
|
| 268 | [ \n \\ \' \" ] # backslash escapes
|
| 269 | [ \yFF \u{03bc} ] # a byte and a code point
|
| 270 |
|
| 271 | Only letters, numbers, and the underscore may be unquoted:
|
| 272 |
|
| 273 | /['a'-'f' 'A'-'F' '0'-'9']/
|
| 274 | /[a-f A-F 0-9]/ # Equivalent to the above
|
| 275 |
|
| 276 | /['!' - ')']/ # Correct range
|
| 277 | /[!-)]/ # Syntax Error
|
| 278 |
|
| 279 | Ranges must be separated by spaces:
|
| 280 |
|
| 281 | No:
|
| 282 |
|
| 283 | /[a-fA-F0-9]/
|
| 284 |
|
| 285 | Yes:
|
| 286 |
|
| 287 | /[a-f A-f 0-9]/
|
| 288 |
|
| 289 | ### Backtracking Constructs Use `!!` (Discouraged)
|
| 290 |
|
| 291 | If you want to translate to PCRE, you can use these.
|
| 292 |
|
| 293 | !!REF 1
|
| 294 | !!REF name
|
| 295 |
|
| 296 | !!AHEAD( d+ )
|
| 297 | !!NOT_AHEAD( d+ )
|
| 298 | !!BEHIND( d+ )
|
| 299 | !!NOT_BEHIND( d+ )
|
| 300 |
|
| 301 | !!ATOMIC( d+ )
|
| 302 |
|
| 303 | Since they all begin with `!!`, You can visually audit your code for potential
|
| 304 | performance problems.
|
| 305 |
|
| 306 | ## Outside the Expression language
|
| 307 |
|
| 308 | ### Flags and Translation Preferences (`;`)
|
| 309 |
|
| 310 | Flags or "regex modifiers" appear after a semicolon:
|
| 311 |
|
| 312 | / digit+ ; i / # ignore case
|
| 313 |
|
| 314 | A translation preference is specified after a second semi-colon:
|
| 315 |
|
| 316 | / digit+ ; ; ERE / # translates to [[:digit:]]+
|
| 317 | / digit+ ; ; python / # could translate to \d+
|
| 318 |
|
| 319 | Flags and translation preferences together:
|
| 320 |
|
| 321 | / digit+ ; ignorecase ; python / # could translate to (?i)\d+
|
| 322 |
|
| 323 | In Oils, the following flags are currently supported:
|
| 324 |
|
| 325 | #### `reg_icase` / `i` (Ignore Case)
|
| 326 |
|
| 327 | Use this flag to ignore case when matching. For example, `/'foo'; i/` matches
|
| 328 | 'FOO', but `/'foo'/` doesn't.
|
| 329 |
|
| 330 | #### `reg_newline` (Multiline)
|
| 331 |
|
| 332 | With this flag, `%end` will match before a newline and `%start` will match
|
| 333 | after a newline.
|
| 334 |
|
| 335 | = u'abc123\n' ~ / digit %end ; reg_newline / # true
|
| 336 | = u'abc\n123' ~ / %start digit ; reg_newline / # true
|
| 337 |
|
| 338 | Without the flag, `%start` and `%end` only match from the start or end of the
|
| 339 | string, respectively.
|
| 340 |
|
| 341 | = u'abc123\n' ~ / digit %end / # false
|
| 342 | = u'abc\n123' ~ / %start digit / # false
|
| 343 |
|
| 344 | Newlines are also ignored in `dot` and `![abc]` patterns.
|
| 345 |
|
| 346 | = u'\n' ~ / . ; reg_newline / # false
|
| 347 | = u'\n' ~ / !digit ; reg_newline / # false
|
| 348 |
|
| 349 | Without this flag, the newline `\n` is treated as an ordinary character.
|
| 350 |
|
| 351 | = u'\n' ~ / . / # true
|
| 352 | = u'\n' ~ / !digit / # true
|
| 353 |
|
| 354 | ### Multiline Syntax
|
| 355 |
|
| 356 | You can spread regexes over multiple lines and add comments:
|
| 357 |
|
| 358 | var x = ///
|
| 359 | digit{4} # year e.g. 2001
|
| 360 | '-'
|
| 361 | digit{2} # month e.g. 06
|
| 362 | '-'
|
| 363 | digit{2} # day e.g. 31
|
| 364 | ///
|
| 365 |
|
| 366 | (Not yet implemented in YSH.)
|
| 367 |
|
| 368 | ### The YSH API
|
| 369 |
|
| 370 | See the [YSH regex API](ysh-regex-api.html) for details.
|
| 371 |
|
| 372 | In summary, YSH has Perl-like conveniences with an `~` operator:
|
| 373 |
|
| 374 | var s = 'on 04-01, 10-31'
|
| 375 | var pat = /<capture d+ as month> '-' <capture d+ as day>/
|
| 376 |
|
| 377 | if (s ~ pat) { # search for the pattern
|
| 378 | echo $[_group('month')] # => 04
|
| 379 | }
|
| 380 |
|
| 381 | It also has an explicit and powerful Python-like API with the `search()` and
|
| 382 | leftMatch()` methods on strings.
|
| 383 |
|
| 384 | var m = s => search(pat, pos=8) # start searching at a position
|
| 385 | if (m) {
|
| 386 | echo $[m => group('month')] # => 10
|
| 387 | }
|
| 388 |
|
| 389 | ### Language Reference
|
| 390 |
|
| 391 | - See bottom of the [YSH Expression Grammar]($oils-src:ysh/grammar.pgen2) for
|
| 392 | the concrete syntax.
|
| 393 | - See the bottom of [frontend/syntax.asdl]($oils-src:frontend/syntax.asdl) for
|
| 394 | the abstract syntax.
|
| 395 |
|
| 396 | ## Usage Notes
|
| 397 |
|
| 398 | ### Use character literals rather than C-Escaped strings
|
| 399 |
|
| 400 | No:
|
| 401 |
|
| 402 | / $'foo\tbar' / # Match 7 characters including a tab, but it's hard to read
|
| 403 | / r'foo\tbar' / # The string must contain 8 chars including '\' and 't'
|
| 404 |
|
| 405 | Yes:
|
| 406 |
|
| 407 | # Instead, Take advantage of char literals and implicit regex concatenation
|
| 408 | / 'foo' \t 'bar' /
|
| 409 | / 'foo' \\ 'tbar' /
|
| 410 |
|
| 411 |
|
| 412 | ## POSIX ERE Limitations
|
| 413 |
|
| 414 | In theory, Eggex can be translated to many different synatxes, like Perl or
|
| 415 | Python.
|
| 416 |
|
| 417 | In practice, Oils translates Eggex to POSIX Extended Regular Expressions (ERE)
|
| 418 | syntax. This syntax is understood by `libc`, e.g. GNU libc or musl libc.
|
| 419 |
|
| 420 | But not all of Eggex can be translated to ERE syntax. And the translation is
|
| 421 | straightforward and "dumb", rather than smart.
|
| 422 |
|
| 423 | Here are some limitations.
|
| 424 |
|
| 425 | ### Repetition of Strings Requires Grouping
|
| 426 |
|
| 427 | Repetitions like `* + ?` apply only to the last character, so literal strings
|
| 428 | need extra grouping:
|
| 429 |
|
| 430 |
|
| 431 | No:
|
| 432 |
|
| 433 | 'foo'+
|
| 434 |
|
| 435 | Yes:
|
| 436 |
|
| 437 | <capture 'foo'>+
|
| 438 |
|
| 439 | Also OK:
|
| 440 |
|
| 441 | ('foo')+ # but note this is a CAPTURING group in ERE
|
| 442 |
|
| 443 | This is necessary because ERE doesn't have non-capturing groups like Perl's
|
| 444 | `(?:...)`, and Eggex only does "dumb" translations. It doesn't silently insert
|
| 445 | constructs that change the meaning of the pattern.
|
| 446 |
|
| 447 | ### Bytes and Code Points are Limited in Range
|
| 448 |
|
| 449 | See [re-chars](ref/chap-expr-lang.html#re-chars) in the Oils Reference.
|
| 450 |
|
| 451 | ### Char class literals: `^ - ] \`
|
| 452 |
|
| 453 | The literal characters `^ - ] \` are problematic because they can be confused
|
| 454 | with operators.
|
| 455 |
|
| 456 | - `^` means negation
|
| 457 | - `-` means range
|
| 458 | - `]` closes the character class
|
| 459 | - `\` is usually literal, but GNU gawk has an extension to make it an escaping
|
| 460 | operator
|
| 461 |
|
| 462 | The Eggex-to-ERE translator is smart enough to handle cases like this:
|
| 463 |
|
| 464 | var pat = / ['^' 'x'] /
|
| 465 | # translated to [x^], not [^x] for correctness
|
| 466 |
|
| 467 | However, cases like this are a fatal runtime error:
|
| 468 |
|
| 469 | var pat1 = / ['a'-'^'] /
|
| 470 | var pat2 = / ['a'-'-'] /
|
| 471 |
|
| 472 | ## Critiques
|
| 473 |
|
| 474 | ### Regexes Are Hard To Read
|
| 475 |
|
| 476 | ... because the **same symbol can mean many things**.
|
| 477 |
|
| 478 | `^` could mean:
|
| 479 |
|
| 480 | - Start of the string/line
|
| 481 | - Negated character class like `[^abc]`
|
| 482 | - Literal character `^` like `[abc^]`
|
| 483 |
|
| 484 | `\` is used in:
|
| 485 |
|
| 486 | - Character classes like `\w` or `\d`
|
| 487 | - Zero-width assertions like `\b`
|
| 488 | - Escaped characters like `\n`
|
| 489 | - Quoted characters like `\+`
|
| 490 |
|
| 491 | `?` could mean:
|
| 492 |
|
| 493 | - optional: `a?`
|
| 494 | - lazy match: `a+?`
|
| 495 | - some other kind of grouping:
|
| 496 | - `(?P<named>\d+)`
|
| 497 | - `(?:noncapturing)`
|
| 498 |
|
| 499 | With egg expressions, each construct has a **distinct syntax**.
|
| 500 |
|
| 501 | ### YSH is Shorter Than Bash
|
| 502 |
|
| 503 | Bash:
|
| 504 |
|
| 505 | if [[ $x =~ '[[:digit:]]+' ]]; then
|
| 506 | echo 'x looks like a number
|
| 507 | fi
|
| 508 |
|
| 509 | Compare with YSH:
|
| 510 |
|
| 511 | if (x ~ /digit+/) {
|
| 512 | echo 'x looks like a number'
|
| 513 | }
|
| 514 |
|
| 515 | ### ... and Perl
|
| 516 |
|
| 517 | Perl:
|
| 518 |
|
| 519 | $x =~ /\d+/
|
| 520 |
|
| 521 | YSH:
|
| 522 |
|
| 523 | x ~ /d+/
|
| 524 |
|
| 525 |
|
| 526 | The Perl expression has three more punctuation characters:
|
| 527 |
|
| 528 | - YSH doesn't require sigils in expression mode
|
| 529 | - The match operator is `~`, not `=~`
|
| 530 | - Named character classes are unadorned like `d`. If that's too short, you can
|
| 531 | also write `digit`.
|
| 532 |
|
| 533 | ## Design Notes
|
| 534 |
|
| 535 | ### Eggexes In Other Languages
|
| 536 |
|
| 537 | The eggex syntax can be incorporated into other tools and shells. It's
|
| 538 | designed to be separate from YSH -- hence the separate name.
|
| 539 |
|
| 540 | Notes:
|
| 541 |
|
| 542 | - Single quoted string literals should **disallow** internal backslashes, and
|
| 543 | treat all other characters literally.. Instead, users can write `/ 'foo' \t
|
| 544 | 'sq' \' bar \n /` — i.e. implicit concatenation of strings and
|
| 545 | characters, described above.
|
| 546 | - To make eggexes portable between languages, Don't use the host language's
|
| 547 | syntax for string literals (at least for single-quoted strings).
|
| 548 |
|
| 549 | ### Backward Compatibility
|
| 550 |
|
| 551 | Eggexes aren't backward compatible in general, but they retain some legacy
|
| 552 | operators like `^ . $` to ease the transition. These expressions are valid
|
| 553 | eggexes **and** valid POSIX EREs:
|
| 554 |
|
| 555 | .*
|
| 556 | ^[0-9]+$
|
| 557 | ^.{1,3}|[0-9][0-9]?$
|
| 558 |
|
| 559 | ## FAQ
|
| 560 |
|
| 561 | ### The Name Sounds Funny.
|
| 562 |
|
| 563 | If "eggex" sounds too much like "regex" to you, simply say "egg expression".
|
| 564 | It won't be confused with "regular expression" or "regex".
|
| 565 |
|
| 566 | ### How Do Eggexes Compare with [Raku Regexes][raku-regex] and the [Rosie Pattern Language][rosie]?
|
| 567 |
|
| 568 | All three languages support pattern composition and have quoted literals. And
|
| 569 | they have the goal of improving upon Perl 5 regex syntax, which has made its
|
| 570 | way into every major programming language (Python, Java, C++, etc.)
|
| 571 |
|
| 572 | The main difference is that Eggexes are meant to be used with **existing**
|
| 573 | regex engines. For example, you translate them to a POSIX ERE, which is
|
| 574 | executed by `egrep` or `awk`. Or you translate them to a Perl-like syntax and
|
| 575 | use them in Python, JavaScript, Java, or C++ programs.
|
| 576 |
|
| 577 | Perl 6 and Rosie have their **own engines** that are more powerful than PCRE,
|
| 578 | Python, etc. That means they **cannot** be used this way.
|
| 579 |
|
| 580 | [rosie]: https://rosie-lang.org/
|
| 581 |
|
| 582 | [raku-regex]: https://docs.raku.org/language/regexes
|
| 583 |
|
| 584 | ### What About Eggex versus Parsing Expression Grammars? (PEGs)
|
| 585 |
|
| 586 | The short answer is that they can be complementary: PEGs are closer to
|
| 587 | **parsing**, while eggex and [regular languages]($xref:regular-language) are
|
| 588 | closer to **lexing**. Related:
|
| 589 |
|
| 590 | - [When Are Lexer Modes Useful?](https://www.oilshell.org/blog/2017/12/17.html)
|
| 591 | - [Why Lexing and Parsing Should Be
|
| 592 | Separate](https://github.com/oils-for-unix/oils/wiki/Why-Lexing-and-Parsing-Should-Be-Separate) (wiki)
|
| 593 |
|
| 594 | The PEG model is more resource intensive, but it can recognize more languages,
|
| 595 | and it can recognize recursive structure (trees).
|
| 596 |
|
| 597 | ### Why Don't `dot`, `%start`, and `%end` Have More Precise Names?
|
| 598 |
|
| 599 | Because the meanings of `.` `^` and `$` are usually affected by regex engine
|
| 600 | flags, like `dotall`, `multiline`, and `unicode`.
|
| 601 |
|
| 602 | As a result, the names mean nothing more than "however your regex engine
|
| 603 | interprets `.` `^` and `$`".
|
| 604 |
|
| 605 | As mentioned in the "Philosophy" section above, eggex only does a superficial,
|
| 606 | one-to-one translation. It doesn't understand the details of which characters
|
| 607 | will be matched under which engine.
|
| 608 |
|
| 609 | ### Where Do I Send Feedback?
|
| 610 |
|
| 611 | Eggexes are implemented in YSH, but not yet set in stone.
|
| 612 |
|
| 613 | Please try them, as described in [this
|
| 614 | post](http://www.oilshell.org/blog/2019/08/22.html) and the
|
| 615 | [README]($oils-src:README.md), and send us feedback!
|
| 616 |
|
| 617 | You can create a new post on [/r/oilshell](https://www.reddit.com/r/oilshell/)
|
| 618 | or a new message on `#oil-discuss` on <https://oilshell.zulipchat.com/> (log in
|
| 619 | with Github, etc.)
|