OILS / mycpp / README.md View on Github | oils.pub

617 lines, 441 significant
1mycpp
2=====
3
4This is a Python-to-C++ translator based on MyPy. It only
5handles the small subset of Python that we use in Oils.
6
7It's inspired by both mypyc and Shed Skin. These posts give background:
8
9- [Brief Descriptions of a Python to C++ Translator](https://www.oilshell.org/blog/2022/05/mycpp.html)
10- [Oil Is Being Implemented "Middle Out"](https://www.oilshell.org/blog/2022/03/middle-out.html)
11
12As of March 2024, the translation to C++ is **done**. So it's no longer
13experimental!
14
15---
16
17`mycpp` started as a **hack**, but it worked because its output is fairly
18strongly-typed C++.
19
20That is, the C++ type system catches many errors! But it doesn't catch all of
21them, so we've gradually made `mycpp` more strict.
22
23As of December 2024, `mycpp` is a pretty clean program, although there are
24still many heuristics. This doc explains the heuristics.
25
26(I'd like to gradually rewrite mycpp as a more principled "yaks" language,
27although this isn't a high priority.)
28
29---
30
31Source for this doc: [mycpp/README.md]($oils-src). The code is all in
32[mycpp/]($oils-src).
33
34
35<div id="toc">
36</div>
37
38## Instructions
39
40### Translating and Compiling `oils-cpp`
41
42Running `mycpp` is best done on a Debian / Ubuntu-ish machine. Follow the
43instructions at <https://github.com/oilshell/oil/wiki/Contributing> to create
44the "dev build" first, which is DISTINCT from the C++ build. Make sure you can
45run:
46
47 oil$ build/py.sh all
48
49This will give you a working shell:
50
51 oil$ bin/osh -c 'echo hi' # running interpreted Python
52 hi
53
54To run mycpp, we will build Python 3.10, clone MyPy, and install MyPy's
55dependencies. First install packages:
56
57 # We need libssl-dev, libffi-dev, zlib1g-dev to bootstrap Python
58 oil$ build/deps.sh install-ubuntu-packages
59
60You'll also need a C++17 compiler for code generated by Souffle datalog, used
61by mycpp, although Oils itself only requires C++11.
62
63Then fetch data, like the Python 3.10 tarball and MyPy repo:
64
65 oil$ build/deps.sh fetch
66
67Then build from source:
68
69 oil$ build/deps.sh install-wedges
70
71To build oil-native, use:
72
73 oil$ ./NINJA-config.sh
74 oil$ ninja # translate and compile, may take 30 seconds
75
76 oil$ _bin/cxx-asan/osh -c 'echo hi' # running compiled C++ !
77 hi
78
79To run the tests and benchmarks:
80
81 oil$ mycpp/TEST.sh test-translator
82 ... 200+ tasks run ...
83
84If you have problems, post a message on `#oil-dev` at
85`https://oilshell.zulipchat.com`. Not many people have contributed to `mycpp`,
86so I can use your feedback!
87
88Related:
89
90- [Oil Native Quick
91Start](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start) on the
92wiki.
93- [Oil Dev Cheat Sheet](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start)
94
95## Notes on the Algorithm / Architecture
96
97Though there is still some global state (in `visitor.py`, inherited by all the
98passes), we are trying to make the dependencies explicit (in `translate.py`).
99
100There are five passes over the MyPy AST.
101
102(1) Const (`const_pass.py`) (analyzes and writes constants)
103 - Collect string constants (e.g. turn the constant in `myfunc("foo")` into
104 top-level `GLOBAL_STR(str1, "foo")`).
105
106 class Foo;
107 class Bar;
108
109 - Collect classes and their method names
110 - Collect classes and their namespace names
111
112(2) `conversion_pass.py` (analyzes)
113 - compute virtual functions, locals, class members, yield, etc.
114 - this also computes forward_decls, and we write it in translate.py
115(3) `control_flow_pass.py` (analyzes)
116 - fully qualified function name -> control flow graph
117 - maybe run Souffle
118
119(4) Decl (`cppgen_pass.Decl`) (writes)
120 - emit C++ declarations like:
121
122 class Foo {
123 void method();
124 };
125 class Bar {
126 void method();
127 };
128
129(5) Impl (`cppgen_pass.Impl`) (writes)
130
131Note: I really wish we were not using visitors, but that's inherited from MyPy.
132
133## mycpp Idioms / "Creative Hacks"
134
135Oils is written in typed Python 2. It will run under a stock Python 2
136interpreter, and it will typecheck with stock MyPy.
137
138However, there are a few language features that don't map cleanly from typed
139Python to C++:
140
141- switch statements (unfortunately we don't have the Python 3 match statement)
142- C++ destructors - the RAII ptatern
143- casting - MyPy has one kind of cast; C++ has `static_cast` and
144 `reinterpret_cast`. (We don't use C-style casting.)
145
146So this describes the idioms we use. There are some hacks in
147[mycpp/cppgen_pass.py]($oils-src) to handle these cases, and also Python
148runtime equivalents in `mycpp/mylib.py`.
149
150### `with {,tag,str_}switch` &rarr; Switch statement
151
152We have three constructs that translate to a C++ switch statement. They use a
153Python context manager `with Xswitch(obj) ...` as a little hack.
154
155Here are examples like the ones in [mycpp/examples/test_switch.py]($oils-src).
156(`ninja mycpp-logs-equal` translates, compiles, and tests all the examples.)
157
158Simple switch:
159
160 myint = 99
161 with switch(myint) as case:
162 if case(42, 43):
163 print('forties')
164 else:
165 print('other')
166
167Switch on **object type**, which goes well with ASDL sum types:
168
169 val = value.Str('foo) # type: value_t
170 with tagswitch(val) as case:
171 if case(value_e.Str, value_e.Int):
172 print('string or int')
173 else:
174 print('other')
175
176We usually need to apply the `UP_val` pattern here, described in the next
177section.
178
179Switch on **string**, which generates a fast **two-level dispatch** -- first on
180length, and then with `str_equals_c()`:
181
182 s = 'foo'
183 with str_switch(s) as case:
184 if case("foo")
185 print('FOO')
186 else:
187 print('other')
188
189### `val` &rarr; `UP_val` &rarr; `val` Downcasting pattern
190
191Summary: variable names like `UP_*` are **special** in our Python code.
192
193Consider the downcasts marked BAD:
194
195 val = value.Str('foo) # type: value_t
196
197 with tagswitch(obj) as case:
198 if case(value_e.Str):
199 val = cast(value.Str, val) # BAD: conflicts with first declaration
200 print('s = %s' % val.s)
201
202 elif case(value_e.Int):
203 val = cast(value.Int, val) # BAD: conflicts with both
204 print('i = %d' % val.i)
205
206 else:
207 print('other')
208
209MyPy allows this, but it translates to invalid C++ code. C++ can't have a
210variable named `val`, with 2 related types `value_t` and `value::Str`.
211
212So we use this idiom instead, which takes advantage of **local vars in case
213blocks** in C++:
214
215 val = value.Str('foo') # type: value_t
216
217 UP_val = val # temporary variable that will be casted
218
219 with tagswitch(val) as case:
220 if case(value_e.Str):
221 val = cast(value.Str, UP_val) # this works
222 print('s = %s' % val.s)
223
224 elif case(value_e.Int):
225 val = cast(value.Int, UP_val) # also works
226 print('i = %d' % val.i)
227
228 else:
229 print('other')
230
231This translates to something like:
232
233 value_t* val = Alloc<value::Str>(str42);
234 value_t* UP_val = val;
235
236 switch (val->tag()) {
237 case value_e::Str: {
238 // DIFFERENT local var
239 value::Str* val = static_cast<value::Str>(UP_val);
240 print(StrFormat(str43, val->s))
241 }
242 break;
243 case value_e::Int: {
244 // ANOTHER DIFFERENT local var
245 value::Int* val = static_cast<value::Int>(UP_val);
246 print(StrFormat(str44, val->i))
247 }
248 break;
249 default:
250 print(str45);
251 }
252
253This works because there's no problem having **different** variables with the
254same name within each `case { }` block.
255
256Again, the names `UP_*` are **special**. If the name doesn't start with `UP_`,
257the inner blocks will look like:
258
259 case value_e::Str: {
260 val = static_cast<value::Str>(val); // BAD: val reused
261 print(StrFormat(str43, val->s))
262 }
263
264And they will fail to compile. It's not valid C++ because the superclass
265`value_t` doesn't have a field `val->s`. Only the subclass `value::Str` has
266it.
267
268(Note that Python has a single flat scope per function, while C++ has nested
269scopes.)
270
271### Python context manager &rarr; C++ constructor and destructor (RAII)
272
273This Python code:
274
275 with ctx_Foo(42):
276 f()
277
278translates to this C++ code:
279
280 {
281 ctx_Foo tmp(42);
282 f()
283
284 // destructor ~ctx_Foo implicitly called
285 }
286
287## MyPy "Shimming" Technique
288
289We have an interesting way of "writing Python and C++ at the same time":
290
2911. First, all Python code must pass the MyPy type checker, and run with a stock
292 Python 2 interpreter.
293 - This is the source of truth &mdash; the source of our semantics.
2941. We translate most `.py` files to C++, **except** some files, in particular
295 [mycpp/mylib.py]($oils-src) and files starting with `py` like
296 `core/{pyos.pyutil}.py`.
2971. In C++, we can substitute custom implementations with the properties we
298 want, like `Dict<K, V>` being ordered, `BigInt` being distinct from C `int`,
299 `BufWriter` being efficient, etc.
300
301The MyPy type system is very powerful! It lets us do all this.
302
303### NewDict() for ordered dicts
304
305Dicts in Python 2 aren't ordered, but we make them ordered at **runtime** by
306using `mylib.NewDict()`, which returns `collections_.OrderedDict`.
307
308The **static type** is still `Dict[K, V]`, but change the "spec" to be an
309ordered dict.
310
311In C++, `Dict<K, V>` is implemented as an ordered dict. (Note: we don't
312implement preserving order on deletion, which seems OK.)
313
314- TODO: `iteritems()` could go away
315
316### StackArray[T]
317
318TODO: describe this when it works.
319
320### BigInt
321
322- In Python, it's simply defined a a class with an integer, in
323 [mylib/mops.py]($oils-src).
324- In C++, it's currently `typedef int64_t BigInt`, but we want to make it a big
325 integer.
326
327### ByteAt(), ByteEquals(), ...
328
329Hand optimization to reduce 1-byte strings. For IFS algorithm,
330`LooksLikeGlob()`, `GlobUnescape()`.
331
332### File / LineReader / BufWriter
333
334TODO: describe how this works.
335
336Can it be more type safe? I think we can cast `File` to both `LineReader` and
337`BufWriter`.
338
339Or can we invert the relationship, so `File` derives from **both** LineReader
340and BufWriter?
341
342### Fast JSON - avoid intermediate allocations
343
344- `pyj8.WriteString()` is shimmed so we don't create encoded J8 string objects,
345 only to throw them away and write to `mylib.BufWriter`. Instead, we append
346 an encoded strings **directly** to the `BufWriter`.
347- Likewise, we have `BufWriter::write_spaces` to avoid temporary allocations
348 when writing indents.
349 - This could be generalized to `BufWriter::write_repeated(' ', 42)`.
350- We may also want `BufWriter::write_slice()`
351
352## Limitations Requiring Source Rewrites
353
354mycpp itself may cause limitations on expressiveness, or the C++ language may
355be able express what we want.
356
357- C++ doesn't have `try / except / else`, or `finally`
358 - Use the `with ctx_Foo` pattern instead.
359- `if mylist` tests if the pointer is non-NULL; use `if len(mylist)` for
360 non-empty test
361- Functions can have at most one keyword / optional argument.
362 - We generate two methods: `f(x)` which calls `f(x, y)` with the default
363 value of `y`
364 - If there are two or more optional arguments:
365 - For classes, you can use the "builder pattern", i.e. add an
366 `Init_MyMember()` method
367 - If the arguments are booleans, translate it to a single bitfield argument
368- C++ has nested scope and Python has flat function scope. This can cause name
369 collisions.
370 - Could enforce this if it becomes a problem
371
372Also see `mycpp/examples/invalid_*` for Python code that fails to translate.
373
374## WARNING: Assumptions Not Checked
375
376### Global Constants Can't Be Mutated
377
378We translate top level constants to statically initialized C data structures
379(zero startup cost):
380
381 gStr = 'foo'
382 gList = [1, 2] # type: List[int]
383 gDict = {'bar': 42} # type: Dict[str, int]
384
385Even though `List` and `Dict` are mutable in general, you should **NOT** mutate
386these global instances! The C++ code will break at runtime.
387
388### Gotcha about Returning Variants (Subclasses) of a Type
389
390MyPy will accept this code:
391
392```
393if cond:
394 sig = proc_sig.Open # type: proc_sig_t
395 # bad because mycpp HOISTS this
396else:
397 sig = proc_sig.Closed.CreateNull()
398 sig.words = words # assignment fails
399return sig
400```
401
402It will translate to C++, but fail to compile. Instead, rewrite it like this:
403
404```
405sig = None # type: proc_sig_t
406if cond:
407 sig = proc_sig.Open # type: proc_sig_t
408 # bad because mycpp HOISTS this
409else:
410 closed = proc_sig.Closed.CreateNull()
411 closed.words = words # assignment fails
412 sig = closed
413return sig
414```
415
416### Exceptions Can't Leave Destructors / Python `__exit__`
417
418Context managers like `with ctx_Foo():` translate to C++ constructors and
419destructors.
420
421In C++, a destructor can't "leave" an exception. It results in a runtime error.
422
423You can throw and CATCH an exception WITHIN a destructor, but you can't let it
424propagate outside.
425
426This means you must be careful when coding the `__exit__` method. For example,
427in `vm::ctx_Redirect`, we had this bug due to `IOError` being thrown and not
428caught when restoring/popping redirects.
429
430To fix the bug, we rewrote the code to use an out param
431`List[IOError_OSError]`.
432
433Related:
434
435- <https://akrzemi1.wordpress.com/2011/09/21/destructors-that-throw/>
436
437## Translation Errors
438
439### Hoisting of C++ variables May Undefined Vars in Python
440
441I ran into this bug in `osh/word_eval.py` in March 2025:
442
443 if cond():
444 a = ''
445
446 if n < 0:
447 # UnboundLocalError: local variable 'a' referenced before assignment
448 raise error.FailGlob('Pattern %r matched no files' % a,
449 loc.Missing)
450
451So the variable is not defined in Python &mdash; *dynamically*. But in C++,
452the variable `a` is "hoisted" to the top and declared, which masks the bug.
453
454This is also not a MyPy error! Usually one of MyPy or mycpp will catch
455undefined variables.
456
457## More Translation Notes
458
459### Special case: `pnode::PNode*` are not GC objects
460
461Instead, they use the arena `ctx_PNodeAllocator`.
462
463There is a special case in mycpp for this. (And regression test in
464build/native.sh)
465
466### Hacky Heuristics
467
468- `callable(arg)` to either:
469 - function call `f(arg)`
470 - instantiation `Alloc<T>(arg)`
471- `name.attr` to either:
472 - `obj->member`
473 - `module::Func`
474- `cast(MyType, obj)` to either
475 - `static_cast<MyType*>(obj)`
476 - `reinterpret_cast<MyType*>(obj)`
477
478### Hacky Hard-Coded Names
479
480These are signs of coupling between mycpp and Oils, which ideally shouldn't
481exist.
482
483- `mycpp_main.py`
484 - `ModulesToCompile()` -- some files have to be ordered first, like the ASDL
485 runtime.
486 - TODO: Pea can respect parameter order? So we do that outside the project?
487 - Another ordering constraint comes from **inheritance**. The forward
488 declaration is NOT sufficient in that case.
489- `cppgen_pass.py`
490 - `_GetCastKind()` has some hard-coded names
491 - `AsdlType::Create()` is special cased to `::`, not `->`
492 - Default arguments e.g. `scope_e::Local` need a repeated `using`.
493
494Issue on mycpp improvements: <https://github.com/oilshell/oil/issues/568>
495
496### Major Features
497
498- Python `int` and `bool` &rarr; C++ `int` and `bool`
499 - `None` &rarr; `nullptr`
500- Statically Typed Python Collections
501 - `str` &rarr; `Str*`
502 - `List[T]` &rarr; `List<T>*`
503 - `Dict[K, V]` &rarr; `Dict<K, V>*`
504 - tuples &rarr; `Tuple2<A, B>`, `Tuple3<A, B, C>`, etc.
505- Collection literals turn into initializer lists
506 - And there is a C++ type inference issue which requires an explicit
507 `std::initializer_list<int>{1, 2, 3}`, not just `{1, 2, 3}`
508- `for` loops, i.e. Python's polymorphic iteration &rarr; `StrIter`,
509 `ListIter<T>`, `DictIter<K, V`
510 - `xrange()`
511 - `enumerate()`
512 - `reversed(mylist)` &rarr; `ReverseListIter`
513 - `d.iteritems()` is rewritten `mylib.iteritems()` &rarr; `DictIter`
514 - TODO: can we be smarter about this?
515- Python's `in` operator:
516 - `s in mystr` &rarr; `str_contains(mystr, s)`
517 - `x in mylist` &rarr; `list_contains(mylist, x)`
518- Classes and inheritance
519 - `__init__` method becomes a constructor. Note: initializer lists aren't
520 used.
521 - Detect `virtual` methods
522 - TODO: could we detect `abstract` methods? (`NotImplementedError`)
523- Python generators `Iterator[T]` &rarr; eager `List<T>` accumulators
524- Python Exceptions &rarr; C++ exceptions
525- Python Modules &rarr; C++ namespace (we assume a 2-level hierarchy)
526 - TODO: mycpp need real modules, because our `oils_for_unix.mycpp.cc`
527 translation unit is getting big.
528 - And `cpp/preamble.h` is a hack to work around the lack of modules.
529
530### Minor Translations
531
532- `s1 == s2` &rarr; `str_equals(s1, s2)`
533- `'x' * 3` &rarr; `str_repeat(globalStr, 3)`
534- `[None] * 3` &rarr; `list_repeat(nullptr, 3)`
535- Omitted:
536 - If the LHS of an assignment is `_`, then the statement is omitted
537 - This is for `_ = log`, which shuts up Python lint warnings for 'unused
538 import'
539 - Code under `if __name__ == '__main__'`
540
541### Optimizations
542
543- Returning Tuples by value. To reduce GC pressure, we we return
544 `Tuple2<A, B>` instead of `Tuple2<A, B>*`, and likewise for `Tuple3` and `Tuple4`.
545
546### Rooting Policy
547
548The translated code roots local variables in every function
549
550 StackRoots _r({&var1, &var2});
551
552We have two kinds of hand-written code:
553
5541. Methods like `Str::strip()` in `mycpp/`
5552. OS bindings like `stat()` in `cpp/`
556
557Neither of them needs any rooting! This is because we use **manual collection
558points** in the interpreter, and these functions don't call any functions that
559can collect. They are "leaves" in the call tree.
560
561## The mycpp Runtime
562
563The mycpp translator targets a runtime that's written from scratch. It
564implements garbage-collected data structures like:
565
566- Typed records
567 - Python classes
568 - ASDL product and sum types
569- `Str` (immutable, as in Python)
570- `List<T>`
571- `Dict<K, V>`
572- `Tuple2<A, B>`, `Tuple3<A, B, C>`, ...
573
574It also has functions based on CPython's:
575
576- `mycpp/gc_builtins.{h,cc}` corresponds roughly to Python's `__builtin__`
577 module, e.g. `int()` and `str()`
578- `mycpp/gc_mylib.{h,cc}` corresponds `mylib.py`
579 - `mylib.BufWriter` is a bit like `cStringIO.StringIO`
580
581### Differences from CPython
582
583- Integers either C `int` or `mylib.BigInt`, not Python's arbitrary size
584 integers
585- `NUL` bytes are allowed in arguments to syscalls like `open()`, unlike in
586 CPython
587- `s.strip()` is defined in terms of ASCII whitespace, which does not include
588 say `\v`.
589 - This is done to be consistent with JSON and J8 Notation.
590
591## C++ Notes
592
593### Gotchas
594
595- C++ classes can have 2 member variables of the same name! From the base
596 class and derived class.
597- Failing to declare methods `virtual` can involve the wrong one being called
598 at runtime
599
600### Minor Features Used
601
602In addition to classes, templates, exceptions, etc. mentioned above, we use:
603
604- `static_cast` and `reinterpret_cast`
605- `enum class` for ASDL
606- Function overloading
607 - For equality and hashing?
608- `offsetof` for introspection of field positions for garbage collection
609- `std::initializer_list` for `StackRoots()`
610 - Should we get rid of this?
611
612### Not Used
613
614- I/O Streams, RTTI, etc.
615- `const`
616- Smart pointers
617