OILS / mycpp / README.md View on Github | oilshell.org

587 lines, 420 significant
1mycpp
2=====
3
4This is a Python-to-C++ translator based on MyPy. It only
5handles the small subset of Python that we use in Oils.
6
7It's inspired by both mypyc and Shed Skin. These posts give background:
8
9- [Brief Descriptions of a Python to C++ Translator](https://www.oilshell.org/blog/2022/05/mycpp.html)
10- [Oil Is Being Implemented "Middle Out"](https://www.oilshell.org/blog/2022/03/middle-out.html)
11
12As of March 2024, the translation to C++ is **done**. So it's no longer
13experimental!
14
15However, it's still pretty **hacky**. This doc exists mainly to explain the
16hacks. (We may want to rewrite mycpp as "yaks", although it's low priority
17right now.)
18
19---
20
21Source for this doc: [mycpp/README.md]($oils-src). The code is all in
22[mycpp/]($oils-src).
23
24
25<div id="toc">
26</div>
27
28## Instructions
29
30### Translating and Compiling `oils-cpp`
31
32Running `mycpp` is best done on a Debian / Ubuntu-ish machine. Follow the
33instructions at <https://github.com/oilshell/oil/wiki/Contributing> to create
34the "dev build" first, which is DISTINCT from the C++ build. Make sure you can
35run:
36
37 oil$ build/py.sh all
38
39This will give you a working shell:
40
41 oil$ bin/osh -c 'echo hi' # running interpreted Python
42 hi
43
44To run mycpp, we will build Python 3.10, clone MyPy, and install MyPy's
45dependencies. First install packages:
46
47 # We need libssl-dev, libffi-dev, zlib1g-dev to bootstrap Python
48 oil$ build/deps.sh install-ubuntu-packages
49
50You'll also need a C++17 compiler for code generated by Souffle datalog, used
51by mycpp, although Oils itself only requires C++11.
52
53Then fetch data, like the Python 3.10 tarball and MyPy repo:
54
55 oil$ build/deps.sh fetch
56
57Then build from source:
58
59 oil$ build/deps.sh install-wedges
60
61To build oil-native, use:
62
63 oil$ ./NINJA-config.sh
64 oil$ ninja # translate and compile, may take 30 seconds
65
66 oil$ _bin/cxx-asan/osh -c 'echo hi' # running compiled C++ !
67 hi
68
69To run the tests and benchmarks:
70
71 oil$ mycpp/TEST.sh test-translator
72 ... 200+ tasks run ...
73
74If you have problems, post a message on `#oil-dev` at
75`https://oilshell.zulipchat.com`. Not many people have contributed to `mycpp`,
76so I can use your feedback!
77
78Related:
79
80- [Oil Native Quick
81Start](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start) on the
82wiki.
83- [Oil Dev Cheat Sheet](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start)
84
85## Notes on the Algorithm / Architecture
86
87There are four passes over the MyPy AST.
88
89(1) `const_pass.py`: Collect string constants
90
91Turn turn the constant in `myfunc("foo")` into top-level `GLOBAL_STR(str1,
92"foo")`.
93
94(2) Three passes in `cppgen_pass.py`.
95
96(a) Forward Declaration Pass.
97
98 class Foo;
99 class Bar;
100
101This pass also determines which methods should be declared `virtual` in their
102declarations. The `virtual` keyword is written in the next pass.
103
104(b) Declaration Pass.
105
106 class Foo {
107 void method();
108 };
109 class Bar {
110 void method();
111 };
112
113More work in this pass:
114
115- Collect member variables and write them at the end of the definition
116- Collect locals for "hoisting". Written in the next pass.
117
118(c) Definition Pass.
119
120 void Foo:method() {
121 ...
122 }
123
124 void Bar:method() {
125 ...
126 }
127
128Note: I really wish we were not using visitors, but that's inherited from MyPy.
129
130## mycpp Idioms / "Creative Hacks"
131
132Oils is written in typed Python 2. It will run under a stock Python 2
133interpreter, and it will typecheck with stock MyPy.
134
135However, there are a few language features that don't map cleanly from typed
136Python to C++:
137
138- switch statements (unfortunately we don't have the Python 3 match statement)
139- C++ destructors - the RAII ptatern
140- casting - MyPy has one kind of cast; C++ has `static_cast` and
141 `reinterpret_cast`. (We don't use C-style casting.)
142
143So this describes the idioms we use. There are some hacks in
144[mycpp/cppgen_pass.py]($oils-src) to handle these cases, and also Python
145runtime equivalents in `mycpp/mylib.py`.
146
147### `with {,tag,str_}switch` &rarr; Switch statement
148
149We have three constructs that translate to a C++ switch statement. They use a
150Python context manager `with Xswitch(obj) ...` as a little hack.
151
152Here are examples like the ones in [mycpp/examples/test_switch.py]($oils-src).
153(`ninja mycpp-logs-equal` translates, compiles, and tests all the examples.)
154
155Simple switch:
156
157 myint = 99
158 with switch(myint) as case:
159 if case(42, 43):
160 print('forties')
161 else:
162 print('other')
163
164Switch on **object type**, which goes well with ASDL sum types:
165
166 val = value.Str('foo) # type: value_t
167 with tagswitch(val) as case:
168 if case(value_e.Str, value_e.Int):
169 print('string or int')
170 else:
171 print('other')
172
173We usually need to apply the `UP_val` pattern here, described in the next
174section.
175
176Switch on **string**, which generates a fast **two-level dispatch** -- first on
177length, and then with `str_equals_c()`:
178
179 s = 'foo'
180 with str_switch(s) as case:
181 if case("foo")
182 print('FOO')
183 else:
184 print('other')
185
186### `val` &rarr; `UP_val` &rarr; `val` Downcasting pattern
187
188Summary: variable names like `UP_*` are **special** in our Python code.
189
190Consider the downcasts marked BAD:
191
192 val = value.Str('foo) # type: value_t
193
194 with tagswitch(obj) as case:
195 if case(value_e.Str):
196 val = cast(value.Str, val) # BAD: conflicts with first declaration
197 print('s = %s' % val.s)
198
199 elif case(value_e.Int):
200 val = cast(value.Int, val) # BAD: conflicts with both
201 print('i = %d' % val.i)
202
203 else:
204 print('other')
205
206MyPy allows this, but it translates to invalid C++ code. C++ can't have a
207variable named `val`, with 2 related types `value_t` and `value::Str`.
208
209So we use this idiom instead, which takes advantage of **local vars in case
210blocks** in C++:
211
212 val = value.Str('foo') # type: value_t
213
214 UP_val = val # temporary variable that will be casted
215
216 with tagswitch(val) as case:
217 if case(value_e.Str):
218 val = cast(value.Str, UP_val) # this works
219 print('s = %s' % val.s)
220
221 elif case(value_e.Int):
222 val = cast(value.Int, UP_val) # also works
223 print('i = %d' % val.i)
224
225 else:
226 print('other')
227
228This translates to something like:
229
230 value_t* val = Alloc<value::Str>(str42);
231 value_t* UP_val = val;
232
233 switch (val->tag()) {
234 case value_e::Str: {
235 // DIFFERENT local var
236 value::Str* val = static_cast<value::Str>(UP_val);
237 print(StrFormat(str43, val->s))
238 }
239 break;
240 case value_e::Int: {
241 // ANOTHER DIFFERENT local var
242 value::Int* val = static_cast<value::Int>(UP_val);
243 print(StrFormat(str44, val->i))
244 }
245 break;
246 default:
247 print(str45);
248 }
249
250This works because there's no problem having **different** variables with the
251same name within each `case { }` block.
252
253Again, the names `UP_*` are **special**. If the name doesn't start with `UP_`,
254the inner blocks will look like:
255
256 case value_e::Str: {
257 val = static_cast<value::Str>(val); // BAD: val reused
258 print(StrFormat(str43, val->s))
259 }
260
261And they will fail to compile. It's not valid C++ because the superclass
262`value_t` doesn't have a field `val->s`. Only the subclass `value::Str` has
263it.
264
265(Note that Python has a single flat scope per function, while C++ has nested
266scopes.)
267
268### Python context manager &rarr; C++ constructor and destructor (RAII)
269
270This Python code:
271
272 with ctx_Foo(42):
273 f()
274
275translates to this C++ code:
276
277 {
278 ctx_Foo tmp(42);
279 f()
280
281 // destructor ~ctx_Foo implicitly called
282 }
283
284## MyPy "Shimming" Technique
285
286We have an interesting way of "writing Python and C++ at the same time":
287
2881. First, all Python code must pass the MyPy type checker, and run with a stock
289 Python 2 interpreter.
290 - This is the source of truth &mdash; the source of our semantics.
2911. We translate most `.py` files to C++, **except** some files, in particular
292 [mycpp/mylib.py]($oils-src) and files starting with `py` like
293 `core/{pyos.pyutil}.py`.
2941. In C++, we can substitute custom implementations with the properties we
295 want, like `Dict<K, V>` being ordered, `BigInt` being distinct from C `int`,
296 `BufWriter` being efficient, etc.
297
298The MyPy type system is very powerful! It lets us do all this.
299
300### NewDict() for ordered dicts
301
302Dicts in Python 2 aren't ordered, but we make them ordered at **runtime** by
303using `mylib.NewDict()`, which returns `collections_.OrderedDict`.
304
305The **static type** is still `Dict[K, V]`, but change the "spec" to be an
306ordered dict.
307
308In C++, `Dict<K, V>` is implemented as an ordered dict. (Note: we don't
309implement preserving order on deletion, which seems OK.)
310
311- TODO: `iteritems()` could go away
312
313### StackArray[T]
314
315TODO: describe this when it works.
316
317### BigInt
318
319- In Python, it's simply defined a a class with an integer, in
320 [mylib/mops.py]($oils-src).
321- In C++, it's currently `typedef int64_t BigInt`, but we want to make it a big
322 integer.
323
324### ByteAt(), ByteEquals(), ...
325
326Hand optimization to reduce 1-byte strings. For IFS algorithm,
327`LooksLikeGlob()`, `GlobUnescape()`.
328
329### File / LineReader / BufWriter
330
331TODO: describe how this works.
332
333Can it be more type safe? I think we can cast `File` to both `LineReader` and
334`BufWriter`.
335
336Or can we invert the relationship, so `File` derives from **both** LineReader
337and BufWriter?
338
339### Fast JSON - avoid intermediate allocations
340
341- `pyj8.WriteString()` is shimmed so we don't create encoded J8 string objects,
342 only to throw them away and write to `mylib.BufWriter`. Instead, we append
343 an encoded strings **directly** to the `BufWriter`.
344- Likewise, we have `BufWriter::write_spaces` to avoid temporary allocations
345 when writing indents.
346 - This could be generalized to `BufWriter::write_repeated(' ', 42)`.
347- We may also want `BufWriter::write_slice()`
348
349## Limitations Requiring Source Rewrites
350
351mycpp itself may cause limitations on expressiveness, or the C++ language may
352be able express what we want.
353
354- C++ doesn't have `try / except / else`, or `finally`
355 - Use the `with ctx_Foo` pattern instead.
356- `if mylist` tests if the pointer is non-NULL; use `if len(mylist)` for
357 non-empty test
358- Functions can have at most one keyword / optional argument.
359 - We generate two methods: `f(x)` which calls `f(x, y)` with the default
360 value of `y`
361 - If there are two or more optional arguments:
362 - For classes, you can use the "builder pattern", i.e. add an
363 `Init_MyMember()` method
364 - If the arguments are booleans, translate it to a single bitfield argument
365- C++ has nested scope and Python has flat function scope. This can cause name
366 collisions.
367 - Could enforce this if it becomes a problem
368
369Also see `mycpp/examples/invalid_*` for Python code that fails to translate.
370
371## WARNING: Assumptions Not Checked
372
373### Global Constants Can't Be Mutated
374
375We translate top level constants to statically initialized C data structures
376(zero startup cost):
377
378 gStr = 'foo'
379 gList = [1, 2] # type: List[int]
380 gDict = {'bar': 42} # type: Dict[str, int]
381
382Even though `List` and `Dict` are mutable in general, you should **NOT** mutate
383these global instances! The C++ code will break at runtime.
384
385### Gotcha about Returning Variants (Subclasses) of a Type
386
387MyPy will accept this code:
388
389```
390if cond:
391 sig = proc_sig.Open # type: proc_sig_t
392 # bad because mycpp HOISTS this
393else:
394 sig = proc_sig.Closed.CreateNull()
395 sig.words = words # assignment fails
396return sig
397```
398
399It will translate to C++, but fail to compile. Instead, rewrite it like this:
400
401```
402sig = None # type: proc_sig_t
403if cond:
404 sig = proc_sig.Open # type: proc_sig_t
405 # bad because mycpp HOISTS this
406else:
407 closed = proc_sig.Closed.CreateNull()
408 closed.words = words # assignment fails
409 sig = closed
410return sig
411```
412
413### Exceptions Can't Leave Destructors / Python `__exit__`
414
415Context managers like `with ctx_Foo():` translate to C++ constructors and
416destructors.
417
418In C++, a destructor can't "leave" an exception. It results in a runtime error.
419
420You can throw and CATCH an exception WITHIN a destructor, but you can't let it
421propagate outside.
422
423This means you must be careful when coding the `__exit__` method. For example,
424in `vm::ctx_Redirect`, we had this bug due to `IOError` being thrown and not
425caught when restoring/popping redirects.
426
427To fix the bug, we rewrote the code to use an out param
428`List[IOError_OSError]`.
429
430Related:
431
432- <https://akrzemi1.wordpress.com/2011/09/21/destructors-that-throw/>
433
434## More Translation Notes
435
436### Hacky Heuristics
437
438- `callable(arg)` to either:
439 - function call `f(arg)`
440 - instantiation `Alloc<T>(arg)`
441- `name.attr` to either:
442 - `obj->member`
443 - `module::Func`
444- `cast(MyType, obj)` to either
445 - `static_cast<MyType*>(obj)`
446 - `reinterpret_cast<MyType*>(obj)`
447
448### Hacky Hard-Coded Names
449
450These are signs of coupling between mycpp and Oils, which ideally shouldn't
451exist.
452
453- `mycpp_main.py`
454 - `ModulesToCompile()` -- some files have to be ordered first, like the ASDL
455 runtime.
456 - TODO: Pea can respect parameter order? So we do that outside the project?
457 - Another ordering constraint comes from **inheritance**. The forward
458 declaration is NOT sufficient in that case.
459- `cppgen_pass.py`
460 - `_GetCastKind()` has some hard-coded names
461 - `AsdlType::Create()` is special cased to `::`, not `->`
462 - Default arguments e.g. `scope_e::Local` need a repeated `using`.
463
464Issue on mycpp improvements: <https://github.com/oilshell/oil/issues/568>
465
466### Major Features
467
468- Python `int` and `bool` &rarr; C++ `int` and `bool`
469 - `None` &rarr; `nullptr`
470- Statically Typed Python Collections
471 - `str` &rarr; `Str*`
472 - `List[T]` &rarr; `List<T>*`
473 - `Dict[K, V]` &rarr; `Dict<K, V>*`
474 - tuples &rarr; `Tuple2<A, B>`, `Tuple3<A, B, C>`, etc.
475- Collection literals turn into initializer lists
476 - And there is a C++ type inference issue which requires an explicit
477 `std::initializer_list<int>{1, 2, 3}`, not just `{1, 2, 3}`
478- `for` loops, i.e. Python's polymorphic iteration &rarr; `StrIter`,
479 `ListIter<T>`, `DictIter<K, V`
480 - `xrange()`
481 - `enumerate()`
482 - `reversed(mylist)` &rarr; `ReverseListIter`
483 - `d.iteritems()` is rewritten `mylib.iteritems()` &rarr; `DictIter`
484 - TODO: can we be smarter about this?
485- Python's `in` operator:
486 - `s in mystr` &rarr; `str_contains(mystr, s)`
487 - `x in mylist` &rarr; `list_contains(mylist, x)`
488- Classes and inheritance
489 - `__init__` method becomes a constructor. Note: initializer lists aren't
490 used.
491 - Detect `virtual` methods
492 - TODO: could we detect `abstract` methods? (`NotImplementedError`)
493- Python generators `Iterator[T]` &rarr; eager `List<T>` accumulators
494- Python Exceptions &rarr; C++ exceptions
495- Python Modules &rarr; C++ namespace (we assume a 2-level hierarchy)
496 - TODO: mycpp need real modules, because our `oils_for_unix.mycpp.cc`
497 translation unit is getting big.
498 - And `cpp/preamble.h` is a hack to work around the lack of modules.
499
500### Minor Translations
501
502- `s1 == s2` &rarr; `str_equals(s1, s2)`
503- `'x' * 3` &rarr; `str_repeat(globalStr, 3)`
504- `[None] * 3` &rarr; `list_repeat(nullptr, 3)`
505- Omitted:
506 - If the LHS of an assignment is `_`, then the statement is omitted
507 - This is for `_ = log`, which shuts up Python lint warnings for 'unused
508 import'
509 - Code under `if __name__ == '__main__'`
510
511### Optimizations
512
513- Returning Tuples by value. To reduce GC pressure, we we return
514 `Tuple2<A, B>` instead of `Tuple2<A, B>*`, and likewise for `Tuple3` and `Tuple4`.
515
516### Rooting Policy
517
518The translated code roots local variables in every function
519
520 StackRoots _r({&var1, &var2});
521
522We have two kinds of hand-written code:
523
5241. Methods like `Str::strip()` in `mycpp/`
5252. OS bindings like `stat()` in `cpp/`
526
527Neither of them needs any rooting! This is because we use **manual collection
528points** in the interpreter, and these functions don't call any functions that
529can collect. They are "leaves" in the call tree.
530
531## The mycpp Runtime
532
533The mycpp translator targets a runtime that's written from scratch. It
534implements garbage-collected data structures like:
535
536- Typed records
537 - Python classes
538 - ASDL product and sum types
539- `Str` (immutable, as in Python)
540- `List<T>`
541- `Dict<K, V>`
542- `Tuple2<A, B>`, `Tuple3<A, B, C>`, ...
543
544It also has functions based on CPython's:
545
546- `mycpp/gc_builtins.{h,cc}` corresponds roughly to Python's `__builtin__`
547 module, e.g. `int()` and `str()`
548- `mycpp/gc_mylib.{h,cc}` corresponds `mylib.py`
549 - `mylib.BufWriter` is a bit like `cStringIO.StringIO`
550
551### Differences from CPython
552
553- Integers either C `int` or `mylib.BigInt`, not Python's arbitrary size
554 integers
555- `NUL` bytes are allowed in arguments to syscalls like `open()`, unlike in
556 CPython
557- `s.strip()` is defined in terms of ASCII whitespace, which does not include
558 say `\v`.
559 - This is done to be consistent with JSON and J8 Notation.
560
561## C++ Notes
562
563### Gotchas
564
565- C++ classes can have 2 member variables of the same name! From the base
566 class and derived class.
567- Failing to declare methods `virtual` can involve the wrong one being called
568 at runtime
569
570### Minor Features Used
571
572In addition to classes, templates, exceptions, etc. mentioned above, we use:
573
574- `static_cast` and `reinterpret_cast`
575- `enum class` for ASDL
576- Function overloading
577 - For equality and hashing?
578- `offsetof` for introspection of field positions for garbage collection
579- `std::initializer_list` for `StackRoots()`
580 - Should we get rid of this?
581
582### Not Used
583
584- I/O Streams, RTTI, etc.
585- `const`
586- Smart pointers
587