OILS / mycpp / README.md View on Github | oils.pub

597 lines, 426 significant
1mycpp
2=====
3
4This is a Python-to-C++ translator based on MyPy. It only
5handles the small subset of Python that we use in Oils.
6
7It's inspired by both mypyc and Shed Skin. These posts give background:
8
9- [Brief Descriptions of a Python to C++ Translator](https://www.oilshell.org/blog/2022/05/mycpp.html)
10- [Oil Is Being Implemented "Middle Out"](https://www.oilshell.org/blog/2022/03/middle-out.html)
11
12As of March 2024, the translation to C++ is **done**. So it's no longer
13experimental!
14
15---
16
17`mycpp` started as a **hack**, but it worked because its output is fairly
18strongly-typed C++.
19
20That is, the C++ type system catches many errors! But it doesn't catch all of
21them, so we've gradually made `mycpp` more strict.
22
23As of December 2024, `mycpp` is a pretty clean program, although there are
24still many heuristics. This doc explains the heuristics.
25
26(I'd like to gradually rewrite mycpp as a more principled "yaks" language,
27although this isn't a high priority.)
28
29---
30
31Source for this doc: [mycpp/README.md]($oils-src). The code is all in
32[mycpp/]($oils-src).
33
34
35<div id="toc">
36</div>
37
38## Instructions
39
40### Translating and Compiling `oils-cpp`
41
42Running `mycpp` is best done on a Debian / Ubuntu-ish machine. Follow the
43instructions at <https://github.com/oilshell/oil/wiki/Contributing> to create
44the "dev build" first, which is DISTINCT from the C++ build. Make sure you can
45run:
46
47 oil$ build/py.sh all
48
49This will give you a working shell:
50
51 oil$ bin/osh -c 'echo hi' # running interpreted Python
52 hi
53
54To run mycpp, we will build Python 3.10, clone MyPy, and install MyPy's
55dependencies. First install packages:
56
57 # We need libssl-dev, libffi-dev, zlib1g-dev to bootstrap Python
58 oil$ build/deps.sh install-ubuntu-packages
59
60You'll also need a C++17 compiler for code generated by Souffle datalog, used
61by mycpp, although Oils itself only requires C++11.
62
63Then fetch data, like the Python 3.10 tarball and MyPy repo:
64
65 oil$ build/deps.sh fetch
66
67Then build from source:
68
69 oil$ build/deps.sh install-wedges
70
71To build oil-native, use:
72
73 oil$ ./NINJA-config.sh
74 oil$ ninja # translate and compile, may take 30 seconds
75
76 oil$ _bin/cxx-asan/osh -c 'echo hi' # running compiled C++ !
77 hi
78
79To run the tests and benchmarks:
80
81 oil$ mycpp/TEST.sh test-translator
82 ... 200+ tasks run ...
83
84If you have problems, post a message on `#oil-dev` at
85`https://oilshell.zulipchat.com`. Not many people have contributed to `mycpp`,
86so I can use your feedback!
87
88Related:
89
90- [Oil Native Quick
91Start](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start) on the
92wiki.
93- [Oil Dev Cheat Sheet](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start)
94
95## Notes on the Algorithm / Architecture
96
97There are four passes over the MyPy AST.
98
99(1) `const_pass.py`: Collect string constants
100
101Turn turn the constant in `myfunc("foo")` into top-level `GLOBAL_STR(str1,
102"foo")`.
103
104(2) Three passes in `cppgen_pass.py`.
105
106(a) Forward Declaration Pass.
107
108 class Foo;
109 class Bar;
110
111This pass also determines which methods should be declared `virtual` in their
112declarations. The `virtual` keyword is written in the next pass.
113
114(b) Declaration Pass.
115
116 class Foo {
117 void method();
118 };
119 class Bar {
120 void method();
121 };
122
123More work in this pass:
124
125- Collect member variables and write them at the end of the definition
126- Collect locals for "hoisting". Written in the next pass.
127
128(c) Definition Pass.
129
130 void Foo:method() {
131 ...
132 }
133
134 void Bar:method() {
135 ...
136 }
137
138Note: I really wish we were not using visitors, but that's inherited from MyPy.
139
140## mycpp Idioms / "Creative Hacks"
141
142Oils is written in typed Python 2. It will run under a stock Python 2
143interpreter, and it will typecheck with stock MyPy.
144
145However, there are a few language features that don't map cleanly from typed
146Python to C++:
147
148- switch statements (unfortunately we don't have the Python 3 match statement)
149- C++ destructors - the RAII ptatern
150- casting - MyPy has one kind of cast; C++ has `static_cast` and
151 `reinterpret_cast`. (We don't use C-style casting.)
152
153So this describes the idioms we use. There are some hacks in
154[mycpp/cppgen_pass.py]($oils-src) to handle these cases, and also Python
155runtime equivalents in `mycpp/mylib.py`.
156
157### `with {,tag,str_}switch` &rarr; Switch statement
158
159We have three constructs that translate to a C++ switch statement. They use a
160Python context manager `with Xswitch(obj) ...` as a little hack.
161
162Here are examples like the ones in [mycpp/examples/test_switch.py]($oils-src).
163(`ninja mycpp-logs-equal` translates, compiles, and tests all the examples.)
164
165Simple switch:
166
167 myint = 99
168 with switch(myint) as case:
169 if case(42, 43):
170 print('forties')
171 else:
172 print('other')
173
174Switch on **object type**, which goes well with ASDL sum types:
175
176 val = value.Str('foo) # type: value_t
177 with tagswitch(val) as case:
178 if case(value_e.Str, value_e.Int):
179 print('string or int')
180 else:
181 print('other')
182
183We usually need to apply the `UP_val` pattern here, described in the next
184section.
185
186Switch on **string**, which generates a fast **two-level dispatch** -- first on
187length, and then with `str_equals_c()`:
188
189 s = 'foo'
190 with str_switch(s) as case:
191 if case("foo")
192 print('FOO')
193 else:
194 print('other')
195
196### `val` &rarr; `UP_val` &rarr; `val` Downcasting pattern
197
198Summary: variable names like `UP_*` are **special** in our Python code.
199
200Consider the downcasts marked BAD:
201
202 val = value.Str('foo) # type: value_t
203
204 with tagswitch(obj) as case:
205 if case(value_e.Str):
206 val = cast(value.Str, val) # BAD: conflicts with first declaration
207 print('s = %s' % val.s)
208
209 elif case(value_e.Int):
210 val = cast(value.Int, val) # BAD: conflicts with both
211 print('i = %d' % val.i)
212
213 else:
214 print('other')
215
216MyPy allows this, but it translates to invalid C++ code. C++ can't have a
217variable named `val`, with 2 related types `value_t` and `value::Str`.
218
219So we use this idiom instead, which takes advantage of **local vars in case
220blocks** in C++:
221
222 val = value.Str('foo') # type: value_t
223
224 UP_val = val # temporary variable that will be casted
225
226 with tagswitch(val) as case:
227 if case(value_e.Str):
228 val = cast(value.Str, UP_val) # this works
229 print('s = %s' % val.s)
230
231 elif case(value_e.Int):
232 val = cast(value.Int, UP_val) # also works
233 print('i = %d' % val.i)
234
235 else:
236 print('other')
237
238This translates to something like:
239
240 value_t* val = Alloc<value::Str>(str42);
241 value_t* UP_val = val;
242
243 switch (val->tag()) {
244 case value_e::Str: {
245 // DIFFERENT local var
246 value::Str* val = static_cast<value::Str>(UP_val);
247 print(StrFormat(str43, val->s))
248 }
249 break;
250 case value_e::Int: {
251 // ANOTHER DIFFERENT local var
252 value::Int* val = static_cast<value::Int>(UP_val);
253 print(StrFormat(str44, val->i))
254 }
255 break;
256 default:
257 print(str45);
258 }
259
260This works because there's no problem having **different** variables with the
261same name within each `case { }` block.
262
263Again, the names `UP_*` are **special**. If the name doesn't start with `UP_`,
264the inner blocks will look like:
265
266 case value_e::Str: {
267 val = static_cast<value::Str>(val); // BAD: val reused
268 print(StrFormat(str43, val->s))
269 }
270
271And they will fail to compile. It's not valid C++ because the superclass
272`value_t` doesn't have a field `val->s`. Only the subclass `value::Str` has
273it.
274
275(Note that Python has a single flat scope per function, while C++ has nested
276scopes.)
277
278### Python context manager &rarr; C++ constructor and destructor (RAII)
279
280This Python code:
281
282 with ctx_Foo(42):
283 f()
284
285translates to this C++ code:
286
287 {
288 ctx_Foo tmp(42);
289 f()
290
291 // destructor ~ctx_Foo implicitly called
292 }
293
294## MyPy "Shimming" Technique
295
296We have an interesting way of "writing Python and C++ at the same time":
297
2981. First, all Python code must pass the MyPy type checker, and run with a stock
299 Python 2 interpreter.
300 - This is the source of truth &mdash; the source of our semantics.
3011. We translate most `.py` files to C++, **except** some files, in particular
302 [mycpp/mylib.py]($oils-src) and files starting with `py` like
303 `core/{pyos.pyutil}.py`.
3041. In C++, we can substitute custom implementations with the properties we
305 want, like `Dict<K, V>` being ordered, `BigInt` being distinct from C `int`,
306 `BufWriter` being efficient, etc.
307
308The MyPy type system is very powerful! It lets us do all this.
309
310### NewDict() for ordered dicts
311
312Dicts in Python 2 aren't ordered, but we make them ordered at **runtime** by
313using `mylib.NewDict()`, which returns `collections_.OrderedDict`.
314
315The **static type** is still `Dict[K, V]`, but change the "spec" to be an
316ordered dict.
317
318In C++, `Dict<K, V>` is implemented as an ordered dict. (Note: we don't
319implement preserving order on deletion, which seems OK.)
320
321- TODO: `iteritems()` could go away
322
323### StackArray[T]
324
325TODO: describe this when it works.
326
327### BigInt
328
329- In Python, it's simply defined a a class with an integer, in
330 [mylib/mops.py]($oils-src).
331- In C++, it's currently `typedef int64_t BigInt`, but we want to make it a big
332 integer.
333
334### ByteAt(), ByteEquals(), ...
335
336Hand optimization to reduce 1-byte strings. For IFS algorithm,
337`LooksLikeGlob()`, `GlobUnescape()`.
338
339### File / LineReader / BufWriter
340
341TODO: describe how this works.
342
343Can it be more type safe? I think we can cast `File` to both `LineReader` and
344`BufWriter`.
345
346Or can we invert the relationship, so `File` derives from **both** LineReader
347and BufWriter?
348
349### Fast JSON - avoid intermediate allocations
350
351- `pyj8.WriteString()` is shimmed so we don't create encoded J8 string objects,
352 only to throw them away and write to `mylib.BufWriter`. Instead, we append
353 an encoded strings **directly** to the `BufWriter`.
354- Likewise, we have `BufWriter::write_spaces` to avoid temporary allocations
355 when writing indents.
356 - This could be generalized to `BufWriter::write_repeated(' ', 42)`.
357- We may also want `BufWriter::write_slice()`
358
359## Limitations Requiring Source Rewrites
360
361mycpp itself may cause limitations on expressiveness, or the C++ language may
362be able express what we want.
363
364- C++ doesn't have `try / except / else`, or `finally`
365 - Use the `with ctx_Foo` pattern instead.
366- `if mylist` tests if the pointer is non-NULL; use `if len(mylist)` for
367 non-empty test
368- Functions can have at most one keyword / optional argument.
369 - We generate two methods: `f(x)` which calls `f(x, y)` with the default
370 value of `y`
371 - If there are two or more optional arguments:
372 - For classes, you can use the "builder pattern", i.e. add an
373 `Init_MyMember()` method
374 - If the arguments are booleans, translate it to a single bitfield argument
375- C++ has nested scope and Python has flat function scope. This can cause name
376 collisions.
377 - Could enforce this if it becomes a problem
378
379Also see `mycpp/examples/invalid_*` for Python code that fails to translate.
380
381## WARNING: Assumptions Not Checked
382
383### Global Constants Can't Be Mutated
384
385We translate top level constants to statically initialized C data structures
386(zero startup cost):
387
388 gStr = 'foo'
389 gList = [1, 2] # type: List[int]
390 gDict = {'bar': 42} # type: Dict[str, int]
391
392Even though `List` and `Dict` are mutable in general, you should **NOT** mutate
393these global instances! The C++ code will break at runtime.
394
395### Gotcha about Returning Variants (Subclasses) of a Type
396
397MyPy will accept this code:
398
399```
400if cond:
401 sig = proc_sig.Open # type: proc_sig_t
402 # bad because mycpp HOISTS this
403else:
404 sig = proc_sig.Closed.CreateNull()
405 sig.words = words # assignment fails
406return sig
407```
408
409It will translate to C++, but fail to compile. Instead, rewrite it like this:
410
411```
412sig = None # type: proc_sig_t
413if cond:
414 sig = proc_sig.Open # type: proc_sig_t
415 # bad because mycpp HOISTS this
416else:
417 closed = proc_sig.Closed.CreateNull()
418 closed.words = words # assignment fails
419 sig = closed
420return sig
421```
422
423### Exceptions Can't Leave Destructors / Python `__exit__`
424
425Context managers like `with ctx_Foo():` translate to C++ constructors and
426destructors.
427
428In C++, a destructor can't "leave" an exception. It results in a runtime error.
429
430You can throw and CATCH an exception WITHIN a destructor, but you can't let it
431propagate outside.
432
433This means you must be careful when coding the `__exit__` method. For example,
434in `vm::ctx_Redirect`, we had this bug due to `IOError` being thrown and not
435caught when restoring/popping redirects.
436
437To fix the bug, we rewrote the code to use an out param
438`List[IOError_OSError]`.
439
440Related:
441
442- <https://akrzemi1.wordpress.com/2011/09/21/destructors-that-throw/>
443
444## More Translation Notes
445
446### Hacky Heuristics
447
448- `callable(arg)` to either:
449 - function call `f(arg)`
450 - instantiation `Alloc<T>(arg)`
451- `name.attr` to either:
452 - `obj->member`
453 - `module::Func`
454- `cast(MyType, obj)` to either
455 - `static_cast<MyType*>(obj)`
456 - `reinterpret_cast<MyType*>(obj)`
457
458### Hacky Hard-Coded Names
459
460These are signs of coupling between mycpp and Oils, which ideally shouldn't
461exist.
462
463- `mycpp_main.py`
464 - `ModulesToCompile()` -- some files have to be ordered first, like the ASDL
465 runtime.
466 - TODO: Pea can respect parameter order? So we do that outside the project?
467 - Another ordering constraint comes from **inheritance**. The forward
468 declaration is NOT sufficient in that case.
469- `cppgen_pass.py`
470 - `_GetCastKind()` has some hard-coded names
471 - `AsdlType::Create()` is special cased to `::`, not `->`
472 - Default arguments e.g. `scope_e::Local` need a repeated `using`.
473
474Issue on mycpp improvements: <https://github.com/oilshell/oil/issues/568>
475
476### Major Features
477
478- Python `int` and `bool` &rarr; C++ `int` and `bool`
479 - `None` &rarr; `nullptr`
480- Statically Typed Python Collections
481 - `str` &rarr; `Str*`
482 - `List[T]` &rarr; `List<T>*`
483 - `Dict[K, V]` &rarr; `Dict<K, V>*`
484 - tuples &rarr; `Tuple2<A, B>`, `Tuple3<A, B, C>`, etc.
485- Collection literals turn into initializer lists
486 - And there is a C++ type inference issue which requires an explicit
487 `std::initializer_list<int>{1, 2, 3}`, not just `{1, 2, 3}`
488- `for` loops, i.e. Python's polymorphic iteration &rarr; `StrIter`,
489 `ListIter<T>`, `DictIter<K, V`
490 - `xrange()`
491 - `enumerate()`
492 - `reversed(mylist)` &rarr; `ReverseListIter`
493 - `d.iteritems()` is rewritten `mylib.iteritems()` &rarr; `DictIter`
494 - TODO: can we be smarter about this?
495- Python's `in` operator:
496 - `s in mystr` &rarr; `str_contains(mystr, s)`
497 - `x in mylist` &rarr; `list_contains(mylist, x)`
498- Classes and inheritance
499 - `__init__` method becomes a constructor. Note: initializer lists aren't
500 used.
501 - Detect `virtual` methods
502 - TODO: could we detect `abstract` methods? (`NotImplementedError`)
503- Python generators `Iterator[T]` &rarr; eager `List<T>` accumulators
504- Python Exceptions &rarr; C++ exceptions
505- Python Modules &rarr; C++ namespace (we assume a 2-level hierarchy)
506 - TODO: mycpp need real modules, because our `oils_for_unix.mycpp.cc`
507 translation unit is getting big.
508 - And `cpp/preamble.h` is a hack to work around the lack of modules.
509
510### Minor Translations
511
512- `s1 == s2` &rarr; `str_equals(s1, s2)`
513- `'x' * 3` &rarr; `str_repeat(globalStr, 3)`
514- `[None] * 3` &rarr; `list_repeat(nullptr, 3)`
515- Omitted:
516 - If the LHS of an assignment is `_`, then the statement is omitted
517 - This is for `_ = log`, which shuts up Python lint warnings for 'unused
518 import'
519 - Code under `if __name__ == '__main__'`
520
521### Optimizations
522
523- Returning Tuples by value. To reduce GC pressure, we we return
524 `Tuple2<A, B>` instead of `Tuple2<A, B>*`, and likewise for `Tuple3` and `Tuple4`.
525
526### Rooting Policy
527
528The translated code roots local variables in every function
529
530 StackRoots _r({&var1, &var2});
531
532We have two kinds of hand-written code:
533
5341. Methods like `Str::strip()` in `mycpp/`
5352. OS bindings like `stat()` in `cpp/`
536
537Neither of them needs any rooting! This is because we use **manual collection
538points** in the interpreter, and these functions don't call any functions that
539can collect. They are "leaves" in the call tree.
540
541## The mycpp Runtime
542
543The mycpp translator targets a runtime that's written from scratch. It
544implements garbage-collected data structures like:
545
546- Typed records
547 - Python classes
548 - ASDL product and sum types
549- `Str` (immutable, as in Python)
550- `List<T>`
551- `Dict<K, V>`
552- `Tuple2<A, B>`, `Tuple3<A, B, C>`, ...
553
554It also has functions based on CPython's:
555
556- `mycpp/gc_builtins.{h,cc}` corresponds roughly to Python's `__builtin__`
557 module, e.g. `int()` and `str()`
558- `mycpp/gc_mylib.{h,cc}` corresponds `mylib.py`
559 - `mylib.BufWriter` is a bit like `cStringIO.StringIO`
560
561### Differences from CPython
562
563- Integers either C `int` or `mylib.BigInt`, not Python's arbitrary size
564 integers
565- `NUL` bytes are allowed in arguments to syscalls like `open()`, unlike in
566 CPython
567- `s.strip()` is defined in terms of ASCII whitespace, which does not include
568 say `\v`.
569 - This is done to be consistent with JSON and J8 Notation.
570
571## C++ Notes
572
573### Gotchas
574
575- C++ classes can have 2 member variables of the same name! From the base
576 class and derived class.
577- Failing to declare methods `virtual` can involve the wrong one being called
578 at runtime
579
580### Minor Features Used
581
582In addition to classes, templates, exceptions, etc. mentioned above, we use:
583
584- `static_cast` and `reinterpret_cast`
585- `enum class` for ASDL
586- Function overloading
587 - For equality and hashing?
588- `offsetof` for introspection of field positions for garbage collection
589- `std::initializer_list` for `StackRoots()`
590 - Should we get rid of this?
591
592### Not Used
593
594- I/O Streams, RTTI, etc.
595- `const`
596- Smart pointers
597