1 | mycpp
|
2 | =====
|
3 |
|
4 | This is a Python-to-C++ translator based on MyPy. It only
|
5 | handles the small subset of Python that we use in Oils.
|
6 |
|
7 | It's inspired by both mypyc and Shed Skin. These posts give background:
|
8 |
|
9 | - [Brief Descriptions of a Python to C++ Translator](https://www.oilshell.org/blog/2022/05/mycpp.html)
|
10 | - [Oil Is Being Implemented "Middle Out"](https://www.oilshell.org/blog/2022/03/middle-out.html)
|
11 |
|
12 | As of March 2024, the translation to C++ is **done**. So it's no longer
|
13 | experimental!
|
14 |
|
15 | However, it's still pretty **hacky**. This doc exists mainly to explain the
|
16 | hacks. (We may want to rewrite mycpp as "yaks", although it's low priority
|
17 | right now.)
|
18 |
|
19 | ---
|
20 |
|
21 | Source for this doc: [mycpp/README.md]($oils-src). The code is all in
|
22 | [mycpp/]($oils-src).
|
23 |
|
24 |
|
25 | <div id="toc">
|
26 | </div>
|
27 |
|
28 | ## Instructions
|
29 |
|
30 | ### Translating and Compiling `oils-cpp`
|
31 |
|
32 | Running `mycpp` is best done on a Debian / Ubuntu-ish machine. Follow the
|
33 | instructions at <https://github.com/oilshell/oil/wiki/Contributing> to create
|
34 | the "dev build" first, which is DISTINCT from the C++ build. Make sure you can
|
35 | run:
|
36 |
|
37 | oil$ build/py.sh all
|
38 |
|
39 | This will give you a working shell:
|
40 |
|
41 | oil$ bin/osh -c 'echo hi' # running interpreted Python
|
42 | hi
|
43 |
|
44 | To run mycpp, we will build Python 3.10, clone MyPy, and install MyPy's
|
45 | dependencies. First install packages:
|
46 |
|
47 | # We need libssl-dev, libffi-dev, zlib1g-dev to bootstrap Python
|
48 | oil$ build/deps.sh install-ubuntu-packages
|
49 |
|
50 | You'll also need a C++17 compiler for code generated by Souffle datalog, used
|
51 | by mycpp, although Oils itself only requires C++11.
|
52 |
|
53 | Then fetch data, like the Python 3.10 tarball and MyPy repo:
|
54 |
|
55 | oil$ build/deps.sh fetch
|
56 |
|
57 | Then build from source:
|
58 |
|
59 | oil$ build/deps.sh install-wedges
|
60 |
|
61 | To build oil-native, use:
|
62 |
|
63 | oil$ ./NINJA-config.sh
|
64 | oil$ ninja # translate and compile, may take 30 seconds
|
65 |
|
66 | oil$ _bin/cxx-asan/osh -c 'echo hi' # running compiled C++ !
|
67 | hi
|
68 |
|
69 | To run the tests and benchmarks:
|
70 |
|
71 | oil$ mycpp/TEST.sh test-translator
|
72 | ... 200+ tasks run ...
|
73 |
|
74 | If you have problems, post a message on `#oil-dev` at
|
75 | `https://oilshell.zulipchat.com`. Not many people have contributed to `mycpp`,
|
76 | so I can use your feedback!
|
77 |
|
78 | Related:
|
79 |
|
80 | - [Oil Native Quick
|
81 | Start](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start) on the
|
82 | wiki.
|
83 | - [Oil Dev Cheat Sheet](https://github.com/oilshell/oil/wiki/Oil-Native-Quick-Start)
|
84 |
|
85 | ## Notes on the Algorithm / Architecture
|
86 |
|
87 | There are four passes over the MyPy AST.
|
88 |
|
89 | (1) `const_pass.py`: Collect string constants
|
90 |
|
91 | Turn turn the constant in `myfunc("foo")` into top-level `GLOBAL_STR(str1,
|
92 | "foo")`.
|
93 |
|
94 | (2) Three passes in `cppgen_pass.py`.
|
95 |
|
96 | (a) Forward Declaration Pass.
|
97 |
|
98 | class Foo;
|
99 | class Bar;
|
100 |
|
101 | This pass also determines which methods should be declared `virtual` in their
|
102 | declarations. The `virtual` keyword is written in the next pass.
|
103 |
|
104 | (b) Declaration Pass.
|
105 |
|
106 | class Foo {
|
107 | void method();
|
108 | };
|
109 | class Bar {
|
110 | void method();
|
111 | };
|
112 |
|
113 | More work in this pass:
|
114 |
|
115 | - Collect member variables and write them at the end of the definition
|
116 | - Collect locals for "hoisting". Written in the next pass.
|
117 |
|
118 | (c) Definition Pass.
|
119 |
|
120 | void Foo:method() {
|
121 | ...
|
122 | }
|
123 |
|
124 | void Bar:method() {
|
125 | ...
|
126 | }
|
127 |
|
128 | Note: I really wish we were not using visitors, but that's inherited from MyPy.
|
129 |
|
130 | ## mycpp Idioms / "Creative Hacks"
|
131 |
|
132 | Oils is written in typed Python 2. It will run under a stock Python 2
|
133 | interpreter, and it will typecheck with stock MyPy.
|
134 |
|
135 | However, there are a few language features that don't map cleanly from typed
|
136 | Python to C++:
|
137 |
|
138 | - switch statements (unfortunately we don't have the Python 3 match statement)
|
139 | - C++ destructors - the RAII ptatern
|
140 | - casting - MyPy has one kind of cast; C++ has `static_cast` and
|
141 | `reinterpret_cast`. (We don't use C-style casting.)
|
142 |
|
143 | So this describes the idioms we use. There are some hacks in
|
144 | [mycpp/cppgen_pass.py]($oils-src) to handle these cases, and also Python
|
145 | runtime equivalents in `mycpp/mylib.py`.
|
146 |
|
147 | ### `with {,tag,str_}switch` → Switch statement
|
148 |
|
149 | We have three constructs that translate to a C++ switch statement. They use a
|
150 | Python context manager `with Xswitch(obj) ...` as a little hack.
|
151 |
|
152 | Here are examples like the ones in [mycpp/examples/test_switch.py]($oils-src).
|
153 | (`ninja mycpp-logs-equal` translates, compiles, and tests all the examples.)
|
154 |
|
155 | Simple switch:
|
156 |
|
157 | myint = 99
|
158 | with switch(myint) as case:
|
159 | if case(42, 43):
|
160 | print('forties')
|
161 | else:
|
162 | print('other')
|
163 |
|
164 | Switch on **object type**, which goes well with ASDL sum types:
|
165 |
|
166 | val = value.Str('foo) # type: value_t
|
167 | with tagswitch(val) as case:
|
168 | if case(value_e.Str, value_e.Int):
|
169 | print('string or int')
|
170 | else:
|
171 | print('other')
|
172 |
|
173 | We usually need to apply the `UP_val` pattern here, described in the next
|
174 | section.
|
175 |
|
176 | Switch on **string**, which generates a fast **two-level dispatch** -- first on
|
177 | length, and then with `str_equals_c()`:
|
178 |
|
179 | s = 'foo'
|
180 | with str_switch(s) as case:
|
181 | if case("foo")
|
182 | print('FOO')
|
183 | else:
|
184 | print('other')
|
185 |
|
186 | ### `val` → `UP_val` → `val` Downcasting pattern
|
187 |
|
188 | Summary: variable names like `UP_*` are **special** in our Python code.
|
189 |
|
190 | Consider the downcasts marked BAD:
|
191 |
|
192 | val = value.Str('foo) # type: value_t
|
193 |
|
194 | with tagswitch(obj) as case:
|
195 | if case(value_e.Str):
|
196 | val = cast(value.Str, val) # BAD: conflicts with first declaration
|
197 | print('s = %s' % val.s)
|
198 |
|
199 | elif case(value_e.Int):
|
200 | val = cast(value.Int, val) # BAD: conflicts with both
|
201 | print('i = %d' % val.i)
|
202 |
|
203 | else:
|
204 | print('other')
|
205 |
|
206 | MyPy allows this, but it translates to invalid C++ code. C++ can't have a
|
207 | variable named `val`, with 2 related types `value_t` and `value::Str`.
|
208 |
|
209 | So we use this idiom instead, which takes advantage of **local vars in case
|
210 | blocks** in C++:
|
211 |
|
212 | val = value.Str('foo') # type: value_t
|
213 |
|
214 | UP_val = val # temporary variable that will be casted
|
215 |
|
216 | with tagswitch(val) as case:
|
217 | if case(value_e.Str):
|
218 | val = cast(value.Str, UP_val) # this works
|
219 | print('s = %s' % val.s)
|
220 |
|
221 | elif case(value_e.Int):
|
222 | val = cast(value.Int, UP_val) # also works
|
223 | print('i = %d' % val.i)
|
224 |
|
225 | else:
|
226 | print('other')
|
227 |
|
228 | This translates to something like:
|
229 |
|
230 | value_t* val = Alloc<value::Str>(str42);
|
231 | value_t* UP_val = val;
|
232 |
|
233 | switch (val->tag()) {
|
234 | case value_e::Str: {
|
235 | // DIFFERENT local var
|
236 | value::Str* val = static_cast<value::Str>(UP_val);
|
237 | print(StrFormat(str43, val->s))
|
238 | }
|
239 | break;
|
240 | case value_e::Int: {
|
241 | // ANOTHER DIFFERENT local var
|
242 | value::Int* val = static_cast<value::Int>(UP_val);
|
243 | print(StrFormat(str44, val->i))
|
244 | }
|
245 | break;
|
246 | default:
|
247 | print(str45);
|
248 | }
|
249 |
|
250 | This works because there's no problem having **different** variables with the
|
251 | same name within each `case { }` block.
|
252 |
|
253 | Again, the names `UP_*` are **special**. If the name doesn't start with `UP_`,
|
254 | the inner blocks will look like:
|
255 |
|
256 | case value_e::Str: {
|
257 | val = static_cast<value::Str>(val); // BAD: val reused
|
258 | print(StrFormat(str43, val->s))
|
259 | }
|
260 |
|
261 | And they will fail to compile. It's not valid C++ because the superclass
|
262 | `value_t` doesn't have a field `val->s`. Only the subclass `value::Str` has
|
263 | it.
|
264 |
|
265 | (Note that Python has a single flat scope per function, while C++ has nested
|
266 | scopes.)
|
267 |
|
268 | ### Python context manager → C++ constructor and destructor (RAII)
|
269 |
|
270 | This Python code:
|
271 |
|
272 | with ctx_Foo(42):
|
273 | f()
|
274 |
|
275 | translates to this C++ code:
|
276 |
|
277 | {
|
278 | ctx_Foo tmp(42);
|
279 | f()
|
280 |
|
281 | // destructor ~ctx_Foo implicitly called
|
282 | }
|
283 |
|
284 | ## MyPy "Shimming" Technique
|
285 |
|
286 | We have an interesting way of "writing Python and C++ at the same time":
|
287 |
|
288 | 1. First, all Python code must pass the MyPy type checker, and run with a stock
|
289 | Python 2 interpreter.
|
290 | - This is the source of truth — the source of our semantics.
|
291 | 1. We translate most `.py` files to C++, **except** some files, in particular
|
292 | [mycpp/mylib.py]($oils-src) and files starting with `py` like
|
293 | `core/{pyos.pyutil}.py`.
|
294 | 1. In C++, we can substitute custom implementations with the properties we
|
295 | want, like `Dict<K, V>` being ordered, `BigInt` being distinct from C `int`,
|
296 | `BufWriter` being efficient, etc.
|
297 |
|
298 | The MyPy type system is very powerful! It lets us do all this.
|
299 |
|
300 | ### NewDict() for ordered dicts
|
301 |
|
302 | Dicts in Python 2 aren't ordered, but we make them ordered at **runtime** by
|
303 | using `mylib.NewDict()`, which returns `collections_.OrderedDict`.
|
304 |
|
305 | The **static type** is still `Dict[K, V]`, but change the "spec" to be an
|
306 | ordered dict.
|
307 |
|
308 | In C++, `Dict<K, V>` is implemented as an ordered dict. (Note: we don't
|
309 | implement preserving order on deletion, which seems OK.)
|
310 |
|
311 | - TODO: `iteritems()` could go away
|
312 |
|
313 | ### StackArray[T]
|
314 |
|
315 | TODO: describe this when it works.
|
316 |
|
317 | ### BigInt
|
318 |
|
319 | - In Python, it's simply defined a a class with an integer, in
|
320 | [mylib/mops.py]($oils-src).
|
321 | - In C++, it's currently `typedef int64_t BigInt`, but we want to make it a big
|
322 | integer.
|
323 |
|
324 | ### ByteAt(), ByteEquals(), ...
|
325 |
|
326 | Hand optimization to reduce 1-byte strings. For IFS algorithm,
|
327 | `LooksLikeGlob()`, `GlobUnescape()`.
|
328 |
|
329 | ### File / LineReader / BufWriter
|
330 |
|
331 | TODO: describe how this works.
|
332 |
|
333 | Can it be more type safe? I think we can cast `File` to both `LineReader` and
|
334 | `BufWriter`.
|
335 |
|
336 | Or can we invert the relationship, so `File` derives from **both** LineReader
|
337 | and BufWriter?
|
338 |
|
339 | ### Fast JSON - avoid intermediate allocations
|
340 |
|
341 | - `pyj8.WriteString()` is shimmed so we don't create encoded J8 string objects,
|
342 | only to throw them away and write to `mylib.BufWriter`. Instead, we append
|
343 | an encoded strings **directly** to the `BufWriter`.
|
344 | - Likewise, we have `BufWriter::write_spaces` to avoid temporary allocations
|
345 | when writing indents.
|
346 | - This could be generalized to `BufWriter::write_repeated(' ', 42)`.
|
347 | - We may also want `BufWriter::write_slice()`
|
348 |
|
349 | ## Limitations Requiring Source Rewrites
|
350 |
|
351 | mycpp itself may cause limitations on expressiveness, or the C++ language may
|
352 | be able express what we want.
|
353 |
|
354 | - C++ doesn't have `try / except / else`, or `finally`
|
355 | - Use the `with ctx_Foo` pattern instead.
|
356 | - `if mylist` tests if the pointer is non-NULL; use `if len(mylist)` for
|
357 | non-empty test
|
358 | - Functions can have at most one keyword / optional argument.
|
359 | - We generate two methods: `f(x)` which calls `f(x, y)` with the default
|
360 | value of `y`
|
361 | - If there are two or more optional arguments:
|
362 | - For classes, you can use the "builder pattern", i.e. add an
|
363 | `Init_MyMember()` method
|
364 | - If the arguments are booleans, translate it to a single bitfield argument
|
365 | - C++ has nested scope and Python has flat function scope. This can cause name
|
366 | collisions.
|
367 | - Could enforce this if it becomes a problem
|
368 |
|
369 | Also see `mycpp/examples/invalid_*` for Python code that fails to translate.
|
370 |
|
371 | ## WARNING: Assumptions Not Checked
|
372 |
|
373 | ### Global Constants Can't Be Mutated
|
374 |
|
375 | We translate top level constants to statically initialized C data structures
|
376 | (zero startup cost):
|
377 |
|
378 | gStr = 'foo'
|
379 | gList = [1, 2] # type: List[int]
|
380 | gDict = {'bar': 42} # type: Dict[str, int]
|
381 |
|
382 | Even though `List` and `Dict` are mutable in general, you should **NOT** mutate
|
383 | these global instances! The C++ code will break at runtime.
|
384 |
|
385 | ### Gotcha about Returning Variants (Subclasses) of a Type
|
386 |
|
387 | MyPy will accept this code:
|
388 |
|
389 | ```
|
390 | if cond:
|
391 | sig = proc_sig.Open # type: proc_sig_t
|
392 | # bad because mycpp HOISTS this
|
393 | else:
|
394 | sig = proc_sig.Closed.CreateNull()
|
395 | sig.words = words # assignment fails
|
396 | return sig
|
397 | ```
|
398 |
|
399 | It will translate to C++, but fail to compile. Instead, rewrite it like this:
|
400 |
|
401 | ```
|
402 | sig = None # type: proc_sig_t
|
403 | if cond:
|
404 | sig = proc_sig.Open # type: proc_sig_t
|
405 | # bad because mycpp HOISTS this
|
406 | else:
|
407 | closed = proc_sig.Closed.CreateNull()
|
408 | closed.words = words # assignment fails
|
409 | sig = closed
|
410 | return sig
|
411 | ```
|
412 |
|
413 | ### Exceptions Can't Leave Destructors / Python `__exit__`
|
414 |
|
415 | Context managers like `with ctx_Foo():` translate to C++ constructors and
|
416 | destructors.
|
417 |
|
418 | In C++, a destructor can't "leave" an exception. It results in a runtime error.
|
419 |
|
420 | You can throw and CATCH an exception WITHIN a destructor, but you can't let it
|
421 | propagate outside.
|
422 |
|
423 | This means you must be careful when coding the `__exit__` method. For example,
|
424 | in `vm::ctx_Redirect`, we had this bug due to `IOError` being thrown and not
|
425 | caught when restoring/popping redirects.
|
426 |
|
427 | To fix the bug, we rewrote the code to use an out param
|
428 | `List[IOError_OSError]`.
|
429 |
|
430 | Related:
|
431 |
|
432 | - <https://akrzemi1.wordpress.com/2011/09/21/destructors-that-throw/>
|
433 |
|
434 | ## More Translation Notes
|
435 |
|
436 | ### Hacky Heuristics
|
437 |
|
438 | - `callable(arg)` to either:
|
439 | - function call `f(arg)`
|
440 | - instantiation `Alloc<T>(arg)`
|
441 | - `name.attr` to either:
|
442 | - `obj->member`
|
443 | - `module::Func`
|
444 | - `cast(MyType, obj)` to either
|
445 | - `static_cast<MyType*>(obj)`
|
446 | - `reinterpret_cast<MyType*>(obj)`
|
447 |
|
448 | ### Hacky Hard-Coded Names
|
449 |
|
450 | These are signs of coupling between mycpp and Oils, which ideally shouldn't
|
451 | exist.
|
452 |
|
453 | - `mycpp_main.py`
|
454 | - `ModulesToCompile()` -- some files have to be ordered first, like the ASDL
|
455 | runtime.
|
456 | - TODO: Pea can respect parameter order? So we do that outside the project?
|
457 | - Another ordering constraint comes from **inheritance**. The forward
|
458 | declaration is NOT sufficient in that case.
|
459 | - `cppgen_pass.py`
|
460 | - `_GetCastKind()` has some hard-coded names
|
461 | - `AsdlType::Create()` is special cased to `::`, not `->`
|
462 | - Default arguments e.g. `scope_e::Local` need a repeated `using`.
|
463 |
|
464 | Issue on mycpp improvements: <https://github.com/oilshell/oil/issues/568>
|
465 |
|
466 | ### Major Features
|
467 |
|
468 | - Python `int` and `bool` → C++ `int` and `bool`
|
469 | - `None` → `nullptr`
|
470 | - Statically Typed Python Collections
|
471 | - `str` → `Str*`
|
472 | - `List[T]` → `List<T>*`
|
473 | - `Dict[K, V]` → `Dict<K, V>*`
|
474 | - tuples → `Tuple2<A, B>`, `Tuple3<A, B, C>`, etc.
|
475 | - Collection literals turn into initializer lists
|
476 | - And there is a C++ type inference issue which requires an explicit
|
477 | `std::initializer_list<int>{1, 2, 3}`, not just `{1, 2, 3}`
|
478 | - Python's polymorphic iteration → `StrIter`, `ListIter<T>`, `DictIter<K,
|
479 | V`
|
480 | - `d.iteritems()` is rewritten `mylib.iteritems()` → `DictIter`
|
481 | - TODO: can we be smarter about this?
|
482 | - `reversed(mylist)` → `ReverseListIter`
|
483 | - Python's `in` operator:
|
484 | - `s in mystr` → `str_contains(mystr, s)`
|
485 | - `x in mylist` → `list_contains(mylist, x)`
|
486 | - Classes and inheritance
|
487 | - `__init__` method becomes a constructor. Note: initializer lists aren't
|
488 | used.
|
489 | - Detect `virtual` methods
|
490 | - TODO: could we detect `abstract` methods? (`NotImplementedError`)
|
491 | - Python generators `Iterator[T]` → eager `List<T>` accumulators
|
492 | - Python Exceptions → C++ exceptions
|
493 | - Python Modules → C++ namespace (we assume a 2-level hierarchy)
|
494 | - TODO: mycpp need real modules, because our `oils_for_unix.mycpp.cc`
|
495 | translation unit is getting big.
|
496 | - And `cpp/preamble.h` is a hack to work around the lack of modules.
|
497 |
|
498 | ### Minor Translations
|
499 |
|
500 | - `s1 == s2` → `str_equals(s1, s2)`
|
501 | - `'x' * 3` → `str_repeat(globalStr, 3)`
|
502 | - `[None] * 3` → `list_repeat(nullptr, 3)`
|
503 | - Omitted:
|
504 | - If the LHS of an assignment is `_`, then the statement is omitted
|
505 | - This is for `_ = log`, which shuts up Python lint warnings for 'unused
|
506 | import'
|
507 | - Code under `if __name__ == '__main__'`
|
508 |
|
509 | ### Optimizations
|
510 |
|
511 | - Returning Tuples by value. To reduce GC pressure, we we return
|
512 | `Tuple2<A, B>` instead of `Tuple2<A, B>*`, and likewise for `Tuple3` and `Tuple4`.
|
513 |
|
514 | ### Rooting Policy
|
515 |
|
516 | The translated code roots local variables in every function
|
517 |
|
518 | StackRoots _r({&var1, &var2});
|
519 |
|
520 | We have two kinds of hand-written code:
|
521 |
|
522 | 1. Methods like `Str::strip()` in `mycpp/`
|
523 | 2. OS bindings like `stat()` in `cpp/`
|
524 |
|
525 | Neither of them needs any rooting! This is because we use **manual collection
|
526 | points** in the interpreter, and these functions don't call any functions that
|
527 | can collect. They are "leaves" in the call tree.
|
528 |
|
529 | ## The mycpp Runtime
|
530 |
|
531 | The mycpp translator targets a runtime that's written from scratch. It
|
532 | implements garbage-collected data structures like:
|
533 |
|
534 | - Typed records
|
535 | - Python classes
|
536 | - ASDL product and sum types
|
537 | - `Str` (immutable, as in Python)
|
538 | - `List<T>`
|
539 | - `Dict<K, V>`
|
540 | - `Tuple2<A, B>`, `Tuple3<A, B, C>`, ...
|
541 |
|
542 | It also has functions based on CPython's:
|
543 |
|
544 | - `mycpp/gc_builtins.{h,cc}` corresponds roughly to Python's `__builtin__`
|
545 | module, e.g. `int()` and `str()`
|
546 | - `mycpp/gc_mylib.{h,cc}` corresponds `mylib.py`
|
547 | - `mylib.BufWriter` is a bit like `cStringIO.StringIO`
|
548 |
|
549 | ### Differences from CPython
|
550 |
|
551 | - Integers either C `int` or `mylib.BigInt`, not Python's arbitrary size
|
552 | integers
|
553 | - `NUL` bytes are allowed in arguments to syscalls like `open()`, unlike in
|
554 | CPython
|
555 | - `s.strip()` is defined in terms of ASCII whitespace, which does not include
|
556 | say `\v`.
|
557 | - This is done to be consistent with JSON and J8 Notation.
|
558 |
|
559 | ## C++ Notes
|
560 |
|
561 | ### Gotchas
|
562 |
|
563 | - C++ classes can have 2 member variables of the same name! From the base
|
564 | class and derived class.
|
565 | - Failing to declare methods `virtual` can involve the wrong one being called
|
566 | at runtime
|
567 |
|
568 | ### Minor Features Used
|
569 |
|
570 | In addition to classes, templates, exceptions, etc. mentioned above, we use:
|
571 |
|
572 | - `static_cast` and `reinterpret_cast`
|
573 | - `enum class` for ASDL
|
574 | - Function overloading
|
575 | - For equality and hashing?
|
576 | - `offsetof` for introspection of field positions for garbage collection
|
577 | - `std::initializer_list` for `StackRoots()`
|
578 | - Should we get rid of this?
|
579 |
|
580 | ### Not Used
|
581 |
|
582 | - I/O Streams, RTTI, etc.
|
583 | - `const`
|
584 | - Smart pointers
|
585 |
|