How it works#

The compilation step starts with tracing the Python code to generate an acyclic directed graph (DAG) of variables and operations. The code can contain functions, closures, branching, and so on, but conditional branching is only allowed when the condition is known at tracing time (a cp.iif function exists to work around this). In the next step, this DAG is optimized and linearized into a sequence of operations. Each operation is mapped to a precompiled stencil or a combination of several stencils. A stencil is a piece of machine code with placeholders for memory addresses pointing to other code or data. The compiler generates patch instructions that fill these placeholders with the correct memory addresses.

After compilation, the binary code built from the stencils, the constant data, and the patch instructions is handed to the runner for execution. The runner allocates memory for code and data, copies both into place, applies the patch instructions, and finally executes the code.

The C code for a very simple stencil can look like this:

add_float_float(float arg1, float arg2) {
    result_float_float(arg1 + arg2, arg2);
}

The call to the dummy function result_float_float ensures that the compiler keeps the result and the second operand in registers for later use. The dummy function acts as a placeholder for the next stencil. Copapy uses two virtual registers, which map on most relevant architectures to actual hardware registers. Data that cannot be kept in a register is stored in statically allocated heap memory. Stack memory may be used inside some stencils, but its usage is essentially fixed and independent of the Copapy program, so total memory requirements are known at compile time.

The machine code for the function above, compiled for x86_64, looks like this:

0000000000000000 <add_float_float>:
   0:    f3 0f 58 c1              addss  %xmm1,%xmm0
   4:    e9 00 00 00 00           jmp    9 <.LC1+0x1>
            5: R_X86_64_PLT32    result_float_float-0x4

Based on the relocation entry for the jmp to the symbol result_float_float, the jmp instruction is stripped when it is the last instruction in a stencil. Thus, a Copapy addition operation results in a single instruction. For stencils containing multiple branch exits, only the final jmp is removed; the others are patched to jump to the next stencil.

For more complex operations - where inlining is less useful - stencils call a non-stencil function, such as in this example:

0000000000000000 <sin_float>:
   0:    48 83 ec 08              sub    $0x8,%rsp
   4:    e8 00 00 00 00           call   9 <sin_float+0x9>
            5: R_X86_64_PLT32    sinf-0x4
   9:    48 83 c4 08              add    $0x8,%rsp
   d:    e9 00 00 00 00           jmp    12 <.LC0+0x2>
            e: R_X86_64_PLT32    result_float-0x4

Unlike stencils, non-stencil functions like sinf are not stripped and do not need to be tail-call-optimizable. These functions can be provided as C code and compiled together with the stencils or can be object files like in the case of sinf compiled from C and assembly code and merged into the stencil object files. Math functions like sinf are currently provided by the MUSL C library, with architecture-specific optimizations.

Non-stencil functions and constants are stored together with the stencils in an ELF object file for each supported CPU architecture. The required non-stencil functions and constants are bundled during compilation. The compiler includes only the data and code required for a specific Copapy program.

The Copapy compilation process is independent of the actual instruction set. It relies purely on relocation entries and symbol metadata from the ELF file generated by the C compiler.

A full listing of all stencils with machine code for all architectures from latest build is here available: Stencil overview. The compiler output for a full example program from latest compiler build is here available: Example program.