Copapy#
Copapy is a Python framework for deterministic, low-latency realtime computation with automatic differentiation support, targeting hardware applications - for example in the fields of robotics, aerospace, embedded systems and control systems in general.
GPU frameworks like PyTorch, JAX and TensorFlow jump-started the development in the field of AI. With the right balance of flexibility and performance, they allow for fast iteration of new ideas while still being performant enough to test or even use them in production.
This is exactly what Copapy aims for - but in the field of embedded realtime computation. While making use of the ergonomics of Python, the tooling, and the general Python ecosystem, Copapy runs seamlessly optimized machine code. Despite being highly portable, the copy-and-patch compiler allows for effortless and fast deployment without any dependencies beyond Python. It’s designed to feel like writing Python scripts with a shallow learning curve, but under the hood it produces high-performance, statically typed and memory-safe code with a minimized set of possible runtime errors[1]. To maximize productivity, the framework provides detailed type hints to catch most errors even before compilation.
Embedded systems come with a variety of CPU architectures. The copy-and-patch compiler already supports the most common ones[3], and porting it to new architectures is straightforward if a C compiler for the target architecture is available[2]. The generated code depends only on the CPU architecture. The generated binaries neither perform system calls nor rely on external libraries like libc. This makes Copapy both highly deterministic and easy to deploy on different realtime operating systems (RTOS) or bare metal.
The main features can be summarized as:
Fast to write & easy to read
Memory and type safety with a minimal set of runtime errors
Deterministic execution
Automatic differentiation for efficient realtime optimization (reverse-mode)
Optimized machine code for x86_64, AArch64 and ARMv7
Highly portable to new architectures
Small Python package with minimal dependencies and no cross-compile toolchain required
Execution of the compiled code is managed by a runner application. The runner is implemented in C and handles I/O and communication with the Copapy framework. The overall design emphasizes minimal complexity of the runner to simplify portability, since this part must be adapted for the individual hardware/application. Because patching of memory addresses is done by the runner, the different architecture-specific relocation types are unified to an architecture-independent format by Copapy before sending the patch instructions to the runner. This keeps the runner implementation as minimal as possible.
The design targets either an architecture with a realtime-patched Linux kernel - where the runner uses the same CPU and memory as Linux but executes in a realtime thread - or a setup where even higher determinism is required. In such cases, the runner can be executed on a separate crossover MCU running on bare metal or a RTOS.
The Copapy framework also includes a runner as Python module build from the same C code. This allows frictionless testing of code and might be valuable for using Copapy in conventional application development.
Current state#
While hardware I/O is obviously a core aspect of the project, it is not yet available. Therefore, this package is currently a proof of concept with limited direct use. However, the computation engine is fully functional and available for testing and experimentation simply by installing the package. The project is now close to being ready for integration into its first demonstration hardware platform.
Currently in development:
Array stencils for handling very large arrays and generating SIMD-optimized code - e.g., for machine vision and neural network applications
Support for Thumb instructions required by ARM*-M targets (for MCUs)
Constant regrouping for further symbolic optimization of the computation graph
Despite missing SIMD-optimization, benchmark performance shows promising numbers. The following chart plots the results in comparison to NumPy 2.3.5:
For the benchmark (tests/benchmark.py) the timing of 30000 iterations for calculating the therm sum((v1 + i) @ v2 for i in range(10)) where measured on an Ryzen 5 3400G. Where the vectors v1 and v2 both have a lengths of v_size which was varied according to the chart from 10 to 600. For the NumPy case the “i in range(10)” loop was vectorized like this: np.sum((v1 + i) @ v2) with i being here a NDArray with a dimension of [10, 1]. The number of calculated scalar operations is the same for both contenders. Obviously copapy profits from less overheat by calling a single function from python per iteration, where the NumPy variant requires 3. Interestingly there is no indication visible in the chart that for increasing v_size the calling overhead for NumPy will be compensated by using faster SIMD instructions.
Furthermore for many applications copypy will benefit by reducing the actual number of operations significantly compared to a NumPy implementation, by precompute constant values know at compile time and benefiting from sparcity. Multiplying by zero (e.g. in a diagonal matrix) eliminate a hole branch in the computation graph. Operations without effect, like multiplications by 1 oder additions with zero gets eliminated at compile time.
Install#
To install Copapy, you can use pip. Precompiled wheels are available for Linux (x86_64, AArch64, ARMv7), Windows (x86_64) and macOS (x86_64, AArch64):
pip install copapy
License#
This project is licensed under the MIT license - see the LICENSE file for details.