I have not looked at this one. I did a little investigation with some arbitrary test code on some other options you might find interesting. So do not hold me to these numbers since it is for a specific piece of test code, but the result was interesting and not what one would expect (or at least I expected):
- CPython 1.0X.
- Nuitka 1.09X.
- Cython(simple) 1.63X. No tweaking.
- Numpy 5.66X.
- PyPy 9.21X.
- Cython(static) 24.0X. This is hand modified to declare static types.
- Numba 30.6X. No numpy, no parallel.
- CFFI(4 way) 63.8. This is C Foreign Function Interface. This is with 4 parallel threads.
- Cython(parallel) 74.8X. This is with openmp enable and prange.
- ctypes(16 way) 106.8X. 16 way.
- CFFI(16 way) 107.7X. This is C Foreign Function Interface. This is with 16 parallel threads.
- C with Hand Wrapping 112.0X. 16 way.
- Numpy+Numba 121.0X
The hardware was an 8 core machine with 2 threads each core so 16 hyper-threads.
The deal is to get best speed you need to both parallelize and vectorize the code for your target hardware. Hand crafted C with good compiler flags can do that, so can Numba. It will be interesting what you find, but I think GIL is often overemphasized. By the way, I think PyPy still has GIL.
Things I did not expect was Cython was just more work then it was worth. Just easier to write C code. The default Numpy on my Linux distribution was terrible too. People rave about PyPy and sure way better then CPython, but still slow in the big scheme of things. Nuitka well it turns out to be more of a packaging tool then a speed tool.