Unless you are doing Machine learning or using numpy, I do not recommend anyone use python for anything performance sensitive. The problem is not just GIL. Because multithreading is not so common in python, it’s really hard to know if some external library is threadsafe. Python also supports async but a lot of libraries do not have asyncio compatibility, so you need to mix threads with asyncio which leads to a big mess.
> I do not recommend anyone use python for anything performance sensitive
My default philosophy is to use python _until_ you find something that is performance sensitive, and then make a C/C++ extension for the slow bits. Pybind works great for a hybrid Python/C++ codebase (https://pybind11.readthedocs.io/en/stable/).
Then you can develop and prototype much quicker w/ Python but re-write the slow parts in C++.
Definitely more of a judgement call when threading and function call overhead enter the equation, but I've found this hybrid "99% of the time Python, 1% C++ when needed" setup works great. And it's typically easier for me to eventually port mature code to C++/Go/etc once I've fleshed it out in Python and hit all the design snags.
If you've never used Pybind before these pybind tests[1] and this repo[2] have good examples you can crib to get started (in addition to the docs). Once you handle passing/returning/creating the main data types (list, tuple, dict, set, numpy array) the first time, then it's mostly smooth sailing.
Pybind offers a lot of functionality, but core "good parts" I've found useful are (a) use a numpy array in Python and pass it to a C++ method to work on, (b) pass your python data structure to pybind and then do work on it in C++ (some copy overhead), and (c) Make a class/struct in C++ and expose it to Python (so no copying overhead and you can create nice cache-aware structs, etc.).
You're still kinda stuck with concurrency of the python code itself though. It sure would be nice to be able to just throw cores at problems for awhile.
Sure, and if you can't get stuff in and out of Python objects with concurrency, it doesn't help you much a lot of the time. Plus, again, computing is cheap: it'd be nice to use all my cores before I spend a lot of effort optimizing and rewriting things in native code.
If your data is fragmented across a bunch of small containers/classes, passing it around will be expensive whichever the method (either passing to C++, or just in terms of cache efficiency).
If you just pass an array of data back and forth it's cheap.
> If you just pass an array of data back and forth it's cheap.
Yes, and numpy is great, and all. Python works great as glue to marshal things to and from native code and do inexpensive (but possibly complicated) bits of control logic.
But if I'm trying to deal with large numbers of client requests, say... the lack of concurrency in python itself really hurts. Sure, I can punt almost everything to native code, but what's the point in having Python at all, then?
Not all problems have state that can be shared well across Multiprocessing or completely externalized to large lumps that travel to native code in a few calls-- I'd actually say these are special case exceptions than the rule.
> Note that having a solution setup where the end result is "a ton of small, individual API calls" could possibly indicate a bad system architecture.
Or just a lot of clients with a fair bit of shared state which is best kept resident, which is a pretty common use case.
It's a bummer to write python code that works well, and then maxes out at 130% CPU load when you grow your usage... and not have any obvious path to scale upwards despite you having 32 threads of execution around. Then, you can rewrite some of the more expensive things in native code to squeeze a little more performance, or add indirection to store the data somewhere else so multiprocessing works.
Other languages that have more finely grained locks scale 3-4x higher with minimal thought, and much, much higher with a bit of thought about how to handle locking and data model.
> At that point you'd look to Go or another language
Well, yah... this is us complaining about Python's concurrency problems.
I think the question is how much more cost does it take to move the code from python to C/C++/Rust/whatever? That's a human problem until ChatGPT can solve that problem for you.
And if you are using for example Numpy, you aren’t using Python for anything performance sensitive of course, because Numpy is almost certainly calling the system’s tuned BLAS implementation. Which should handle the parallelism I guess. If anything I’d expect parallel Python calls to Numpy to result in oversubscription…
I don't think numpy functions are neccessarily multithreaded and probably many are inherently sequential by their nature, so there are definitely case where multiprocessing can speed up the overall program.
Someone once said that python + numpy is probably going to be faster than writing it using basic C++, since numpy is using highly tuned libraries underneath.
I don't know for certain this is the case, but I'd like to see some benchmarks about it.
You would almost never use raw C++ when working with linear algebra stuff. You use a library like Eigen that interfaces with BLAS, LAPACK, etc., so you definitely get all the advantages of those highly tuned libraries, plus the speed of C++ and potential flexibility of not having to make multiple array copies and so on.
They aren’t necessarily threaded, but if you care about Numpy performance on an Intel chip at least you are already using MKL for Numpy’s BLAS, and MKL’s gemm is threaded.
Multiplying a large enough matrix in Python using MKL for Numpy, I can watch the cpu usage go to 400% in top. You may need to run it in a loop or make the matrices quite large, a surprisingly large amount of computation has to happen before it’ll show up in top.