Generally the workflow framework ought to be doing very little compared to the jobs it orchestrates and executes. In the common use cases, airflow is often only a little more than an (distributed) executor/scheduler for bash commands.
Wouldn't the value of optimization at the framework level, be generally... very little?
Not always. When the workflow tool is a central scheduling point for thousands of tasks fired off to other compute nodes (via some distribution mechanism), you can start to experience pain.
There is also a problem with python that it does not do thread-based concurrency, but only process-based ditto, where the processes need to communicate in a less robust way with each other, e.g. via HTTP.
Because of this, in our work (scheduling machine learning workflows on HPC clusters), we started see lost HTTP-connections between workers with Luigi when going over 64 workers, even though these workers were doing nothing else than keeping track of a job running on another compute node.
This lead us to use Go instead, with SciPipe, and it has none of these problems. Go being compiled also means we're catching a lot more stupid typos and such at compile time rather than at run time (like 7 days into a 7 day HPC job), which also makes our life easier.
Python is definitely quite a lot easier to write and read, but the robustness of compiled Go code is hard to beat, and I don't wish to go back to a python-based tool.
> There is also a problem with python that it does not do thread-based concurrency, but only process-based ditto, where the processes need to communicate in a less robust way with each other, e.g. via HTTP.
this is half true, and importantly, in a workflow tool, the part where python is perfectly capable of being multithreaded while waiting on IO (which releases the GIL) should be quite sufficient.
there are problems with python like its memory usage and with airflow specifically like its at times flaky scheduler (maybe that's fixed in newer versions), but multithreading shouldn't be one in this particular context.
The scheduler has seen a lot of improvements in the past couple of releases, and release 1.10 should be coming out soon. (full disclosure: I work on Airflow)
Great points. Scaling indefinitely will inevitably expose inefficiencies, even tiny ones. And even tiny inefficiencies will place a hard limit on your ability to scale.
Performance issues can be overcome with Cython or Pypy.
"Python is slow."
What do you mean by "slow" and what have you done to make Python code run faster? I find for most things Python is fast enough or performance can be improved to meet most compute and data demands.
In the future we will all be using Javascript. Language performance debates are moot.
Don't you need some kind of scripting capabilities though to create custom pipelines without having to recompile all the app each time you had a new one?
Just use airflow.
Things I want in an ETL:
[x] works at scale.
[x] simple to use.
[x] not written in python (eg. in go or rust)
[x] easy to scale (eg. in docker)
[ ] this.