Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

...but why?

Just use airflow.

Things I want in an ETL:

[x] works at scale.

[x] simple to use.

[x] not written in python (eg. in go or rust)

[x] easy to scale (eg. in docker)

[ ] this.



https://github.com/thbar/kiba This is still my workhorse and has never let me down. I do a ton of ETL.

https://github.com/thbar/kiba-ex This looks interesting though.


Kiba author here - your comment made my day, so thanks!


:D Thank you!!!


There are a few pipeline tools/frameworks written in Go: http://gopherdata.io/post/more_go_based_workflow_tools_in_bi...

(Full disclosure: I'm developing SciPipe)


[ ] has a way to test data connections before trying to use them

[ ] has a clean way to do releases

[ ] can handle more complex network setups (like found in docker envs)


Python is more a glue in this case. Most operations are basically unix pipes pushing data in and out of Postgres


Go is nice, Rust is interesting, and Python is fine.


python is slow and single threaded, which is a tangible problem for an ETL, and airflow specifically.

...airflow is good enough for other reasons, but that doesn’t mean ‘not python’ gets off the wish list.


Generally the workflow framework ought to be doing very little compared to the jobs it orchestrates and executes. In the common use cases, airflow is often only a little more than an (distributed) executor/scheduler for bash commands.

Wouldn't the value of optimization at the framework level, be generally... very little?


Not always. When the workflow tool is a central scheduling point for thousands of tasks fired off to other compute nodes (via some distribution mechanism), you can start to experience pain.

There is also a problem with python that it does not do thread-based concurrency, but only process-based ditto, where the processes need to communicate in a less robust way with each other, e.g. via HTTP.

Because of this, in our work (scheduling machine learning workflows on HPC clusters), we started see lost HTTP-connections between workers with Luigi when going over 64 workers, even though these workers were doing nothing else than keeping track of a job running on another compute node.

This lead us to use Go instead, with SciPipe, and it has none of these problems. Go being compiled also means we're catching a lot more stupid typos and such at compile time rather than at run time (like 7 days into a 7 day HPC job), which also makes our life easier.

Python is definitely quite a lot easier to write and read, but the robustness of compiled Go code is hard to beat, and I don't wish to go back to a python-based tool.


> There is also a problem with python that it does not do thread-based concurrency, but only process-based ditto, where the processes need to communicate in a less robust way with each other, e.g. via HTTP.

this is half true, and importantly, in a workflow tool, the part where python is perfectly capable of being multithreaded while waiting on IO (which releases the GIL) should be quite sufficient.

there are problems with python like its memory usage and with airflow specifically like its at times flaky scheduler (maybe that's fixed in newer versions), but multithreading shouldn't be one in this particular context.


The scheduler has seen a lot of improvements in the past couple of releases, and release 1.10 should be coming out soon. (full disclosure: I work on Airflow)


Great points. Scaling indefinitely will inevitably expose inefficiencies, even tiny ones. And even tiny inefficiencies will place a hard limit on your ability to scale.


python in airflow is just used to define your graph and trigger your jobs. I think it's a wonderful language to do this.

You can (and should) program your batch jobs in other languages than python if you have large amount of data of course.


Performance issues can be overcome with Cython or Pypy.

"Python is slow."

What do you mean by "slow" and what have you done to make Python code run faster? I find for most things Python is fast enough or performance can be improved to meet most compute and data demands.

In the future we will all be using Javascript. Language performance debates are moot.


some other languages (e.g. java) don't need much untrivial effort from you to work fast.


Are you not using celery with airflow?


You hit my wishlist almost exactly. A strongly typed, not java, ETL framework would be my dream.

Plus, works at scale and easy to scale almost becomes a non-issue because we wouldn't be choked by the slowness and bloat of python.



> ...but why?

You ask this generally, but then you give a bunch of specific reasons why you don't want to use it.


yes, but it doesn't do any of the things on the list.


> [x] not written in python (eg. in go or rust)

Don't you need some kind of scripting capabilities though to create custom pipelines without having to recompile all the app each time you had a new one?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: