Agreed, not completely on duckdb but we used it for consolidating billing data from 10+ ERP systems and it works, so I see his point. Just to add to his points:
- Integrations are still one of the hardest things in Enterprise IT. Snowflake/Databricks/etc in fact add to the number of systems to integrate, they make this problem worst most of the times
- Governance in a self-service data ecosystem gets complicated fast, especially if you need to stay compliant with data privacy, gdpr, etc.. And amazingly again neither Snowflake nor Databricks solve this. In fact, they make it worst by sucking up budget away from governance initiatives
Connectivity was the hardest part, we had to write Python connectors for a variety of ERP systems. We have a 1 server setup where we run duckdb and Python scripts, we monitor and orchestrate with Prefect but you can use one of the many Python orchestration tools. We load the finished data marts to a MS SQL server and users connect to it via PowerBI or Excel.
I'm sure DuckDB was used to transform/prepare the data from the ERP extract format to an MSSQL ingestion format.
There are plenty of arguments and reasons why you would use DuckDB to do this esp if you're preparing the data for Analytical/OLAP use-cases.
Perhaps a more relevant question might be why they didn't use DataFactory or some other ETL tool/service. DuckDB is rising the occasion for these kinds of use-cases though.
Let me give you some bullets on DataFactory (DF) because it is a question I get a lot.
- DF is quite hard to operationalize, the logging is not so good, Python stack traces are easier to debug and logging can get as detailed as needed
- DF lacks connectors for data ingestion, this is easy in Python as on average a custom connector takes a week or two to develop
- DF is not data pipelines as code and it is becoming really hard to manage governance and change management on UI based ETL tools
- It is hard to enforce best practices on DF. We are finding it is easy to enforce standard ways of writing and managing SQL models and metadata with a combo of dbt and a dbt-ready data catalog
- Integrations are still one of the hardest things in Enterprise IT. Snowflake/Databricks/etc in fact add to the number of systems to integrate, they make this problem worst most of the times
- Governance in a self-service data ecosystem gets complicated fast, especially if you need to stay compliant with data privacy, gdpr, etc.. And amazingly again neither Snowflake nor Databricks solve this. In fact, they make it worst by sucking up budget away from governance initiatives