ryan_green's comments

ryan_green · on Aug 25, 2023

exactly!!!

ryan_green · on Aug 25, 2023

apologies for not getting into more detail--wanted to start by covering things at a high level. There are a few key concepts that might be helpful. * data state - this is contents of both your data and metadata at a given point in time. if your data doesn't fit into a single database, this can be difficult to manage. We use this technology to help us: https://lakefs.io/ * logical state - this is everything you use in processing the data in your pipeline (i.e. code, config, info for connecting external services, etc.). This can all reside in git

We found the key was associating our logical state (git branch) with our logical state (lakefs branch). We make this association during our branch deployment process.

Let me know if this helps at all. I was planning to write a follow up post about what we learned about managing the logical state of a data pipeline. If you have suggestions for a different topic to dive into, I'd love to hear about it.

ryan_green · on Aug 25, 2023

sort of. the main issues we've found with each developer having their own DB on a complex data pipeline are 1) if that DB contains petabytes of data, creating it one for each developer is non-trivial from a time and cost perspective 2) the developer needs to develop and test multiple changes they want to isolate (think git branches) 3) the data state the developer is operating on gradually becomes stale and so results deviate from prod

any way, hope that perspective is helpful

ryan_green · on Aug 25, 2023

saltcured, find these comments super insightful!

> Yeah, there's a lot of hidden magic/assumptions in having a "writable snapshot of a specific version" of production data.

That's absolutely a huge assumption. This technology has been a game changer for us: https://lakefs.io/

> It becomes a headache when there is too much contention to use these sandboxes, or too much manual effort to reset them to a desired testing state.

This exactly the situation we were encountering.

ryan_green · on Aug 25, 2023

NBJack, your point about the difficult of managing external dependencies is well taken. That said, our data pipeline uses cloud storage and multiple external services and the scenarios you're describing haven't materialized so far. We have found that we need to take extreme care in managing the logical state of the data pipeline (e.g. ensuring that we use explicit versions of external services). And we can certainly end up in trouble if external service provider violates their api contract. So I don't think this a replacement for a strong data testing regimen which is what would hopefully help us if this occurred. I also think you can encounter these same issues if you go the dev/stage/prod route. curious to get your thoughts.

NBJack · on Aug 26, 2023

Believe me, I'm not claiming that the dev/stage/prod pattern is any kind of cure-all. It has its own problems, which are probably too numerous for a late-night post.

From what you've described, you're doing the right thing for you and your team. Keep it simple as long as you possibly can. I can only advise you to just keep the goal of balancing the time needed to maintain your approach vs. the return you get from it.

The key advantage of the dev/stage/prod approach is only at sufficient scale and proper discipline among teams, each maintaining their own version of their product at the dev and stage points. This has plenty of headaches, but you're at least getting a chance to exercise your work in something that will be as close to production as possible without actually being there. It tends to work 'best' when you only start holding other teams accountable at the stage point.

Cloud dependencies are where I've seen thing get the weirdest and most volatile. There are all kinds of limitations that can crop up even if you try to maintain the highest level of separation and discipline.

For example, did you know that AWS limits a single account to no more than 5 Elastic IP addresses, and that there's an upper-limit to how many Elastic Network Interfaces can be held in a region? [1] It sounds stupid, but I've actually seen these limits hit even after politely asking AWS to make them as large as possible; keeping developers empowered to deploy their own, compartmentalized version of the product became a real pain.

[1] https://docs.aws.amazon.com/vpc/latest/userguide/amazon-vpc-...

ryan_green · on Aug 25, 2023

totally agree. and creating ephemeral environments for data pipelines is quite a bit more challenging than systems with a less complicated data state. nonetheless, this has paid off for us already many times over.