Fishing for graphs in a Hadoop data lake

matthewvon3 · on Jan 25, 2018

The article is a nice summary. I think the author missed a key argument for his multi-model case, AWS costs. Short queries via Spark/Hadoop will cost more on AWS than a focused graph model on dedicated graph DB / multi-model DB on AWS.

janemanos · on Jan 25, 2018

Thanks for the article. Seems like a good appeoach to combine the strenghts of both graph & Hadoop. Wonder which other, in addtion to the described ones, use cases could be suitable here.

Anyone an idea?

lmeyerov · on Jan 25, 2018

We get good visibility into what folks do in practice based on their use of Graphistry: we're a DB-agnostic scalable visual graph analytics environment, so we've been seeing (& assisting) what analysts do standalone / what developers build / what data scientists do from notebooks.

1. Most graphs are small (< 100M nodes & edges, probably even < 1M). So analysts just load a CSV directly into us or dump into pandas and work from there. Most Graphistry users do this. It became so common that we baked in a transform to our library that shortcuts the data wrangling problem of SQL/CSV records -> node table + edge table via our "hypergraph" transform.

2. Sometimes the data is too big or they want to use a query language they're more comfortable with vs. Pandas. We'll see a bunch of SQL (incl. Spark), Splunk, Elastic, etc. when approach #1 isn't enough. No need for Neo4j/Titan/GraphX for that problem. If they end up doing this a lot, Neo4j ends up being a sensible choice because of the ergonomics of the Cypher query language.

3. Sometimes, graph queries or analytics _are_ technically critical. We'll see mostly analytics via use of NetworkX or maybe iGraph, such as for slightly better community detection, or something smarter than degrees for node sizes. Sometimes we'll see query langs, probably Neo4j because (I'm guessing) the database is packaged accessibly. For ergonomic reasons, I've been expecting the efforts around OpenCypher for Spark will eventually supplant GraphX for the exploratory case, and we'll start seeing more Janus as it gains more steam.

4. Even more occasionally, people are building true graph algorithms that cannot be sufficiently approximated with their existing tools. E.g., we're seeing a bunch more in the knowledge graph space (ex: finance), and in security/fraud, we're seeing the bigger enterprises needing the same for correlation work. This gets into powering latency-sensitive ML / detection algorithms, fast analyst experiences, etc. However, stuff like regular SQL & Splunk & Spark still gets _most_ teams mostly there with great scaleout etc., so there's a bit of a problem/time/budget/expertise thing going on.

We've been happy to support all these kinds of projects at Graphistry -- and are often part of the entry into them -- so always happy to chat about it. Likewise, I'm not listing work by good teams like those at Datastax Graph, Blazegraph, and Amazon Neptune -- we see them, just they're used more in specific enterprise/federal scenarios.

neunhoef · on Jan 25, 2018

Author of posted article here: thanks for the additional pointers. It seems that graphistry excels at visualization. Essentially, your offering confirms the main story of the article: make more out of your (graph) data by extracting it from Hadoop to a different tool.

And obviously, one should use the right tool for the purpose . I think graphistry is a good choice for graph visualization, graph databases like ArangoDB or Neo4j will be good at ad hoc traversals. And multi-model databases like ArangoDB or OrientDB will be good at a wide range of ad hoc queries. Anyway, thanks again for the pointers.

lmeyerov · on Jan 25, 2018

Yep. Maybe the observation is (1) data has gravity -- it was originally in another non-graph-specific DB -- and (2) the graph structure part is normally small. So we indeed see a lot of extraction into easier-to-use systems.

The nuance being... with stuff like data science notebooks and pandas, the people skilled enough to do extraction are also skilled enough that it's easier to just use pandas. The exception is repeat work or when it is for regular analysts. Friendly query languages like Neo4j's Cypher helps there. Not sure what Arango supports... Gremlin? Proprietary?

Graphistry's environment is agnostic, and _not_ a database, so it'd be wrong of me to advocate teams drop their system of record and use just us ;-) We ended up building a visual "playbook" investigation environment to help teams streamline these scenarios. They run visual playbooks against their legacy db (splunk, elastic, sql, ...) for faux-graph queries, or their new graph db for deeper ones (e.g., path queries). So we're more of the system of record + superpowers for your investigations, kind of like a smarter version of what Tableau/Looker do for SQL.