Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I've been using hadoop for 4 years now (author of hadoopy.com) so I'll chime in. I'll state the use cases that Hadooop/MapReduce (and to a close approximation the ecosystem around them) were developed for so that we're on the same page. 1.) Save developers time at the expense of inefficiency (compared to custom systems), 2.) Really huge data (several petabytes), 3.) Unstructured data (e.g., webpages), 4.) Fault tolerance, 5.) Shared Cluster Resource, and 6.) Horizontal scalability. Basically people already had that and wanted easier queries, so it's been pulled that way for the second generation now 1.) Pig/Hive and 2.) Impala and others.

Of the 6 design considerations I listed, none of them are really addressed here. If you outgrow a single GPU then you have a huge performance penalty growing (that's a vertical growth). If you want to make your own operations (very common), then this would be impractical.

It's a nice idea but it'd be better to compare against things like Memsql and the like, where they have been designed from first principles for fast SQL processing. I'd recommend just dropping any Hadoop/HBase comparisons and compare within the same class, Hive is embarrassingly slow even in the class it's in (compare it to Google's Dremel/F1 or Apache Impala).



Comparing against Impala doesn't change anything, Impala is still well behind even on a cluster with just 600 million records : https://docs.google.com/spreadsheet/ccc?key=0AgQ09vI0R_wIdEV...

Your other considerations are still valid though. Although the point was to show the inefficiency of Hadoop/MapReduce when it comes to relational operations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: