Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Processing 6 billion records in 4 seconds on a home PC (github.com/antonmks)
96 points by rck on Sept 30, 2013 | hide | past | favorite | 25 comments


Hm. I'm looking at this and "6 billion in 4 seconds" seems misleading - the test that appears to be referring to is "Query 6", which (a) only examines records from 1994 (b) on a table with entries sorted by timestamp, such that (c) only the portions of the data which fall in the correct timerange are actually sent for processing.

In other words, it's not actually really looking at a full 6 billion records for that query. More accurate would be the next query discussed, "Query 1", which takes 72 seconds to look over a much more significant portion of that 6 billion records.

It's still a pretty impressive set of numbers (as one would expect from GPU SIMD processing), but it irks me when short descriptions bend the facts to try to sound more significant. (Not to mention anything about the disk time subtraction.)


If it makes you feel better (worse?), even the big guys do this; I recall reading SAP HANA marketing materials which did a similar thing to OP[0]. With enough precomputation and knowing which data to skip, you too can achieve better than linear performance; though at least they aren't suggesting they are churning through the entire dataset in a few seconds.

[0] http://www54.sap.com/content/dam/site/sapcom/global/usa/en_u...


"time is counted as total processing time minus disk time."

Can anyone explain why that's a valid benchmark for him to use? Certainly the Hadoop version had significant disk access time?


Much of the point of using something like Hadoop is to increase raw disk IO by parallelizing it over many machines. So no, it's not a terribly valid benchmark.


I'm guessing: Total time will vary dramatically depending on whether the database is on HD, SSD, or in memory, so he is separating out that component. Of course, he might be optimizing something that doesn't help much if it takes forever to read the data from disk.


Not just the disk, but also transferring it over to the GPU is often a bottleneck.

If there is back and forth talk between the GPU and CPU, things will slow down considerably.


Arguably if data fits in memory but particular software (Hadoop) doesn't keep the data in memory then that particular software can be considered unoptimized. But he should really be comparing against Shark anyway.


>NVidia Titan GPU : it is a relatively cheap, massively parallel GPU

Relatively cheap compared to having to build clusters, perhaps, but $1000 isn't cheap for a desktop computing GPU.

A mid-high tier card (GeForce GTX 770) is closer to $400. A mid-range gaming card (GTX 760) is closer to $260.

Those finding the topic link interesting may also be interested in this CUDA radix sorting article[0] from 2010, as it featured "one billion 32-bit keys sorted per second."

[0] https://code.google.com/p/back40computing/wiki/RadixSorting


He's probably talking about professional line of graphics cards (e.g. Tesla), where $1k is considered quite cheap. Titan is the first consumer oriented card by nVidia that can be used as a pro-grade card (no throttling on compute tasks) so maybe that's what the OP was talking about.

Of course for gaming the Titan is not considered cheap at all.


I'm inclined to believe that this sort of thing is memory-limited. Its probably less important for a Titan-class GPU to be used... rather... its more important that a super-wide GDDR5 RAM is used to store the records on.

The problem with GTX770 / GTX760 is that their double-precision performance is gimped by NVidia. AMD cards are better at that price range for general purpose compute (which is why AMD cards are almost always used for BTC mining: http://www.extremetech.com/computing/153467-amd-destroys-nvi...)

Anyway, sorting and searching are all memory-constrained problems. GPUs have significantly faster RAM with significantly more bandwidth than CPUs. Hopefully DDR4 fixes the problem... but that isn't going to come for another year or two.


"* - time is counted as total processing time minus disk time. "

So, in other words, i subtracted time that both actually have to spend, for no good reason.

In order to further make results look better, I only subtracted it from my database, instead of running tests myself and subtracting it from both.


When you love, disk access doesn't count. ;)


The disk time was subtracted because it is in-memory system and when your compressed datasets fit into memory your disk subsystem hopefully will be irrelevant.


So in other words, it makes the system look better.

Look, disk time counts, whether it's hadoop loading it into the memory of a given machine, or you reading it from disk and transferring it into a GPU piece by piece.

Your "hopefully it will be irrelevant" is, well, crazy. I work for an employer with plenty of in-memory systems (very large ones in fact), and it certainly doesn't discount disk time. In fact, it matters a lot!


Disk time matters only when you read your data first time. The following queries won't have to read from the disk. The compressed data may sit in memory for days and the queries won't touch the disk. Now, if your compressed data doesn't fit into memory, then disk speeds of course would matter a lot, on this I agree with you.


You are making a lot of assumptions about working set sizes, etc. In any case, even if only once, it is still a cost you are paying, and a cost hadoop is paying, and it is completely wrong to simply subtract it out when comparing performance.


There's nothing about the Hadoop model that precludes the use of GPUs instead of CPUs. Hadoop solves the problem of storing massive quantities of data and processing it using a large number of machines. There's no reason the processing can't be done using GPUs.


Hadoop cluster of of GPU + SSD'd up machines is not going to be cheap.... (but would be fun)


First, Hadoop has two parts, HDFS and MapReduce. This so called benchmark, compares only the computation part of it. For people who say Hadoop is slow, they never really understood what Hadoop is. MapReduce is meant for processing big data in a batch oriented way and not meant for real time analytics. However, there are many technologies that work on top of Hadoop that will give real time analytics capabilities like HBase,and Impala. Column oriented storage is available in Hadoop too, Parquet. Also with the Hadoop, the real power comes with the availability of UDFs and streaming. Please don't do any stupid benchmark like this without getting to know what you are comparing to.


I've been using hadoop for 4 years now (author of hadoopy.com) so I'll chime in. I'll state the use cases that Hadooop/MapReduce (and to a close approximation the ecosystem around them) were developed for so that we're on the same page. 1.) Save developers time at the expense of inefficiency (compared to custom systems), 2.) Really huge data (several petabytes), 3.) Unstructured data (e.g., webpages), 4.) Fault tolerance, 5.) Shared Cluster Resource, and 6.) Horizontal scalability. Basically people already had that and wanted easier queries, so it's been pulled that way for the second generation now 1.) Pig/Hive and 2.) Impala and others.

Of the 6 design considerations I listed, none of them are really addressed here. If you outgrow a single GPU then you have a huge performance penalty growing (that's a vertical growth). If you want to make your own operations (very common), then this would be impractical.

It's a nice idea but it'd be better to compare against things like Memsql and the like, where they have been designed from first principles for fast SQL processing. I'd recommend just dropping any Hadoop/HBase comparisons and compare within the same class, Hive is embarrassingly slow even in the class it's in (compare it to Google's Dremel/F1 or Apache Impala).


Comparing against Impala doesn't change anything, Impala is still well behind even on a cluster with just 600 million records : https://docs.google.com/spreadsheet/ccc?key=0AgQ09vI0R_wIdEV...

Your other considerations are still valid though. Although the point was to show the inefficiency of Hadoop/MapReduce when it comes to relational operations.


The two most interesting things about this article to me were unstated.

1. The TPC-H benchmark is measured in price-for-performance ($/QphH, or dollars per queries-per-hour). At 4 seconds for Q6, he's getting ~900 queries per hour. The cost of his rig is probably ~$2k, so he's under $2 per QphH. The top TPC-H scores are around $.10, but <$10 is pretty good for a first go.

2. The standard knock against GPU processing is the time it takes to load GPU memory. GPU processing may be blazing once data is in memory. But there was an MIT paper last year claiming you couldn't load the GPU fast enough to keep up. Evidently, he's keeping up.

With regard to comparing his performance to hadoop/hive - yeah it's apples and oranges, but he's in good company. Hadapt, Hortonworks Stinger, Cloudera Impala, Spark/Shark and others all rate themselves on how many times faster they are than Hive.

And frankly, I don't buy the whole "the point of MR is for huge, horizontally scaling networks" If you factor out Yahoo!, Facebook, Amazon, LinkedIn and a few others, the largest remaining hadoop clusters are all WELL south of 1000 nodes. And most run on homogenous high-end hardware.


So, I found this from back in 2011: http://www.tomshardware.com/news/ibm-patent-gpu-accelerated-... However, I couldn't find any commercial or even (active) open source projects on this topic. It seems like something that would be valuable to businesses working with big data, so what's the hold up? Has nobody reached this scale yet? Is it still too expensive? I don't get it.. Maybe I'm overthinking it.



I know folding@home was taking big advantage of CUDA enabled nvidia GPUs




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: