Hacker News new | past | comments | ask | show | jobs | submit login
Using GPUs to Speed Through the 1.2B Record Taxi Dataset (mapd.com)
203 points by jtsymonds on Oct 15, 2016 | hide | past | favorite | 47 comments



One of the big data points missing from this article is the price. Unless you need features specific to the high end cards such as unlocked 64bit or 16bit performance or antialiased lines, the consumer cards have much higher performancep dollar$ [1]. It would be really interesting if they compared their 8 K80s ~(8 * ($4K for 8TFLOPS)) against a set of GTX 1080s ~($650 for 8TFLOPS)

[1] https://www.youtube.com/watch?v=LC_sx6A5Wko & http://www.videocardbenchmark.net/gpu.php?gpu=Tesla+C2050


Hi SXP, thanks for your comment. You might want to check out Mark Litwintschik's posts (independent blogger who has benchmarked this dataset across many different databases) for performance on GeForce GTX TITAN X's. 4 x GeForce GTX TITAN X: http://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-tit.... 8 x K80s: http://tech.marksblogg.com/billion-nyc-taxi-rides-nvidia-tes.... He has additional posts on MapD on Pascal Titan X's and AWS as well. In full disclosure I work at MapD...


Nice demo, but it would be even nicer if you could get the blooper fixed: "Mapbox's Openstreetmap"

MapBox is an active and respected participant in the OpenStreetMap project and uses our data in some of its products, but that is it.


Blooper Fixed :)


Thanks for the info. The tl;dr is "It's fantastic to see that I've been able to use a machine that costs 1/10th of the one used in the 8 x Tesla K80s benchmark but still have queries running within 33% of the previous performances witnessed."

However, I'm suspicious of the numbers in those articles since the author lists only 4 data points in each trial and doesn't mention the stdev in his measurements. One of his measurements was .964 vs .891 so it looks like the Titan Xs were 90% as fast as K80s if the numbers can be trusted.


Both those links got shortened for some reason, into invalid ones.


Yep, sorry about that:

Here is the Titan X link http://bit.ly/2e6C3Gg

Here is the K80 link: http://bit.ly/2eiIwvp


It's the memory size here that's likely important, 8 GTX 1080s have a third of the total global memory size as 8 K80s. And the aggregate memory bandwidth is way more important than flops, since almost certainly this workload isn't bound by arithmetic throughput.


One more thing to note is that gamer cards aren't meant to be abused like workstation cards are. You can leave a K80 running at full blast for a week doing stuff, but the same workload will significantly reduce the lifespan of a high-end gamer video card. They're meant for gaming sessions that last a few hours with some outliers going for maybe a full day, but not much more than that.

If you really have a tight budget and need to use gamer cards as workstation cards (e.g. a two-person startup that needs to crunch things on 4 GPUs), find yourself some aftermarket coolers, preferably liquid cooling.


Almost irrelevant, because MapD software charges dwarf the price of the hardware.

https://aws.amazon.com/marketplace/pp/B01M0ZY2OV?qid=1475606...


There's probably a good reason they are using server hardware. But sure, just like you could slap consumer CPUs into a server for a cheaper unit cost, you could use consumer GPUs.


For Graphistry's GPU platform, we suggest our users to go with server-grade GPUs because they get (a) more memory and (b) great multitenancy. So using MapD as a personal system is an expensive use of resources, but when a system is architected and billed as an elastic, multitenant system, total cost of ownership for a team is less. Not all platforms are built for this (and I don't know enough about MapD vs. other GPU databases), but that's the engineering view.

And mini-disclaimer: Graphistry is a related platform focused on scaling & automating investigations. Part of that is a GPU compute stack the we started building around the same time as MapD, though we're not in the database business. E.g., our customers will generally use us to look across multiple other systems that already feature high-availability, long-term storage, and scaleout querying for TB+ storage. As some examples: SQL engines, Spark, Splunk, Datastax, and various graph databases.


Why are anti-aliased lines specifically a high-end feature? I thought that anti-aliasing was done by over-sampling and then down-sampling, so all drawing primitives would work with it uniformly.


The question, though, is what are you oversampling? Depending on how the line gets rasterized, supersampling (or multisampling) may or may not help you at all.


High quality anti-aliased lines aren't a bottleneck in video games, but they're very important in CAD applications, so enterprise customers paid a large premium for Quadro cards with CAD-specific functionality. The specific list of premium vs consumer features has varied over time, but previous generations of consumer cards could have their professional features unlocked via modded drivers.


Thats one way of doing anti-aliasing (MSAA), there are many others.


> we use the GPU to render the image, compress it to a .png (about 100KB) and send it to the browser as a tile. This allows for lightning fast rendering and the perception by the user that all of this data is actually in their browser.

With the enormous caveat that you need to have a low latency to their server to get this illusion of client-side rendering. Considering that they only have this cluster of K80's close to them, geographically, and not a number of clusters spread out globally, this isn't a usable example in much of the world.

Now, I don't expect them to roll out K80 clusters world-wide just for the sake of a demo, but it's still pretty important.


I'm in Eastern Europe and it loads up in like half a second. Much faster than I'd expect the browser to process queries on a 1.2bln dataset and without taking up untold gigabytes of memory.


GPUs sound cool, but here's what I did with the same dataset, one using flat files and another using cassandra:

https://www.michaelfogleman.com/static/yellow/

https://www.michaelfogleman.com/static/density/


This is 77M rows, not the full 1.2B dataset shown in the MapD demo (with 60 variables). It also looks like he map is pre-rendered as opposed to being dynamically rendered with filters applied.

Pretty cool but a different animal.


The link to the demo is broken in the blog post. It's actually: https://www.mapd.com/demos/taxis/


They left their source map open. Interresting tech choices:

- React / Redux / mapbox-gl

I always look at the data table implementation to see how far people go. And here they made their own implementation based on d3.

Here's the sources for those curious: https://github.com/d8d/mapd-sources/tree/master/out


d3/dc.js/crossfilter to be precise. having been working on something similar, I found dc.js to be redundant if you already use redux. it's much cleaner to use a lighter weight charting library with cross filter.


Thanks for the tip, I found this table/d3 implemenation but it looks like they use it only for the grouping table: https://github.com/d8d/mapd-sources/blob/master/out/home/jen...


Is this the data set? http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtm...

Why wouldn't they be hosting this in compressed form. A quick shot through pigz has it down to < 50% original size.


Twrrim,

Slightly different. We have appended all of the data from Factual as well. This includes the location of every business in NYC.


Why don't you publish the real data then?


Because it's their business?


What are the 'commuter confidential' tricks around bridges? I know some bridges are tolled...


Look at the coloring around the rides near bridges. People take the subway down to the closest point and then take a cab home. The hybrid trip is both pocketbook friendly and probably faster.


Interesting data point for the scale-up vs scale-out debate.

https://www.mapd.com/assets/static/images/barchart.png


That's an interesting slide, but without knowledge of the size of the dataset it could be misleading (especially considering communication costs between nodes in a cluster).


Hi infinite8s, to get additional information on how that chart was made, you can to go https://www.mapd.com/product/ scroll down to the bar chart, and click “See Details” under the chart. Shows the machines used, queries, and the source data set and size. Note that the machine configurations used to generate the chart were normalized for equivalent cost on AWS, i.e. the chart is hardware-dollar normalized.


Source: https://www.mapd.com/product/

It tops off at 192GB (8 x 24GB). Assuming in a couple of gens (#slots & gpu mem) will put it in 1TB range.


This is fucking awesome.


This blog post has since been deleted. :(


Failed to Load Dashboard TypeError: this.painter is undefined

HN effect?


Very impressive technology, but is there an open source version? Even a limited one? That one can try on something more modest than 100 grand's worth of pro GPUs?


There is not an open source version as yet, but you can spin up these instances on an hourly basis on AWS https://aws.amazon.com/marketplace/pp/B01M0ZY2OV?qid=1475606... and on IBM Softlayer.


thanks, but at 5 bucks an hour for an entry-level instance (single 12GB GPU) I'm looking at 120 bucks a day if I don't want to constantly re-upload my dataset into MapD (a very slow operation judging by Mark Litwintschik's posts linked by you). That's a very very high price for such a modest hardware configuration, not to mention the more credible one which goes for an eye-watering 30 bucks an hour ie not much change from a grand a day. Not for us startup folk, clearly.

I have to say it seems your pricing for such a new entrant and before having built share, is bound to attract very stiff newcomer competition. "Interesting" business model.


You could try BlazingDB if what you are interested is the gpu powered SQL component. There is a free community edition available here: https://docs.blazingdb.com/docs/quickstart-guide-to-blazingd...

You can install this on AWS or on your own infrastructure (I run this on my laptop for example).


MapD has a persistent store and normally customers would keep that on an EBS volume, so they don't have to reload their data every time they spin up an AWS instance.


very interesting. Do you have a link to documentation? I'd like to take a look before I try it out.


fair enough but software cost 3-4x the (already high) hourly hardware cost seems excessive.


If you find this too pricey, then build a cheaper or free competitor.


How long before we stop bollocking about and just call them all PUs?


GPU: Generic Processing Unit




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: