Hacker News new | past | comments | ask | show | jobs | submit login
MapReduce for the Masses: Zero to Hadoop in 5 minutes with Common Crawl (commoncrawl.org)
109 points by Aloisius on Dec 16, 2011 | hide | past | favorite | 26 comments



Outside of Google, Facebook and another top 10 internet businesses, can someone share a realistic example of employing M/R in a typical corporate environment?

I'd be really interested at moving forward with the tech but for the time being I do have to stay where I'm - in my primitive cubicle cell.


Most of the Fortune 500s have enough data to theoretically have something to throw M/R at, though whether they'd gain from any particular project is anyone's guess.

e.g. At a national bank, determine whether distance to the nearest branch or ATM correlates with deposit frequency or average customer relationship value. Your inputs are a) 10 billion timestamped transactions, b) 50 million accounts, c) 200 million addresses and dates for which they began and entered service, and d) a list of 25,000 branch/ATM locations and the dates they entered service.

This is fairly straightforward to describe as a map/reduce job. You could do it on one machine with a few nested loops and some elbow grease, too, but the mucky mucks might want an answer this quarter.

I know the feeling, though: I keep wanting to try it, but haven't been able to find a good excuse in my own business yet.


I doubt such companies would put their data "in the cloud".

And if they don't, the cost of supporting this "straightforward" Hadoop infrastructure, both in terms of hardware, engineering and support, is so massive that the little elbow grease for a simple I-know-what-it-does solution may well be worth it.

In other words, I share alexro's concerns. If you're buying into the M/R hype and process your blog logs in the cloud, that's one thing. But legitimate business use-cases are probably not as common as people may expect/hope.


Hadoop is not a cloud solution - most organizations deploy it on their own infrastructure. Some folks (eg. Amazon AWS) offer a "hosted" version.


Sure, but that's not what this article is about ("5 min setup, for the masses").


And if they don't, the cost of supporting this "straightforward" Hadoop infrastructure, both in terms of hardware, engineering and support, is so massive that the little elbow grease for a simple I-know-what-it-does solution may well be worth it.

That's not true. A small cluster and support from a company like Cloudera is much less than an new Oracle install.


The mucky mucks can have their answer today with a random subset of that data: 10,000 accounts and the corresponding addresses and transactions. We'll need map-reduce tomorrow if they want precision in the last decimal point.


You might be able to do it with a random subset, but your accuracy would not be as good. For companies with this much data, 0.1% increase (or decrease) can cost millions. If you use a random subset, you won't be able to accurately find this teeny little variation/


Read http://infolab.stanford.edu/~ullman/mmds.html a few times and you'll know more about manipulating large data sets than most people.

If you're just looking for a quick example: imagine you want to parse 6,000 webserver logs (each being 100+GB) to determine the most popular pages, referrers and user agents. The map phase extracts all fields you want and the reduce phase counts all the extracted unique values.

"mapreduce" is about processing at scale. You can always do the same thing on a smaller set of data locally (and usually with just sed, awk, sort, and uniq).


Many people mistakenly assume that MapReduce is only useful when dealing with Google or Facebook scale datasets. Nowadays, MapReduce is simply a convenient data access pattern, particularly when your data is spread across multiple machines. For example, Riak employs the MapReduce paradigm as its "primary querying and data-processing system." [0]

Clearly, Riak does not expect each and every query to run for hours and involve TBs of data. MapReduce is just a straightforward way to process distributed data.

[0] http://wiki.basho.com/MapReduce.html


I think any corporate data warehouse is an area that Hadoop+Pig/Hive could add a huge amount of value. Here's why. Conventional data warehouses are built from running ETL transforms against existing sources, where you are carefully selecting which attributes you want to transform in. These transforms are tricky to set up and even trickier to change if you decide you need a new piece of data (which happens every day in big BI implementations). And if the source system isn't permanently preserving the data (eg. log files) then pulling in data retrospectively often isn't possible.

With a HDFS cluster you can cost effectively dump undifferentiated data from existing sources into a "data lake" without worrying about complex and highly selective transforms, and then use tools like Pig and Hive to do ad-hoc interrogations of the data.

Most data warehouse implementations fail to a large extent because of the ETL problem - Hadoop could help solve that in a big way.

Further reading (no connection to me) http://www.pentaho.com/hadoop/


The NY Times, back in 2007 shared how they generate PDFs using Map/Reduce (on Amazon infrastructure)[1]

Hadoop has a long, long list of companies using it[2]. Cloudera has a similar list[3]

[1] http://open.blogs.nytimes.com/2007/11/01/self-service-prorat...

[2] http://wiki.apache.org/hadoop/PoweredBy

[3] http://www.cloudera.com/customers/


If you have tons of data (logs, for example), on the order or tens or more terabytes, and want the benefit of SQL in addition to MR to query and crunch that data offline, you can use something like Hive which runs on top of Hadoop, gives you all the power and familiarity of SQL, and lets you "drop down" to MR if you need the extra power.


Splunk is killing it in the "Map/Reduce for enterprise" market. See http://www.splunk.com/industries


map reduce can be used for garden variety stuff, for instance my blog index is the result of a map reduce index of recent posts (in this case those tagged html5) http://bit.ly/s7NHYM


This is a good article...would be interested in seeing this by using AppScale (http://code.google.com/p/appscale/wiki/MapReduce_API_Documen...) and Eucalyptus.


That's an interesting idea, and I dig using a fully open stack; we'll consider adding it into our next howto!


Does anyone know of an application of M/R to non-text data like image data or time series data? I'm trying to think about how to process a hge set of 3D atmOspheric data where we are looking for geographic areas that have certain favorable time series statistics. We have the data stored in time series order for each pixel (where a pixel is a 4KM x 4KM area on Earth) and we compute stats for random pixels and try to find optimal combinations of N pixels/locations (where N is a runtime setting).


Don't know of an application specifically like that, but I'd love to hear more, what kind of dimensionality does the per-pixel data have? (is it just RGB * N time series values?) There's nothing very text specific about Hadoop, just a lot of text-oriented examples.

(e-mail is on my profile)


MapReduce is applicable wherever you can partition the data and process each part independently of others.

I used Hadoop/Hbase for EEG time series analysis, looking for certain oscillation patterns (basically classic time-series classification) and it was an embarrassingly parallel problem:

Map:

1. Partition the data into fixed segments (either temporal, say 1hr chunks or location based, say 10x10 blocks of pixels). Alternatively you can use a 'sliding window' and extract features as you go. In some cases you can use symbolic representation/piecewise approximation to reduce dimensionality, as in iSax: http://www.cs.ucr.edu/~eamonn/iSAX/iSAX.html , "sketches" as described here: http://www.amazon.com/High-Performance-Discovery-Time-Techni... or some other time-series segmentation techniques: http://scholar.google.com/scholar?q=time+series+segmentation

2. Extract features for each segment (either linear statistics/moments or non-linear signatures: http://www.nbb.cornell.edu/neurobio/land/PROJECTS/Complexity... ). The most difficult part here has nothing to do with MapReduce but decide which features carry the most information. I found ID3 criterion helpful: http://en.wikipedia.org/wiki/ID3_algorithm, also see http://www.quora.com/Time-Series/What-are-some-time-series-c... and http://scholar.google.com/scholar?hl=en&as_sdt=0,33&...

Reduce:

3. Aggregate the results into a hash-table where the keys are segment' signatures/features/fingerprints, and the values are arrays of pointers to corresponding segments (Based on the size this table can either sit on a single machine, of be distributed on multiple hdfs nodes)

Essentially you do time-series clustering at the Reduce stage with each 'basket' in a hash-table containing a group of similar segments. It can be used as an index for similarity or range searches (for fast in-memory retrieval you can use HBase which sits on top of HDFS). You can also have multiple indices for different feature sets.

-----

The hard part is problem decomposition, i.e. dividing work into independent units, replacing one big nested loop/sigma on the entire dataset with smaller loops that can run in parallel on parts of the dataset, when you've done that, MapReduce is just a natural way to execute the job and aggregate the results.


It would be nice if there were an estimate for how much it costs to run the sample code.

EDIT: Apparently, I can't read. :(


Hey Curt, Most of my own runs, using the 2 small VM default, resulted in 3 normalized hours of usage, which equated to around 25 cents per run.


that's for the crawl sample, not the entire 4TB index, right?

how much data was that?


That was just for the crawl sample, yes, and was approximately 100M of data, though you can specify as much as you'd prefer.

The cool thing about running this job inside Elastic MapReduce right now is the ability to get at S3 data for free, and for cost of access outside of it, both pretty reasonable sums. Right now, you can analyze the entire dataset for around $150, and if you build a good enough algorithm you'll be able to get a lot of good information back.

We're working to index this information so you can process it even more inexpensively, so stay tuned for more updates!


How is the $150 broken down?


The blog post says it costs something like 25 cents.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: