Google BigQuery brings Big Data analytics to all businesses

haberman · on May 1, 2012

Hi everyone, I work on the BigQuery team. From a technical perspective, think of this as an append-only cloud SQL service built on Dremel (http://research.google.com/pubs/pub36632.html). You upload huge data sets, we run SQL queries over them in seconds, even for billions of rows. And without having to specify/build any indexes (we actually scan the billions of rows, though only the columns we need to).

All interaction with the system happens through REST interfaces. Even our own UI uses only our publicly-available REST APIs.

There's a certain amount of free quota available. If you sign up you can try queries against public data sets like Wikipedia edits. Also it looks like the GitHub guys have been experimenting with analyzing GitHub data with BigQuery: https://github.com/blog/1112-data-at-github

I've just joined the team recently, but I really believe in what we're doing. I'd be happy to answer any questions I can.

DenisM · on May 1, 2012

So, what are the publicly available data sets? I see there is wikipedia in one of your screenshots, but short of that I couldn't find a list. I think if I saw something enticing I would sign up just to play with it.

haberman · on May 1, 2012

You can find the list of public data sets and descriptions of them here: https://developers.google.com/bigquery/docs/sample-tables

DenisM · on May 2, 2012

Thanks.

It's a bit thin, so I suggest you guys pump a lot of public datasets into it, and then do a series of blog posts about "look what you can discover from these public datasets with out awesome Q engine in a matter of seconds".

mattmiller · on May 2, 2012

I would love to use it but my company would veto this product based on security concerns. At a minimum we would require a VPN connection to the cloud and the ability to limit (or cut off entirely) access to the web interface. You guys could have a huge product if these concerns are addressed, but based on Google's history I do not think they will be.

Also, does anyone know how this performs compared to Hive?

mwhooker · on May 2, 2012

just started playing around with it. We've been using hive on EMR with tables stored in S3 (json formatted). using a single m1.large to run run queries over an hour of data was taking 10-15 minutes. Bigquery returns the same query in seconds. For example, extracting referrer domains on big query:

  > Query complete (7.6s elapsed, 583 MB processed)

granted that's with an under-provisioned emr "cluster", so I don't want to assign too much meaning to the results, but they are promising.

I'll run some more comparisons on a larger cluster and update later.

oliverkofoed · on May 1, 2012

With all the 'spring cleaning' going on recently at google, my main concern would be the likelihood of this service staying available permanently.

pgrote · on May 1, 2012

As a pay service focused on businesses, I think they would keep it going.

I cannot get to it right now. :)

"Error: Server Error The server encountered an error and could not complete your request. If the problem persists, please report your problem and mention this error message and the query that caused it."

batista · on May 1, 2012

>As a pay service focused on businesses, I think they would keep it going.

Unless if not enough businesses end up paying for it, so if yours does use it and they cancel it, you're screwed. Or if Google decides that while it makes a nice revenue, they'd rather killer to concentrate on something else...

That's the problem with putting out tons of products (including highly touted stuff like Wave) and then killing them, nobody trust you to maintain a product their business will depend on anymore... Contrast that with Amazon AWS.

rabidsnail · on May 2, 2012

Who wants to upload the CommonCrawl corpus as a public dataset? :P

JPKab · on May 1, 2012

This unquestionably lowers the barrier to entry for crunching large data sets. I'm looking forward to messing around with it. Are there any other alternatives to this service? Something like a PigAsAService or HiveAsAService offering?

pappnase12 · on May 1, 2012

Well, we are working on a project that provides Hive (and Hadoop Streaming) as a service. It's http://www.hadoopondemand.com and uses amazon ec2. We have just started our private beta and you are very welcome to join. And there is also amazon's offering EMR (http://aws.amazon.com/elasticmapreduce/) which also provides an interface to Hive and Pig.

EDIT: link to amazon's offering

JPKab · on May 3, 2012

Thanks. I look forward to checking out your project.