We're trialling BigQuery at the moment, pushed about 130TB into it last week (about ~300 billion rows) and have been blown away by the performance of it.
It's a bit of a shame they never released a paper on how it works, well, nothing since Dremel anyway. And Collosus isn't public either.
I can already see this in built ML stuff being useful for trialling models, especially as it's built right into the SQL.
Did you all work together with the Firebase team on the recent schema change? We're scrambling to mitigate the ~2x cost increase that caused for us, since it blew our analytics budget out of the water.
It was a change that Firebase implemented that went into effect about a month ago. Check this out and the look at the "Old Export Schema" at the bottom. The export schema went from multiple events per record to one event per record, with the User Parameters repeated on each record. That bloated the exports from Firebase and increased costs. https://support.google.com/firebase/answer/7029846?hl=en
Has anything changed on BigQuery since then that would warrant rerunning those benchmarks? If Amazon’s largest cluster outperforms BigQuery then is your decision about which service to use just a cost calculation of dedicated per month for Redshift vs. projected units scanned per month on BigQuery?
I've worked with both, and BigQuery is so much easier to work with. Redshift does give you more control (to a limited extent) but it takes a lot more work and specialized knowledge to perform simple tasks well.
In the article you linked, they gloss over the part that takes by far the longest when working with redshift: setting up "compression, distribution keys on large tables, and sort keys on commonly filtered columns". With BigQuery, you don't really worry about that crap, or about reloading your terabytes of data if you made a mistake somewhere in the schema. You also don't worry about vacuuming, or running out of space, or taking down the database from excessive CPU usage. Do you want to have a team of people at your company whose job is just to keep Redshift running smoothly? Or do you want another team of analysts or engineers?
I could see some companies being forced to use Redshift, especially if they're using S3 a lot, and I could see people saving money in some specialized use cases, but for most people BigQuery is almost certainly going to be cheaper and faster in almost every way.
> I could see people saving money in some specialized use cases, but for most people BigQuery is almost certainly going to be cheaper and faster in almost every way
That’s exactly what the TPC benchmarks are designed to show: how different appliances perform under the same diverse set of generic workloads. As of 2 years ago, they show that Redshift is faster.
> With BigQuery, you don't really worry about that crap
The benefit this depends on your organization’s level of expertise. If you grok sorting and distribution then you can leverage those to increase performance—but it’s not a prerequisite.
For example, in Redshift, when you bulk load data into a table, if the user didn’t specify a compression scheme in the table definition, Redshift will analyze the data and find a scheme that works best, and automatically apply it to the table for you. BigQuery almost certainly does something similar. The difference is, with BigQuery, you’re not invited to take part in that discussion. And you’re charged as if the data is uncompressed. Psychologically, this is a huge relief if you don’t (want to) know how compression works but rest assured you’re paying for it somehow.
To draw a tired analogy, vehicles with automatic transmission still have to shift gears. If you’re driving to the grocery store, not having to worry about that is a win. But if you’re racing stock cars, you’re definitely going to want a stick.
Compute resources tend to be significantly more expensive than storage. Our approach is two-fold:
- We finance compute resources required for you to ingest data into BigQuery. With Redshift, you pay for ingest directly via compute cluster consumption (again, more expensive than storage). This also increases your complexity due to on-cluster contention of resources between ingest and query.
- Separation of storage and compute gives you lots of options. With BigQuery ,you don't need to attach relatively expensive compute just for the luxury of getting more storage. Spectrum helps somewhat, but ultimately with Redshift you don't even get to pay for storage - you pay for compute/storage combos.
- BigQuery's Long-Term Storage is not an archival storage tier - it's a discount on storage, with identical performance and durability characteristics. At only $0.01 per GB per month.
This is likely a result of origins of the two technologies. BigQuery is Dremel, written and operated by Google since 2006. Redshift purchased source code to an on-premise fork of Postgres.
Or listen to Nick Caldwell, Reddit VP Engineering, moving away from AWS to BigQuery:
- "2017, which effectively brought us to the present system, we began forking all of our event data into BigQuery, after considering a lot of different alternatives" https://youtu.be/tKISLQ87GO8?t=426
Great question. At the time we had a hard time replicating results of that blog post.
Here's some food for thought.
- We recently announced that we've improved our TPC-DS 10T geomean by 5x over past 18 months [0]. The nice thing is that users never had to set maintenance windows or "upgrade their clusters" to get these performance improvements.
- Here's a session by our eng lead and CTO of Looker discussing various performance improvements, including querying a PB of data in ~5 seconds and paying a fraction of a penny for it.[1]
- Benchmarks are important, but they miss three key scenarios.
------ Effort - how much complexity is there to achieve maximum performance. As someone else stated in this thread, we feel that we differentiate here.
------ Maintenance - when data changes, new data comes in, you run DML, what happens to this pristine benchmark of yours? What is required to maintain this performance? Again, we feel that our offering is compelling (we don't ever have messy storage state that requires a vacuum, our compute is entirely stateless so re-sorting/re-distributing data is not needed, and BigQuery ingest never affects query capacity, even at PB/day ingest scale).
------ Real-world usage - a serially-executed set of queries poorly represents what happens in reality - high volatility, high concurrency workloads. Again, BigQuery has stateless compute, separation of compute and storage, and separation of compute and state.
- customers - in addition to what's been stated already, PTAL at Kabam [2] & NYT/Yahoo/BlueApron [3] for their motivations for migrating from Redshift to BigQuery
That said, lots of folks really like running on Redshift, and they've extended their platform with Spectrum and other bits. I'd invite you to find out what works best for you. Competition is good for the end user!
If you haven't read it already, I strongly suggest reading the original Dremel paper [0]. It's no doubt somewhat out of date, but I believe BQ is based in Dremel.
tl;dr for underlying storage model: distributed column store which pushes computation down a tree to leaf nodes to parallelize disk I/O. Parent nodes aggregate computations before returning to the client.
We (Sojern) have been using BigQuery since late 2015, we moved from Hive. It has been an amazing experience in terms of cost savings (managed infra, operational cost of a hadoop cluster), and performance. We use it across the org from analytics, ingestion, reporting, ML, and many more uses.
There is information about Colossus: the original GFS paper, an interview with Sean Quinlan with McKusick on Colossus improvements, as well as a talk by Denis Serenyi on more of the gory details, e.g. rebalancing.
There was also a public talk on D, the GFS chunkserver replacement, but there don't seem to be recordings or slides from that.
For this trial we are transferring some data from AWS, which is already partitioned in S3 by quite a few dimensions, including DAY.
As BQ only supports DAY partitioning, I had to create multiple tables that represent the other dimensions, but it seems to work OK, if a little clunky when you have to use wildcards/_TABLE_SUFFIX
And with the just announced clustering now you get even better costs and performance.
Disclosure: I'm https://twitter.com/felipehoffa and I work for Google Cloud. And I'm really excited to reprocess all public tables into clustered ones.
Cost (or rather bytes scanned) is sent back from the API on any queries or dry run/validation calls. There's a common chrome extension to put it in the legacy ui, and I believe the beta ui has it baked in (as part of validation)
This is very exciting for us, even at its nascent limited ability point.
Compared to many players in different verticals, our data is small. But in our vertical of asthma care, we probably have one of the largest (possibly the largest) asthma data.
We've been looking at different way of plumbing the data to automate and run some rudimentary analysis on it, since we found BigQuery a bit limiting. Now seeing this announcement, it could be a great start for us.
I hope this is just a start and people like us can send Google Cloud team a wish list as we come across various needs. Good job and thanks to everyone behind this release.
Independent of DL, I'm curious why these two regression cases haven't been made available in a SQL-like interface until now. Kudos to Google for putting this in the hand of folks who otherwise just use SQL.
I wonder how they're doing logistic regression. It doesn't have a closed form and normally solved with an iterative, multi-pass optimization. edit: Read more closely. They're using (non-stochastic) batch gradient descent.
Good, I hope this trend of shoving ML into SQL (instead of the other way around) continues. I always thought it was silly that every "data wrangling" system like Pandas and R needed to (poorly) re-invent SQL.
Unless you present clear arguments, I'd refrain from saying that Pandas is "poorly re-inventing SQL".
Pandas is now the standard for data analysis (as long as things fit into memory). It's much much easier to debug than a SQL command. You can write operations as a succession of small logical steps (instead of one huge query that is hard to debug).
It's raw Python, so you can do something like:
df.groupby('movie_id').agg(dict(ratings='median', price=lambda p : np.percentile(p, .95))).plot.bar(bins=50)
Yeah, also in Pandas you can do stuff that otherwise requires writing a custom reducer or UDAF in which case you aren't using SQL anyway.
I just use SQL to grab and if necessary aggregate the data and then do everything else in Pandas - using Python custom reducers to deployed trained models although we are migrating to GCP now so soon that won't be necessary.
I appreciate how convenient it is to have statistical analysis tools available directly in BQ SQL, but are linear and logistic regression really considered "machine learning"?
Actually most of online ads ctr prediction machine learning systems are still logistic regression based until very recently. It's simple but very effective.
Just because it's not a neural net doesn't make it not ML. Also, can echo the other folks here: linear models and basic logistic regression are still competitive.
I wholeheartedly agree with you. I'm suggesting that it might be time to have the counter conversation that just because it has numerical statistical analysis does not make it ML. I've been around the field for a couple of decades and understand linear regression is covered in the first pages of chapter one. If that is the standard, then Excel can claim to have had machine learning capabilities for years.
Does anyone here tried Druid? I hear at Druid performance is much better in terms of response time, especially for arbitratry queries over large set of dimensions. Did anyone do an in-depth comparison between Bq vs Druid?
It's a bit of a shame they never released a paper on how it works, well, nothing since Dremel anyway. And Collosus isn't public either.
I can already see this in built ML stuff being useful for trialling models, especially as it's built right into the SQL.