Hacker News new | past | comments | ask | show | jobs | submit login
Why is Snowflake so Valuable? (freshpaint.io)
247 points by malisper on Sept 30, 2020 | hide | past | favorite | 163 comments



I've used Snowflake a fair amount. It's a decent product, probably on par with Redshift / BigQuery. Obviously theres a lot of hype and free money floating around but my take on why they are popular is that they are basically a replacement for large Hadoop installations that have become untenable to manage over the past decade. If a company is already using Redshift or BigQuery I'm not sure why they would switch.

I would be apprehensive in investing in Snowflake long term purely because their product is highly susceptible to being obsoleted in the next 5-10 years.


I was at a company that switched from Redshift to Snowflake. It was a night and day difference. Faster (orders of magnitude!), cheaper, and significantly easier to work with (since everyone had their own personal view of the data to mutate/work with).

As far as I can tell, it is a unique product in the database space. Extremely well executed ideas and design.


Snowflake seems like a unique product and I can only imagine the complex math they're doing under the hood to achieve these incredible query times. memsql is the only real competitor I know of. Redshift is a lot less user friendly (constant need to run vacuum queries). Parquet lakes / Delta lakes don't have anything close to the performance.

Predicate pushdown filtering enabled by the Snowflake Spark connector seems really promising. Lots of companies are currently running big data analyses on Parquet files in S3. Snowflake has the opportunity to grab a huge slice of the big data market.


What kind of math is involved in building a faster database? Genuinely curious. I would guess maybe linear algebra, indirectly.


Not at all. I'd highly recommend CMU's 15-445/645 Intro to Database Systems course (sponsored by Snowflake lol) because they put all their lectures online on YouTube [1]! Here's what's involved in making fast databases from the syllabus [2]:

This course is on the design and implementation of database management systems. Topics include data models (relational, document, key/value), storage models (n-ary, decomposition), query languages (SQL, stored procedures), storage architectures (heaps, log-structured), indexing (order preserving trees, hash tables), transaction processing (ACID, concurrency control), recovery (logging, checkpoints), query processing (joins, sorting, aggregation, optimization), and parallel architectures (multi-core, distributed). Case studies on open-source and commercial database systems are used to illustrate these techniques and trade-offs. The course is appropriate for students that are prepared to flex their strong systems programming skills.

[1] https://www.youtube.com/playlist?list=PLSE8ODhjZXjbohkNBWQs_...

[2] https://15445.courses.cs.cmu.edu/fall2020/syllabus.html


Oof... CMU courses directly sponsored by Snowflake. Gross.


Please elaborate? I can see a lot of ways a sponsored course could go badly, but I can't immediately see which ones apply here.


I'm not qualified to evaluate this particular course. But any time there is a corporate sponsor of a course, it provides strong incentives to the professor to not harm that sponsor at a minimum. If there's a methodology that the professor would like to teach, but that sidesteps, or calls into question, the sponsor's main offering, then that content is in jeopardy. The corruption will always take root given enough time, so that's why editorial and advertising, or academic content and corporate sponsors, etc. should always be at arm's length. Snowflake should give money to CMU to fund "database-related research and teaching" and the university should decide what to do with it. There's still a possibility of improper influence, but it's harder to achieve. This is particularly bad because it's CMU and not University of Phoenix... CMU is in the highest echelon of computer science universities, so it's sad to see it so debased.

What if Kodak sponsored an imaging class in 1990... what do you think they would have said about film vs. digital photography?


A lot of ML classes at CMU (and probably other prestigious campuses) are sponsored by AWS or GCP through cloud credit donation, including the popular Cloud Computing class. Is that any different ?


Not really. Cloud computing has a lot of benefits, but a lot of risks and drawbacks. Who is sponsoring a class to teach about those? About keeping users’ data private by building your own infrastructure? CMU is actively tilting their students, who are the top CS students in the world, towards cloud computing, based on the choices of these sponsors.


Sounds kind of conspiratorial.

I think any increase in educational content is good, even if ‘bad actors’ are funding it.


Bad actors funding it always leads to bad actors writing it. Then it's hard to argue that an increase in its quantity is good.


>I can only imagine the complex math they're doing under the hood to achieve these incredible query times

Maybe its cynical/paranoid, but in this age of Theranos I must ask: is it possible their algorithm excels at showing you a reasonable looking number, rather than an accurate one?


It's SQL, if they were giving wrong answers people would notice.


It's not too terribly difficult to load test Snowflake to get a sense of scaling. Jmeter does the job well. Heck I can pass you along some sample projects I've done against them if you really wanted.


yeah redshift is not at all comparable to snowflake. big query is much closer, it's ahead in some areas and in the last year has closed some of the gaps where it wasn't. big query's biggest problem is that it's tied to gcp which is a distant 3rd in cloud marketshare. they have big query omni coming which is multi-cloud but it'll probably be a while before it's comparable to big query in gcp.


The other problem with BigQuery is that you can very easily write a query that's going to cost you a lot of money to run - with Snowflake you can let it run for an hour or so, and then realise it was a bad idea and you're only out a few credits, a handful of dollars.

The killer feature for me was the query profiler - you can see WHY a query is taking a long time and optimise it - BigQuery just felt like Google were brute forcing the performance, and then charging you accordingly.

When the project I was on switched, the micro-clusters (and the ability to recluster a table) as well as the MERGE semantics beat BigQuery hands down - although those features my be out of beta now (but I've moved on to a new gig).


That's also a problem that it'd be fairly straightforward for Google to solve by automatically spinning up smaller, entirely separate serving clusters for customers who are worried about such a blowout (for a fee, obvs). It's just the serving tree (+ whatever in-memory storage service they use to do distributed joins nowadays), no need to duplicate the rest of the service. The caveat is, a smaller cluster will favor query optimizations specific to that smaller cluster. Some of those "small cluster" optimizations could hurt query performance when deployed against BQ proper with its tens of thousands of workers.

Also, BQ does explain the query plan to some extent: https://cloud.google.com/bigquery/query-plan-explanation. Not quite at the level of a "regular" SQL DB, but it does give you some info to work with when optimizing queries. If you haven't used it in a while I'd give it another try.


I believe this is exactly what slot reservations in BigQuery achieve. Instead of paying on-demand pricing that is determined by data read, you purchase a fixed number of “slots” that are shared by queries running within that particular project.


Ah OK, after reading their docs I see they've changed what "slots" used to mean in Dremel (internal version of BQ). It used to be that slots _guaranteed_ capacity, but did not limit it. Meaning that you could rely on having a certain number of workers in the cluster when you issue a query, but if Dremel had more it'd give you all it's got. Obviously this is not viable when people have to pay per terabyte read, because a ton can be read.

What they have now strikes me as an even better solution to the problem of bankrupting someone with a query IMO. Not sure how pricing compares to redshift et al, but pricing is the easiest thing for Google to change.


Slots don't control how much data you consume, your query does.

If you need to read a terabyte of data to answer your query then more slots only gets it done faster.


BQ Slots lets you do essentially that (pre-commit to a particular cluster size)


I was hitting some rough edges / complexity with BigQuery's MERGE recently, but wasn't able to ascertain any significant difference with Snowflake by scanning their docs briefly -- what aspects of the MERGE semantics are better in Snowflake in your opinion?

Wondering if this is a somewhat new feature in BQ since you used it, or if there's still a feature gap here (e.g. see https://cloud.google.com/blog/products/gcp/performing-large-...).


BQ has per-project and per-user cost controls. Normally when running new large queries one would run them under a special user with a limit on costs.


I think the obsolescence issue is complicated.

I recently saw a criticism of Palantir which went: "The company has largely succeeded, they say, not because of its technological wizardry but because its interface is slicker and more user friendly than the alternatives created by defense contractors."

A lot of the most successful tech firms started post-dot-com are decent interfaces to not-particularly-revolutionary databases. In high-end consulting and investment banking, appearances are hugely important. You can't have trash decks. It's unsurprising to me that the same is true in defense and intelligence. You can get a roof over your head and breakfast at a trashy motel or the Ritz. Everybody knows the Ritz can command a much higher price because "its interface is slicker and more user friendly than the alternatives."

I think the same thing is true here.


The ritz has far better beds, cleaner & safer rooms, better food and is far more likely to deliver that consistently. It's not just the appearance.


A closer reading will reveal that I'm not talking about superficial appearances, but the interface. That's an important distinction.

When I start talking about the Ritz and high-end consultants, I'm discussing the interface, which of course includes the "far better beds, cleaner & safer rooms, better food..." and consistency you're trying to contrast with appearance. I would agree that those things are more than superficial and are extremely important to the experience of the user, because that's exactly the point I'm making.

The beds and concierge are nicer at the Ritz, and the interface (note: not appearance) and support are better at Palantir (or, as we're discussing here, at Snowflake).


Maybe your Ritz experiences have been different than mine, but IMHO all hotel rooms are concrete boxes with a facsimile of home stuffed inside them, copied and pasted as many times as local demand will merit.

Hotel restaurants are the same principle, except replace furnishing with food.


Stay at an aging Courtyard Marriott. Some boxes are nicer than others.


I've stayed at everything from a Motel 6, to Courtyards / Residence Inns / Sheratons between NYC and San Diego, to Four Seasons / Ritz Carltons.

I stand by my claim. The relative differentiation in niceness is swamped by their mass produced boxness.

Ironically, my favorite road chain tends to be Aloft. At least they're upfront about their capsule-esque nature, in a sort of ironic/not-ironic way?

Least favorite: Embassy Suites. shudders It's like every Disney vacationing family's fantasy about what a hotel should be... packed with every Disney vacationing family. Omelette?


The point of hotel chains, and chains in general, is the consistency of the mass-produced experience. I can walk into a DoubleTree hotel anywhere in the world and get the same welcome cookie. It's a positive, not a negative; people often enjoy knowing what they're going to get. If you prefer a more unique experience, which is perfectly understandable, then simply avoid chains perhaps?


That's my point, but extended: I feel like walking into any hotel chain (including different product tiers and luxury brands) gives effectively the same experience.

Don't get me wrong, there's a benefit to consistency of product (especially when you travel Su-F for consulting).

But that benefit, parent company consolidation, and economies of scale drive a net result of overwhelming homogeneity.


Totally get your point of view, and I share it in vacation contexts.. As the hotel chains have consolidated, they slice pennies everywhere.

When I'm travel for business or putting my head on a pillow on a roadtrip, consistency makes my life easier and less stressful. I'm a gorilla-sized person :), I would rather stay at higher end hotel that provides an actual bath sheet than a marriott whatever where I have to call for 6 towels. Surprises aren't delightful at 10PM when you've been on the road for 15 hours.


I got eaten up by gnats (they claim not bed bugs) over a week at a particularly nice hotel. On the plus side, nothing came home with me, the bites healed, and they gave me enough "points" as compensation to cover a luxury hotel in Barcelona for 2 weeks. So... Future Self can look back on the experience with a smile.


Nothing ever gets obsolete once it gains a large foothold in the enterprise space. There's a reason why Oracle and IBM are worth what they are today.


> Nothing ever gets obsolete once it gains a large foothold in the enterprise space.

Lotus? Delphi?


Both still in very heavy use. In 2014, anyway, every single IBM employee had to keep a Lotus Notes window open. It was hellish.

Dunno if that's changed since Red Hat took them over.


Used Lotus notes as recently as 2010, I am pretty sure it's going strong in my megacorp former employer.


Lotus is all over in government and insurance. As a mail client it is mostly dead, but the apps live on.


There is a reason, but ain’t bc of their cloud databases...


Novell, Word Perfect


Wordperfect was used in certain industries (legal especially i think) long after it started dying everywhere else. I don't think its an exception to this rule.


Yes but it’s dead now.


There was a post maybe two weeks from Tavis Ormandy (a tweet) that made the HN front page, about how he uses WordPerfect:

Tavis Ormandy (@taviso) Tweeted: @mkolsek Funny you should mention that, I was recently curious if there are any console word processors. I discovered there's a community who still use WordPerfect 5.1 for DOS. They kinda sold me on it, got it working in DOSEMU. https://t.co/t6j0c1G3w1


WordPerfect still has some users.

Last year we recruited an attorney from a firm that still uses WordPerfect for all their documents.


My school district still runs on ZENWorks.


At the end of the day, all the data warehouses run on SQL, with a bit of customization around ingestion and export. Most of them are backed by object storage (S3/GCS) and those integrations look very similar.

I wouldn't be that worried about lock-in or being made obsolete. Business logic is going to be pretty easy to port between Redshift, BigQuery, Snowflake, or whatever comes next.


> going to be pretty easy to port between Redshift, BigQuery, Snowflake, or whatever comes next.

This isn't even remotely true. Each has unique SQL syntax, and once you have few hundred or thousand queries written using vendor-specific SQL (be it date functions or JSON), it is non-trivial to migrate.


> Most of them are backed by object storage (S3/GCS)

Redshift is backed by worker instances that have their own stores in what's basically an EC2 instance. It's definitely not backed by S3 like Athena.

Bigquery and GCS are both built on top of Colossus, but they have different layers in between them.


With the newer Redshift ra3 instances you use S3 backed storage with local SSD caching

https://aws.amazon.com/redshift/features/ra3/


Same applies to Teradata vantage on cloud.


Sorry, probably should have been more precise. Meant to say: most users are going to interact with the warehouses via object storage for import and export of data.

Since the object store APIs are almost identical across platforms, it doesn't matter that much which warehouse you actually use for production work. It's something that does massive SQL, imports data from S3, and exports data to S3.


> most users are going to interact with the warehouses via object storage for import and export of data.

No, most are going to be using SQL IDE's to query and export data.


> I would be apprehensive in investing in Snowflake long term purely because their product is highly susceptible to being obsoleted in the next 5-10 years.

This can be said about most products and companies. What keeps them alive is how robustly they capture (and hold on to) the market, reduce costs through economies of scale, and innovate. This specific market is also very rapidly growing.


I would think it wouldn't be the same product in 5-10 years.


Lots of companies have built on top of snowflake.


People are excited about Snowflake because it can completely disrupt the traditional data-warehouse market.

The legacy players like Teradata and Exadata (from Oracle) really don't scale. Teradata has ~2B in revenue, Exadata is probably in the same range. That's all up for grabs but that's only scratching the surface.

Historically, only transactional data was dumped into the warehouse. Snowflake is selling storage at S3 price (plus you get compression so often ends up cheaper) while they are making money of compute/query. If they can provide all the right query abstractions (SQL, full-text search), in theory all data can be thrown in Snowflake. Yes, tech savy bay area companies can setup their own stack using Presto etc but rest of the world is not like that.


[Teradata employee here.]

> The legacy players like Teradata and Exadata (from Oracle) really don't scale.

I get why Teradata gets labelled "legacy", but one of Teradata's main differentiators is scale. Teradata engineers have been tackling incredibly interesting scale problems (on many dimensions of "scale") for 40 years. Teradata has many customers who routinely manage and perform analytics on many petabytes of data.

> Historically, only transactional data was dumped into the warehouse.

That was once true, because initially that was all the data that companies had. However, companies have long since used data warehouses for all kinds of data — sensor data, text, behavioral data, product info/BOMs, vendor info, contract info, etc. — whatever's necessary to run the business.

> Snowflake is selling storage at S3 price…

This is important, but not unique. For example, Teradata's current product has native support for S3 and S3-compatible object stores, and you can query them just like any other database table, join that data with data in high-performance native storage, etc.


Sorry, I didn't clarify well. I am sure it scales technically well but not on cost.

My experience of TD is > 10 yrs and then the multi-node version was substantially more expensive than the single-node version. Also, storage and compute was coupled which meant I had to pay for nodes even if 99% of my data was cold. That's a problem with RedShift too but not for Snowflake.

De-coupling storage and compute was a brilliant move by Snowflake. BigQuery can completely abstracted compute - you don't provision compute and only pay for data scanned. However, it gives you a sense of insecurity around cost - A single bad cron job running a query every sec can blow up your cost (real-life experience). Snowflake provides the best cost/performance tradeoff I have seen.


> I am sure it scales technically well but not on cost.

The honest answer is "it depends". Because Teradata is a different beast, per-query pricing can be significantly cheaper than Snowflake with high-volume workloads. It's worth trying both to evaluate cost and performance.

> Also, storage and compute was coupled which meant I had to pay for nodes even if 99% of my data was cold.

Yes, it used to be that everything had to into Teradata's high-performance filesystem. These days, Vantage's native object storage support means that you can keep that cold data in S3.


>>However, it gives you a sense of insecurity around cost - A single bad cron job running a query every sec can blow up your cost (real-life experience).

Only if you're using on-demand. Instead reserve some slots and pay flat rate. The minimum quantity is very low, and the minimum time is 1min.


> Teradata's current product has native support for S3 and S3-compatible object stores too, and you can query them just like any other database table, join that data with data in high-performance native storage, etc.

Storage costs for S3 (or any cloud-provider object storage) are only one dimension of the price. The other is interaction costs which can get prohibitively expensive, for example if you accidentally forget to provide a partition key in your query predicate. Snowflake absorbs this cost if you use internal storage (or just copy into tables).


Snowflake doesn't absorb the cost because there is no cost.

The benefit of native tables for all columnar databases is that it provides an optimized format with metadata for each column, which is then used to eliminate most of the data retrieval during query time. The more selective your query, the faster the results.


> Yes, tech savy bay area companies can setup their own stack using Presto etc but rest of the world is not like that.

My last company was an early adopter of Snowflake. And we tried Presto first, circa 2016 and Presto was sloooow. We were using vertica at the time and it was so much slower. Snowflake on the other hand was able to perform on the same order of latency as Vertica, which was pretty crazy to us.


That's interesting. I thought Vertica's pitch was real-time analytics for which draditional disk based data-warehouses are too slow.


Vertica is a disk based analyctics database. It was very fast, but also very expensive. And hardware failures could be particularly difficult to recover from.


Vertica was very powerful for us but the separation of compute from storage was a critical feature of SF that motivated us to switch. We evaluated EON mode but SF was too easy. But from performance were also running nightly batch processing on a 22 node Vertica cluster that was taking 6 hours per night to run on highly optimized projections. We threw the same query at SF and 8XL cluster and it finished in about 30 minutes. The biggest difference however is cost where we are spending more on SF than we otherwise would have on Vertica.


SF prices may have raised. We had a 48 node cluster in AWS, which wasn’t cheap obviously and our license was also seven figures a year. And we got a good deal from Snowflake because we were an early large customer. One big advantage for us was that we could auto scale snowflake based on our load to save more money.


So why did you switch from Vertica (or did you)?


Vertica was too expensive, their licensing fees were terrible at our scale. Operations were also awful, if we had two nodes go down we were always in trouble. We built an EBS solution that made it a little better, but it still wasn’t tenable long term.


Good info -- thanks.

Their Eon mode product is very similar to Snowflake, with S3 storage and semi-dynamic compute nodes, but they may not be as slick at marketing it or providing a UI.


Yes Eon is similar BUT it is not nearly as turnkey. Agree it is not well marketed.


My previous company still uses Vertica (on-prem) but it still wasn't as fast as it should be. Trying to hire anyone for an operations team was an enterprise in futility, since it's a rather niche tech. Maybe it would've been better in the cloud, but here large companies are cautious about vendor lock-in and after 2014, potential impact of sanctions.


>>>Yes, tech savy bay area companies can setup their own stack using [insert open source tool here] etc but rest of the world is not like that.

It sounds like a ton of these cloud infra companies have this product strategy (datadog, snowflake, elastic, hashicorp, etc)


When you look at cloud infra companies like that, their competitive advantage is in quickly being able to ingest data and make it accessible, so an off-the-shelf solution likely doesn't exist for their particular use-case. Also, since that operation is your competitive advantage, you should look to in-source it rather than reach for a COTS solution.


Ironic that products named exadata and teradata (guess we skipped petadata) dont scale.


Smaller companies with presto won’t get the same performance benefit.

Snowflake & BigQuery get the ability to have multiple customers on a large cluster.

It’d be cost prohibitive for a single smaller customer to have all that compute sitting idle for a few queries per minute.

Storage also benefits as snowflake/ BQ can shard your data across a much larger array of disk giving you better IO.

Think is it faster to drive a car 100 ft starting at 0mph and flooring it. Or to drive 100ft with a car which starts off doing 120mph


Almost all the big companies I worked for had a "database gang" -- a database group which, in the name of centralization, forced you to bow to them to get anything. New DB? bow to them. More nodes? bow to them. Reboot? bow to them. The internal budget "prices" would be off the charts unbelievable.

It makes sense to centralize, but only at a certain cost. Beyond that cost, it is better to just de-centralize because not every project can spend 4-5 months of meetings to spin up a DB.

The cloud changed this because it became an OpEx discussion and something you could spin up on your own. For non-production workloads, it comes especially obvious to do this.


The database gang at my company offers most things through self-service tooling a sub-24-hour Slack queue for most others. Meanwhile the "spend the company's money" gang is way off in some ivory tower, and I'm pretty sure a purchase order would take 4-5 months of meetings (heavily involving the database gang). My director has a budget for headcount and budget for travel & entertainment which is divided among the Sr. Managers and Managers, but I think we'd have to reach the CTO or CTO + 1 to get the authority to spend money on products and services.

Baffled both by how bad your internal infrastructure is and by how easy it is for you to buy stuff.


This still in no way answers why Snowflake is so valuable, though. I completely understand our argument, and I agree with it; I just don’t think the article’s arguments are anything else than ex post facto rationalization. When they mentioned NPS I almost snorted my coffee, that metric can be gamified in any way you want it.


>> Almost all the big companies I worked for had a "database gang" -- a database group which, in the name of centralization, forced you to bow to them to get anything. New DB? bow to them. More nodes? bow to them. Reboot? bow to them. The internal budget "prices" would be off the charts unbelievable.

> This still in no way answers why Snowflake is so valuable, though.

It explains it to a T! You have something you want, but internal company politics and territoriality keep you from getting it the way you want. An outside provider lets everyone get it for a bit of cash. It's basically the same play as Salesforce. It's not some kind of technical moonshot. It has to do with a modicum of technology delivered by a 3rd party who can avoid all of the internal friction.

The next founder who can think of this kind of play, then execute on it, will be the next Salesforce/Snowflake, and will probably have the ear of the same investors!


What I meant was why Snowflake specifically is so valuable. BigQuery, Redshift or any other cloud db would fill this gap as well. Why Snowflake?


Snowflake fanboy here who can't really answer your question about why it's so valuable. Not sure I can rationalize the current value. Not sure I think it should be valued this much.

But I can probably answer why Snowflake instead of Redshift (sorry, not too familiar with BigQuery)...

First of all it's cloud-provider agnostic so you can set up Snowflake on any or all of the 3 major cloud providers as well as set up replication between them directly or indirectly through their data exchange. Probably the most powerful feature is the way that Snowflake has the ability to scale (up or down) compute (vertically and horizontal) and storage independently of each other. Furthermore, you have the ability to scale compute down to nothing, and spin up "instantly" when the demand arrives. On top of all of this there is an incredible selection of functionality that i could go on and on about.


Honestly, if you don’t mind, please do. I believe the decoupled elastic compute / storage advantage has been well described; what are the more granular or technical things you like?

Edit: seems you’ve already answered this :) https://news.ycombinator.com/item?id=24265856


> What I meant was why Snowflake specifically

Marketing. Vast amounts of marketing.


Not just that. Being the neutral 3rd party who can overcome intra-company roadblocks has real value. I'm not sure Redshift, etc, could be as well focused on being that.


It's a chance for investors to get in on the next big thing and invest specifically in data warehousing. No one can put his money directly into BigQuery or Redshift.

Edit: why are so many people downvoting this? Is there some other reason for Snowflake's valuation (aside from tech bubble playing a role)?


I find Snowflake much easier to use than BigQuery, Redshift. It is also cloud-service-provider agnostic. So your only hook at that point is ingest + any snowflake-specific SQL (and obviously security migration etc.) So for retention, they compete on UX rather than walls.

W/r/t value, the idea is a disproportionate of egress from Oracle, Teradata, etc will end up at Snowflake, hence huge TAM, SAM, and SOM.


> it is better to just de-centralize

Until you want to join and then all of sudden it's not your problem. Then you end up with a "gang" of cowboy analysts running ad-hoc data-dumps against operational datastores affecting production uptime and stability only so that they can do a lookup between the multiple (source_table_column_count * source_table_row_count) sheets in their uber excel document.

I'm all for decomposing the monolith as long as you have a plan for recomposing when it's necessary.


Yep, thats why I said centralizing makes sense -- up until some price point. Beyond that point, you might as well just spend the money on re-composing when you need.


It's not just the technical costs of recomposing to achieve a jaoin; it's also diversity of this kind makes long-term maintenance a nightmare, and moving people/functionality/whatever between apps almost impossible.

If everybody cowboys their own storage, you're risking building up a legacy cruft that's very hard to work your way out of later.

The costs if recomposing later can be prohibitive, if a few particularly poor choices were made early on, and the hidden costs of inflexibility can bite too.

There's nothing wrong with outsourcing storage; that's not the issue - the issue is the culture in which it's easier to just not talk to the rest of the company (assuming the company is small enough to have any kind of cohesion in the firs place). If it's too expensive to talk (or even pick from a few common defaults) beforehand, how are you ever going to interop later on? You're getting the downsides of a large organization without the upsides.


Fair point! Sorry I missed that.


> The internal budget "prices" would be off the charts unbelievable.

I find that this occurs when an infrastructure team considers itself a "platform". The only supplier of an asset that everyone else demands can set the "price" of the asset as high as they want.


This isn't actually how monopolies work. Theoretically, monopolies cannot just charge what they want for a product, unless demand is perfectly inelastic. i.e. At some point, people (consumers) will stop paying or move to a substitute.


I've seen multiple corporate re-orgs (Fortune 100) and you are right -- internal monopolies cannot just charge what they want. Eventually, internal customers get fed up and "revlot" -- then you have a corporate re-org where IT infra/services get distributed across business lines or lower.

Eventually that blows up too -- re-composing data at the parent level (e.g., for quarterly financial reporting) becomes too exhausting and the company decides to revert to centralized services.


Can someone summarize Snowflake's unique technical value? I'm quite familiar with both Redshift (I would summarize it as Postgres adapted to sharded, columnar OLAP functioning) and BigQuery (there is a famous paper explaining the architecture). Also with more traditional databases such as MySQL, PostgreSQL, SQL Server, and columnar OLAP databases like Vertica. I explored the website a little bit, but couldn't construe a clear statement of the technical architectural value. Some of the comments here are valuable, but I'm missing a clearer "big picture" overview. Thanks!


(I'm the author of the post.)

I've worked with a large Postgres cluster before (~1PB of data) and have been experimenting with Snowflake recently. I would say there's two clear technical advantages of Snowflake over Redshift. First is there's no maintenance when using Snowflake. You just signup for a Snowflake account, upload a CSV, and you can start querying the data. This is in contrast to Redshift where you have to manually provision a cluster, resize it as you add more data, etc.

The second is their pricing. Storing data in Snowflake costs the same as it would cost to store in S3. The tradeoff is you also have to pay based on how long your queries take. Depending on your workload this can result in a massive cost savings. If you access only small amounts of your data infrequently, it's like you're storing the data in S3 and you only have to pay a bit more when accessing the data. This is in contrast to Redshift where you have to pay for the full cost of the cluster regardless of whether you are actually querying the data or not.

Snowflake also has a ton of quality of life improvements compared to Redshift. One really nice thing is you can change the amount of compute used for any individual query. For example, if you have one specific slow query, you can allocate 4x the compute for that one query, pay 4x as much while the query is running, and get the query to run 4x faster (ultimately costing you the same amount as if you used 1x the compute).

One neat thing is there's ultimately only one "Snowflake instance" in each region. Everyone's tables are in the same instance, but you can only access the tables you have permission to access. This allows you to easily share data between different Snowflake accounts. You can store the data in one account and query it from another.

So the core value proposition is really strong and it also has a bunch of extra features that are all pretty useful at the end of the day.

This post focused on Snowflake solely from a business point of view. I'm considering writing another one that focuses on it from a technical point of view.


Thanks for the details, very useful. Please write that post, I'm sure it will make it to the front page here :)


Thanks for the great clarity between the 2 tools. Do you have any thoughts on Dremio/AtScale and whether they complement or replace Snowflake?


Snowflake owes much of its performance benefits to "micro-partitions" [1]. BigQuery is a worthwhile comparison. MySQL, PostgreSQL, SQL Server, and Vertica are not close equivalents.

[1] https://www.infoq.com/presentations/snowflake-automatic-clus...


That's an interesting look at their internals; I wasn't aware of their dynamic sorting feature.

At read time, though, Snowflake's zone map is the same as Redshift's and Vertica's; you'll see similar pruning for many queries.

Redshift however doesn't prune during joins, which is a huge deficiency.

Snowflake looks more flexible about getting the data into its final ordering.


FWIW, if you're looking at PostgreSQL-at-humungo scale, there are a few options around. TimescaleDB, Citus (which Azure now offers as a service called Hyperscale DB) and there are others I always feel terrible for overlooking.

I work for VMware and get along well with folks who work on Greenplum. It's still doing massive workloads with massive amounts of data for lots of customers, has the ability to operate over blob stores with predicate pushdowns and recently merged up to parity with the PostgreSQL 12 upstream[0]. It's the fruit of a six year effort to return to the upstream from a heavily modified fork of 8.3. A truly monumental effort.

[0] https://github.com/greenplum-db/gpdb/commit/19cd1cf4b68faff2...


Snowflake’s value is that they provide the same technical products as amazon/google/etc, but are not amazon/google/etc. Some shops like buying into the google ecosystem, some are afraid of vendor lock in.

Probably other things, but many companies exist just to be alternatives to faang. If you’re good enough, you surpass that intention.


I saw them a year or two ago positioning themselves in contrast to Redshift and BigQuery. I though "these guys are building something for Microsoft to acquire" (my thought was, something with a more modern OLAP architecture than SQL Server, which they can offer via Azure). Naive me, they were so much more business savvy than that...


The snowflake team came from Microsoft.


SQL Server had too much political pull within the org for such a deal to succeed.


The thing to keep in mind is that leadership in data management lasts about 5 years, definitely less than 10. An (incomplete but representative) timeline:

- late 90s: Early DWs like Redbrick

- early 2000s: Oracle, Teradata

- late 2000s: Shared-nothing Data Warehouses (Vertica, Aster Data, Greenplum) - bought up by Teradata, EMC, HP

- early 2010s: Hadoop and Hive

- late 2010s: Redshift and cloud DBs

- early 2020s: Snowflake

- late 2020s: probably something else...

All these technologies felt they were here to stay at the time, but they didn't. Will Snowflake be the exception? Maybe, but the odds are not nearly as great as their valuation implies.


I guess this quote from the article sums it up: "There was so much hype, my mom, who doesn't even know what Snowflake is, decided to invest in Snowflake."


"Taxi drivers told you what to buy. The shoeshine boy could give you a summary of the day's financial news as he worked with rag and polish. An old beggar who regularly patrolled the street in front of my office now gave me tips and, I suppose, spent the money I and others gave him in the market. My cook had a brokerage account and followed the ticker closely. Her paper profits were quickly blown away in the gale of 1929."


This article explains why Snowflake is a great company. And there's no doubt about that.

But it does not explain why it is valued at 60B$. Or if that value makes sense.

To put a price on a company, I need to know in projection for the next 5-10 years what income they will generate to shareholders.

The fact that they are a great company does not guarantee they will generate an income to shareholders that value them at 60B$.

This is how bubble starts. Overvalue a great company, then overvalue fair companies just not to miss out and "in comparison" with the great companies, and in the end, overvalue nothing (if Tesla is great, then Nikola was the nothing).


That assumes the stock market is logical and rational, but you know it isn't. Stock prices are mostly independent from the underlying company's actions and performance, and instead vary depending on investors. The bigger investors have so much money and influence (e.g. operating multiple large 'financial news' / stock news websites) they can sway the prices wherever they want.


So maybe the article should've been

Why is Snowflake so Valuable?

Stock market is not logical or rational.

Thanks for reading.


Because interest rates are super low and there’s a lot of uninvested capital sloshing around.


Because we're in a tech bubble. These valuations are absurd.


That and low interest rates and looking for a place to put money, plus the Warren Buffet multiplier.

There are only 3600~ companies on the US markets [1], half of what was there in the 90s. There aren't many places to put lots of money.

The rise of private equity (PE) really has taken lots of the growth out of the public markets but since there are so few, and low interest rates, people are looking for returns. Lots of it is also pump and dump schemes loading up stocks and then short and distort. The problem is the growth is tapped on public markets now, PE drains it before it gets there, so more volatility games are being played. Couple that with less spending purchasing power in the lower/middle and that adds to the games as investment in new consumer focused companies isn't working as well when purchasing power is drained, M2V has hit a precipice [2]

[1] https://www.wsj.com/articles/where-have-all-the-public-compa...

[2] https://fred.stlouisfed.org/series/M2V


With all the money the government is giving around, it might be part inflation too.


This is some ridiculous overvaluation. I've used their product and while it worked all right, I could definitely notice part of their websites and docs being half-assed by not very experienced web developers. Which didn't really conjure an impression of quality to me. More like, "let's do things in hurry to get the money".

I don't know how can you, as an investor, rationalize the price by any other reason than speculative increase in share price. Which is again driven by some fundamentals but mostly by hype. If they fail to gain almost 100% increase in revenue for the next year, that 60B is looking way too high.

Compare this with something like Splunk. They make high quality product and have over 2B in revenue (with a nice growth curve). Market cap 30B. So is Snowflake as of now worth two Splunks? I don't think so. I could see it maybe being half of that, 15B, if I squint real hard.


If I want full ownership of my data, meaning I don't want it managed/hosted by Snowflake, doesn't that mean that snowflake will reference my S3 bucket as an external table? How exactly am I going to be saving money? Snowflake will have to index my S3 bucket, most likely doing a poor job at doing so, all the while charging me for S3 requests just to create an index that I'm already maintaining. My company has strict contractual obligations regarding data de-identification that I have written code to securely index. We use S3 to store de-identified data and the index to query data we have in S3, meaning that we run no risk of our customer data being identified even if attackers acquired access to both the index & S3 bucket. Snowflake may not be trying to handle our use-case, but the corporate account executive spamming my inbox was dead certain that it was our panacea. I'm failing to see where they fit in and I'm glad that my manager listens to me. I also don't see the arguments that Snowflake is beneficial because it is cloud-agnostic. I can count the number of cloud-agnostic employers I've had over my career on 0 hands. If I can't set up Lambda triggers to puts to my S3 bucket, then snowflake is a hard no from me.


Nerds are surprisingly susceptible to hard-sell tactics.


I think part of it is that many know that most companies under perform the market. So I imagine it's not hard to see someone justifying (correctly or incorrectly) that its worth paying more for a company you think is more likely going to be one of those outliers. and being that there is a limited supply of these companies, they can shoot up in value fast.

I never used snowflake, so hard for me to have a solid opinion of this company. I remember when facebook IPO'ed and people were like 'what? worth 100 billion? OVERVALUED'. and they were wrong in every way. so who knows? Though my gut doesn't tell me this company is the next facebook. coming out of the gate with a 70 bil market cap feels like all the growth is already priced in.

With facebook and on their IPO, I felt just their mobil revenue in 5 years time alone would be worth their valuation. But I had week hands. I think I bought them at 28 and sold at 22. I should have had more conviction because I truly did believe they were worth a lot more.


I don't see it mentioned in the article but isn't one of the main selling points of Snowflake their data exchange? Companies upload their data to Snowflake in the hopes of someday monetizing it? If that's the case then I think it's just a matter of time until regulators become interested too.


The article talks about 'good things' but doesn't put them in context of valuation.

$60B is still too much.

It's odd that Buffet is in, it's a weird signal, because this is a weird era for markets: all other things equal, we are looking at .com-ish situations here and the timing would be ideal for a true crash.

That the world economy is shrinking by 10% and governments, major industries are going insolvent should be scary.

Perhaps investors think they are preparing for the 'covid future' but this may be a weird kind of inflation whereby everything else (including cash) is crap so they are piling into winners.

There is an emotion to a lot of stocks these days that is probably making every analysts job a nightmare - if the CEO or company is popular, it really messes with valuation.


One thing to keep in mind is the price at which Buffet invested. I vaguely remember they he invested at about $70 per share (I may be wrong. Just going by memory here). This means there are a lot of buffer room for correction at current price of around $250.


I am guessing all of the Berkshire companies are either integrating into snowflake or have.


> It's odd that Buffet is in

More like Todd Combs and Ted Weschler...


Snowflake seems to be part of a segment of companies that are defying economy reality during covid: $TSLA, $ZM, $SNOW. These companies have huge P/S ratios and a higher volume of options activity relative to their underlying stock. A rational person may look at metrics, but we're not in a rational market. The strange behavior of the market may explain these valuations more than anything else. And maybe 3 digit P/S ratios are the new normal.

[1] https://www.ft.com/content/b330e091-2a59-4527-b958-9213731a5...


For every $1 of revenue Snowflake received from their customers a year ago, that same pool of customers are now paying $1.58. That means Snowflake could acquire no new customers, and they would still be doubling revenue every 18 months.

No. That would require customers to grow their spend 58% every year. Might happen early on, especially as projects like this are usually staged in phases so the initial go-live usage is lower than when you get to full production. But projecting long term revenue growth on the basis of average annual 58% increases in per-customer revenue is simply ridiculous.


Agree. Generally it kind of stabiles around lower than that. For example for cloudflare[1] it is around 110%, while this[2] points to 109 being kind of standard.

1. https://www.sec.gov/Archives/edgar/data/1477333/000119312519... ( see dollar based retention rate)

2. https://medium.com/@sammyabdullah/109-net-dollar-retention-i...


Ok, It's a great product, but valuation still does not make sense!


My experience with investing tells me that you will never get a good investment at a fair value, you always have to pay a premium for it. There is a pretty consistent pattern in my stock portfolio. The stocks that I overpaid for ended up being great investments while the stocks that I thought I was getting a deal on ended up being a disaster of an investment.


So, at the moment Snowflake beat all the cloud providers at usability, but what stops them from improving their offerings using the tech Snowflake can't match because it doesn't own the hardware? Like, as far as I understand AWS Aurora could only be offered by Amazon because it uses interfaces available only internally. Also, it seems that DWH/analytics landscape changes very rapidly, isn't being "long Snowflake" is actually "short progress"?


People want to own the next Salesforce. Snowflake could expand into ERP/CRM/Visualizations. There is over 50B revenue across the world in those areas.


Those lines of business are very far away from Snowflake.


They were from Oracle too.


If you were a foreigner and had to invest your savings somewhere (because banks and govs are forcing negative rates) where would you invest


I would invest all of it into Snowflake stock, of course.

The shoe shine boy told me about it....


AMT / DPZ 50/50 split


5g and pizza? Interesting combo.


I did an analysis of Snowflake's Facebook ads if you want to see how they message there and the kind of content they promote : https://www.rightpercent.com/b2b-guides/snowflake-just-went-...


I don't know anything about Snowflake but in general they are fortunate to go public in this time. The tech market is in a bubble and investors are frothy at the mouth for tech stocks. Pessimism started to creep in for some of the already sky high tech stocks. Snowflake IPO'd right at that time and people just flocked to it.


> Net promoter score (NPS) is a way of measuring customer satisfaction.

How easy/hard is it to fake an NPS score? Is this somehow regulated? Can the company only provide its most satisfied customers (which it knows beforehand) and only have them participate to get a good NPS?


Likely this can be easily gamed. But in the context of Snowflake's value, NPS manifests in Net Retention, which is likely to be more difficult to fudge:

> For every $1 of revenue Snowflake received from their customers a year ago, that same pool of customers are now paying $1.58.

Net Retention is more important, but in this case it also gives credence to the NPS number.


I also have some questions/concerns over the NPS. Online surveys, which is effectively the instrument of NPS, typically yield statistically incorrect data due to, often, some flavor of self-selection bias. If you think of an online survey as an experiment, they rarely allow enough control over the sample to mean much. However, that's not to say it's impossible to properly conduct an NPS, just that it's probably very easy to get wrong which may paint a false picture.


NPS originally came from car manufactures and industries that produce products that are easy comparable, I.e. how happy are you with the car? Would you recommend BMW to friends?

I don’t think it’s that great for software, but it’s very trendy. It can be gamed, like anything else, usually by sending NPS surveys to decision makers who aren’t usually the actual users.

Slightly better metric is Customer Effort Score (CEF) which shows how easy is to do business with a company.


NPS can be gamified in any way you want it. If you want to use it as a real metric to improve your product, it’s a great metric. But if you want to use it to convince shareholders your company is very valuable, it’s extremely easy to do so by implementing certain biases.


In the article they list AT&T as having a positive 20 NPS score, which is pretty hard to believe.


In order to grow into this valuation, snowflake will need to become the biggest data warehouse company ever. They have a great product and that will probably happen! But you need to be highly confident of the best case scenario happening to buy at the current price.


Yeah, the article seems to be contradictory. First it says “even my Mum, who knows nothing about Snowflake, bought shares.” Then it gives complex technical analysis to explain why the share price is high. Um, I think the first explanation works better....

In general the article persuaded me that Snowflake is a great company. But without some maths relating all these numbers to an estimate of future revenue, there’s no way to tell if it’s a great stock — at the current price. It’s the same problem that Tesla bulls have.


Am I wrong if I say that they are so valuable (in monetary terms) just because they are the only (or one of the few) non-free database management systems (or whatever they are)?


Not really, they’re largely targeting the same kind of use cases as redshift and bigquery.


I ask people if they’ve ever heard of Redshift and BigQuery, and most, even many in IT haven’t.

But everyone has heard of Snowflake.


Sure, I’m not arguing that they’re equally good or equally good at marketing. But they’re far from the only cloud OLAP database.


As someone who spends a lot of time in this space, their only "killer app" is automated workload / data distribution management. Which is cool, and hard to get right, but clearly something the cloud vendors and other data players are have taken steps towards / offer more or less the same outcomes.

And in contrast their Silicon Valley roots means a lot of their tooling/UX/data capabilities are ... undercooked. Their Web IDE feels like a throwback to 2003 Hadoop, their ETL capabilities are a joke, they don't support joins in views ...

And they've also squandered some opportunities to actually offer a differentiating "all in one" data processing experience for ad hoc/exploratory, BI/aggregated, and Big Data/AI/ML model crunching. For example, here's their garbage blog post on Spark SQL - https://www.snowflake.com/blog/snowflake-spark-part-2-pushin...

tl;dr when someone writes a Spark job that includes a filter against data in Snowflake, it's more efficient to let Snowflake filter the data before shipping it off to the (much more performant) Spark engine to do the actual analytical pieces of the query plan, instead of just shipping all the data over and letting Spark do the filtering.

Like ... wow, predicate pushdown is your answer?

Contrast with Azure Synapse providing Spark and SQL Server compute in the same environment; Databricks adding Delta Lake capabilities to be more schema-on-write friendly; Dremio building AI into their caching, and Starburst into their workload management ...

Anyway, I don't see any secret sauce, which means it's still just traditional enterprise sales cycles...


> they don't support joins in views

Perhaps you mean they don't support joins in Materialized Views (uppercase M)? We use Snowflake views with joins all over the place. Furthermore if views don't cut it for you, you can always use joins in UDTFs. Or if you really need joins in a materialize view (lowercase M), you can use change streams in combination with joins to maintain your own materialized view (table).

> instead of just shipping all the data over and letting Spark do the filtering

Forgive my ignorance, but in what capacity would this be less efficient? Doesn't it make more sense to reduce you result set before shipping it off to external compute?


To the first point, yes, in materialized views, thanks for correcting me. I perceive it as a limitation of their micropartitioning working against the actual goals of such a system, i.e. pushing even more work on data engineers to prep data.

And we're in agreement on the second part: my point is not that their solution isn't more efficient - it is - but it's treating predicate pushdown as some kind of deep synergy between Spark and Snowflake.

That is, it's more the marketing aspect - clueless execs see "Snowflake and Spark work great together" and a box is checked.


Are you familiar with Alteryx and if so, do you see Snowflake eventually replacing them?


alteryx will continue to exist as long as people use spreadsheets.


How does Snowflake compare to Databricks?


1) What is Snowflake?

2) What is their stack?


buffet effect. it's insane


[flagged]


There are many things that show up on HN that have names I don't recognize. When this happens, I'm excited! I get to learn about something I didn't know about before.

A simple Google search for "Snowflake" will immediately answer your question - both by the company being the top result, and by Google conveniently including a card with an overview of the company.

There's also plenty of stuff that shows up on HN that I don't care about because it's not relevant to me, but that doesn't mean it isn't relevant to the rest of the community.


If you don't know and don't care, you could always not comment...


The point of the comment is that it would take like 3 words to identify what the thing is, so that you could save everybody who doesn't know what it is from having to do a bunch of research to figure out whether it's interesting to them.

It's sloppy writing with no respect for the audience's time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: