Hacker News new | past | comments | ask | show | jobs | submit login
Colossus for Rapid Storage (cloud.google.com)
243 points by alobrah 3 days ago | hide | past | favorite | 115 comments





(This was posted last night with https://cloud.google.com/blog/products/compute/whats-new-wit... above. We've changed the URL to the product-specific article.)


Very cool! This makes Google the only major cloud that has low-latency single-zone object storage, standard regional object storage, and transparently-replicated dual-region object storage - all with the same API.

For infra systems, this is great: code against the GCS API, and let the user choose the cost/latency/durability tradeoffs that make sense for their use case.


> This makes Google the only major cloud that has low-latency single-zone object storage, standard regional object storage,

Absurd claim. S3 Express launched last year.


Sure, but AFAIK S3’s multi-region capabilities are quite far behind GCS’s.

S3 offers some multi-region replication facilities, but as far as I’ve seen they all come at the cost of inconsistent reads - which greatly complicates application code. GCS dual-region buckets offer strongly consistent metadata reads across multiple regions, transparently fetch data from the source region where necessary, and offer clear SLAs for replication. I don’t think the S3 offerings are comparable. But maybe I’m wrong - I’d love more competition here!

https://cloud.google.com/blog/products/storage-data-transfer...


> Sure, but AFAIK S3’s multi-region capabilities are quite far behind GCS’s.

Entirely different claim.


I claimed that Google is the only major cloud provider with all three of:

- single-zone object storage buckets

- regional object storage buckets

- transparently replicated, dual region object storage buckets

I agree that AWS has two of the three. AFAIK AWS does not have multi-region buckets - the closest they have is canned replication between single-region buckets.


not quite the same, but S3 does have https://aws.amazon.com/s3/features/multi-region-access-point..., which would let you treat multiple buckets in different regions as one single bucket (mostly). But you still do need to set up canned replication.

S3 doesn't have "transparently-replicated dual-region object storage", which was part of the claim.

S3 does have replication, but it is far from transparent and frought with gotchas.

And it certainly doesn't have all of that with a single API.


Isn't S3 Express not the same API? You have to use a "directory bucket" which isn't an object store anymore, as it has actual directories.

To be honest I'm not actually sure how different the API is. I've never used it. I just frequently trip over the existence of parallel APIs for directory buckets (when I'm doing something niche, mostly; I think GetObject/PutObject are the same.)



The cross-region replication I’ve seen for S3 (including the link you’ve provided) is fundamentally different from a dual-region GCS bucket. AWS is providing a way to automatically copy objects between distinct buckets, while GCS is providing a single bucket that spans multiple regions.

It’s much, much easier to code against a dual-region GCS bucket because the bucket namespace and object metadata are strongly consistent across regions.


The semantics they are offering are very different from S3. In Colossus a writer can make a durable 1-byte append and other observers are able to reason about the commit point. S3 does not offer this property.

Sure, but that's not what the parent said.

FYI this was unveiled at the 2025 Google Next conference, and they're apparently unveiling a gRPC client for Rapid Storage, which appears to be a very thin wrapper over Colossus itself, as this is just zonal storage.

Struggling to find a definition, but seemingly zonal just means there's a massive instance per cluster.

Did find some interesting recent (March 28th, 2025) reads though!

Colossus under the hood: How we deliver SSD performance at HDD prices https://cloud.google.com/blog/products/storage-data-transfer...

I kind of thought you meant ZNS / https://zonedstorage.io/ at first, or it's more recent better awesomer counterpart Host Directed Placement (HDP). I wish someone would please please advertize support for HDP, sounds like such a free win, tackling so many write amplification issues for so little extra complexity: just say which stream you want to write to, and writes to that stream will go onto the same superblock. Duh, simple, great.


Delivering "HDD prices" is a bold claim there.

They charge $20/TB/month for basic cloud storage. You can build storage servers for $20/TB flat. If you add 10% for local parity, 15% free space, 5% in spare drives, and $2000/rack/month overhead, then triple everything for redundancy purposes, then over a 3 year period the price of using your own hard drives is $115/TB and google's price is $720. Over 5 years it's $145 versus $1200. And that's before they charge you massive bandwidth fees.


I like your comparison with self-built storage, but comparing $20/TB/month with other CLOUD offerings, we see:

* hetzner storage box starts from $4/month for 1TB, and then goes down to $2.4/TB/month if you rent a 10TB box.

* mega starts from €10/month for 2TB, and goes down to €2/TB/month if you get a 16TB plan

* backblaze costs (starts from?) $6/TB/month

I was looking for a cheap cloud storage recently, so have a a list of these numbers :)

Moreover, these are not even the cheapest one. The cheapest one I found had prices starting from $6.5 for 5TB, going down to $0.64/TB/month for plans starting with 25TB (called uloz, but I haven't tested them yet).

Also, looking at lowendbox you can find a VPS in Canada with 2TB storage for $5/month and run whatever you want there.

How all that compares to $20/TB/month?!

Please feel free to correct me if i'm comparing apples to oranges, though. But I can't believe all of these offers are scam or so-called "promotional" offers which cost companies more than you pay for it.


Thank You. So backblaze for $6/TB a month. I could have a TB of Data backed up safely against file corruption? I wonder how have I missed that.

Now you could use it with Synology NAS and it is a lot cheaper than doing RAID 5 for ZFS / BTRFS with Muti redundancy.

I wonder if there are any NAS that does that automatically? Any drawbacks? Also wonder if the price could go down to $5 / TB in a few years time.


The price of Backblaze WAS $5 a few years ago and they increased it to $6 (and added some free bandwidth).

I'm still annoyed they increased the price for B2. Maybe "free" bandwidth gets people to use it more? But as far as their costs go, between the time they launched at $5 and the time they upped it to $6, hard drives (and servers full of hard drives) cost half as much per TB, with 1/4 as many servers needed for the same number of TB.

I get the impression that business has always been about being the best schmoozer more than about having the best product.

BTW at Hetzner you can rent servers with very large (hundred of TB) non-redundant storage for an effective price of about $1.50/TB/month. If you want to build a cloud storage product, that seems like a good starting point - of course, once you take into account redundancy, spare capacity, and paying yourself, the prices you charge to your customers will end up closer to the price of Backblaze at a minimum.


>I get the impression that business has always been about being the best schmoozer more than about having the best product

and thus, market efficiency feels like a myth. This feels most true when it comes to cloud services. They're way overpriced in multiple different common cases at the big providers


Yes, this is pretty much what Hetzner must have built with their object storage - and they get to 5 EUR/month, so really close to Backblaze pricing.

Of what you mentioned, only backblaze is similar (object storage with S3-like API), all others are apples to oranges.

You don't need very many terabytes to cover the labor cost of installing and maintaining an S3-compatible server program.

You need a very big cluster for it to be worth it though for non-backup use-cases when using HDDs.

You forgot paying yourself to set that up.

That's covered by the build and overhead numbers. But if you want more on the build side, an extra $10k of labor per rack of 9 servers only increases the cost per TB by about $4.

You're paying the same for "cloud engineers".

Also, don't forget the hidden cost/risk of giving a third party full access to your data.


Clicking yourself a Bucket takes 5 Minutes.

Building a Server and keeping it secure and up-to-date and fixing hardware issues, takes relevant time


Not to mention that I can: - Create a bucket and store 1MB in it without any overhead - Create 50 buckets with strong perimeters around them such that someone deleting the entire account doesn’t bring down the other 49 - Create a bucket and fill it with terabytes of data within seconds and don’t need to wait for hardware to be racked and stacked - Create a bucket, fill it with 2TB of data, and delete it tomorrow

Cloud is more than bare metal, but plenty of folks discount the cost benefits of elasticity.


I suspect the problem is that we're engineers in domains that have very different needs.

For example, I agree that elasticity is great. But at the same time, to me, it sounds like bad engineering. Why do you need to store terabytes of data and then delete it - couldn't it be processed continuously, streamed, compressed, process changes only, and so on. A lot of engineering today is incredibly wasteful. Maybe your data source doesn't care, and just provides you with terabyte csv files, and you have no choice, but for engineers that care about efficiency, it reeks.

It might make a lot of sense in a highly corporate context where everything is hard, nobody cares, and the cost of inefficiency is just passed on to the customer (i.e. often government and tax payers). But the real problem here is that customers aren't demanding more efficiency.


Alone the fact of audit gives you a lot of reasons to keep data. Even if it gets downsampled one way or the other.

And plenty of use cases have natural growth. I do not throw away my pictures for example.

Data also grows dependent of users. More users, more 'live' data.

We have such a huge advantage with digital, we need to stop thinking its wasteful. Everything we do digital (pictures, finance data, etc.) is so much more energy and space efficient than what we had 20 years ago, we should just not delete data because we feel its wasteful.


Leverage erasure encoding for durability and avoid both the tripling and local parity. You'll get better durability than 3x while only taking up significantly less than 2x the space Backblaze open sourced their library and talk about it here, https://www.backblaze.com/blog/Reed-Solomon. They use a 17:20 ratio that'll get them 3 drive failure resistance for just 1.17x stretch (ie a 100mb file gets that resilient while taking up 117mb of space)

"Zonal" relates to the concept of "availability zones" which are the next-smallest unit below a (physical) "region."

Most instances of a cloud ___ created in a region are allocated and exist at the zonal level (i.e. a specific zone of a region).

A physical "region" usually consists of three or more availability zones, and each zone is physically separated from other zones, limiting the potential for foreseeable disaster events from affecting multiple zones simultaneously. Zones are close enough networking-wise to have high throughput and low latency interconnection, but not as fast as same-rack, same-cluster communications.

Systems requiring high availability (or replication) generally attain this by placing instances (or replicas) in multiple availability zones.

Systems requiring high-availability generally start with multi-zone replication, and Systems with even higher availability requirements may use multi-region replication, which comes at greater cost.


It’s GCP’s answer to AWS S3 express zone https://aws.amazon.com/s3/storage-classes/express-one-zone/

In Google Cloud parlance, "regional" usually means "transparently master-master replicated across the availability zones within a region", while "zonal" means "not replicated, it just is where it is."

Slight nit: "zonal" doesn't necessarily mean "not replicated", it means that the replicas could all be within the same zone. That means they can share more points of failure. (I don't know if there's an official definition of zonal.)

NB: I am on the rapid storage team.


> Struggling to find a definition, but seemingly zonal just means there's a massive instance per cluster.

There are a number of zones in a region. Region usually means city. Zone can mean data center. Rarely just means some sort of isolation (separate power / network).


What on this page gives you that impression? Do I have to watch the 2-hour video to learn this?

Of course not. Gemini can summarize it for you.

I mean, sure, it can easily provide quick text summaries of this sort of thing, but I only consume ML summaries in the forms of podcast discussions between two simulated pundits, as God intended.

This could actually speed up some of my scientific computing (in some cases, data localization/delocalization is an important part of overall instance run-time). I will be interested to try it.

Had to go back to the classic microservices video as I was pretty sure they used Colossus but it was actually Galactus & Omega Star.

This is what OP is referring to in case you haven’t been enlightened https://youtu.be/y8OnoxKotPQ?si=JAK5iPMcG1yoAhiT

Glad to see the zonal object store take off. Such massive bandwidth speed will re define data analytics where 99% of all queries able to run on a single node faster than what distributed compute can offer.

This link makes so much more sense than the previous link did.

SSDs with high random I/o speeds are a significant contributor to the advantage. I think 20m writes per second are likely distributed over a network of drives to make that kind of speed possible.


Similar to S3 express one zone

Is S3 Express One Zone performance greatly improved to standard S3 like GCP rapid storage? My understanding is S3 Express One Zone is just more cost effective.

> 20x faster random-read data loading than a Cloud Storage regional bucket.


Update: Just read this article[1] which clarifies S3 Express One Zone. Yes, performance is greatly improved, but actually storage costs are 8x more than a standard S3 bucket. The naming S3 Express One Zone is terrible and a bit misleading on pricing changes.

[1] https://www.warpstream.com/blog/s3-express-is-all-you-need


I understand your belief that One Zone implies less expensive, but I’m staunchly in favor of them having it in the name so people know that their data is in a single AZ. The storage class succinctly summarizes faster with lower availability.

Fair, how about instead of S3 Express they call it S3 Max (One Zone). It doesn’t take a rocket scientist to come up with good product names, just copy Apple. Though I suppose what happens when engineers are left up to the marketing. :-)

If Apple's so great at naming things, tell me (without looking) which is bigger/better/faster for their CPUs: Max or Ultra?

ha, ha, fair. Ultra. To be fair, I own a MacBook Pro M1 Max and Mac Mini M4 Pro and follow Apple products closely.

Yep, I love Apple, follow them closely, own a Mac Studio with an M3 Ultra and a MacBook Pro with an M4 Max, and it's still confusing. :)

I mean, surely a Mac Studio with an M4 Max must be the best, right? It's an entire CPU generation ahead and it's maximum! Of course, it's not... the M3 Ultra is the best.

Naming things is hard.


Yes, it’s horribly more expensive… I think you are thinking of one zone infrequent access

AWS just reduced prices on One Zone Express today.

AWS claims 10x lower latency but I haven't personally checked.

I want chubby as a service so I can throw etcd and zookeeper in the trash.

For some reason, text highlight didn't work, so here's the text-highlighted link: https://cloud.google.com/blog/products/compute/whats-new-wit...

That link doesn't work for me, so here's the relevant bit:

Rapid Storage: A new Cloud Storage zonal bucket that enables you to colocate your primary storage with your TPUs or GPUs for optimal utilization. It provides up to 20x faster random-read data loading than a Cloud Storage regional bucket.

(Normally we wouldn't allow a post like this which cherry-picks one bit of a larger article, but judging by the community response it's clear that you've put your finger on something important, so thanks! We're always game to suspend the rules when doing so is interesting.)


There's now another blog post about Rapid storage specifically: https://cloud.google.com/blog/products/storage-data-transfer... . (That wasn't up yet when the original post was made.)

Ah excellent—that's what we were waiting for. I've changed the URL to that from https://cloud.google.com/blog/products/compute/whats-new-wit... above. Thanks!

Apologies! First time making a post on hacker news, and I thought this was really exciting news. FWIW, I talked to the presenter after this was revealed during the NEXT conference today, and he seems to have implied that zonal storage is quite close to what Google seems to have with Colossus.

Oh no, don't apologize - this was a case where you did exactly the right thing and I'm glad you posted!

(I was just adding some explanation for more seasoned users who might wonder why we were treating this a bit differently.)

Also, welcome to posting on HN and we hope you'll continue!


The gods strip off interesting bits of URLs when you submit it

if you saw that code you wouldn't deify it

It took me 4-5 attempts to not read:

> If you saw that code, you wouldn't _defy_ it


Moloch was also a god!

Is this related at all the the private invite only anywhere caches? (or maybe they're GA now?)

https://cloud.google.com/storage/docs/anywhere-cache


Anywhere Cache and Rapid Storage share some infrastructure inside of GCS and both are good solutions for improving GCS performance, but Anywhere Cache is an SSD cache in front of the normal buckets while Rapid Storage is a new type of bucket.

(I work on Google storage)


Can you expand a bit on when it would make sense to use one versus the other?

Anywhere Cache shines in front of a multi-regional bucket. Once the data is cached, there's no egress charges and there's much better latency. This is great for someone who looks for spot compute capacity to run computations anywhere in the multi-region. It will also improve performance in front of regional buckets but as a cache, you'll see the difference between hits and misses.

Rapid Storage will have all of your data local and fast, including writes. It also adds the ability to have fast durable appends, which is something you can't get from the standard buckets.


Is it like PureStorage?

There's a detailed blog post about Rapid Storage now available, see https://news.ycombinator.com/item?id=43645309

(I work on Google storage)


Thanks! I've changed the URL of the current thread and re-upped this one. More at https://news.ycombinator.com/item?id=43646209.

Super interesting! Rapid Storage especially, very useful, but that first line:

"Today's innovation isn't born in a lab or at a drafting board; it's built on the bedrock of AI infrastructure. "

Uhh..No. Even as an AI developer I can tell that is some AI Comms person tripping over.


Everyone needs to learn to use a single, unique, unambiguous URL for new product announcements like this.

Google aren't the only company that consistently mess this up, but given how they built a 1.95 trillion company on top of crawling URLs on the web they really should have an internal culture that values giving things unique URLs!

[I had to learn this lesson myself: I used to blog "weeknotes" every week or two where I'd bundle all of my project announcements together and it sucked not being able to link to them as individual posts]


Google's not really at fault here: the OP submitted a link to an article called "Introducing Ironwood TPUs and new innovations in AI Hypercomputer" that happens to mention Rapid Storage way down the page.

In case marketing seems to move faster than documentation though, since I can't find any mention of this in the main GCS docs. https://cloud.google.com/search?hl=en&q=rapid%20storage


It's down here too: https://cloud.google.com/products/storage

No link and no details though.


They revealed it's in private preview atm ;)

Hats off to whoever convinced management that selling Colossus via cloud was Artificial Intelligence. Bravo.

They're not saying that it's AI. They're saying it's for customers who do AI. Training means lots and lots of reads from a big data store, and if you're reading from, like, big Parquet files, that probably means lots of random reads. This is for that. Speedier data access, presumably at the cost of durability and availability, which is probably a great trade-off for people doing ML training jobs.

calling everything 'for AI' is the new standard

>if you're reading from, like, big Parquet files, that probably means lots of random reads

and it also usually means that you shouldn't use s3 in the first place for workloads like this. Because they are usually very inefficient comparing to distributed fs. Unless you have some prefetch/cache layer, you will get both bad timings and higher costs


But a distributed FS is far more expensive than cloud blob storage would be, and I can't imagine most workloads would need the features of a POSIX filesystem.

I don't fault them for this at all. AI isn't possible without the full infra stack, which clearly includes storage (and compute, and networking, and data pipelining, and and and...). There's an entire ecosystem of ISVs that only do one of these things, very well (Pure Storage, for example, or Lamba or Coreweave, or Confluent (Kafka + Flink with LLM integration). While it might be more precisely accurate to state "AI enabling" tech, I'll give them a pass.

I think the joke here is that somehow management refused to sell Colossus (which is such an obvious nice product just like BigQuery) before and it takes "AI" to convince them.

> which is such an obvious nice product just like BigQuery

I always assumed (from outside Google) that the problem was that Colossus had to make a "no malicious actors" assumption in its design in order to make the performance/scaling guarantees it does; and that therefore just exposing it directly to the public would make it possible for someone to DoS-attack the Colossus cluster.

My logic was that there's actually nothing forcing [the public GCP service of] BigTable to require that a full copy of the dataset be kept hot across the nodes, with pre-reserved storage space — rather than mostly decoupling origin storage from compute† — unless it was to prevent some DoS vector.

As for exactly what that DoS vector is... maybe GC/compaction policy-engine logic? (AFAICT, Colossus has pluggable "send compute to data" GC, which internal-BigTable and GCS both use. But external-BigTable forces the GC to be offloaded to the client [i.e. to the BigTable compute nodes the user has allocated] so that the user can't just load down the system with so many complex GC policies that the DC-scale Colossus cluster itself starts to fall behind its GC time budget.)

---

† Where by "decouple storage from compute", I mean:

• Each compute node gets a fixed-sized DAS diskset, like GCE local NVMe SSDs;

• each disk in that diskset gets partitioned up at some fixed ratio, into two virtual disksets;

• one virtual diskset gets RAID6'ed or ZFS'ed together, and is used as storage for non-Colossus-synced tablet-LDB nursery level SSTs;

• the other virtual diskset gets RAID0'ed or LVM-JBOD-ed together and is used as a bounded-size LFU read-through cache of the Colossus-synced tablets — just like BigQuery compute nodes presumably have.

(AFAIK the LDB nursery levels already get force-compacted into "full" [128MiB] Colossus-synced tablets after some quite-short finality interval, so it's not like this increases data loss likelihood by much. And BigTable doesn't guarantee durability for non-replicated keys anyway.)


> a "no malicious actors" assumption in its design in order to make the performance/scaling guarantees it does

Didn't think deep into it, could this be solved with billing designs with more nuance?


That’s an actually impressive level of spin. Flashback to when shipping companies were slapping blockchain on international container shipments.

Flashback to the cabbage on the Blockchain or when they wanted to tag each fish caught in the wild as well for hu traceability!

You gotta feed the GPUs & TPUs with enough data to avoid them sitting idle. Which starts to become incredibly challenging with latest gen GPU/TPU chips

Maybe they can start selling Capacitor as a file format for storing LLMs metadata or something

I would pay serious money if they sold CFS as a service but on AWS.

Hi, I'm looking for a job. Are you willing to pay me serious money to set up CFS as a service on your AWS?

Obviously not, since you could not deliver it. It seems that you maybe don't realize what CFS is in this context, and are thinking of something else that you could just "set up"?

What jeffbee is talking about is Google's proprietary Colossus File System, and all its transitive dependencies.


I meant it sarcastically, but for "serious money" you can have any software system you can dream of. You have to dream of it, though - that's one of the hard parts.

It looks like every other clustered file system. What's special about Google's Colossus?


There are some semantic differences compared to POSIX filesystems. A couple big ones:

  - You can only append to an object, and each object can only have one writer at the time. This is useful for distributed systems - you could have one process adding records to the end of a log, and readers pulling new records from the end.
  - It's also possible to "finalize" an object, meaning that it can't be appended to any more.
(I work on Rapid storage.)

Why would you wish for a system with constraints like that, which other systems don't have?

Other systems don't offer the performance that Colossus offers, is why. POSIX has all kinds of silly features that aren't really necessary for every use case. Throwing away things like atomic writes by multiple writers allows the whole system to just go faster.

It sounds like you have to find a design that meets your performance target and usage patterns - just like anything else. It also sounds like Google's CFS is a grass is greener situation - you heard Google had something that solved the problem you have, so you want it. But the reason it sounds good, compared to the other designs, is that you haven't had to actually use it and run into its quirks yet.

Everything is AI these days. Does it still need convincing?

Ha, right after I read your comment, I looked at the bottom of this Hacker News page and saw their "Join us for AI Startup School" ad.

Reading the press release about the "Hypercomputer" and I can't tell what part of this is real and what part is marketing.

They say it comes in two configuration, 256 chips or 9,216 chips. They also say that the maximal configuration of 9,216 chips delivers 24x the compute power of the world's largest supercomputer (which they say is called El Capitan). They say that this comes to 42.6 exaFLOPs.

This implies that the 9,216 chip configuration doesn't actually exist in any form in reality, or else it would now be the world's largest supercomputer (by flops) by a huge margin.

Am I massively misunderstanding what the claims being made are about the TPU and the 42.6 exaFLOPs? I feel like this would be much bigger news if this was fully legit.

Edit: The flops being benchmarked are not the same as regular supercomputer flops.


Supercomputers are measured based on 64 bit floating point operations. Here they (inaptly) compared it to their 8 bit floating point operations (which are only useful for AI workloads).

Gotcha. That makes a lot more sense. I was led to believe by the wording of the comparison that they were the same operations. Appreciate the explanation.

Why is it inapt?

If all you care about is an 8-bit AI workload (there's definitely a market for that), it's nice to have 24x the speed.


It's an apples to oranges comparison.

It's apples to apples if you care about 8-bit (a lot of people do these days).

AFAIK, there wasn't a faster 8-bit super computer to compare to - which is why they made the comparison.


Also the set of supported/accelerated operations in the fastest path is different no matter whether you use 8, 16, or 32bit floats, thus the common use of "TOPS" as benchmark number recently.

Terrifyingly complicated and buzzword packed. I really don't know what to make of any of this or what it does, and I work with AI applications in my day job.

I'm guessing the $300 of Google Cloud credit offered in this webpage wouldn't go very far using any of this stuff?


You can try out everything for 300 dollars easily. Most expensive thing you can do is get a server with 8 h200s and spend 90 dollars an hour.

Like with any other new Google product, better wait a few years to see if it sticks before investing in its usage. In most cases, you'd be better off searching for an alternative from the start.

[flagged]


Please don't do this here.

If you want object storage faster than S3 Express One Zone or GCP Rapid Storage without the zonal limitation check out ACS: https://acceleratedcloudstorage.com

You can bring data in and out of the GPU quickly and improve utilization.




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: