Hacker News new | past | comments | ask | show | jobs | submit login
What if mass storage were free? – George Copeland (1980) [pdf] (acm.org)
45 points by thunderbong on Dec 1, 2023 | hide | past | favorite | 41 comments



Prices have never been cheaper and yet deletion strategies remain important. The flaw in this assumption is twofold - data creation grows faster than the price drops and “garbage” data can have performance implications. Cloud storage providers love it if you never delete data because they’re charging you more than it costs, but internally they need to carefully and speedily delete data you’ve asked them to delete because it’s a cost (you’re not getting billed for it).


I don't disagree, but this doesn't address that "if it were free" design point.

Obviously retrieval isn't free, and we have only so much write bandwidth. But a GC might be able to use this in pretty neat ways. You could page out allocations that you think aren't going to be in use, or rarely in use. If you write it to storage, you don't have to be sure. Now that kinda gives you extra infinite memory (the GC dream), so you could do things like project the same data structure into multiple ways depending on how it is being accessed (AoS, SoA, projected subsets, columnar, etc).

We should be thinking about how to have computers decide that deletion strategy for us.


We had full daily backups of our application stretching back to 2017, totaling six terabytes of cloud storage.

They got to the point where it cost $200/mo just so we could theoretically restore the application as it was on June 5, 2018.

I think deletion strategies will always be relevant from a “staying organized” standpoint, as well as cost.


Even if you specifically want to keep it in cloud storage, cold storage tiers would cost $6/mo.

This is not an example where you need to delete. If the value is zero, go ahead, but it's sustainable to keep that data around.


You can buy 20T drives like popcorn today.


If it's important, you don't want to just write it to a hard drive and chuck it in a closet.


True. Get two drives. One you chuck in a closet, the other you store off site.


Basically he's describing immutable storage and what we now call write-append-log DB backend.

Quite a foresight at time when microcomputers persisted data on audio tapes and Sinclair launched a computer with custom chassis, keyboard, PCB and 3.5MHz Z80 CPU,..., but yet chose to include only 1kB of RAM to keep the costs low.


Dijkstra once —when the discipline of CS was itself much younger— wrote something to the effect of "how are we supposed to teach our students things that will last their lifetimes?"

(ie if today's kids are ~20, what could we teach that will still be relevant for computing in ~2070?)


The fundamentals and concepts haven't changed much at all, and probably won't for a very, very long time. If you have a good handle on those, everything else is relatively easy to pick up -- even the really new stuff.

What concerns me about new CS grads is that they're not only lacking a lot of the fundamentals, they sometimes even argue that learning them isn't useful.


`curl http://fundamentals.io | sudo bash -`. checkmate, old man.

edit: forgot `curl -k`. like anyone have time to deal with those cert errors.


Discrete mathematics and calculus.

Also, likely Java. I bet there will still be Java code running in 2070.


I use discrete math quite often, but rarely calculus—at least nothing more complicated than knowing what integrals and derivatives are (not how to actually calculate them). I mainly work at the application level, though: understanding business processes and other "soft" skills are much more relevant than advanced math.

I fully expect some companies to still be using Java 8 in fifty years.


Having internalized what it is and what they are and how they work in general is not to be discounted. The usefulness of education is often not in the detail of the rules but in this internalizing. You now know about this concept and it will come to mind without even noticing it when useful when something passes in front of you. If only in business, interest rate or commission is not a small number each time, but adds up. Growth rate is properly "compounded growth rate". Etc.


That’s a good point. Taking calculus and other courses definitely helped to provide a good foundation. But, for anyone struggling in those courses and wondering if they can make it in this industry: a barely-passing grade in Calculus II won’t be a career-ender. :-)


Eugenia Cheng has a funny quip about how the US is the only place that studies calculus as hard as we do. It sounds like she would like it to be a footnote, and not the main course.


I bet there will still be COBOL code running in 2070.


Haven't most of these codebases moved to C#/Java over the past 20 years? I feel like Cobol is truly a thing of the past, even for your average old-school bank/insurance behemoth, but then I might live in a bubble.


Some has, but there's still a very large and active COBOL installed base, and there's still active COBOL development taking place.

In fact, COBOL devs tend to be better paid these days, because they're critical but there are fewer of them.

The deal is that companies who rely on such software have a solid, time-proven, solution. Switching that out just to change to a different language would be irresponsibly risky.


Does Java (or it's programmers) know how to represent decimal numbers and fractions at the machine level?

COBOL is used in banking because it natively supported decimal floats from the 70s or some crap, and no other language bothers to truly try and be a COBOL replacement.

Banking / insurance / etc etc are on the Dollar/Penny system. They need 0.01 to be exactly 0.01, and not 0.09999997 or whatever double precision decides to round that to.

And remember, there are fractions of a penny. Ex: $15.097 could be a real price that needs to be exactly calculated.

-------

If this crap hasn't been figured out in the last 20 years, why would Java or C# programmers try to solve it in the next 20 years?

It's more likely for the old COBOL code to just keep running along than to port over to a language that doesn't even meet your legal requirements.


In Java, BigDecimal (https://docs.oracle.com/en/java/javase/21/docs/api/java.base...) is the standard. It's used widely in every bank around the world. In Python, you have Decimal. In C#, Decimal works also great.

It's not like COBOL has a particular edge against "modern" languages, but it has legacy with it.


> "If this crap hasn't been figured out in the last 20 years, why would Java or C# programmers try to solve it in the next 20 years?"

C# has had System.Decimal since .NET Standard 1.0 over 20 years ago: https://learn.microsoft.com/en-us/dotnet/api/system.decimal?... - "Decimal value type is appropriate for financial calculations that require large numbers of significant integral and fractional digits and no round-off errors."


Python's decimal library does this pretty well. https://docs.python.org/3/library/decimal.html


There probably will be Java code running in 2070. As well as Python, C, C++, COBOL, Fortran, etc.


Concepts.

What changes over time is syntax, but most of the concepts remain.

Source: 30+ year SysAdmin.


Including mass storage not being free :(


If machines are still Turing tape machines at their heart that follow instructions as in assembly.

Or we're all encoding behaviours as activations of vectors in English language prompting


Is there even any kind of an online resource that defines these “fundamentals” in a widely-agreed-upon basis, and focuses on only said fundamentals as a purpose-built resource of high specificity?

If so, it’s only a Google search away for these young’uns.

Or, as a mangled quote attributed to Einstein goes, “Never memorize what you can look up in books.”


"fundamentals of computer science" is a pretty disappointing G search, at least in my bubble.

(I currently believe "it's all quantales, what's the problem?" is a defensible proposition, but suspect that this viewpoint may be reminiscent of Mathematics Made Difficult)


Dijkstra's deep-seated pessimism about the state of computing and software quality seems ever-green, two decades after his death.


grep


One thing the "no deletion" argument misses is that sometimes you have to delete data for policy reasons. At least two cases are important:

* Users ask you to delete their data. If you don't, and they find out you didn't, you have a problem.

* Legal action may require you to delete data. (E.g. you may find that someone uploaded child pornography to your system.)

This is actually a huge problem for companies like Google (where I work). When you have enormous volumes of highly reliable and durable (i.e. replicated) storage, it's actually really hard to make sure you can delete all copies of specific data reliably and quickly.


While acknowledging you're only addressing the Copeland paper (and not Endatabas, where the OP found it), here's the Endatabas solution to this problem:

https://www.youtube.com/watch?v=oDHGjUMqPvI&t=129s

Apologies for the hijack. :)


> here's the Endatabas solution to this problem:

Where?

The narrator says "Endb supports ERASE. Mustard is gone." and then moves on to another topic entirely.

This is right after they said the data was immutable.

What does ERASE actually do? Does it wipe the old bytes? Does it add a tombstone that could be bypassed?


I get the impression this was discovered in the Endatabas bibliography, since the same user just posted a link to the quickstart.

https://www.endatabas.com/bibliography.html

...Copeland's paper is a fun and inspirational read. If you enjoy that, you'll probably enjoy other papers from this list.


Did you use it or are involved in the project?

wonder how it compares with postgre temporal table or just adding a `entity_history` somewhere. Or the timeline data is more intrinsic to the DB design on this one?


I'm involved in the project.

The temporal columns are intrinsic to Endb, but they are completely optional. By default, Endb queries run as-of-now, which then return the same results one would expect from a regular Postgres database.

Postgres temporal tables can't make Postgres natively aware of time, so temporal queries tend to be awkward, even if you want the default as-of-now result.

There are temporally-aware databases (SAP HANA, MySQL, SQL Server), but they all treat time as an additional concept layered on top of SQL-92 via SQL:2011. It's difficult for a mutable database to assume immutability or a timeline without becoming another product.

`entity_history` and similar audit tables aren't comparable at all, since they don't even involve the same entity/table, which means all querying of history is manual. Indexing of audit tables is at least a bit easier than the SQL:2011 temporal solutions mentioned above, though.

In all these cases, schema is still an issue that needs to be resolved somehow, since (again) incumbent relational databases assume a Schema First approach to table layout. Endb is Schema Last and allows strongly-typed nested data by default.

The Endb demo is pretty recent, and explains all of this in more detail, with examples:

https://www.youtube.com/watch?v=oDHGjUMqPvI


InterBase[1] was a popular database at one point, the appealing feature to me was that it didn't overwrite data, but kept versions of data, with a pointer to the current version being the only actually overwritten value on disk.

  Such a system could be highly useful these days, in the times of almost infinite storage.
[1] https://en.wikipedia.org/wiki/InterBase


Depends on whether that storage will last for millenia (and still be readable with simple technology) or for a decade.

I worry what percentage of valued storage (digitization of valued objects followed by their disposal) will remain in 50 years.

I think of the Eloi libraries (whether Pal's disaster or those of H.G. or Simon Wells.) We can find only faint echos of most profound 'Ancient Greek' texts.


Can someone explain how an organization like the NSA manages to keep records on every digital transaction (supposedly)? This seems like an impossible physics problem because the amount of data seems unfathomable even for an organization like the US govt. Pre-Snowden you could say that thinking was a conspiracy theory. Now we know that not only is it not a conspiracy, its probably much worse.

How about Youtube? They amount of video uploaded daily is increasing yet they manage to store everything essentially on demand going back to the first Youtube video. At the end of the day for Youtube the data must be fetched from a hard drive somewhere...right? Are they buying thousands of HDDs daily?


Regarding YouTube I would assume yes. At this scale each day you remove many broken servers and add many more new ones




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: