More

benrutter · 2025-11-06T15:00:20 1762441220

My grandma used to make stinging nettle soup a lot (I think a Swedish thing) - it tasted great, mostly like spinach and vegetable stock! Always feel nostalgic whenever I see it anywhere.

benrutter · 2025-11-06T08:21:27 1762417287

> P2P terminal game about spacepirates playing basketball across the galaxy

Absolutely fantastic description right there.

benrutter · 2025-11-06T06:15:19 1762409719

There's a lot of good takes here about the link between happiness and intelligence.

I think it's also worth saying that both happiness ans intelligence are very loose concepts, and few people should be convinced we can measure them well.

So I guess my rhetorical question is, if smart people were happier, would we even know?

benrutter · 2025-11-05T08:58:07 1762333087

I love the idea of hypothesis! Haven't found a lot of use cases for it yet, I think the quick start example helps explain why.

Essentially, you're testing that "my_sort" returns the same as python's standard "sort". Of course, this means you need a second function that acts the same as the function you wrote. In real life, if you had that you probably wouldn't have written the function "my_sort" at all.

Obviously it's just a tiny example, but in general I often find writing a test that will cover every hypothetical case isn't realistic.

Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it?

masklinn · 2025-11-05T09:39:34 1762335574

What you're talking about is using an oracle (a different implementation of what you're testing), it's an option for property (or exhaustive) testing but is by no means a requirement. Even for a sort function there are plenty of properties you can check without needing an oracle e.g.

- that the output sequence is the same length as the input

- that the output sequence is sorted

- that the population counts are the same in the input and the output

> In real life, if you had that you probably wouldn't have written the function "my_sort" at all.

Having a generalised sort doesn't mean you can't write a more specialised one which better fits your data set e.g. you might be in a situation where a radix sort is more appropriate.

eru · 2025-11-06T02:34:37 1762396477

> Having a generalised sort doesn't mean you can't write a more specialised one which better fits your data set e.g. you might be in a situation where a radix sort is more appropriate.

The opposite might also be true. Suppose you already have a specialised implementation, and you want to write a new generalised one. You can still test them against each other.

Eg suppose you are writing a library that supports sorting crazy huge datasets that don't fit into memory. You can still check that it gives the same answers as the built-in sorting algorithm from the standard library on input that's small enough to fit into memory.

arnsholt · 2025-11-05T12:03:36 1762344216

I've only used PBT a few times, but when it fits it's been extremely useful. A concrete example from my practice of what has been pointed out in this thread, that you want to properties of your function's output rather than the output itself: I was implementing a fairly involved algorithm (the Zhang-Shasha edit distance algorithm for ordered trees), and PBT was extremely useful in weeding out bugs. What I did was writing a function that generated random tree structures in the form I needed for my code, and tested the four properties that all distance functions should have:

1. d(x, x) = 0 for all x 2. d(x, y) >= 0 for all x, y 3. d(x, y) = d(y, x) for all x, y 4. d(x, z) <= d(x, y) + d(y, z) for all x, y, z (the triangle inequality)

Especially the triangle inequality check weeded out some tricky corner cases I probably wouldn't have figured out on my own. Some will object that you're not guaranteed to find bugs with this kind of random-generation strategy, but if you blast through a few thousand cases every time you run your test suite, and the odd overnight run testing a few million, you quickly get fairly confident that the properties you test actually hold. Of course any counterexamples the PBT finds should get lifted to regression tests in addition, to make sure they're always caught if they crop up again. And as with any testing approach, there are no guarantees, but it adds a nice layer of defense in depth IMO.

aflukasz · 2025-11-05T09:18:09 1762334289

> Anyone here use this testing in the wild? Where's it most useful? Do you have the issue I described? Is there an easy way to overcome it?

One example would be when you have a naive implementation of some algorithm and you want to introduce a second one, optimized but with much more complex implementation. Then this naive one will act as a model for comparisons.

Another case that comes to mind is when you have rather simple properties to test (like: does it finish without a crash, within a given time?, does not cross some boundaries on the output?), and want to easily run over a sensible set of varying inputs.

sevensor · 2025-11-05T11:00:54 1762340454

I use hypothesis to test that the happy path is really happy. I use it to define the set of expected inputs for a function, and it looks for the nastiest possible sets of inputs. It’s kind of the opposite of how I write ordinary tests, where I pick challenging but representative examples. Often I’m just checking some basic property, like “function f is the inverse of function g” for round tripping data through a different representation. Sometimes it’s just “function f does not divide by zero.”. Hypothesis seems to have a library of corner cases that it checks first. If you don’t say you can’t handle floats above 2^53, it will try 2^53, for instance. (That’s the threshold where doubles stop being able to represent every integer.) Often the result of running hypothesis is not a code change but a more precise description of what inputs are valid.

dragonwriter · 2025-11-05T17:29:10 1762363750

> Essentially, you're testing that "my_sort" returns the same as python's standard "sort". Of course, this means you need a second function that acts the same as the function you wrote. In real life, if you had that you probably wouldn't have written the function "my_sort" at all.

You don't need that, that’s just a convenient shortcut when you do have a function that does that (but, I would argue, a not-great example for precisely that reason.)

All you actually need is the ability to verify the property (or set of properties) that you are testing in the output. You could replace the test in that sample with something like this, which I think is a better illustration of what is going on (this is somewhat simplified, rather than the length test you really want to test that it has the same set:

  @given(st.lists(st.integers() | st.floats()))
  def test_sort_correct(lst):
    result = my_sort(lst)
    
    # result has the same unique items as original list
    assert set(result) | set(lst) 

    # for each item, result has the same number of copies as original
    assert all(
      result.count(item) == lst.count(item)
      for item in set(result)
    ) 

    # result is sorted
    assert all(result[n] <= result[n+1] for n in range(len(result)-1))

Doxin · 2025-11-06T10:03:31 1762423411

I've used hypothesis in practice for testing a slugify function. When someone signs up for our product our software produces a subdomain based on the organisation name you input. People put all sorts of weird things in that field, and we want to be pretty broad in what we'll accept.

The hypothesis test for this is near trivial. Pass some text to the slugify function, and check it either a) throws an invalid input error (e.g. input too short etc) or b) return a string that's a valid domain name.

Doing this found me so many odd little edge cases around length truncation and idna encoding. The one bug I would've never found myself is the fact that lowercasing "ß" turns it into "ss" which broke truncation, resulting in an invalid domain only if it includes ß and is exactly 64 characters long.

hypothesis is a pretty niche tool, but boy when it's the right tool it's the right tool.

thekoma · 2025-11-05T09:13:00 1762333980

Indeed. For a sorting algorithm it would be more insightful to show it test the actual property of something that is sorted: every consecutive element is larger than the previous one (or the other way around). You don’t need a sorting function to test the “sorted” property.

wongarsu · 2025-11-05T09:36:41 1762335401

Imho it's just not a great example. A better example would be testing the invariants of the function. "For any list of numbers, my_sort does not throw an exception" is trivial to test. "For any list of numbers, in the list returned by my_sort the item at index n is smaller than the item at index n+1" would be another test. Those two probably capture the full specification of my_sort, and you don't need a sort function to test either. In a real-world example you would more likely be testing just some subset of the specification, those things that are easy to assert

lgas · 2025-11-05T10:39:12 1762339152

Always returning the empty list meets your spec.

wongarsu · 2025-11-05T10:50:45 1762339845

Good point. I suppose we should add "number of input elements equals number of output elements" and "every input element is present in the output". Translated in a straightforward test that still allows my_sort([1,1,2]) to return [1,2,2], but we have to draw the line somewhere

travisjungroth · 2025-11-05T18:42:39 1762368159

Just use Counter and if the objects aren’t hashable, use the count of IDs. Grab this before calling the function, in case the function is destructive. Check it against the output.

Add in checking each item is less than or equal to its successor and you have the fundamental sort properties. You might have more, like stability.

masklinn · 2025-11-05T16:44:30 1762361070

> we have to draw the line somewhere

Do we? You can pop-count the two lists and checks that those are equal.

rowanG077 · 2025-11-05T09:09:48 1762333788

On more complex logic you can often reason about many invariants. I don't think using sorting and then comparing to the existing sort is a good example. Precisely for the reason you mention.

A sort itself is a good example though. You can do much better then just compare to an existing sort implementation. Because you can easily check if an array is sorted with a linear scan. If your sort is stable you can do an additional check that for all pairs of adjacent elements that compare as equal the order of them is the same as in the input array.

chriswarbo · 2025-11-06T01:17:36 1762391856

Here's a property check I wrote yesterday, which found a couple of bugs in a large, decade-old codebase.

I'd just changed a data structure with three components, and made sure the test suite was still passing. I happened to notice a function in the same file, for parsing strings into that data structure, which had a docstring saying that it ignores whitespace at the start/end of the string, and in-between the components. It had tests for the happy-path, like "foo123x" -> ("foo", 123, 'x'), as well as checking that optional components could be left out, and it even checked some failure cases. Yet none of the tests used any whitespace.

I thought it would be good to test that, given that somebody had gone to the effort of documenting it. Yet I didn't want to write a bunch of combination like " foo123x", "foo 123x", "foo123 x", " foo 123x", "foo 123 x", and so on. Instead, I wrote a property which adds some amount of whitespace (possibly none) to each of those places, and assert that it gets the same result as with no whitespace (regardless of whether it's a successful parse or not). I wasn't using Python, but it was something like this:

    def whitespace_is_ignored(b1: bool, b2: bool, b3: bool, s1: int, s2: int, s3: int, s4: int):
      v1 = "foo" if b1 else ""
      v2 = "123" if b2 else ""
      v3 = "x" if b3 else ""

      spaces = lambda n: " " * n
      spaced = "".join([spaces(s1), v1, spaces(s2), v2, spaces(s3), v3, spaces(s4)])
      assert parser(v1 + v2 + v3) == parser(spaced)

The property-checker immediately found that "foo123x " (with two spaces at the end) will fail to parse. When I fixed that, it found that spaces after the first component will end up in the result, like "foo 123x" -> ("foo ", 123, 'x').

Of course, we could make this property more general (e.g. by taking the components as inputs, instead of hard-coding those particular values); but this was really quick to write, and managed to find multiple issues!

If I had written a bunch of explicit examples instead, then it's pretty likely I would have found the "foo 123x" issue; but I don't think I would have bothered writing combinations with multiple consecutive spaces, and hence would not have found the "foo123x " issue.

FuckButtons · 2025-11-05T18:34:06 1762367646

Mostly I found making composite strategies most helpful. I had a lot of fiddly details to get right when validating some dsp algorithms and the intermediate data structures they required, and for that it was very helpful, mostly I used property testing with some kind of composite strategy for generating valid inputs.

IanCal · 2025-11-05T09:35:08 1762335308

I definitely had the same issue when I started, I think you've hit on the two main mental blocking points many people hit.

> this means you need a second function that acts the same as the function you wrote.

Actually no, you need to check that outcomes match certain properties. The example there works for a reimplementation (for all inputs, the output of the existing function and new one must be identical) but you can break down a sorting function into other properties.

Here are some for a sorting function that sorts list X into list Y.

Lists X and Y should be the same length.

All of the elements in list X should be in list Y.

Y[n] <= Y[n+m] for n+m<len(X)

If you support a reverse parameter, iterating backwards over the sorted list should match the reverse sorted list.

Sorting functions are a good simple case but we don't write them a lot and so it can be hard to take that idea and implement it elsewhere.

Here's another example:

In the app focus ELEMENT A. Press tab X times. Press shift-tab X times. Focus should be be on ELEMENT A.

In an app with two or more elements, focus ELEMENT A. Press tab. The focus should have changed.

In an app with N elements. Focus ELEMENT A. Press tab N times. The focus should be on ELEMENT A.

Those are a simple properties that should be true, are easy to test and maybe you'd have some specific set of tests for certain cases and numbers but there's a bigger chance you then miss that focus gets trapped in some part of your UI.

I had a library for making UIs for TVs that could be shaped in any way (so a carousel with 2 vertical elements then one larger one, then two again for example). I had a test that for a UI, if you press a direction and the focus changes, then pressing the other direction you should go back to where you were. Extending that, I had that same test run but combined with "for any sequence of UI construction API calls...". This single test found a lot of bugs in edge cases. But it was a very simple statement about using the navigation.

I had a similar test - no matter what API calls are made to change the UI, either there are no elements on screen, or one and only one has focus. This was defined in the spec, but we also had defined in the spec that if there was focus, removing everything and then adding in items meant nothing should have focus. These parts of the spec were independently tested with unit tests, and nobody spotted the contradiction until we added a single test that checked the first part automatically.

This I find exceptionally powerful when you have APIs. For any way people might use them, some basic things hold true.

I've had cases where we were identifying parts of text and things like "given a random string, adding ' thethingtofind ' results in getting at least a match for ' thethingtofind ', with start and end points, and when we extract that part of the string we get ' thethingtofind '". Sounds basic, right? Almost pointless to add when you've seen the output work just fine and you have explicit test cases for this? It failed. At one step we lowercased things for normalisation and we took the positions from this lowercased string. If there was a Turkish I in the string, when lowercased it became two characters:

    >>> len("İ") == len("İ".lower())
    False

So anything where this appeared before the match string would throw off the positions.

Have a look at your tests, I imagine you have some tests like

"Doing X results in Y" and "Doing X twice results in Y happening once". This is two tests expressing "Doing X n times where n>=1, Y happens". This is much more explicitly describing the actual property you want to test as well.

You probably have some tests where you parameterise a set of inputs but actually check the same output - those could be randomly generated inputs testing a wider variety of cases.

I'd also suggest that if you have a library or application and you cannot give a general statement that always holds true, it's probably quite hard to use.

Worst case, point hypothesis at a HTTP API, have it auto generate some tests that say "no matter what I send to this it doesn't return a 500".

In summary (and no this isn't LLM generated):

1. Don't test every case.

2. Don't check the exact output, check something about the output.

3. Ask where you have general things that are true. How would you explain something to a user without a very specific setup (you can't do X more than once, you always get back Y, it's never more than you put in Z...)

4. See if you have numbers in tests where the specific number isn't relevant.

5. See if you have inputs in your tests where the input is a developer putting random things in (sets of short passwords, whatever)

pfdietz · 2025-11-05T13:41:22 1762350082

I don't do the testing using Hypothesis, but I do a great deal of property based testing of Common Lisp implementations, where it has been invaluable.

One large testing loop is testing CL's compile function. This involves creating random valid lambda expressions (unnamed functions), compiling them, and running them on random input. The properties tested are things like "the compiler doesn't crash, the generated code doesn't crash, different versions of the code with different valid declarations compute the same thing".

More specific properties are tested for specific CL functions. For example, CL has a function subtypep which, given two types, says either the first value is or is not a subtype of the second, or that it couldn't figure it out. One can test this on randomly generated types, evaluating properties like (subtypep T1 T2) should be consistent with (subtypep `(not ,T2) `(not ,T1)).

The point of PBT isn't that you write tests to cover every possible case, it's that PBT will cook up tests for all sorts of cases you will never come up with yourself. I have repeatedly found it kicking out the whackiest test inputs that happen to trigger bugs from strange interactions.

benrutter · 2025-11-04T14:00:55 1762264855

Tangent, but software bloat always leaves me wondering what hardware/software could be if we had different engineering goals.

If we worked hard to keep OS requirements to a minimum- could we be looking at unimaginably improved battery life? Hyper reliable technology that lasts many years? Significantly more affordable hardware?

We know that software bloat wastes RAM and CPU, but we can't know what alternatives we could have if he hadn't spent our metaphorical budget on bloat already.

1vuio0pswjnm7 · 2025-11-04T15:07:48 1762268868

"If we worked hard to keep OS requirements to a minimum- could we be looking at unimaginably improved battery life? Hyper reliable technology that lasts many years? Significantly more affordable hardware?"

Volunteer-supported UNIX-like OS, e.g., NetBSD, represents the closest to this ideal for me

I am able to use an "old" operating system with new hardware. No forced "upgrades" or remotely-installed "updates". I decide when I want to upgrade software. No new software is pre-installed

This allows me to observe and enjoy the speed gains from upgrading hardware in a way I cannot with a corporate operating system. The later will usurp new hardware resources in large part for its own commercial purposes. It has business goals that may conflict with the non-commercial interests of the computer owner

It would be nice if software did not always grow in size. It happens to even the simplest of programs. Look at the growth of NetBSD's init over time for example

Why not shrink programs instead of growing them

Programmers who remove code may be the "heros", as McIlroy once suggested ("The hero is the negative coder")

Retric · 2025-11-04T14:57:14 1762268234

Screens and radios do a lot to limit battery life on most modern devices even if the energy use running the OS and user software was free.

If with a reasonable battery standby mode can only last a few weeks and active use is at best a few days then you might as well add a fairly beefy CPU and with a beefy CPU OS optimizations only go so far. This is why eInk devices can end up with such a noticeably longer lifespan, they now have a reason to put in a weak CPU and do some optimization because the possibility of a long lifespan is a huge potential selling point.

benrutter · 2025-10-29T06:26:05 1761719165

Putting aside my political concerns about Elon Musk starting an encyclopedia, I thought I'd look up the indie rock band Beat Happening as a tester. My first thoughts were that it was very impressive. There's a confused garbled image link (that appears as text) but otherwise it looks convincing:

> Beat Happening was an American indie pop band formed in Olympia, Washington, in 1982 by vocalist and guitarist Calvin Johnson, guitarist Heather Lewis, and drummer Bret Lunsford.

Then I though, hey let's see what wikipedia says, here it is:

> Beat Happening was an American indie pop band formed in Olympia, Washington in 1982. Calvin Johnson, Heather Lewis, and Bret Lunsford were the band's continual members.

I came away less impressed- grokipedia's opening paragraph reads to me like a very minor rephrase. I assume wikipedia is in the training data here and being spat back out?

In fact going back to the garbled image link:

> ![Beat Happening in 1988. Clockwise from top: Calvin Johnson, Bret Lunsford, Heather Lewis.](./assets/Beat_Happening_(1988_Rough_Trade_publicity_photo)

Even this seems to be a reference to the wilipedia image at the top of the article titled:

> File:Beat Happening (1988 Rough Trade publicity photo).jpg

What a collosal waste of resources? Is anyone under the impression that this kind of "rephrased and lesss acurate wikipedia" is going to be helpful?

Maken · 2025-10-29T16:33:53 1761755633

The point is rewriting the articles that matter to Musk. Nobody cares about the rest of the articles, they are there just to justify Grokipedia being an encyclopedia at all.

benrutter · 2025-10-29T06:16:39 1761718599

> That said, I have grave concerns that Gates is simply wrong, that we cannot invent our way out of both climate change and the suffering of the global poor.

I feel positive around whether or not we can- renewable and battery technology in incredible at the moment and we could shift quickly to a much less devastating track if the political will was there.

That said, without the political story, technology itself can't change things. We've seen this by the fact thay renewable technology advances have also come alongside technology advances in the fossil fuel industry, like fracking.

Around a 1/3 of CO2 emision is down to agriculture, which is mostly meat (and of that mostly beef). There's no tech fix required here, since humans can survive easily without beef, but something would need to change economically or otherwise so that the CO2 impact of food factors into our collective diets.

benrutter · 2025-10-29T06:09:58 1761718198

Climate change is so confusing to message about in this sense, because the future is not decided.

We're on track to warm the planet far past the point that of sustaining something you and I would recognise as human society. That said, we're also changing course which has the opportunity to mitigate things.

The meta part is, if everyone thinks we'll change course, that affects whether we do. There's no straight prediction that can be made.

benrutter · 2025-10-24T12:05:32 1761307532

I don't know of one- have thought this before but with python and fsspec. Having a google drive style interface that can run on local files, or any filesystem of your choice (ssh, s3 etc) would be really great.

benrutter · 2025-10-20T08:46:19 1760949979

Wait, Google and Microsoft rely on AWS? That seems unlikely? (does it? I wouldn't really know to be honest)

ssl-3 · 2025-10-20T09:35:35 1760952935

In terms of user reports: Some users don't know what the hell is going on. This is a constant.

For instance: When there's a widespread Verizon cellular outage, sites like downdetector will show a spike in Verizon reports.

But such sites will also show a spike in AT&T and T-Mobile reports. Even though those latter networks are completely unaffected by Verizon's back-end issues, the graphs of user reports are consistently shaped the same for all 3 carriers.

This is just because some of the users doing the reporting have no clue.

So when the observation is "AWS is in outage and people are reporting issues at Google, and Microsoft," then the last two are often just factors of people being people and reporting the wrong thing.

(You're hanging out on HN, so there's very good certainty that you know what precisely what cell carrier you're using and also can discern the difference betwixt an Amazon, a Google, and a Microsoft. But lots of other people are not particularly adept at making these distinctions. It's normal and expected for some of them to be this way at all times.)

tpetry · 2025-10-20T13:07:33 1760965653

Thats true. And big part of the reason is the user‘s browser. They use Microsoft Edge or Google Chrome and can‘t open a page and there are weird error messages? oh, thats probably a Google issue…

ta1243 · 2025-10-20T08:54:33 1760950473

More likely the outage reports for google and microsoft are based around systems which also include aws

thephyber · 2025-10-20T10:12:52 1760955172

It’s very likely they’ve bought companies that were built on AWS and haven’t migrated to use their homegrown cloud platforms.

CrayKhoi · 2025-10-20T09:42:00 1760953320

They might be using third party services that rely on AWS.