More

sharkdp · 2024-11-19T16:38:07 1732034287

> “Best publicly available” != “great”

Of course. But it is free and open source. And everyone is invited to make it better.

sharkdp · 2024-11-19T16:34:42 1732034082

> Robust statistics with p-values (not just min/max, compensation for multiple hypotheses, no Gaussian assumptions)

This is not included in the core of hyperfine, but we do have scripts to compute "advanced" statistics, and to perform t-tests here: https://github.com/sharkdp/hyperfine/tree/master/scripts

Please feel free to comment here if you think it should be included in hyperfine itself: https://github.com/sharkdp/hyperfine/issues/523

> Automatic isolation to the greatest extent possible (given appropriate permissions)

This sounds interesting. Please feel free to open a ticket if you have any ideas.

> Interleaved execution, in case something external changes mid-way.

Please see the discussion here: https://github.com/sharkdp/hyperfine/issues/21

> It just… runs things N times and then does a naïve average/min/max?

While there is nothing wrong with computing average/min/max, this is not all hyperfine does. We also compute modified Z-scores to detect outliers. We use that to issue warnings, if we think the mean value is influenced by them. We also warn if the first run of a command took significantly longer than the rest of the runs and suggest counter-measures.

Depending on the benchmark I do, I tend to look at either the `min` or the `mean`. If I need something more fine-grained, I export the results and use the scripts referenced above.

> At that rate, one could just as well use a shell script and eyeball the results.

Statistical analysis (which you can consider to be basic) is just one reason why I wrote hyperfine. The other reason is that I wanted to make benchmarking easy to use. I use warmup runs, preparation commands and parametrized benchmarks all the time. I also frequently use the Markdown export or the JSON export to generate graphs or histograms. This is my personal experience. If you are not interested in all of these features, you can obviously "just as well use a shell script".

Sesse__ · 2024-11-19T17:16:07 1732036567

> This is not included in the core of hyperfine, but we do have scripts to compute "advanced" statistics, and to perform t-tests here: https://github.com/sharkdp/hyperfine/tree/master/scripts

t-tests run afoul of the “no Gaussian assumptions”, though. Distributions arising from benchmarking frequently has various forms of skew which messes up t-tests and gives artificially narrow confidence intervals.

(I'll gladly give you credit for your outlier detection, though!)

>> Automatic isolation to the greatest extent possible (given appropriate permissions) > This sounds interesting. Please feel free to open a ticket if you have any ideas.

Off the top of my head, some option that would:

* Bind to isolated CPUs, if booted with it (isolcpus=) * Binding to a consistent set of cores/hyperthreads (the scheduler frequently sabotages benchmarking, especially if your cores are have very different maximum frequency) * Warns if thermal throttling is detected during the run * Warns if an inappropriate CPU governor is enabled * Locks the program into RAM (probably hard to do without some sort of help from the program) * Enables realtime priority if available (e.g., if isolcpus= is not enabled, or you're not on Linux)

Of course, sometimes you would _want_ to benchmark some of these effects, and that's fine. But most people probably won't, and won't know that they exist. I may easily have forgotten some.

On the flip side (making things more random as opposed to less), something that randomizes the initial stack pointer would be nice, as I've sometimes seen this go really, really wrong (renaming a binary from foo to foo_new made it run >1% slower!).

sharkdp · 2024-11-19T17:57:03 1732039023

> On the flip side (making things more random as opposed to less), something that randomizes the initial stack pointer would be nice, as I've sometimes seen this go really, really wrong (renaming a binary from foo to foo_new made it run >1% slower!).

This is something we do already. We set a `HYPERFINE_RANDOMIZED_ENVIRONMENT_OFFSET` environment variable with a random-length value: https://github.com/sharkdp/hyperfine/blob/87d77c861f1b6c761a...

sharkdp · 2024-11-19T11:00:16 1732014016

> The issue is it runs a kajillion tests to try and be “statistical”.

If you see any reason for putting “statistical” in quotes, please let us know. hyperfine does not run a lot of tests, but it does try to find outliers in your measurements. This is really valuable in some cases. For example: we can detect when the first run of your program takes much longer than the rest of the runs. We can then show you a warning to let you know that you probably want to either use some warmup runs, or a "--prepare" command to clean (OS) caches if you want a cold-cache benchmark.

> But there’s no good way to say “just run it for 5 seconds and give me the best answer you can”.

What is the "best answer you can"?

> It’s very much designed for nanosecond to low microsecond benchmarks.

Absolutely not. With hyperfine, you can not measure execution times in the "low microsecond" range, let alone nanosecond range. See also my other comment.

sharkdp · 2024-11-19T10:54:01 1732013641

That doesn't make a lot of sense. It's more like the opposite of what you are saying. The precision of hyperfine is typically in the single-digit millisecond range. Maybe just below 1 ms if you take special care to run the benchmark on a quiet system. Everything below that (microsecond or nanosecond range) is something that you need to address with other forms of benchmarking.

But for everything in the right range (milliseconds, seconds, minutes or above), hyperfine is well suited.

forrestthewoods · 2024-11-19T16:03:38 1732032218

No it’s not.

Back in the day my goal for Advent of Code was to run all solutions in under 1 second total. Hyperfine would take like 30 minutes to benchmark a 1 second runtime.

It was hyper frustrating. I could not find a good way to get Hyperfine to do what I wanted.

sharkdp · 2024-11-19T16:18:47 1732033127

If that's the case, I would consider it a bug. Please feel free to report it. In general, hyperfine should not take longer than ~3 seconds, unless the command itself takes > 300 ms second to run. In the latter case, we do a minimum of 10 runs by default. So if your program takes 3 min for a single iteration, it would take 30 min by default — yes. But this can be controlled using the `-m`/`--min-runs` option. You can also specify the exact amount of runs using `-r`/`--runs`, if you prefer that.

> I could not find a good way to get Hyperfine to do what I wanted

This is all documented here: https://github.com/sharkdp/hyperfine/tree/master?tab=readme-... under "Basic benchmarks". The options to control the amount of runs are also listed in `hyperfine --help` and in the man page. Please let us know if you think we can improve the documentation / discovery of those options.

fwip · 2024-11-19T17:57:50 1732039070

I've been using it for about four or five years, and never experienced this behavior.

Current defaults: "By default, it will perform at least 10 benchmarking runs and measure for at least 3 seconds." If your program takes 1s to run, it should take 10 seconds to benchmark.

Is it possible that your program was waiting for input that never came? One "gotcha" is that it expects each argument to be a full program, so if you ran `hyperfine ./a.out input.txt`, it will first bench a.out with no args, then try to bench input.txt (which will fail). If a.out reads from stdin when no argument is given, then it would hang forever, and I can see why you'd give up after a half hour.

sharkdp · 2024-11-19T18:50:24 1732042224

> Is it possible that your program was waiting for input that never came?

We do close stdin to prevent this. So you can benchmark `cat`, for example, and it works just fine.

fwip · 2024-11-19T19:51:02 1732045862

Oh, my bad! Thank you for the correction, and for all your work making hyperfine.

sharkdp · 2024-11-19T10:48:12 1732013292

Yes. If you don't make use of shell builtins/syntax, you can use hyperfine's `--shell=none`/`-N` option to disable the intermediate shell.

oguz-ismail · 2024-11-19T12:19:58 1732018798

You still need to quote the command though. `hyperfine -N ls "$dir"' won't work, you need `hyperfine -N "ls ${dir@Q}"' or something. It'd be better if you could specify commands like with `find -exec'.

PhilipRoman · 2024-11-19T12:32:51 1732019571

Oh that sucks, I really hate when programs impose useless shell parsing instead of letting the user give an argument vector natively.

sharkdp · 2024-11-19T13:20:17 1732022417

I don't think it's useless. You can use hyperfine to run multiple benchmarks at the same time, to get a comparison between multiple tools. So if you want it to work without quotes, you need to (1) come up with a way to separate commands and (2) come up with a way to distinguish hyperfine arguments from command arguments. It's doable, but it's also not a great UX if you have to write something like

    hyperfine -N -- ls "$dir" \; my_ls "$dir"

oguz-ismail · 2024-11-19T13:53:58 1732024438

> not a great UX

Looks fine to me. Obviously it's too late to undo that mistake, but a new flag to enable new behavior wouldn't hurt anyone.

sharkdp · 2024-11-19T10:46:43 1732013203

Caching is something that you almost always have to be aware of when benchmarking command line applications, even if the application itself has no caching behavior. Please see https://github.com/sharkdp/hyperfine?tab=readme-ov-file#warm... on how to run either warm-cache benchmarks or cold-cache benchmarks.

mmastrac · 2024-11-19T15:31:01 1732030261

I'm fully aware but it's not a problem that warmup runs fix. An executable freshly compiled will always benchmark differently than one that has "cooled off" on macos, regardless of warmup runs.

I've tried to understand what the issue is (played with resigning executables etc) but it's literally something about the inode of the executable itself. Most likely part of the OSX security system.

renewiltord · 2024-11-19T17:23:07 1732036987

Interesting. I've encountered this obviously on first run (because of the security checking it does on novel executables) but didn't realize this expired. Probably because I usually attribute it to a recompilation. Thanks.

sharkdp · on Nov 16, 2023

> 365·243 ought to be 365·2425 exactly:

Yes. This is also how it is defined: https://github.com/sharkdp/numbat/blob/ba9e97b1fbf6353d24695...

The calculation above is showing a rounded result (6 significant digits by default).

agalunar · on Nov 16, 2023

That's what I figured! but thought the derivation would be fun to share with people reading the comments.

sharkdp · on Nov 16, 2023

Yes. If you want to know more, you can read about it here: https://github.com/sharkdp/numbat/blob/master/assets/reasons...

sharkdp · on Nov 16, 2023

So 'year' refers to the Gregorian year and is equal to 365.243 days [1]. We also have 'julian_year' which is equal to '365.25 days'.

We also have 'sidereal_day' equal to '23.9345 hours', and if you believe it is useful, we can also add 'sidereal_years'.

[1] https://numbat.dev/doc/list-units.html

thriftwy · on Nov 16, 2023

Do you have a calendar_year and a calendar_leap_year?

techdragon · on Nov 17, 2023

Ok so I did some reading and I like what I see, its important however to properly disambiguate between the two kinds of time units.

Chronological Time units and Calendrical Time units... These are fundamentally different concepts that overlap a lot in day to day life but when you need to ensure technical accuracy, can be very different.

- Planck time, Stoney time, Second: Unambiguously valid for both chronological and calendrical usage. Since we define everything in terms of the second anyway it's basically the centre of the Venn diagram.

- Sidereal day: It isn't a fixed value over longer periods of time, getting longer at a rate on the order of 1.7 milliseconds per century [1], so a conversion of a short period like 7 sidereal days into seconds is going to be off by something like 3.26×10^-7 seconds which might be ok, particularly if you also track the precision of values to avoid introducing false precision in the output, since you can then truncate the precision to above the error margin for a calculation like this one and treat it unambiguously as valid for both calendrical and chronological purposes.

- It's also worth noting since you mentioned it, the slight difference between Tropical year (seconds the earth takes to do a complete orbit around the sun), and Sidereal Year (seconds for the sun to fully traverse the astronomical ecliptic), the sidereal year is longer due to the precession of the equinoxes.

- Minute, Hour: These can vary in length up to a second if a leap second is accounted for, so while conventionally fine for chronological calculations as fixed multiples, don't have precise chronological values when used with calendar calculations. The exact number of minutes between now and 2030 is fixed, but the number of seconds in those minutes is not.

- Day: In addition to leap seconds, the length of a calendar day also have to deal with the ambiguity of daylight savings time and are where the significant differences in calendrical calculations vs chronological calculations really start to kick in.

- Week, Fortnight: all the problems of days but magnified by 7 and fourteen respectively. Also, theres the concept of standard business weeks and ISO week calendars, where some years wind up with more weeks than others due to the ISO week related calendar rules.

- Month: Obvious problem... "which month?" theres quite a few less seconds in February than in October.

- Julian year, Gregorian year: These are conventionally defined by how many days they have and the leap day rules and then approximated to an average seconds value so you can "pave over" the problem here and a lot of people might not be as surprised as if you average the length of a day or a month.

- Decade, Century, Millennium: are all affected by leap day related rules, and over a given length of time you see the introduction of an unknown but sort of predictable number of leap seconds. So while you can average it down yet again, the problems of anything bigger than a day have reached the point where over a millennium you're dealing approximately 0.017 seconds of change in the rotation of the earth,

Doing this right is basically incompatible with doing this the easy way, I'd at least re-label the averaged time units to make the use of average approximations more obvious, and ideally I'd split the time types into calendrical and chronological, and use more sophisticated (and I'll be the first to admit, annoying to implement) calendar math for the chronological calculations.

[1] - Dennis D. McCarthy; Kenneth P. Seidelmann (18 September 2009). Time: From Earth Rotation to Atomic Physics. John Wiley & Sons. p. 232. ISBN 978-3-527-62795-0.

joshuanapoli · on Nov 17, 2023

The entire type system needs to be parameterized by the inertial frame of reference, too.

techdragon · on Nov 17, 2023

Some people may think you’re tossing out a sarcastic joke here… but unambiguously fuck yes … because doing this kind of preemptive typing, the forward thinking to “frame of reference” is basically the next step after overhauling everything to disambiguate between calendrical and chronological timekeeping and units…

Because fundamentally you can’t correct for the reference frame if you can’t work out if your dealing with chronological or calendrical units. Calendrical units are in a weird liminal space outside of the earth reference frame. We measure the history of most deep space missions by earth reference frame mission elapsed time and do so by keeping a clock on earth and silently keeping records of the vehicle clock.. but on Mars we have a per mission Sol count that brings Mars time into the mix, and I know for a fact a lot of people neglect the barycentric gravity gradient difference between Earth and Mars because for literally 99.9% of things it doesn’t matter… but if you measure a transit of an Astronomical body from instruments on Mars and don’t deal with the relative reference frames your fractions of an arc second are basically pointless false precision.

joshuanapoli · on Nov 18, 2023

It would be fun to play an interstellar mashup of Sid Meyer’s Civilization and Kerbal Space Program.

thriftwy · on Nov 18, 2023

Isn't that Sid Meyer's Alpha Centauri?

joshuanapoli · on Nov 20, 2023

Alpha Centauri is simply a weird earth. I think that a game around long-lived creatures that colonize a galaxy and have to work with relativistic effects could be different and fun. At some level it builds on all these different ways of looking at time. Breathe some life via simulation into this tongue-in-cheek interstellar economics: https://www.princeton.edu/~pkrugman/interstellar.pdf

thriftwy · on Nov 21, 2023

They will not have to work with relativistic effects. Nobody is going to fly faster than 0.1C. It is prohibitively expensive energy-wise and there is no point of doing that. Just accelerate to 0.01C, and arrive at the neighbouring system in 400 years. 400 years is a blink of eye, anyway.

sharkdp · on Nov 16, 2023

Frink is not open source, unfortunately.

thesuperbigfrog · on Nov 16, 2023

>> Frink is not open source, unfortunately.

True: https://frinklang.org/faq.html#OpenSource

Thanks for sharing Numbat with us.

It looks great!