Neatly omits the massively expensive calendar conversion step for turning the timestamp into a useful format for fast arithmetic. Parsing integers from strings is not why parsing timestamps is slow, any more than counting the length of strings is why XML parsing might be slow.
For time-sequential data, one easy speedup hack is caching the UNIX epoch for the current day start and adding the HH:MM:SS manually. No measurements to offer, but would suggest approaches like this would cleanly beat simdification of a trivial part of a complex task
Recently I came across another variant of the algorithm that is even smaller and simpler. I've been meaning to test this variant and add it to my article (assuming it is correct), but I haven't had the chance: https://dotat.at/@/2008-09-10-counting-the-days.html. (EDIT: I just tried it out. It works, and while the source code is smaller than my other variants, epoch_days_fast() from my article is still smaller and faster in terms of machine code: https://godbolt.org/z/Kjnafzqcx vs https://godbolt.org/z/e7nchjKGW)
That's amazing! Long time ago I wrote a conversion function that took longer than 10ns but had the input normalization feature you skipped in your blog post. (Say January 32nd will be automatically transformed to February 1st, and specifying the thirteenth month of a year will become January of the following year.) Those are simple to implement and I doubt it costs much.
To normalize months outside of the 1–12 range:
--month;
if (month >= 12 || month < 0) {
year += month / 12;
month %= 12;
if (month < 0) {
--year;
month += 12;
}
}
++month;
To fix incorrect days that's even easier. Most algorithms do not need any changes.
Amazing work :) Do you know if glibc gmtime_r() might be impacted by the active timezone? My "massively expensive" experience above came from gmtime_r() on Amazon Linux running in Lambda, but I can't remember the exact scenario involved (or e.g. if TZ were set), it certainly seems unlikely worth putting the effort into it that I did if the costs were similar to those observed in your post.
Thanks! I didn't experiment with gmtime_r() since it goes in the other direction (UnixTime -> civil time). I wouldn't expect timegm() or gmtime_r() to depend on current timezone, since both are hard-coded to use UTC. I never did figure out why the macOS version is performing mutex locks on global data.
Indeed! It was a while ago but Im quite proud of the fast timestamp functions I made for Java after the standard library timestamp performance turned out to be critical to some data pipelines I was working on.
The early versions of the code did cache the last parsed date etc but as the parser got quicker this became diminishing returns.
The fastest way I found for the calendar month numbers etc was to use a precomputed table etc.
Wouldn't it be possible to contribute the faster algorithm to the JDK (e.g. the Instant class)? Maybe you'd need some iterations, but do you see any downsides?
> omits the massively expensive calendar conversion step
Is that not included in what's 6x faster? I agree comparing against strptime is a bit of a gimme and a scalar fixed-format parser would be fairer, but I'm pretty sure it will beat that too.
Edit to reply since rate-limited:
> No mention of calendar in either place. Did you see something in the article or code, or are you just asking us to check?
The (lack so far of) calendar is mentioned in the blog post - "It does not get all the parsing done... We can just use standard C code for the result." - and is very obviously in the GitHub code.
> one easy speedup hack is caching the UNIX epoch for the current day start and adding the HH:MM:SS manually
Side note, this is a really great example of why you'd want an operation such as parsing to be state-mutating when you'd normally expect it not to. i.e. why it makes sense to support passing a Parser object rather than just a parse() function, even for operations where you don't think it's necessary at first glance.
Side note but I always found this phrasing to be incredibly confusing (hence the need for the "i.e."...); it's the exact opposite of what I would think it means to "ignore" leap seconds. Ignoring something means you don't change your behavior when that something happens. But when leap seconds happen, the epoch timestamp halts its counting. That's not ignoring leap seconds -- that's actively countering the leap seconds!
yeah - "ignoring" to me means "is ignorant of" means "one second passing means increasing the time by one second". You do by far the most obvious thing when time passes, and you ignore the complexities of reality to continue being maximally obvious.
Doing anything else means handling them by adding custom behavior.
Halting the advancement that would've ordinarily happened is not "ignoring it", it's doing something to actively counter it. That "something" ends up canceling the previous trajectory to evaluate to "nothing", but that doesn't happen through ignorance - it required a countering action.
Example: You're driving your car when pedestrians appear in front of you. What does "ignoring" them constitute? Stopping the car, or running into them at the previous speed?
I'd say that the people counting leap seconds are the ones "doing something".
At the point they needed to be inserted, Unix systems carry on counting seconds as units of time as they pass, "ignoring" the need to add 1 for the purposes of accounting.
If a system considers 2016-12-31 23:50 UTC and 2017-01-01 00:10 UTC to be 20 minutes apart because it is ignorant of leapseconds - is that ignorance "active"?
Yes, that means that :60 should parse into same value as :59. My reading of the code is that now it might end up in the `return false` branch instead, although I'm not exactly sure how the limits thing is used here
If you're using epoch timestamps to represent time you don't have leap second bugs. Each Unix timestamp corresponds to one unique point in time which also has some representation in civil time.
If you're subtracting Unix timestamps to get durations you may have bugs, but those bugs might align better to your user's expectations anyway.
> Each Unix timestamp corresponds to one unique point in time
That depends. 30th June 2015 23:59:59 has the timestamp 1435708799. 1st July 2015 00:00:00 has the timestamp 1435708800. But what about the leap second that happened between those two seconds? I think most people would claim it also has timestamp 1435708799. but then "Each Unix timestamp corresponds to one unique point in time" isn't true, because 1435708799 now corresponds to two different points in time (30th June 2015 23:59:59 and 30th June 2015 23:59:60). Saying there is a unique mapping from unix timestamp to time would only be true if you either say that 30th June 2015 23:59:60 has no unix timestamp (which isn't a valid return value of any time API I've seen) or if you smear time.
Well, all time stamps are rounded or floored to the nearest second (or millisecond, or micro, or...) because you only have a finite amount of space to store them — you always have infinitely many time stamps mapping to each Unix second.
This means you can just adjust the mapping around the leap second. A naive implementation might have those two real-life seconds tick twice as fast to take only one Unix second. Or you could have the whole day’s seconds tick 86401/86400 times as fast, etc. In terms of the mapping from continuous time to discrete time, this is just as fine as the non-leap-second case.
No, your misdesigned ntpd resets your clock to 1435708799 at 2015-06-30T23:59:60Z. I have yet to see a convincing argument that smearing and freezing are not compatible with POSIX's definitions of time.
sure leap smearing could be considered compatible with posix in the sense that posix does afaik not require any level of accuracy from clocks, so even something silly like xkcd 2266 could be considered posix compatible https://xkcd.com/2266/
No, bugs can come up without any subtraction. If an event was recorded as happening at 23:59:60.2, you need it to be sequenced strictly after 23:59:59.9. Winding back the clock by 1 second makes it so you end up with the later event as having occurred at 23:59:59.2, violating causality. Even without sub-second precision the situation is still buggy, just not as bad - you end up with two simultaneous events when there weren't any in the original representation, which is itself quite surprising when the two look like they have the same precision.
"Lose information" is not a binary thing, nor is it equivalent to introducing wrong information. You don't expect (short)INT_MAX to suddenly give you 1337 just because it "loses information". You still want to have (and do have) some useful properties & information to be preserved even after certain information loss, otherwise the implementation is buggy despite the information loss.
In this case, without leap seconds, you'd have still lost the precision, but you wouldn't have actually violated causality. Winding back during a leap second makes it so that a timestamp that's > 1 second later in real time still ends up being ordered as strictly before the earlier time. That's a bug, regardless of how you slice it.
Now this can be partially avoided by saturating the subtraction at the boundary rather than winding it back, and I'm not sure how many implementations do this. That seems better, but it's still surprising (and thus bug-prone) given the input and output representation otherwise appear to have sufficient precision.
I would say that Unix time doesn't violate causality, because it only does whole seconds and those whole seconds monotonically increase. You need to take extra care when bolting on fractional seconds.
If converting things from one system to another means two very close events get the same timestamp? Eh, that can happen in many ways with many systems. That's not bugginess.
Yea, my eye twitched when I read that and I said to myself “so, who’s gonna tell him?” Glad to see this is the top comment. This is why we should not roll our own anything.
> This is why we should not roll our own anything.
This seems like a major overcorrection. It's equivalent to saying "you can't know anything" "no one is qualified to do anything"
You really have to consider the downsides. In cryptography, "don't roll you own crypto" is the maxim because the downside is pretty bad. Make sure you're an expert before writing crypto.
What's the downside to mis-parsing leap seconds? Maybe a 500 on a super rare request? Then they patch it and it's fine? Most computers are off by more than 1 second anyway.
I think this mindset comes from developers looking into some field and realizing it's super deep ("Did you know there are missing dates in the gregorian calendar if you go back hundreds of years? Your time library is incorrect if it doesn't handle these cases!"). And sure, most things aren't completely correct, but you have to consider what the downside is to being slightly wrong. Maybe it'll be wrong in cases that will never actually happen in practice.
> This is why we should not roll our own anything.
How is anything going to be made then ?
Also, the author isn't just anybody and has created stuff that you've most likely interacted with. If almost nobody should "roll his own", he's one of those who should.
At one consulting gig I had at an airline, I noted that timestamps were everywhere and among the most frequent operations in the system was comparison with current time. The legacy systems spit out date/time in at least 3 different formats, and almost never included the timezone. The TZ had to be determined by looking up the airport's TZ.
When I got there time comparisons were a mess, parsing in multiple places, some comparisons not taking into account the TZ, some add and subtract operations incorrectly code, and other issues.
I tried to make the case that parsing and compare timestamps was a core operation, and the system should have all that functionality polished and packaged up to its own module so programmers could focus on the rest of the system, but was unsuccessful. We were writing in Go, so the language libraries did about 80% of the work, and the less experienced programmers thought it was sufficient.
When I left they were struggling with issues around determining passenger age at the time of flight and getting off-by-one errors around certain cutoff ages.
That snippet would not reject bad chars, are non-digits rejected somewhere else?
A simple scalar loop that multiplies an accumulator by 10, itoa() style, would be faster.
The code in the post doesn't seem to finish the job. For a timestamp like "2023-07-02T15:46:00Z" or the other formats you mentioned, the timestamp fits in a single 32-byte register, so you could write code that deals with the stuff as it is. If you have a mix of timestamps with different numbers of decimals or with different time zones, dealing with those things will take a bit of work compered to the ideal case where you had all "Z" or all "+" or all "-" for the time zones and all the same number of decimals for the fractional seconds.
Shouldn't be too hard, assuming you don't need any validation. There afaik are simd instructions to move bytes around (vpermb maybe?), so that should take care of getting rid of the separators
What is actually the task here? Parsing is clear and it looks like one has to sum all parsed timestamps, but what does the additional comment about the shortest time mean?
It means that your solution is scored on how fast it runs.
Frequently we end up with solutions that use tricks that would not be applicable to the n=1 problem in TFA ("parse 1 timestamp") when we try to get the fastest solutions for "find the sum of all these timestamps."
Kind of a weird leaderboard if you can’t even check to see if answers satisfy correctness. I’ve seen plenty of leaderboards where the top answers would fail cases that weren’t covered by the leaderboards testing.
I think it's not such a big deal because the input generators are usually pretty good and because most solutions that give up some correctness for performance can pretty cheaply detect that they done goofed and fall back to a slow path where they start over and do it some other way. This ends up being useful because scoring is the median runtime of 3 successful runs (so if you fail one of the 3 then your submission fails completely). It also means that probabilistic solutions that would be correct in non-adversarial environments (like using a bloom filter or N bloom filters tuned to give an exact result given the input constraints with probability >99.9%) are admissible.
This particular solution does satisfy correctness, although I can't share the source code (as the competition is ongoing). Feel free to provide your input data and I'll run it to compare with your expected result.
This is really cool, I recently wanted to implement a ARGB->RGBA conversion function with SIMD in Rust, but as far as I can tell SIMD is STILL not in stable Rust :(
The CPU-specific intrinsic stabilized a little while ago for cases where you need SIMD on a specific platform at least. A lot of crates like glam seem to only support SIMD on x86/x86-64 for that reason.
I know that `wide`, `faster`, and `simdeez` are options for crates to do cross-platform SIMD in safe Rust. Eventually we'll have `std::simd`, but I don't know what the timeline is or what the blockers are.
Interesting. I think you'd find that doing this in chunks of 128-bit wide vectors with platform-agnostic logic would be even better. Have you tried something like Google Highway or writing this exact logic in scalar code and seeing if the compiler will autovectorize it? Then you have the option to do up to 4 of them at once instead of one if your platform supports it.
There are two digits for months. A check is required per-character to prevent bad chars (e.g 0x3A) from being laundered by the multiply. The '19' comes from fact that the first digit may be only 0 or 1, while the 2nd digit maybe any value 0..9 (e.g September '0'9' October '1'0'). The second check catches bad month values between 13..19 which were not caught by looking at individual digits. Realistically, the first check may be overbuilt, it only needs to check is_digit or not, but it still has to ignore the padding bytes at the end, somehow. Now... I believe there would be a problem with the month value of '00'... because it unconditionally subtracts 1 from that field then uses it as a table index.
> You'd think a professor would want to encourage better documentation, or at least expose students who are reading this post to some basic commenting.
Welcome to academic code. We're lucky there's any comments at all. :)
What do you think I'm talking about? OP pruned the comments out of github and put the code on the page w/o them. I didn't click to the github page. Still confused?
in datasets that I’ve worked with it is so incredibly common and so incredibly slow that I’ve previously made a fast library to work with them: https://github.com/williame/TimeMillis
Not as fast as SIMD, perhaps, but much faster than the Java standard library…
Nothing wrong with not having seen it, but I default to time stamps like this for log files, I use it all the time in scripts. I even also do it manually and have an alias in my .bashrc I call ‘now’ that spits out a call to date with that format string. This is, obviously, useful for organizing and sorting and finding files & folders by date & time. (And, hopefully obviously, creation & mod times on files are completely unreliable for this once you factor in copying & archiving.)
I've seen lots of %Y%m%d on filenames timestamped by humans.
People timestamp stuff for different reasons. If you do it to avoid collisions then adding sub-second precision makes sense, but in many contexts second or even day precision is all you care about.
The important point is that %Y%m%d%H%M%S is human parsable and still sorts correctly (though humans typically prefer %Y-%m-%d %H:%M:%S for readability)
For time-sequential data, one easy speedup hack is caching the UNIX epoch for the current day start and adding the HH:MM:SS manually. No measurements to offer, but would suggest approaches like this would cleanly beat simdification of a trivial part of a complex task