Hacker News new | past | comments | ask | show | jobs | submit login
Parsing time stamps faster with SIMD instructions (lemire.me)
227 points by mfiguiere on July 2, 2023 | hide | past | favorite | 103 comments



Neatly omits the massively expensive calendar conversion step for turning the timestamp into a useful format for fast arithmetic. Parsing integers from strings is not why parsing timestamps is slow, any more than counting the length of strings is why XML parsing might be slow.

For time-sequential data, one easy speedup hack is caching the UNIX epoch for the current day start and adding the HH:MM:SS manually. No measurements to offer, but would suggest approaches like this would cleanly beat simdification of a trivial part of a complex task


Do you mean the conversion from YYYY-MM-DD to Unix Time? That can be quite fast; I wrote a blog article showing some algorithms that take less than 10ns to perform the conversion: https://blog.reverberate.org/2020/05/12/optimizing-date-algo...

Recently I came across another variant of the algorithm that is even smaller and simpler. I've been meaning to test this variant and add it to my article (assuming it is correct), but I haven't had the chance: https://dotat.at/@/2008-09-10-counting-the-days.html. (EDIT: I just tried it out. It works, and while the source code is smaller than my other variants, epoch_days_fast() from my article is still smaller and faster in terms of machine code: https://godbolt.org/z/Kjnafzqcx vs https://godbolt.org/z/e7nchjKGW)


That's amazing! Long time ago I wrote a conversion function that took longer than 10ns but had the input normalization feature you skipped in your blog post. (Say January 32nd will be automatically transformed to February 1st, and specifying the thirteenth month of a year will become January of the following year.) Those are simple to implement and I doubt it costs much.

To normalize months outside of the 1–12 range:

    --month;
    if (month >= 12 || month < 0) {
      year += month / 12;
      month %= 12;
      if (month < 0) {
        --year;
        month += 12;
      }
    }
    ++month;
To fix incorrect days that's even easier. Most algorithms do not need any changes.


Amazing work :) Do you know if glibc gmtime_r() might be impacted by the active timezone? My "massively expensive" experience above came from gmtime_r() on Amazon Linux running in Lambda, but I can't remember the exact scenario involved (or e.g. if TZ were set), it certainly seems unlikely worth putting the effort into it that I did if the costs were similar to those observed in your post.


Thanks! I didn't experiment with gmtime_r() since it goes in the other direction (UnixTime -> civil time). I wouldn't expect timegm() or gmtime_r() to depend on current timezone, since both are hard-coded to use UTC. I never did figure out why the macOS version is performing mutex locks on global data.


Indeed! It was a while ago but Im quite proud of the fast timestamp functions I made for Java after the standard library timestamp performance turned out to be critical to some data pipelines I was working on.

The early versions of the code did cache the last parsed date etc but as the parser got quicker this became diminishing returns.

The fastest way I found for the calendar month numbers etc was to use a precomputed table etc.

https://github.com/williame/TimeMillis


Wouldn't it be possible to contribute the faster algorithm to the JDK (e.g. the Instant class)? Maybe you'd need some iterations, but do you see any downsides?


> omits the massively expensive calendar conversion step

Is that not included in what's 6x faster? I agree comparing against strptime is a bit of a gimme and a scalar fixed-format parser would be fairer, but I'm pretty sure it will beat that too.

Edit to reply since rate-limited:

> No mention of calendar in either place. Did you see something in the article or code, or are you just asking us to check?

The (lack so far of) calendar is mentioned in the blog post - "It does not get all the parsing done... We can just use standard C code for the result." - and is very obviously in the GitHub code.

https://github.com/lemire/Code-used-on-Daniel-Lemire-s-blog/...


After reading twice and scouring the code on GitHub, I don't think so.

No mention of calendar in either place.

Did you see something in the article or code, or are you just asking us to check?


> one easy speedup hack is caching the UNIX epoch for the current day start and adding the HH:MM:SS manually

Side note, this is a really great example of why you'd want an operation such as parsing to be state-mutating when you'd normally expect it not to. i.e. why it makes sense to support passing a Parser object rather than just a parse() function, even for operations where you don't think it's necessary at first glance.


the downside is that now you will have to consider thread safety


Yep, you'd want to copy the parser per thread.


FWIW, the last time I measured the string handling part was far slower than the timezone conversion when loading timestamptz data in postgres.


What are you referring to? The benchmark includes the step to convert to unix time int, at the end of sse_parse_time() even handling leap years.


> We know that some character must be smaller than 9, for example, we cannot have more than 59 seconds and never 60 seconds, in the time stamp string.

This is how you get leap seconds bugs?


Reminds me of that post about things programmers always get wrong about dates [1].

[1]: https://news.ycombinator.com/item?id=4128208


> we want to convert them to an integer presenting the number of seconds since the Unix epoch. The Unix epoch is January 1st 1970.

UNIX time ignores leap seconds.

I.e. it's the number of seconds ignoring leap seconds since 1970.


> UNIX time ignores leap seconds.

Side note but I always found this phrasing to be incredibly confusing (hence the need for the "i.e."...); it's the exact opposite of what I would think it means to "ignore" leap seconds. Ignoring something means you don't change your behavior when that something happens. But when leap seconds happen, the epoch timestamp halts its counting. That's not ignoring leap seconds -- that's actively countering the leap seconds!


yeah - "ignoring" to me means "is ignorant of" means "one second passing means increasing the time by one second". You do by far the most obvious thing when time passes, and you ignore the complexities of reality to continue being maximally obvious.

Doing anything else means handling them by adding custom behavior.


Ignores = does nothing

Normally, UNIX time goes up by one each second.

Unless, it's a leap second, in which case UNIX time does nothing.


Halting the advancement that would've ordinarily happened is not "ignoring it", it's doing something to actively counter it. That "something" ends up canceling the previous trajectory to evaluate to "nothing", but that doesn't happen through ignorance - it required a countering action.

Example: You're driving your car when pedestrians appear in front of you. What does "ignoring" them constitute? Stopping the car, or running into them at the previous speed?


I'd say that the people counting leap seconds are the ones "doing something".

At the point they needed to be inserted, Unix systems carry on counting seconds as units of time as they pass, "ignoring" the need to add 1 for the purposes of accounting.


> At the point they needed to be inserted, Unix systems carry on counting seconds as units of time as they pass

unix systems essentially will roll back clocks during leap seconds. here is handy dandy table as an example:

    Time                         tv_sec         tv_usec    
    2015-06-30T23:59:58.9Z       1435708798     900000
    2015-06-30T23:59:59.0Z       1435708799          0
    2015-06-30T23:59:59.1Z       1435708799     100000
    ...                          ...            ...          
    2015-06-30T23:59:59.9Z       1435708799     900000
    2015-06-30T23:59:60.0Z       1435708799          0
    2015-06-30T23:59:60.1Z       1435708799     100000
    ...                          ...            ...          
    2015-06-30T23:59:60.9Z       1435708799     900000
    2015-07-01T00:00:00.0Z       1435708800          0
    2015-07-01T00:00:00.1Z       1435708800     100000
    ...                          ...            ...          
I don't know about you but I wouldn't call that "counting seconds as units of time as they pass"


If a system considers 2016-12-31 23:50 UTC and 2017-01-01 00:10 UTC to be 20 minutes apart because it is ignorant of leapseconds - is that ignorance "active"?


I suppose you can envision UNIX clock with "inertia"/Newton's second law.

But when I count people passing me, I press my clicker each time to count them. Except once in a while I don't do anything, i.e. ignore them.


Not doing anything is a decision to depart from your standard behavior. It’s not “ignoring”.


The computer that uses timestamps usually recognizes leap seconds. But that doesn't mean the timestamp itself recognizes them.

You can make a pretty good argument that "every day is 86400 seconds" is ignoring leap seconds. And that's what "Unix time" does.

Though you can also make a good argument that "seconds since 1970, no idea what a 'day' is" ignores leap seconds.


Sure, but he’s not parsing Unix timestamps. He’s parsing time in the format %Y%m%d%H%M%S, which would have leap seconds on most systems


He's parsing them into UNIX timestamps.

And apparently these strings don't have second 60.

Works for me.


Yes, that means that :60 should parse into same value as :59. My reading of the code is that now it might end up in the `return false` branch instead, although I'm not exactly sure how the limits thing is used here


That is not how POSIX defines seconds since the epoch. It says that :60 is the same as :00 of the following minute.

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1...


I believe if you're using epoch timestamps you already have leap seconds bugs. I don't think they increment/decrement when the leap second happens.


If you're using epoch timestamps to represent time you don't have leap second bugs. Each Unix timestamp corresponds to one unique point in time which also has some representation in civil time.

If you're subtracting Unix timestamps to get durations you may have bugs, but those bugs might align better to your user's expectations anyway.


> Each Unix timestamp corresponds to one unique point in time

That depends. 30th June 2015 23:59:59 has the timestamp 1435708799. 1st July 2015 00:00:00 has the timestamp 1435708800. But what about the leap second that happened between those two seconds? I think most people would claim it also has timestamp 1435708799. but then "Each Unix timestamp corresponds to one unique point in time" isn't true, because 1435708799 now corresponds to two different points in time (30th June 2015 23:59:59 and 30th June 2015 23:59:60). Saying there is a unique mapping from unix timestamp to time would only be true if you either say that 30th June 2015 23:59:60 has no unix timestamp (which isn't a valid return value of any time API I've seen) or if you smear time.


> 30th June 2015 23:59:60 has no unix timestamp (which isn't a valid return value of any time API I've seen)

https://go.dev/play/p/I8GcKBoAVXi


Well, all time stamps are rounded or floored to the nearest second (or millisecond, or micro, or...) because you only have a finite amount of space to store them — you always have infinitely many time stamps mapping to each Unix second.

This means you can just adjust the mapping around the leap second. A naive implementation might have those two real-life seconds tick twice as fast to take only one Unix second. Or you could have the whole day’s seconds tick 86401/86400 times as fast, etc. In terms of the mapping from continuous time to discrete time, this is just as fine as the non-leap-second case.


It still wouldn't be true because there are points in time smaller than a second.


No, I didn't say it was a bijective mapping. (Although with smearing, it still can be.)


> Each Unix timestamp corresponds to one unique point in time which also has some representation in civil time

This is plain false. For example 1435708799 corresponds to both 2015-06-30T23:59:59Z and 2015-06-30T23:59:60Z


No, your misdesigned ntpd resets your clock to 1435708799 at 2015-06-30T23:59:60Z. I have yet to see a convincing argument that smearing and freezing are not compatible with POSIX's definitions of time.


its kernel that does the leap second handling, ntpd just informs it of upcoming leaps

https://elixir.bootlin.com/linux/v6.4.1/source/kernel/time/n...

sure leap smearing could be considered compatible with posix in the sense that posix does afaik not require any level of accuracy from clocks, so even something silly like xkcd 2266 could be considered posix compatible https://xkcd.com/2266/


No, bugs can come up without any subtraction. If an event was recorded as happening at 23:59:60.2, you need it to be sequenced strictly after 23:59:59.9. Winding back the clock by 1 second makes it so you end up with the later event as having occurred at 23:59:59.2, violating causality. Even without sub-second precision the situation is still buggy, just not as bad - you end up with two simultaneous events when there weren't any in the original representation, which is itself quite surprising when the two look like they have the same precision.


> you end up with the later event as having occurred at 23:59:59.1

Unix time is seconds since 1970... If you convert between time formats you MAY loose information. That has nothing to do with unix time.


> Unix time is seconds since 1970...

This is unfortunately common misconception. Unix time in reality is seconds since epoch minus the number of leap seconds


no, it just doesn't care about "leap seconds".

I it just a guy hwo started counting at 1970


UNIX time double-counts leap seconds.

UTC: 58, 59, 60, 00, 01

UNIX: 58, 59, 00, 00, 01


"Lose information" is not a binary thing, nor is it equivalent to introducing wrong information. You don't expect (short)INT_MAX to suddenly give you 1337 just because it "loses information". You still want to have (and do have) some useful properties & information to be preserved even after certain information loss, otherwise the implementation is buggy despite the information loss.

In this case, without leap seconds, you'd have still lost the precision, but you wouldn't have actually violated causality. Winding back during a leap second makes it so that a timestamp that's > 1 second later in real time still ends up being ordered as strictly before the earlier time. That's a bug, regardless of how you slice it.

Now this can be partially avoided by saturating the subtraction at the boundary rather than winding it back, and I'm not sure how many implementations do this. That seems better, but it's still surprising (and thus bug-prone) given the input and output representation otherwise appear to have sufficient precision.


I would say that Unix time doesn't violate causality, because it only does whole seconds and those whole seconds monotonically increase. You need to take extra care when bolting on fractional seconds.

If converting things from one system to another means two very close events get the same timestamp? Eh, that can happen in many ways with many systems. That's not bugginess.


Can you elaborate? How does this lead to a leap second bug? Would a string really say "60" as the number of seconds, somehow, due to leap seconds?



Yea, my eye twitched when I read that and I said to myself “so, who’s gonna tell him?” Glad to see this is the top comment. This is why we should not roll our own anything.


> This is why we should not roll our own anything.

This seems like a major overcorrection. It's equivalent to saying "you can't know anything" "no one is qualified to do anything"

You really have to consider the downsides. In cryptography, "don't roll you own crypto" is the maxim because the downside is pretty bad. Make sure you're an expert before writing crypto.

What's the downside to mis-parsing leap seconds? Maybe a 500 on a super rare request? Then they patch it and it's fine? Most computers are off by more than 1 second anyway.

I think this mindset comes from developers looking into some field and realizing it's super deep ("Did you know there are missing dates in the gregorian calendar if you go back hundreds of years? Your time library is incorrect if it doesn't handle these cases!"). And sure, most things aren't completely correct, but you have to consider what the downside is to being slightly wrong. Maybe it'll be wrong in cases that will never actually happen in practice.


> This is why we should not roll our own anything.

How is anything going to be made then ?

Also, the author isn't just anybody and has created stuff that you've most likely interacted with. If almost nobody should "roll his own", he's one of those who should.


At one consulting gig I had at an airline, I noted that timestamps were everywhere and among the most frequent operations in the system was comparison with current time. The legacy systems spit out date/time in at least 3 different formats, and almost never included the timezone. The TZ had to be determined by looking up the airport's TZ.

When I got there time comparisons were a mess, parsing in multiple places, some comparisons not taking into account the TZ, some add and subtract operations incorrectly code, and other issues.

I tried to make the case that parsing and compare timestamps was a core operation, and the system should have all that functionality polished and packaged up to its own module so programmers could focus on the rest of the system, but was unsuccessful. We were writing in Go, so the language libraries did about 80% of the work, and the less experienced programmers thought it was sufficient.

When I left they were struggling with issues around determining passenger age at the time of flight and getting off-by-one errors around certain cutoff ages.

Time is Hard. https://gist.github.com/timvisee/fcda9bbdff88d45cc9061606b4b...


Contrast that C code with:

    pub fn parseYear4(text: *const [4]u8) !u16 {
        const nnnn: @Vector(4, u16) = .{ text[0], text[1], text[2], text[3] };
        const zero: @Vector(4, u16) = .{ '0', '0', '0', '0' };
        const mmmm: @Vector(4, u16) = .{ 1000, 100, 10, 1 };
        const result = @reduce(.Add, (nnnn -% zero) *% mmmm);
        if (result > 9999) return error.CertificateTimeInvalid;
        return result;
    }

source: https://github.com/ziglang/zig/blob/fc9ab5f0e838196e99d1706e...

machine code example: https://godbolt.org/z/ezrPEdPx7


That snippet would not reject bad chars, are non-digits rejected somewhere else? A simple scalar loop that multiplies an accumulator by 10, itoa() style, would be faster.


How hard would it be to modify this to work with ISO-8601, the international time stamp format?

In the most common case, you'd have something like 2023-07-02T15:46:00Z possibly with a ".123" been the last "0" and "Z".

Would it be fastest to just copy the relevant chars to the correct place for the code in the post? Or could you do it in place somehow?


The code in the post doesn't seem to finish the job. For a timestamp like "2023-07-02T15:46:00Z" or the other formats you mentioned, the timestamp fits in a single 32-byte register, so you could write code that deals with the stuff as it is. If you have a mix of timestamps with different numbers of decimals or with different time zones, dealing with those things will take a bit of work compered to the ideal case where you had all "Z" or all "+" or all "-" for the time zones and all the same number of decimals for the fractional seconds.


Shouldn't be too hard, assuming you don't need any validation. There afaik are simd instructions to move bytes around (vpermb maybe?), so that should take care of getting rid of the separators


The solution exposed don't seems competitive enough to solve with good rank the problem exposed here https://highload.fun/tasks/14

That site is really good to play with SIMD code.


The solution codes are not available as you can continuously update your solution.


What is actually the task here? Parsing is clear and it looks like one has to sum all parsed timestamps, but what does the additional comment about the shortest time mean?


It means that your solution is scored on how fast it runs.

Frequently we end up with solutions that use tricks that would not be applicable to the n=1 problem in TFA ("parse 1 timestamp") when we try to get the fastest solutions for "find the sum of all these timestamps."


How can I see Yuriy’s top solution for this task?


You can ask him but he probably won't share


Kind of a weird leaderboard if you can’t even check to see if answers satisfy correctness. I’ve seen plenty of leaderboards where the top answers would fail cases that weren’t covered by the leaderboards testing.


I think it's not such a big deal because the input generators are usually pretty good and because most solutions that give up some correctness for performance can pretty cheaply detect that they done goofed and fall back to a slow path where they start over and do it some other way. This ends up being useful because scoring is the median runtime of 3 successful runs (so if you fail one of the 3 then your submission fails completely). It also means that probabilistic solutions that would be correct in non-adversarial environments (like using a bloom filter or N bloom filters tuned to give an exact result given the input constraints with probability >99.9%) are admissible.


This particular solution does satisfy correctness, although I can't share the source code (as the competition is ongoing). Feel free to provide your input data and I'll run it to compare with your expected result.


Yeah, most of the leetcode top solutions are a hardcoded table of answers for the scoring questions followed by the general solution.


Is the code for some of the solutions available? Thanks for the link. That's a fun place.


This is really cool, I recently wanted to implement a ARGB->RGBA conversion function with SIMD in Rust, but as far as I can tell SIMD is STILL not in stable Rust :(


The CPU-specific intrinsic stabilized a little while ago for cases where you need SIMD on a specific platform at least. A lot of crates like glam seem to only support SIMD on x86/x86-64 for that reason.

I know that `wide`, `faster`, and `simdeez` are options for crates to do cross-platform SIMD in safe Rust. Eventually we'll have `std::simd`, but I don't know what the timeline is or what the blockers are.


SIMD is stable in rust, it just requires unsafe.

Here's an article I wrote about using SIMD in rust, with a real-world example: https://neosmart.net/blog/using-simd-acceleration-in-rust-to...


Generic architecture independent SIMD isn't stabilised, aka the proposed std::simd. But yeah, the x86 specific stuff is there


You won't get much speed up using SIMD for such a simple task. You'll be memory bandwidth limited.


Interesting. I think you'd find that doing this in chunks of 128-bit wide vectors with platform-agnostic logic would be even better. Have you tried something like Google Highway or writing this exact logic in scalar code and seeing if the compiler will autovectorize it? Then you have the option to do up to 4 of them at once instead of one if your platform supports it.


Why does month in the limit vector go to 19?


There are two digits for months. A check is required per-character to prevent bad chars (e.g 0x3A) from being laundered by the multiply. The '19' comes from fact that the first digit may be only 0 or 1, while the 2nd digit maybe any value 0..9 (e.g September '0'9' October '1'0'). The second check catches bad month values between 13..19 which were not caught by looking at individual digits. Realistically, the first check may be overbuilt, it only needs to check is_digit or not, but it still has to ignore the padding bytes at the end, somehow. Now... I believe there would be a problem with the month value of '00'... because it unconditionally subtracts 1 from that field then uses it as a table index.


Ok 8 bit unsigned ints and zero padded month, and then they do the zero saturated subtract. I hadn't made it to the second check, doh.


[flagged]


> You'd think a professor would want to encourage better documentation, or at least expose students who are reading this post to some basic commenting.

Welcome to academic code. We're lucky there's any comments at all. :)


What are you even talking about? There's tons of comments and the code isn't obscure. https://github.dev/lemire/Code-used-on-Daniel-Lemire-s-blog/...


What do you think I'm talking about? OP pruned the comments out of github and put the code on the page w/o them. I didn't click to the github page. Still confused?


You mean the small code snippet surrounded by a page full of commentary? You seem to be the confused one.


Looks to be using a lot of similar ideas to the ones explained well in these articles:

https://kholdstare.github.io/technical/2020/05/26/faster-int...

http://0x80.pl/articles/simd-parsing-int-sequences.html


The methods in those articles could be slightly improved by picking better multipliers. https://stackoverflow.com/a/71594769



Oh, I didn't follow that link, I was just reading the blog. Those are decent comments, too bad they were nixed for the blog.


But the entire paragraph where they are introduced is a commentary on them.


[flagged]


You've unfortunately been breaking the site guidelines repeatedly. Can you please review https://news.ycombinator.com/newsguidelines.html and stick to them? We'd be grateful.


Code from academia is notoriously unreadable


I don't think I have ever seen %Y%m%d%H%M%S used in real life.


Sweden uses "%Y-%m-%d %H:%M:%S" - removal of the separators using SIMD left as an exercise ;-) https://en.wikipedia.org/wiki/Date_and_time_notation_in_Swed...


China does too, and IIRC Japan and South Korea.


in datasets that I’ve worked with it is so incredibly common and so incredibly slow that I’ve previously made a fast library to work with them: https://github.com/williame/TimeMillis

Not as fast as SIMD, perhaps, but much faster than the Java standard library…


"The TimeMillis library offers helper functions to read, write and operate on timestamps stored as milliseconds in UNIX epoch."

That doesn't seem related?


The parse function takes a string in the format mentioned.


Nothing wrong with not having seen it, but I default to time stamps like this for log files, I use it all the time in scripts. I even also do it manually and have an alias in my .bashrc I call ‘now’ that spits out a call to date with that format string. This is, obviously, useful for organizing and sorting and finding files & folders by date & time. (And, hopefully obviously, creation & mod times on files are completely unreliable for this once you factor in copying & archiving.)


You mean without delimiters, or? Stripping superfluous delimiters in logs isn't unusal.


That's a very common format.


Other than RRSIG (which I would argue is not at all common), where have you seen it?


Timestamped filenames.


Whenever I've seen it on timestamped filenames I've also needed sub-second precision.

If that's a performance concern, a (potentially zero-padded, but realistically you don't even needed that) Unix timestamp works just as well.


I've seen lots of %Y%m%d on filenames timestamped by humans.

People timestamp stuff for different reasons. If you do it to avoid collisions then adding sub-second precision makes sense, but in many contexts second or even day precision is all you care about.

The important point is that %Y%m%d%H%M%S is human parsable and still sorts correctly (though humans typically prefer %Y-%m-%d %H:%M:%S for readability)




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: