Day one part 2 was relatively rough. Things I learned from it: rust regex crate doesn't support look-ahead, rust onig crate is currently broken in many ways and shouldn't be used (the version in crates.io doesn't compile and the version on GitHub is failing tests and look-ahead isn't working). It was a very frustrating time for me. After 2 hours of troubleshooting the above I used the same approach in python and it took 2 minutes to write. So annoying.
> The regex crate doesn't support arbitrary look-around because it isn't known how to implement efficiently.
A bit of a philosophical question:
If how to write an efficient implementation is yet not known to man, ie. not a matter of the library's author time or skills, but literally a limit on human knowledge: why not at least provide the functionality with a good enough implementation? (with caveats just possibly mentioned in documentation)
IMHO that'd be arguably a good thing for everybody, at a minimum better than just not offering the possibility at all. Which drives users to frustration, or leaves them having to discover a more pragmatic alternative lib that opted to add it.
This is no complaint or feature request... I just want to learn from some insight behind the thought process of "if it's not efficient, better not have it at all"
PS. Thanks for the link. Now I have a good read for the weekend, for sure!
ReDoS is one answer. Another answer is that it would effectively require supporting another regex engine with very different semantics internally. That's a tall order, and with the existence of fancy-regex (and other engines), there's no real compelling reason to do it.
You don't need to be all things to all people. Embrace the fact that there are other choices!
Because it is a mechanism for ReDOS, and the standard library should not be introducing vulnerabilities into users. Other libraries can implement it for folks who decide they really need it.
Okay, let me amend my comment. If you have a degree in CS, you're absolutely sure that you've understood all the caveats of the libraries you're using, and you limit your inputs so that the expected running time is under your target execution time, then you can avoid putting a timeout on your regex executions. In any other case, add a timeout.
Ah jeez I totally missed this crate! Thanks!! I had originally gone with Onig because of a SO post you made years ago. Fancy-regex substituted right in and worked immediately. Much appreciation for all you do
It's a Regular Language, in the mathematical sense. All you need is the greedy match anything sequence, I don't know why so many people jumped to non-Regular regex extensions.
Oh I understand. That's a good idea. Honestly I only ever use regex for AOC so I never would've thought of using it like that. I'll keep it in mind. Thanks!
Easy mode for this use-case is to match on a reversed version of the regex on a reversed version of the string, but yeah, negative lookahead would be great.
Yeah I mean I did think of that but after spending all that time figuring out Regex I was decided to commit as fully as I could until it was 3am and I had to go to bed. There were lots of alternative solutions I thought of but chose to ignore out of stubborness. I'm admit, a good bit of the frustration was my own fault (but also c'mon this was way harder than any day 1 we've had before)
Yeah, that was exactly was I was doing. For the left number do everything as normal, while walking through the string. For the reverse part, just walk through the reverse of the string and match the reversed keywords.
I'm not familiar with the AoC problem. You might be able to. But RegexSet doesn't give you match offsets.
You can drop down to regex-automata, which does let you do multi-regex search and it will tell you which patterns match[1]. The docs have an example of a simple lexer[2]. But... that will only give you non-overlapping matches.
You can drop down to an even lower level of abstraction and get multi-pattern overlapping matches[3], but it's awkward. The comment there explains that I had initially tried to provide a higher level API for it, but was unsure of what the semantics should be. Getting the starting position in particular is a bit of a wrinkle.
It's kind of awkward with that one, because you still have to check the individual patterns; it doesn't give you a multi-match on each pattern.
With the Aho-Corasick implementation you can just map the string -> {ordered list of matches} -> numbers associated with the match, and then you've got a little vec of digits you can grab the first and last entries of. Ended up being just a few lines of code, together with a hard-coded list of ["0", "one", "1", "two", ...] and the numbers they mapped to [0, 1, 1, 2, 2, ...].
> With the Aho-Corasick implementation you can just map the string -> {ordered list of matches} -> numbers associated with the match, and then you've got a little vec of digits you can grab the first and last entries of.
My solution was to shove it through `Itertools::minmax_by_key(|m| m.start()).into_option()`, which returns the lowest and highest matches (or a duplicate of a single match). Then to map to digits I actually ordered the patterns differently: I went 1, 2, 3, ..., one, two, three, ...
That way:
- for part 1 I could slice out the first 9 elements and it works uniformly
- mapping a "digit" to an actual digit is taking the index (match.as_pattern().as_usize()) modulo 9 to shift the textual versions to the numerical, then add one.
0/zero is not a valid digit so you can just ignore it, although you could always include it, use mod 10, and not increment the result, so same diff.
Ah, that's a nice observation about zero, that would have saved me some typing. :-)
I did the same thing with my actual solution in terms of order (but I went 0, 1, 2, 3), but mostly so I could truncate the matching array to solve part 1. Notably, that was a ret-con of my actual solution to part 1, which I originally did by just mapping the characters through .is_ascii_digit(), but I wanted to consolidate the code a little. I ended up with:
Definitely not but it was the most elegant solution that I could think of right away, so I committed to it. In hindsight, it would've been relatively easy to solve with some loops and no dependencies but hey, I wouldn't have learned so much about the rust regex ecosystem that way.