Day one part 2 was relatively rough. Things I learned from it: rust regex crate ...

burntsushi · on Dec 1, 2023

Rust regex crate author here.

fancy-regex is built on top of the regex crate and supports look-around.

The regex crate doesn't support arbitrary look-around because it isn't known how to implement efficiently.

See: https://swtch.com/~rsc/regexp/regexp1.html

j1elo · on Dec 1, 2023

> The regex crate doesn't support arbitrary look-around because it isn't known how to implement efficiently.

A bit of a philosophical question:

If how to write an efficient implementation is yet not known to man, ie. not a matter of the library's author time or skills, but literally a limit on human knowledge: why not at least provide the functionality with a good enough implementation? (with caveats just possibly mentioned in documentation)

IMHO that'd be arguably a good thing for everybody, at a minimum better than just not offering the possibility at all. Which drives users to frustration, or leaves them having to discover a more pragmatic alternative lib that opted to add it.

This is no complaint or feature request... I just want to learn from some insight behind the thought process of "if it's not efficient, better not have it at all"

PS. Thanks for the link. Now I have a good read for the weekend, for sure!

burntsushi · on Dec 1, 2023

ReDoS is one answer. Another answer is that it would effectively require supporting another regex engine with very different semantics internally. That's a tall order, and with the existence of fancy-regex (and other engines), there's no real compelling reason to do it.

You don't need to be all things to all people. Embrace the fact that there are other choices!

fsociety · on Dec 1, 2023

Because it is a mechanism for ReDOS, and the standard library should not be introducing vulnerabilities into users. Other libraries can implement it for folks who decide they really need it.

uxp8u61q · on Dec 1, 2023

You need to put a time limit on your regex execution no matter what, if you're parsing untrusted input.

burntsushi · on Dec 1, 2023

Not necessarily. But it's complicated. See: https://docs.rs/regex/latest/regex/#untrusted-input

One of the key advantages of a regex engine based on finite automata is that it lets you make guarantees about the runtime performance of a search.

uxp8u61q · on Dec 2, 2023

Okay, let me amend my comment. If you have a degree in CS, you're absolutely sure that you've understood all the caveats of the libraries you're using, and you limit your inputs so that the expected running time is under your target execution time, then you can avoid putting a timeout on your regex executions. In any other case, add a timeout.

burntsushi · on Dec 2, 2023

I certainly wouldn't agree with that at all. And I don't see what a "degree in CS" has to do with anything.

63 · on Dec 2, 2023

Ah jeez I totally missed this crate! Thanks!! I had originally gone with Onig because of a SO post you made years ago. Fancy-regex substituted right in and worked immediately. Much appreciation for all you do

dumbo-octopus · on Dec 1, 2023

It's a Regular Language, in the mathematical sense. All you need is the greedy match anything sequence, I don't know why so many people jumped to non-Regular regex extensions.

63 · on Dec 2, 2023

I'm sorry I'm not familiar with the mathematics behind regular expressions. Can you give an example of what your approach would look like?

dumbo-octopus · on Dec 2, 2023

    first = line.match(  /(?:(0|zero)|(1|one) ... (9|nine))/)
    last  = line.match(/.*(?:(0|zero)|(1|one) ... (9|nine))/)

    indexOfGroupMatched = 
        ([_, ...groups]) => groups.findIndex(x => x != undefined)

    num = +[first, last].map(indexOfGroupMatched).join('')

63 · on Dec 2, 2023

Oh I understand. That's a good idea. Honestly I only ever use regex for AOC so I never would've thought of using it like that. I'll keep it in mind. Thanks!

czarit · on Dec 1, 2023

Easy mode for this use-case is to match on a reversed version of the regex on a reversed version of the string, but yeah, negative lookahead would be great.

63 · on Dec 2, 2023

Yeah I mean I did think of that but after spending all that time figuring out Regex I was decided to commit as fully as I could until it was 3am and I had to go to bed. There were lots of alternative solutions I thought of but chose to ignore out of stubborness. I'm admit, a good bit of the frustration was my own fault (but also c'mon this was way harder than any day 1 we've had before)

murermader · on Dec 2, 2023

Yeah, that was exactly was I was doing. For the left number do everything as normal, while walking through the string. For the reverse part, just walk through the reverse of the string and match the reversed keywords.

dgacmu · on Dec 1, 2023

You might try BurntSushi's aho-corasick crate. It led to a fairly nice solution in Rust for this one. (It will report overlapping matches)

masklinn · on Dec 1, 2023

I would assume you can also use RegexSet from the regex crate, as it

> match(es) multiple, possibly overlapping, regexes in a single search.

burntsushi · on Dec 1, 2023

I'm not familiar with the AoC problem. You might be able to. But RegexSet doesn't give you match offsets.

You can drop down to regex-automata, which does let you do multi-regex search and it will tell you which patterns match[1]. The docs have an example of a simple lexer[2]. But... that will only give you non-overlapping matches.

You can drop down to an even lower level of abstraction and get multi-pattern overlapping matches[3], but it's awkward. The comment there explains that I had initially tried to provide a higher level API for it, but was unsure of what the semantics should be. Getting the starting position in particular is a bit of a wrinkle.

[1]: https://docs.rs/regex-automata/latest/regex_automata/meta/in...

[2]: https://docs.rs/regex-automata/latest/regex_automata/meta/st...

[3]: https://github.com/rust-lang/regex/blob/837fd85e79fac2a4ea64...

dgacmu · on Dec 1, 2023

It's kind of awkward with that one, because you still have to check the individual patterns; it doesn't give you a multi-match on each pattern.

With the Aho-Corasick implementation you can just map the string -> {ordered list of matches} -> numbers associated with the match, and then you've got a little vec of digits you can grab the first and last entries of. Ended up being just a few lines of code, together with a hard-coded list of ["0", "one", "1", "two", ...] and the numbers they mapped to [0, 1, 1, 2, 2, ...].

masklinn · on Dec 1, 2023

> With the Aho-Corasick implementation you can just map the string -> {ordered list of matches} -> numbers associated with the match, and then you've got a little vec of digits you can grab the first and last entries of.

My solution was to shove it through `Itertools::minmax_by_key(|m| m.start()).into_option()`, which returns the lowest and highest matches (or a duplicate of a single match). Then to map to digits I actually ordered the patterns differently: I went 1, 2, 3, ..., one, two, three, ...

That way:

- for part 1 I could slice out the first 9 elements and it works uniformly

- mapping a "digit" to an actual digit is taking the index (match.as_pattern().as_usize()) modulo 9 to shift the textual versions to the numerical, then add one.

0/zero is not a valid digit so you can just ignore it, although you could always include it, use mod 10, and not increment the result, so same diff.

dgacmu · on Dec 1, 2023

Ah, that's a nice observation about zero, that would have saved me some typing. :-)

I did the same thing with my actual solution in terms of order (but I went 0, 1, 2, 3), but mostly so I could truncate the matching array to solve part 1. Notably, that was a ret-con of my actual solution to part 1, which I originally did by just mapping the characters through .is_ascii_digit(), but I wanted to consolidate the code a little. I ended up with:

https://github.com/dave-andersen/advent2023/blob/main/src/ma...

timeon · on Dec 1, 2023

Was regex even necessary?

burntsushi · on Dec 1, 2023

I don't think a regex is ever necessary for any problem, is it?

"necessary" is a strong word. :-)

63 · on Dec 2, 2023

Definitely not but it was the most elegant solution that I could think of right away, so I committed to it. In hindsight, it would've been relatively easy to solve with some loops and no dependencies but hey, I wouldn't have learned so much about the rust regex ecosystem that way.