You'll be able to do this soon with the Rust regex crate as well. Well, by using one of its dependencies. Once regex 1.9 is out, you'll be able to do this with regex-automata:
use regex_automata::{
meta::Regex,
util::iter::Searcher,
Anchored, Input,
};
#[derive(Clone, Copy, Debug)]
enum Token {
Integer,
Identifier,
Punctuation,
}
fn main() {
let re = Regex::new_many(&[
r"[0-9]+",
r"[a-z_]+",
r"[,.]+",
r"\s+",
]).unwrap();
let hay = "45 pigeons, 23 cows, 11 spiders";
let input = Input::new(hay).anchored(Anchored::Yes);
let mut it = Searcher::new(input).into_matches_iter(|input| {
Ok(re.search(input))
}).infallible();
for m in &mut it {
let token = match m.pattern().as_usize() {
0 => Token::Integer,
1 => Token::Identifier,
2 => Token::Punctuation,
3 => continue,
pid => unreachable!("unrecognized pattern ID: {:?}", pid),
};
println!("{:?}: {:?}", token, &hay[m.range()]);
}
let remainder = &hay[it.input().get_span()];
if !remainder.is_empty() {
println!("did not consume entire haystack");
}
}
A bit more verbose than the Python, but the library is exposing much lower level components. You have to do a little more stitching to get the `Scanner` behavior. But it does everything the Python does: a single scan (using finite automata and not backtracking like Python), skip certain token types and guarantees that the entirety of the haystack is consumed.
Yes, as I said, the APIs exposed in regex-automata give a lot more power. It's an "expert" level crate. You could pretty easily build a scanner-like abstraction and get pretty close to the Python code.
I posted this because a lot of regex engines don't support this type of use case. Or don't support it well without having to give something up.
Interesting! Not often using either crate, this example looks like something for which I might usually look to nom. Is there a reason I should consider using regex for this use case instead (if neither is a pre-existing dependency)?
I don't use nom. I've tried using parser combinator libraries in the past but generally don't like them.
That said, I don't usually use regexes for this either. Instead, I just do things by hand.
So I'm probably not the right person to answer your question unfortunately. I just know that more than one person has asked for this style of use case to be supported in the regex crate. :-)