'Nom parser that skips escaped terminator characters
I've checked the other SO answers for nom parser combinator questions, but this one doesn't seem to have been asked yet.
I am attempting to parse delimited regular expressions, they will always be delimited with /...../, perhaps with the modifiers at the end (which for all the data I need to parse right now is out of scope.) however if there's an escaped \/ in the middle of the string, my parser is stopping prematurely, on the first / even if it was preceeded with a \.
I have this parser:
use nom::bytes::complete::{tag, take_until};
use nom::{combinator::map_res, sequence::tuple, IResult};
use regex::Regex;
pub fn regex(input: &str) -> IResult<&str, Regex> {
map_res(
tuple((tag("/"), take_until("/"), tag("/"))),
|(_, re, _)| Regex::new(re),
)(input)
}
Naturally the take_until stops at the first / without noticing that the previous character was a \, I've looked at peek and recognize, and map and a whole bunch of other things, but I'm just coming up short, I feel like I literally want take_until("/") with some kind of either encoding awareness, or simply .. I am anyway, using map_res to hand-off to Rust's regex crate to do the parsing.
I also tried something like this using the escaped combinator, but the examples are somewhat unclear and I couldn't make it work:
pub fn regex(input: &str) -> IResult<&str, Regex> {
map_res(
tuple((
tag("/"),
escaped(many1(anychar), '\\', one_of(r"/")),
tag("/"),
)),
|(_, re, _)| {
println!("mapres {}", re);
Regex::new(re)
},
)(input)
}
My test cases are as such (the .unwrap().as_str() is just to have a small example, since regex::Regex doesn't implement PartialEq):
#[cfg(test)]
mod tests {
use super::regex;
use super::Regex;
#[test]
fn test_parse_regex_simple() {
assert_eq!(
Regex::new(r#"hello world"#).unwrap().as_str(),
regex("/hello world/").unwrap().1.as_str()
);
}
#[test]
fn test_parse_regex_with_escaped_forwardslash() {
assert_eq!(
Regex::new(r#"hello /world"#).unwrap().as_str(),
regex(r"/hello \/world/").unwrap().1.as_str(),
);
}
}
Solution 1:[1]
The accepted answer from Chayim Friedman is correct, I however was able to extend it also to handle \w \d and other such modifiers thusly, it's simply an extension of Chayim's idea in the escaped_transform version:
pub fn regex(input: &str) -> IResult<&str, Regex> {
map_res(
delimited(
tag("/"),
escaped_transform(
none_of("\\/"),
'\\',
alt((
value(r"/", tag("/")),
value(r"\d", tag("d")),
value(r"\W", tag("W")),
value(r"\w", tag("w")),
value(r"\b", tag("b")),
value(r"\B", tag("B")),
)),
),
tag("/"),
),
|re| Regex::new(&re),
)(input)
}
note this list is also incomplete, but https://docs.rs/regex/1.5.6/regex/#escape-sequences gives a complete set of escapes, and https://github.com/Geal/nom/blob/main/examples/string.rs gives a more detailed explanation of how to handle \u{....} type escape sequences.
Sources
This article follows the attribution requirements of Stack Overflow and is licensed under CC BY-SA 3.0.
Source: Stack Overflow
| Solution | Source |
|---|---|
| Solution 1 | Lee Hambley |
