Struct regex_automata::util::prefilter::Prefilter
source · pub struct Prefilter { /* private fields */ }
Expand description
A prefilter for accelerating regex searches.
If you already have your literals that you want to search with,
then the vanilla Prefilter::new
constructor is for you. But
if you have an Hir
value from the regex-syntax
crate, then
Prefilter::from_hir_prefix
might be more convenient. Namely, it uses
the regex-syntax::hir::literal
module to
extract literal prefixes for you, optimize them and then select and build a
prefilter matcher.
A prefilter must have zero false negatives. However, by its very
nature, it may produce false positives. That is, a prefilter will never
skip over a position in the haystack that corresponds to a match of the
original regex pattern, but it may produce a match for a position
in the haystack that does not correspond to a match of the original
regex pattern. If you use either the Prefilter::from_hir_prefix
or
Prefilter::from_hirs_prefix
constructors, then this guarantee is
upheld for you automatically. This guarantee is not preserved if you use
Prefilter::new
though, since it is up to the caller to provide correct
literal strings with respect to the original regex pattern.
Cloning
It is an API guarantee that cloning a prefilter is cheap. That is, cloning it will not duplicate whatever heap memory is used to represent the underlying matcher.
Example
This example shows how to attach a Prefilter
to the
PikeVM
in order to accelerate
searches.
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
util::prefilter::Prefilter,
Match, MatchKind,
};
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["Bruce "])
.expect("a prefilter");
let re = PikeVM::builder()
.configure(PikeVM::config().prefilter(Some(pre)))
.build(r"Bruce \w+")?;
let mut cache = re.create_cache();
assert_eq!(
Some(Match::must(0, 6..23)),
re.find(&mut cache, "Hello Bruce Springsteen!"),
);
But note that if you get your prefilter incorrect, it could lead to an incorrect result!
use regex_automata::{
nfa::thompson::pikevm::PikeVM,
util::prefilter::Prefilter,
Match, MatchKind,
};
// This prefilter is wrong!
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["Patti "])
.expect("a prefilter");
let re = PikeVM::builder()
.configure(PikeVM::config().prefilter(Some(pre)))
.build(r"Bruce \w+")?;
let mut cache = re.create_cache();
// We find no match even though the regex does match.
assert_eq!(
None,
re.find(&mut cache, "Hello Bruce Springsteen!"),
);
Implementations§
source§impl Prefilter
impl Prefilter
sourcepub fn new<B: AsRef<[u8]>>(kind: MatchKind, needles: &[B]) -> Option<Prefilter>
pub fn new<B: AsRef<[u8]>>(kind: MatchKind, needles: &[B]) -> Option<Prefilter>
Create a new prefilter from a sequence of needles and a corresponding match semantics.
This may return None
for a variety of reasons, for example, if
a suitable prefilter could not be constructed. That might occur
if they are unavailable (e.g., the perf-literal-substring
and
perf-literal-multisubstring
features aren’t enabled), or it might
occur because of heuristics or other artifacts of how the prefilter
works.
Note that if you have an Hir
expression, it may be more convenient
to use Prefilter::from_hir_prefix
. It will automatically handle the
task of extracting prefix literals for you.
Example
This example shows how match semantics can impact the matching algorithm used by the prefilter. For this reason, it is important to ensure that the match semantics given here are consistent with the match semantics intended for the regular expression that the literals were extracted from.
use regex_automata::{
util::{prefilter::Prefilter, syntax},
MatchKind, Span,
};
let hay = "Hello samwise";
// With leftmost-first, we find 'samwise' here because it comes
// before 'sam' in the sequence we give it..
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["samwise", "sam"])
.expect("a prefilter");
assert_eq!(
Some(Span::from(6..13)),
pre.find(hay.as_bytes(), Span::from(0..hay.len())),
);
// Still with leftmost-first but with the literals reverse, now 'sam'
// will match instead!
let pre = Prefilter::new(MatchKind::LeftmostFirst, &["sam", "samwise"])
.expect("a prefilter");
assert_eq!(
Some(Span::from(6..9)),
pre.find(hay.as_bytes(), Span::from(0..hay.len())),
);
sourcepub fn from_hir_prefix(kind: MatchKind, hir: &Hir) -> Option<Prefilter>
pub fn from_hir_prefix(kind: MatchKind, hir: &Hir) -> Option<Prefilter>
This attempts to extract prefixes from the given Hir
expression for
the given match semantics, and if possible, builds a prefilter for
them.
Example
This example shows how to build a prefilter directly from an Hir
expression, and use to find an occurrence of a prefix from the regex
pattern.
use regex_automata::{
util::{prefilter::Prefilter, syntax},
MatchKind, Span,
};
let hir = syntax::parse(r"(Bruce|Patti) \w+")?;
let pre = Prefilter::from_hir_prefix(MatchKind::LeftmostFirst, &hir)
.expect("a prefilter");
let hay = "Hello Patti Scialfa!";
assert_eq!(
Some(Span::from(6..12)),
pre.find(hay.as_bytes(), Span::from(0..hay.len())),
);
sourcepub fn from_hirs_prefix<H: Borrow<Hir>>(
kind: MatchKind,
hirs: &[H]
) -> Option<Prefilter>
pub fn from_hirs_prefix<H: Borrow<Hir>>( kind: MatchKind, hirs: &[H] ) -> Option<Prefilter>
This attempts to extract prefixes from the given Hir
expressions for
the given match semantics, and if possible, builds a prefilter for
them.
Note that as of now, prefilters throw away information about which pattern each literal comes from. In other words, when a prefilter finds a match, there’s no way to know which pattern (or patterns) it came from. Therefore, in order to confirm a match, you’ll have to check all of the patterns by running the full regex engine.
Example
This example shows how to build a prefilter directly from multiple
Hir
expressions expression, and use it to find an occurrence of a
prefix from the regex patterns.
use regex_automata::{
util::{prefilter::Prefilter, syntax},
MatchKind, Span,
};
let hirs = syntax::parse_many(&[
r"(Bruce|Patti) \w+",
r"Mrs?\. Doubtfire",
])?;
let pre = Prefilter::from_hirs_prefix(MatchKind::LeftmostFirst, &hirs)
.expect("a prefilter");
let hay = "Hello Mrs. Doubtfire";
assert_eq!(
Some(Span::from(6..20)),
pre.find(hay.as_bytes(), Span::from(0..hay.len())),
);
sourcepub fn find(&self, haystack: &[u8], span: Span) -> Option<Span>
pub fn find(&self, haystack: &[u8], span: Span) -> Option<Span>
Run this prefilter on haystack[span.start..end]
and return a matching
span if one exists.
The span returned is guaranteed to have a start position greater than or equal to the one given, and an end position less than or equal to the one given.
Example
This example shows how to build a prefilter directly from an Hir
expression, and use it to find an occurrence of a prefix from the regex
pattern.
use regex_automata::{
util::{prefilter::Prefilter, syntax},
MatchKind, Span,
};
let hir = syntax::parse(r"Bruce \w+")?;
let pre = Prefilter::from_hir_prefix(MatchKind::LeftmostFirst, &hir)
.expect("a prefilter");
let hay = "Hello Bruce Springsteen!";
assert_eq!(
Some(Span::from(6..12)),
pre.find(hay.as_bytes(), Span::from(0..hay.len())),
);
sourcepub fn prefix(&self, haystack: &[u8], span: Span) -> Option<Span>
pub fn prefix(&self, haystack: &[u8], span: Span) -> Option<Span>
Returns the span of a prefix of haystack[span.start..span.end]
if
the prefilter matches.
The span returned is guaranteed to have a start position equivalent to the one given, and an end position less than or equal to the one given.
Example
This example shows how to build a prefilter directly from an Hir
expression, and use it to find an occurrence of a prefix from the regex
pattern that begins at the start of a haystack only.
use regex_automata::{
util::{prefilter::Prefilter, syntax},
MatchKind, Span,
};
let hir = syntax::parse(r"Bruce \w+")?;
let pre = Prefilter::from_hir_prefix(MatchKind::LeftmostFirst, &hir)
.expect("a prefilter");
let hay = "Hello Bruce Springsteen!";
// Nothing is found here because 'Bruce' does
// not occur at the beginning of our search.
assert_eq!(
None,
pre.prefix(hay.as_bytes(), Span::from(0..hay.len())),
);
// But if we change where we start the search
// to begin where 'Bruce ' begins, then a
// match will be found.
assert_eq!(
Some(Span::from(6..12)),
pre.prefix(hay.as_bytes(), Span::from(6..hay.len())),
);
sourcepub fn memory_usage(&self) -> usize
pub fn memory_usage(&self) -> usize
Returns the heap memory, in bytes, used by the underlying prefilter.
sourcepub fn max_needle_len(&self) -> usize
pub fn max_needle_len(&self) -> usize
Return the length of the longest needle in this Prefilter
sourcepub fn is_fast(&self) -> bool
pub fn is_fast(&self) -> bool
Implementations might return true here if they believe themselves to be “fast.” The concept of “fast” is deliberately left vague, but in practice this usually corresponds to whether it’s believed that SIMD will be used.
Why do we care about this? Well, some prefilter tricks tend to come with their own bits of overhead, and so might only make sense if we know that a scan will be much faster than the regex engine itself. Otherwise, the trick may not be worth doing. Whether something is “much” faster than the regex engine generally boils down to whether SIMD is used. (But not always. Even a SIMD matcher with a high false positive rate can become quite slow.)
Even if this returns true, it is still possible for the prefilter to be “slow.” Remember, prefilters are just heuristics. We can’t really know a prefilter will be fast without actually trying the prefilter. (Which of course we cannot afford to do.)