Struct regex_automata::dfa::onepass::Config
source · pub struct Config { /* private fields */ }
Expand description
The configuration used for building a one-pass DFA.
A one-pass DFA configuration is a simple data object that is typically used
with Builder::configure
. It can be cheaply cloned.
A default configuration can be created either with Config::new
, or
perhaps more conveniently, with DFA::config
.
Implementations§
source§impl Config
impl Config
sourcepub fn match_kind(self, kind: MatchKind) -> Config
pub fn match_kind(self, kind: MatchKind) -> Config
Set the desired match semantics.
The default is MatchKind::LeftmostFirst
, which corresponds to the
match semantics of Perl-like regex engines. That is, when multiple
patterns would match at the same leftmost position, the pattern that
appears first in the concrete syntax is chosen.
Currently, the only other kind of match semantics supported is
MatchKind::All
. This corresponds to “classical DFA” construction
where all possible matches are visited.
When it comes to the one-pass DFA, it is rarer for preference order and
“longest match” to actually disagree. Since if they did disagree, then
the regex typically isn’t one-pass. For example, searching Samwise
for Sam|Samwise
will report Sam
for leftmost-first matching and
Samwise
for “longest match” or “all” matching. However, this regex is
not one-pass if taken literally. The equivalent regex, Sam(?:|wise)
is one-pass and Sam|Samwise
may be optimized to it.
The other main difference is that “all” match semantics don’t support non-greedy matches. “All” match semantics always try to match as much as possible.
sourcepub fn starts_for_each_pattern(self, yes: bool) -> Config
pub fn starts_for_each_pattern(self, yes: bool) -> Config
Whether to compile a separate start state for each pattern in the one-pass DFA.
When enabled, a separate anchored start state is added for each pattern in the DFA. When this start state is used, then the DFA will only search for matches for the pattern specified, even if there are other patterns in the DFA.
The main downside of this option is that it can potentially increase the size of the DFA and/or increase the time it takes to build the DFA.
You might want to enable this option when you want to both search for anchored matches of any pattern or to search for anchored matches of one particular pattern while using the same DFA. (Otherwise, you would need to compile a new DFA for each pattern.)
By default this is disabled.
Example
This example shows how to build a multi-regex and then search for matches for a any of the patterns or matches for a specific pattern.
use regex_automata::{
dfa::onepass::DFA, Anchored, Input, Match, PatternID,
};
let re = DFA::builder()
.configure(DFA::config().starts_for_each_pattern(true))
.build_many(&["[a-z]+", "[0-9]+"])?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "123abc";
let input = Input::new(haystack).anchored(Anchored::Yes);
// A normal multi-pattern search will show pattern 1 matches.
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(Some(Match::must(1, 0..3)), caps.get_match());
// If we only want to report pattern 0 matches, then we'll get no
// match here.
let input = input.anchored(Anchored::Pattern(PatternID::must(0)));
re.try_search(&mut cache, &input, &mut caps)?;
assert_eq!(None, caps.get_match());
sourcepub fn byte_classes(self, yes: bool) -> Config
pub fn byte_classes(self, yes: bool) -> Config
Whether to attempt to shrink the size of the DFA’s alphabet or not.
This option is enabled by default and should never be disabled unless one is debugging a one-pass DFA.
When enabled, the DFA will use a map from all possible bytes to their
corresponding equivalence class. Each equivalence class represents a
set of bytes that does not discriminate between a match and a non-match
in the DFA. For example, the pattern [ab]+
has at least two
equivalence classes: a set containing a
and b
and a set containing
every byte except for a
and b
. a
and b
are in the same
equivalence class because they never discriminate between a match and a
non-match.
The advantage of this map is that the size of the transition table
can be reduced drastically from (approximately) #states * 256 * sizeof(StateID)
to #states * k * sizeof(StateID)
where k
is the
number of equivalence classes (rounded up to the nearest power of 2).
As a result, total space usage can decrease substantially. Moreover,
since a smaller alphabet is used, DFA compilation becomes faster as
well.
WARNING: This is only useful for debugging DFAs. Disabling this does not yield any speed advantages. Namely, even when this is disabled, a byte class map is still used while searching. The only difference is that every byte will be forced into its own distinct equivalence class. This is useful for debugging the actual generated transitions because it lets one see the transitions defined on actual bytes instead of the equivalence classes.
sourcepub fn size_limit(self, limit: Option<usize>) -> Config
pub fn size_limit(self, limit: Option<usize>) -> Config
Set a size limit on the total heap used by a one-pass DFA.
This size limit is expressed in bytes and is applied during construction of a one-pass DFA. If the DFA’s heap usage exceeds this configured limit, then construction is stopped and an error is returned.
The default is no limit.
Example
This example shows a one-pass DFA that fails to build because of a configured size limit. This particular example also serves as a cautionary tale demonstrating just how big DFAs with large Unicode character classes can get.
use regex_automata::{dfa::onepass::DFA, Match};
// 6MB isn't enough!
DFA::builder()
.configure(DFA::config().size_limit(Some(6_000_000)))
.build(r"\w{20}")
.unwrap_err();
// ... but 7MB probably is!
// (Note that DFA sizes aren't necessarily stable between releases.)
let re = DFA::builder()
.configure(DFA::config().size_limit(Some(7_000_000)))
.build(r"\w{20}")?;
let (mut cache, mut caps) = (re.create_cache(), re.create_captures());
let haystack = "A".repeat(20);
re.captures(&mut cache, &haystack, &mut caps);
assert_eq!(Some(Match::must(0, 0..20)), caps.get_match());
While one needs a little more than 3MB to represent \w{20}
, it
turns out that you only need a little more than 4KB to represent
(?-u:\w{20})
. So only use Unicode if you need it!
sourcepub fn get_match_kind(&self) -> MatchKind
pub fn get_match_kind(&self) -> MatchKind
Returns the match semantics set in this configuration.
sourcepub fn get_starts_for_each_pattern(&self) -> bool
pub fn get_starts_for_each_pattern(&self) -> bool
Returns whether this configuration has enabled anchored starting states for every pattern in the DFA.
sourcepub fn get_byte_classes(&self) -> bool
pub fn get_byte_classes(&self) -> bool
Returns whether this configuration has enabled byte classes or not. This is typically a debugging oriented option, as disabling it confers no speed benefit.
sourcepub fn get_size_limit(&self) -> Option<usize>
pub fn get_size_limit(&self) -> Option<usize>
Returns the DFA size limit of this configuration if one was set. The size limit is total number of bytes on the heap that a DFA is permitted to use. If the DFA exceeds this limit during construction, then construction is stopped and an error is returned.