Struct regex_automata::dfa::sparse::DFA
source · pub struct DFA<T> { /* private fields */ }
Expand description
A sparse deterministic finite automaton (DFA) with variable sized states.
In contrast to a [dense::DFA], a sparse DFA uses a more space efficient representation for its transitions. Consequently, sparse DFAs may use much less memory than dense DFAs, but this comes at a price. In particular, reading the more space efficient transitions takes more work, and consequently, searching using a sparse DFA is typically slower than a dense DFA.
A sparse DFA can be built using the default configuration via the
[DFA::new
] constructor. Otherwise, one can configure various aspects of a
dense DFA via [dense::Builder
], and then convert a dense DFA to a sparse
DFA using [dense::DFA::to_sparse
].
In general, a sparse DFA supports all the same search operations as a dense DFA.
Making the choice between a dense and sparse DFA depends on your specific work load. If you can sacrifice a bit of search time performance, then a sparse DFA might be the best choice. In particular, while sparse DFAs are probably always slower than dense DFAs, you may find that they are easily fast enough for your purposes!
Type parameters
A DFA
has one type parameter, T
, which is used to represent the parts
of a sparse DFA. T
is typically a Vec<u8>
or a &[u8]
.
The Automaton
trait
This type implements the Automaton
trait, which means it can be used
for searching. For example:
use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};
let dfa = DFA::new("foo[0-9]+")?;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
Implementations§
source§impl<T: AsRef<[u8]>> DFA<T>
impl<T: AsRef<[u8]>> DFA<T>
sourcepub fn as_ref<'a>(&'a self) -> DFA<&'a [u8]>
pub fn as_ref<'a>(&'a self) -> DFA<&'a [u8]>
Cheaply return a borrowed version of this sparse DFA. Specifically, the
DFA returned always uses &[u8]
for its transitions.
sourcepub fn to_owned(&self) -> DFA<Vec<u8>>
pub fn to_owned(&self) -> DFA<Vec<u8>>
Return an owned version of this sparse DFA. Specifically, the DFA
returned always uses Vec<u8>
for its transitions.
Effectively, this returns a sparse DFA whose transitions live on the heap.
sourcepub fn start_kind(&self) -> StartKind
pub fn start_kind(&self) -> StartKind
Returns the starting state configuration for this DFA.
The default is StartKind::Both
, which means the DFA supports both
unanchored and anchored searches. However, this can generally lead to
bigger DFAs. Therefore, a DFA might be compiled with support for just
unanchored or anchored searches. In that case, running a search with
an unsupported configuration will panic.
sourcepub fn starts_for_each_pattern(&self) -> bool
pub fn starts_for_each_pattern(&self) -> bool
Returns true only if this DFA has starting states for each pattern.
When a DFA has starting states for each pattern, then a search with the
DFA can be configured to only look for anchored matches of a specific
pattern. Specifically, APIs like Automaton::try_search_fwd
can
accept a Anchored::Pattern
if and only if this method returns true.
Otherwise, an error will be returned.
Note that if the DFA is empty, this always returns false.
sourcepub fn byte_classes(&self) -> &ByteClasses
pub fn byte_classes(&self) -> &ByteClasses
Returns the equivalence classes that make up the alphabet for this DFA.
Unless [dense::Config::byte_classes
] was disabled, it is possible
that multiple distinct bytes are grouped into the same equivalence
class if it is impossible for them to discriminate between a match and
a non-match. This has the effect of reducing the overall alphabet size
and in turn potentially substantially reducing the size of the DFA’s
transition table.
The downside of using equivalence classes like this is that every state transition will automatically use this map to convert an arbitrary byte to its corresponding equivalence class. In practice this has a negligible impact on performance.
sourcepub fn memory_usage(&self) -> usize
pub fn memory_usage(&self) -> usize
Returns the memory usage, in bytes, of this DFA.
The memory usage is computed based on the number of bytes used to represent this DFA.
This does not include the stack size used up by this DFA. To
compute that, use std::mem::size_of::<sparse::DFA>()
.
source§impl<T: AsRef<[u8]>> DFA<T>
impl<T: AsRef<[u8]>> DFA<T>
Routines for converting a sparse DFA to other representations, such as raw bytes suitable for persistent storage.
sourcepub fn write_to_little_endian(
&self,
dst: &mut [u8]
) -> Result<usize, SerializeError>
pub fn write_to_little_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>
Serialize this DFA as raw bytes to the given slice, in little endian
format. Upon success, the total number of bytes written to dst
is
returned.
The written bytes are guaranteed to be deserialized correctly and
without errors in a semver compatible release of this crate by a
DFA
’s deserialization APIs (assuming all other criteria for the
deserialization APIs has been satisfied):
Errors
This returns an error if the given destination slice is not big enough
to contain the full serialized DFA. If an error occurs, then nothing
is written to dst
.
Example
This example shows how to serialize and deserialize a DFA without dynamic memory allocation.
use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};
// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;
// Create a 4KB buffer on the stack to store our serialized DFA.
let mut buf = [0u8; 4 * (1<<10)];
// N.B. We use native endianness here to make the example work, but
// using write_to_little_endian would work on a little endian target.
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
sourcepub fn write_to_big_endian(
&self,
dst: &mut [u8]
) -> Result<usize, SerializeError>
pub fn write_to_big_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>
Serialize this DFA as raw bytes to the given slice, in big endian
format. Upon success, the total number of bytes written to dst
is
returned.
The written bytes are guaranteed to be deserialized correctly and
without errors in a semver compatible release of this crate by a
DFA
’s deserialization APIs (assuming all other criteria for the
deserialization APIs has been satisfied):
Errors
This returns an error if the given destination slice is not big enough
to contain the full serialized DFA. If an error occurs, then nothing
is written to dst
.
Example
This example shows how to serialize and deserialize a DFA without dynamic memory allocation.
use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};
// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;
// Create a 4KB buffer on the stack to store our serialized DFA.
let mut buf = [0u8; 4 * (1<<10)];
// N.B. We use native endianness here to make the example work, but
// using write_to_big_endian would work on a big endian target.
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
sourcepub fn write_to_native_endian(
&self,
dst: &mut [u8]
) -> Result<usize, SerializeError>
pub fn write_to_native_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>
Serialize this DFA as raw bytes to the given slice, in native endian
format. Upon success, the total number of bytes written to dst
is
returned.
The written bytes are guaranteed to be deserialized correctly and
without errors in a semver compatible release of this crate by a
DFA
’s deserialization APIs (assuming all other criteria for the
deserialization APIs has been satisfied):
Generally speaking, native endian format should only be used when you know that the target you’re compiling the DFA for matches the endianness of the target on which you’re compiling DFA. For example, if serialization and deserialization happen in the same process or on the same machine. Otherwise, when serializing a DFA for use in a portable environment, you’ll almost certainly want to serialize both a little endian and a big endian version and then load the correct one based on the target’s configuration.
Errors
This returns an error if the given destination slice is not big enough
to contain the full serialized DFA. If an error occurs, then nothing
is written to dst
.
Example
This example shows how to serialize and deserialize a DFA without dynamic memory allocation.
use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};
// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;
// Create a 4KB buffer on the stack to store our serialized DFA.
let mut buf = [0u8; 4 * (1<<10)];
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
sourcepub fn write_to_len(&self) -> usize
pub fn write_to_len(&self) -> usize
Return the total number of bytes required to serialize this DFA.
This is useful for determining the size of the buffer required to pass to one of the serialization routines:
Passing a buffer smaller than the size returned by this method will result in a serialization error.
Example
This example shows how to dynamically allocate enough room to serialize a sparse DFA.
use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};
// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;
let mut buf = vec![0; original_dfa.write_to_len()];
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source§impl<'a> DFA<&'a [u8]>
impl<'a> DFA<&'a [u8]>
sourcepub fn from_bytes(
slice: &'a [u8]
) -> Result<(DFA<&'a [u8]>, usize), DeserializeError>
pub fn from_bytes( slice: &'a [u8] ) -> Result<(DFA<&'a [u8]>, usize), DeserializeError>
Safely deserialize a sparse DFA with a specific state identifier representation. Upon success, this returns both the deserialized DFA and the number of bytes read from the given slice. Namely, the contents of the slice beyond the DFA are not read.
Deserializing a DFA using this routine will never allocate heap memory.
For safety purposes, the DFA’s transitions will be verified such that
every transition points to a valid state. If this verification is too
costly, then a DFA::from_bytes_unchecked
API is provided, which
will always execute in constant time.
The bytes given must be generated by one of the serialization APIs
of a DFA
using a semver compatible release of this crate. Those
include:
- [
DFA::to_bytes_little_endian
] - [
DFA::to_bytes_big_endian
] - [
DFA::to_bytes_native_endian
] DFA::write_to_little_endian
DFA::write_to_big_endian
DFA::write_to_native_endian
The to_bytes
methods allocate and return a Vec<u8>
for you. The
write_to
methods do not allocate and write to an existing slice
(which may be on the stack). Since deserialization always uses the
native endianness of the target platform, the serialization API you use
should match the endianness of the target platform. (It’s often a good
idea to generate serialized DFAs for both forms of endianness and then
load the correct one based on endianness.)
Errors
Generally speaking, it’s easier to state the conditions in which an error is not returned. All of the following must be true:
- The bytes given must be produced by one of the serialization APIs on this DFA, as mentioned above.
- The endianness of the target platform matches the endianness used to serialized the provided DFA.
If any of the above are not true, then an error will be returned.
Note that unlike deserializing a [dense::DFA
], deserializing a sparse
DFA has no alignment requirements. That is, an alignment of 1
is
valid.
Panics
This routine will never panic for any input.
Example
This example shows how to serialize a DFA to raw bytes, deserialize it and then use it for searching.
use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};
let initial = DFA::new("foo[0-9]+")?;
let bytes = initial.to_bytes_native_endian();
let dfa: DFA<&[u8]> = DFA::from_bytes(&bytes)?.0;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
Example: loading a DFA from static memory
One use case this library supports is the ability to serialize a
DFA to disk and then use include_bytes!
to store it in a compiled
Rust program. Those bytes can then be cheaply deserialized into a
DFA
structure at runtime and used for searching without having to
re-compile the DFA (which can be quite costly).
We can show this in two parts. The first part is serializing the DFA to a file:
use regex_automata::dfa::sparse::DFA;
let dfa = DFA::new("foo[0-9]+")?;
// Write a big endian serialized version of this DFA to a file.
let bytes = dfa.to_bytes_big_endian();
std::fs::write("foo.bigendian.dfa", &bytes)?;
// Do it again, but this time for little endian.
let bytes = dfa.to_bytes_little_endian();
std::fs::write("foo.littleendian.dfa", &bytes)?;
And now the second part is embedding the DFA into the compiled program and deserializing it at runtime on first use. We use conditional compilation to choose the correct endianness. We do not need to employ any special tricks to ensure a proper alignment, since a sparse DFA has no alignment requirements.
use regex_automata::{
dfa::{Automaton, sparse::DFA},
util::lazy::Lazy,
HalfMatch, Input,
};
// This crate provides its own "lazy" type, kind of like
// lazy_static! or once_cell::sync::Lazy. But it works in no-alloc
// no-std environments and let's us write this using completely
// safe code.
static RE: Lazy<DFA<&'static [u8]>> = Lazy::new(|| {
#[cfg(target_endian = "big")]
static BYTES: &[u8] = include_bytes!("foo.bigendian.dfa");
#[cfg(target_endian = "little")]
static BYTES: &[u8] = include_bytes!("foo.littleendian.dfa");
let (dfa, _) = DFA::from_bytes(BYTES)
.expect("serialized DFA should be valid");
dfa
});
let expected = Ok(Some(HalfMatch::must(0, 8)));
assert_eq!(expected, RE.try_search_fwd(&Input::new("foo12345")));
Alternatively, consider using
lazy_static
or
once_cell
,
which will guarantee safety for you.
sourcepub unsafe fn from_bytes_unchecked(
slice: &'a [u8]
) -> Result<(DFA<&'a [u8]>, usize), DeserializeError>
pub unsafe fn from_bytes_unchecked( slice: &'a [u8] ) -> Result<(DFA<&'a [u8]>, usize), DeserializeError>
Deserialize a DFA with a specific state identifier representation in constant time by omitting the verification of the validity of the sparse transitions.
This is just like DFA::from_bytes
, except it can potentially return
a DFA that exhibits undefined behavior if its transitions contains
invalid state identifiers.
This routine is useful if you need to deserialize a DFA cheaply and
cannot afford the transition validation performed by from_bytes
.
Safety
This routine is not safe because it permits callers to provide
arbitrary transitions with possibly incorrect state identifiers. While
the various serialization routines will never return an incorrect
DFA, there is no guarantee that the bytes provided here are correct.
While from_bytes_unchecked
will still do several forms of basic
validation, this routine does not check that the transitions themselves
are correct. Given an incorrect transition table, it is possible for
the search routines to access out-of-bounds memory because of explicit
bounds check elision.
Example
use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};
let initial = DFA::new("foo[0-9]+")?;
let bytes = initial.to_bytes_native_endian();
// SAFETY: This is guaranteed to be safe since the bytes given come
// directly from a compatible serialization routine.
let dfa: DFA<&[u8]> = unsafe { DFA::from_bytes_unchecked(&bytes)?.0 };
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
Trait Implementations§
source§impl<T: AsRef<[u8]>> Automaton for DFA<T>
impl<T: AsRef<[u8]>> Automaton for DFA<T>
source§fn is_special_state(&self, id: StateID) -> bool
fn is_special_state(&self, id: StateID) -> bool
source§fn is_dead_state(&self, id: StateID) -> bool
fn is_dead_state(&self, id: StateID) -> bool
source§fn is_quit_state(&self, id: StateID) -> bool
fn is_quit_state(&self, id: StateID) -> bool
source§fn is_match_state(&self, id: StateID) -> bool
fn is_match_state(&self, id: StateID) -> bool
source§fn is_start_state(&self, id: StateID) -> bool
fn is_start_state(&self, id: StateID) -> bool
source§fn is_accel_state(&self, id: StateID) -> bool
fn is_accel_state(&self, id: StateID) -> bool
source§fn next_state(&self, current: StateID, input: u8) -> StateID
fn next_state(&self, current: StateID, input: u8) -> StateID
source§unsafe fn next_state_unchecked(&self, current: StateID, input: u8) -> StateID
unsafe fn next_state_unchecked(&self, current: StateID, input: u8) -> StateID
source§fn next_eoi_state(&self, current: StateID) -> StateID
fn next_eoi_state(&self, current: StateID) -> StateID
source§fn pattern_len(&self) -> usize
fn pattern_len(&self) -> usize
source§fn match_len(&self, id: StateID) -> usize
fn match_len(&self, id: StateID) -> usize
source§fn match_pattern(&self, id: StateID, match_index: usize) -> PatternID
fn match_pattern(&self, id: StateID, match_index: usize) -> PatternID
source§fn has_empty(&self) -> bool
fn has_empty(&self) -> bool
source§fn is_always_start_anchored(&self) -> bool
fn is_always_start_anchored(&self) -> bool
0
. Read moresource§fn start_state(&self, config: &Config) -> Result<StateID, StartError>
fn start_state(&self, config: &Config) -> Result<StateID, StartError>
source§fn universal_start_state(&self, mode: Anchored) -> Option<StateID>
fn universal_start_state(&self, mode: Anchored) -> Option<StateID>
source§fn accelerator(&self, id: StateID) -> &[u8] ⓘ
fn accelerator(&self, id: StateID) -> &[u8] ⓘ
source§fn get_prefilter(&self) -> Option<&Prefilter>
fn get_prefilter(&self) -> Option<&Prefilter>
source§fn start_state_forward(&self, input: &Input<'_>) -> Result<StateID, MatchError>
fn start_state_forward(&self, input: &Input<'_>) -> Result<StateID, MatchError>
source§fn start_state_reverse(&self, input: &Input<'_>) -> Result<StateID, MatchError>
fn start_state_reverse(&self, input: &Input<'_>) -> Result<StateID, MatchError>
source§fn try_search_fwd(
&self,
input: &Input<'_>
) -> Result<Option<HalfMatch>, MatchError>
fn try_search_fwd( &self, input: &Input<'_> ) -> Result<Option<HalfMatch>, MatchError>
None
is returned. Read moresource§fn try_search_rev(
&self,
input: &Input<'_>
) -> Result<Option<HalfMatch>, MatchError>
fn try_search_rev( &self, input: &Input<'_> ) -> Result<Option<HalfMatch>, MatchError>
None
is
returned. Read moresource§fn try_search_overlapping_fwd(
&self,
input: &Input<'_>,
state: &mut OverlappingState
) -> Result<(), MatchError>
fn try_search_overlapping_fwd( &self, input: &Input<'_>, state: &mut OverlappingState ) -> Result<(), MatchError>
OverlappingState::get_match
method. Read moresource§fn try_search_overlapping_rev(
&self,
input: &Input<'_>,
state: &mut OverlappingState
) -> Result<(), MatchError>
fn try_search_overlapping_rev( &self, input: &Input<'_>, state: &mut OverlappingState ) -> Result<(), MatchError>
OverlappingState::get_match
method. Read moresource§fn try_which_overlapping_matches(
&self,
input: &Input<'_>,
patset: &mut PatternSet
) -> Result<(), MatchError>
fn try_which_overlapping_matches( &self, input: &Input<'_>, patset: &mut PatternSet ) -> Result<(), MatchError>
patset
. If multiple patterns match at the same
position and the underlying DFA supports overlapping matches, then all
matching patterns are written to the given set. Read more