Struct regex_automata::dfa::sparse::DFA

source ·
pub struct DFA<T> { /* private fields */ }
Expand description

A sparse deterministic finite automaton (DFA) with variable sized states.

In contrast to a [dense::DFA], a sparse DFA uses a more space efficient representation for its transitions. Consequently, sparse DFAs may use much less memory than dense DFAs, but this comes at a price. In particular, reading the more space efficient transitions takes more work, and consequently, searching using a sparse DFA is typically slower than a dense DFA.

A sparse DFA can be built using the default configuration via the [DFA::new] constructor. Otherwise, one can configure various aspects of a dense DFA via [dense::Builder], and then convert a dense DFA to a sparse DFA using [dense::DFA::to_sparse].

In general, a sparse DFA supports all the same search operations as a dense DFA.

Making the choice between a dense and sparse DFA depends on your specific work load. If you can sacrifice a bit of search time performance, then a sparse DFA might be the best choice. In particular, while sparse DFAs are probably always slower than dense DFAs, you may find that they are easily fast enough for your purposes!

Type parameters

A DFA has one type parameter, T, which is used to represent the parts of a sparse DFA. T is typically a Vec<u8> or a &[u8].

The Automaton trait

This type implements the Automaton trait, which means it can be used for searching. For example:

use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};

let dfa = DFA::new("foo[0-9]+")?;
let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);

Implementations§

source§

impl<T: AsRef<[u8]>> DFA<T>

source

pub fn as_ref<'a>(&'a self) -> DFA<&'a [u8]>

Cheaply return a borrowed version of this sparse DFA. Specifically, the DFA returned always uses &[u8] for its transitions.

source

pub fn to_owned(&self) -> DFA<Vec<u8>>

Return an owned version of this sparse DFA. Specifically, the DFA returned always uses Vec<u8> for its transitions.

Effectively, this returns a sparse DFA whose transitions live on the heap.

source

pub fn start_kind(&self) -> StartKind

Returns the starting state configuration for this DFA.

The default is StartKind::Both, which means the DFA supports both unanchored and anchored searches. However, this can generally lead to bigger DFAs. Therefore, a DFA might be compiled with support for just unanchored or anchored searches. In that case, running a search with an unsupported configuration will panic.

source

pub fn starts_for_each_pattern(&self) -> bool

Returns true only if this DFA has starting states for each pattern.

When a DFA has starting states for each pattern, then a search with the DFA can be configured to only look for anchored matches of a specific pattern. Specifically, APIs like Automaton::try_search_fwd can accept a Anchored::Pattern if and only if this method returns true. Otherwise, an error will be returned.

Note that if the DFA is empty, this always returns false.

source

pub fn byte_classes(&self) -> &ByteClasses

Returns the equivalence classes that make up the alphabet for this DFA.

Unless [dense::Config::byte_classes] was disabled, it is possible that multiple distinct bytes are grouped into the same equivalence class if it is impossible for them to discriminate between a match and a non-match. This has the effect of reducing the overall alphabet size and in turn potentially substantially reducing the size of the DFA’s transition table.

The downside of using equivalence classes like this is that every state transition will automatically use this map to convert an arbitrary byte to its corresponding equivalence class. In practice this has a negligible impact on performance.

source

pub fn memory_usage(&self) -> usize

Returns the memory usage, in bytes, of this DFA.

The memory usage is computed based on the number of bytes used to represent this DFA.

This does not include the stack size used up by this DFA. To compute that, use std::mem::size_of::<sparse::DFA>().

source§

impl<T: AsRef<[u8]>> DFA<T>

Routines for converting a sparse DFA to other representations, such as raw bytes suitable for persistent storage.

source

pub fn write_to_little_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>

Serialize this DFA as raw bytes to the given slice, in little endian format. Upon success, the total number of bytes written to dst is returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Errors

This returns an error if the given destination slice is not big enough to contain the full serialized DFA. If an error occurs, then nothing is written to dst.

Example

This example shows how to serialize and deserialize a DFA without dynamic memory allocation.

use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// Create a 4KB buffer on the stack to store our serialized DFA.
let mut buf = [0u8; 4 * (1<<10)];
// N.B. We use native endianness here to make the example work, but
// using write_to_little_endian would work on a little endian target.
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source

pub fn write_to_big_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>

Serialize this DFA as raw bytes to the given slice, in big endian format. Upon success, the total number of bytes written to dst is returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Errors

This returns an error if the given destination slice is not big enough to contain the full serialized DFA. If an error occurs, then nothing is written to dst.

Example

This example shows how to serialize and deserialize a DFA without dynamic memory allocation.

use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// Create a 4KB buffer on the stack to store our serialized DFA.
let mut buf = [0u8; 4 * (1<<10)];
// N.B. We use native endianness here to make the example work, but
// using write_to_big_endian would work on a big endian target.
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source

pub fn write_to_native_endian( &self, dst: &mut [u8] ) -> Result<usize, SerializeError>

Serialize this DFA as raw bytes to the given slice, in native endian format. Upon success, the total number of bytes written to dst is returned.

The written bytes are guaranteed to be deserialized correctly and without errors in a semver compatible release of this crate by a DFA’s deserialization APIs (assuming all other criteria for the deserialization APIs has been satisfied):

Generally speaking, native endian format should only be used when you know that the target you’re compiling the DFA for matches the endianness of the target on which you’re compiling DFA. For example, if serialization and deserialization happen in the same process or on the same machine. Otherwise, when serializing a DFA for use in a portable environment, you’ll almost certainly want to serialize both a little endian and a big endian version and then load the correct one based on the target’s configuration.

Errors

This returns an error if the given destination slice is not big enough to contain the full serialized DFA. If an error occurs, then nothing is written to dst.

Example

This example shows how to serialize and deserialize a DFA without dynamic memory allocation.

use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

// Create a 4KB buffer on the stack to store our serialized DFA.
let mut buf = [0u8; 4 * (1<<10)];
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source

pub fn write_to_len(&self) -> usize

Return the total number of bytes required to serialize this DFA.

This is useful for determining the size of the buffer required to pass to one of the serialization routines:

Passing a buffer smaller than the size returned by this method will result in a serialization error.

Example

This example shows how to dynamically allocate enough room to serialize a sparse DFA.

use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};

// Compile our original DFA.
let original_dfa = DFA::new("foo[0-9]+")?;

let mut buf = vec![0; original_dfa.write_to_len()];
let written = original_dfa.write_to_native_endian(&mut buf)?;
let dfa: DFA<&[u8]> = DFA::from_bytes(&buf[..written])?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
source§

impl<'a> DFA<&'a [u8]>

source

pub fn from_bytes( slice: &'a [u8] ) -> Result<(DFA<&'a [u8]>, usize), DeserializeError>

Safely deserialize a sparse DFA with a specific state identifier representation. Upon success, this returns both the deserialized DFA and the number of bytes read from the given slice. Namely, the contents of the slice beyond the DFA are not read.

Deserializing a DFA using this routine will never allocate heap memory. For safety purposes, the DFA’s transitions will be verified such that every transition points to a valid state. If this verification is too costly, then a DFA::from_bytes_unchecked API is provided, which will always execute in constant time.

The bytes given must be generated by one of the serialization APIs of a DFA using a semver compatible release of this crate. Those include:

The to_bytes methods allocate and return a Vec<u8> for you. The write_to methods do not allocate and write to an existing slice (which may be on the stack). Since deserialization always uses the native endianness of the target platform, the serialization API you use should match the endianness of the target platform. (It’s often a good idea to generate serialized DFAs for both forms of endianness and then load the correct one based on endianness.)

Errors

Generally speaking, it’s easier to state the conditions in which an error is not returned. All of the following must be true:

  • The bytes given must be produced by one of the serialization APIs on this DFA, as mentioned above.
  • The endianness of the target platform matches the endianness used to serialized the provided DFA.

If any of the above are not true, then an error will be returned.

Note that unlike deserializing a [dense::DFA], deserializing a sparse DFA has no alignment requirements. That is, an alignment of 1 is valid.

Panics

This routine will never panic for any input.

Example

This example shows how to serialize a DFA to raw bytes, deserialize it and then use it for searching.

use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};

let initial = DFA::new("foo[0-9]+")?;
let bytes = initial.to_bytes_native_endian();
let dfa: DFA<&[u8]> = DFA::from_bytes(&bytes)?.0;

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);
Example: loading a DFA from static memory

One use case this library supports is the ability to serialize a DFA to disk and then use include_bytes! to store it in a compiled Rust program. Those bytes can then be cheaply deserialized into a DFA structure at runtime and used for searching without having to re-compile the DFA (which can be quite costly).

We can show this in two parts. The first part is serializing the DFA to a file:

use regex_automata::dfa::sparse::DFA;

let dfa = DFA::new("foo[0-9]+")?;

// Write a big endian serialized version of this DFA to a file.
let bytes = dfa.to_bytes_big_endian();
std::fs::write("foo.bigendian.dfa", &bytes)?;

// Do it again, but this time for little endian.
let bytes = dfa.to_bytes_little_endian();
std::fs::write("foo.littleendian.dfa", &bytes)?;

And now the second part is embedding the DFA into the compiled program and deserializing it at runtime on first use. We use conditional compilation to choose the correct endianness. We do not need to employ any special tricks to ensure a proper alignment, since a sparse DFA has no alignment requirements.

use regex_automata::{
    dfa::{Automaton, sparse::DFA},
    util::lazy::Lazy,
    HalfMatch, Input,
};

// This crate provides its own "lazy" type, kind of like
// lazy_static! or once_cell::sync::Lazy. But it works in no-alloc
// no-std environments and let's us write this using completely
// safe code.
static RE: Lazy<DFA<&'static [u8]>> = Lazy::new(|| {
    #[cfg(target_endian = "big")]
    static BYTES: &[u8] = include_bytes!("foo.bigendian.dfa");
    #[cfg(target_endian = "little")]
    static BYTES: &[u8] = include_bytes!("foo.littleendian.dfa");

    let (dfa, _) = DFA::from_bytes(BYTES)
        .expect("serialized DFA should be valid");
    dfa
});

let expected = Ok(Some(HalfMatch::must(0, 8)));
assert_eq!(expected, RE.try_search_fwd(&Input::new("foo12345")));

Alternatively, consider using lazy_static or once_cell, which will guarantee safety for you.

source

pub unsafe fn from_bytes_unchecked( slice: &'a [u8] ) -> Result<(DFA<&'a [u8]>, usize), DeserializeError>

Deserialize a DFA with a specific state identifier representation in constant time by omitting the verification of the validity of the sparse transitions.

This is just like DFA::from_bytes, except it can potentially return a DFA that exhibits undefined behavior if its transitions contains invalid state identifiers.

This routine is useful if you need to deserialize a DFA cheaply and cannot afford the transition validation performed by from_bytes.

Safety

This routine is not safe because it permits callers to provide arbitrary transitions with possibly incorrect state identifiers. While the various serialization routines will never return an incorrect DFA, there is no guarantee that the bytes provided here are correct. While from_bytes_unchecked will still do several forms of basic validation, this routine does not check that the transitions themselves are correct. Given an incorrect transition table, it is possible for the search routines to access out-of-bounds memory because of explicit bounds check elision.

Example
use regex_automata::{dfa::{Automaton, sparse::DFA}, HalfMatch, Input};

let initial = DFA::new("foo[0-9]+")?;
let bytes = initial.to_bytes_native_endian();
// SAFETY: This is guaranteed to be safe since the bytes given come
// directly from a compatible serialization routine.
let dfa: DFA<&[u8]> = unsafe { DFA::from_bytes_unchecked(&bytes)?.0 };

let expected = Some(HalfMatch::must(0, 8));
assert_eq!(expected, dfa.try_search_fwd(&Input::new("foo12345"))?);

Trait Implementations§

source§

impl<T: AsRef<[u8]>> Automaton for DFA<T>

source§

fn is_special_state(&self, id: StateID) -> bool

Returns true if and only if the given identifier corresponds to a “special” state. A special state is one or more of the following: a dead state, a quit state, a match state, a start state or an accelerated state. Read more
source§

fn is_dead_state(&self, id: StateID) -> bool

Returns true if and only if the given identifier corresponds to a dead state. When a DFA enters a dead state, it is impossible to leave. That is, every transition on a dead state by definition leads back to the same dead state. Read more
source§

fn is_quit_state(&self, id: StateID) -> bool

Returns true if and only if the given identifier corresponds to a quit state. A quit state is like a dead state (it has no transitions other than to itself), except it indicates that the DFA failed to complete the search. When this occurs, callers can neither accept or reject that a match occurred. Read more
source§

fn is_match_state(&self, id: StateID) -> bool

Returns true if and only if the given identifier corresponds to a match state. A match state is also referred to as a “final” state and indicates that a match has been found. Read more
source§

fn is_start_state(&self, id: StateID) -> bool

Returns true only if the given identifier corresponds to a start state Read more
source§

fn is_accel_state(&self, id: StateID) -> bool

Returns true if and only if the given identifier corresponds to an accelerated state. Read more
source§

fn next_state(&self, current: StateID, input: u8) -> StateID

Transitions from the current state to the next state, given the next byte of input. Read more
source§

unsafe fn next_state_unchecked(&self, current: StateID, input: u8) -> StateID

Transitions from the current state to the next state, given the next byte of input. Read more
source§

fn next_eoi_state(&self, current: StateID) -> StateID

Transitions from the current state to the next state for the special EOI symbol. Read more
source§

fn pattern_len(&self) -> usize

Returns the total number of patterns compiled into this DFA. Read more
source§

fn match_len(&self, id: StateID) -> usize

Returns the total number of patterns that match in this state. Read more
source§

fn match_pattern(&self, id: StateID, match_index: usize) -> PatternID

Returns the pattern ID corresponding to the given match index in the given state. Read more
source§

fn has_empty(&self) -> bool

Returns true if and only if this automaton can match the empty string. When it returns false, all possible matches are guaranteed to have a non-zero length. Read more
source§

fn is_utf8(&self) -> bool

Whether UTF-8 mode is enabled for this DFA or not. Read more
source§

fn is_always_start_anchored(&self) -> bool

Returns true if and only if this DFA is limited to returning matches whose start position is 0. Read more
source§

fn start_state(&self, config: &Config) -> Result<StateID, StartError>

Return the ID of the start state for this DFA for the given starting configuration. Read more
source§

fn universal_start_state(&self, mode: Anchored) -> Option<StateID>

If this DFA has a universal starting state for the given anchor mode and the DFA supports universal starting states, then this returns that state’s identifier. Read more
source§

fn accelerator(&self, id: StateID) -> &[u8]

Return a slice of bytes to accelerate for the given state, if possible. Read more
source§

fn get_prefilter(&self) -> Option<&Prefilter>

Returns the prefilter associated with a DFA, if one exists. Read more
source§

fn start_state_forward(&self, input: &Input<'_>) -> Result<StateID, MatchError>

Return the ID of the start state for this DFA when executing a forward search. Read more
source§

fn start_state_reverse(&self, input: &Input<'_>) -> Result<StateID, MatchError>

Return the ID of the start state for this DFA when executing a reverse search. Read more
source§

fn try_search_fwd( &self, input: &Input<'_> ) -> Result<Option<HalfMatch>, MatchError>

Executes a forward search and returns the end position of the leftmost match that is found. If no match exists, then None is returned. Read more
source§

fn try_search_rev( &self, input: &Input<'_> ) -> Result<Option<HalfMatch>, MatchError>

Executes a reverse search and returns the start of the position of the leftmost match that is found. If no match exists, then None is returned. Read more
source§

fn try_search_overlapping_fwd( &self, input: &Input<'_>, state: &mut OverlappingState ) -> Result<(), MatchError>

Executes an overlapping forward search. Matches, if one exists, can be obtained via the OverlappingState::get_match method. Read more
source§

fn try_search_overlapping_rev( &self, input: &Input<'_>, state: &mut OverlappingState ) -> Result<(), MatchError>

Executes a reverse overlapping forward search. Matches, if one exists, can be obtained via the OverlappingState::get_match method. Read more
source§

fn try_which_overlapping_matches( &self, input: &Input<'_>, patset: &mut PatternSet ) -> Result<(), MatchError>

Writes the set of patterns that match anywhere in the given search configuration to patset. If multiple patterns match at the same position and the underlying DFA supports overlapping matches, then all matching patterns are written to the given set. Read more
source§

impl<T: Clone> Clone for DFA<T>

source§

fn clone(&self) -> DFA<T>

Returns a copy of the value. Read more
1.0.0 · source§

fn clone_from(&mut self, source: &Self)

Performs copy-assignment from source. Read more
source§

impl<T: AsRef<[u8]>> Debug for DFA<T>

source§

fn fmt(&self, f: &mut Formatter<'_>) -> Result

Formats the value using the given formatter. Read more

Auto Trait Implementations§

§

impl<T> RefUnwindSafe for DFA<T>
where T: RefUnwindSafe,

§

impl<T> Send for DFA<T>
where T: Send,

§

impl<T> Sync for DFA<T>
where T: Sync,

§

impl<T> Unpin for DFA<T>
where T: Unpin,

§

impl<T> UnwindSafe for DFA<T>
where T: UnwindSafe,

Blanket Implementations§

source§

impl<T> Any for T
where T: 'static + ?Sized,

source§

fn type_id(&self) -> TypeId

Gets the TypeId of self. Read more
source§

impl<T> Borrow<T> for T
where T: ?Sized,

source§

fn borrow(&self) -> &T

Immutably borrows from an owned value. Read more
source§

impl<T> BorrowMut<T> for T
where T: ?Sized,

source§

fn borrow_mut(&mut self) -> &mut T

Mutably borrows from an owned value. Read more
source§

impl<T> From<T> for T

source§

fn from(t: T) -> T

Returns the argument unchanged.

source§

impl<T, U> Into<U> for T
where U: From<T>,

source§

fn into(self) -> U

Calls U::from(self).

That is, this conversion is whatever the implementation of From<T> for U chooses to do.

source§

impl<T> ToOwned for T
where T: Clone,

§

type Owned = T

The resulting type after obtaining ownership.
source§

fn to_owned(&self) -> T

Creates owned data from borrowed data, usually by cloning. Read more
source§

fn clone_into(&self, target: &mut T)

Uses borrowed data to replace owned data, usually by cloning. Read more
source§

impl<T, U> TryFrom<U> for T
where U: Into<T>,

§

type Error = Infallible

The type returned in the event of a conversion error.
source§

fn try_from(value: U) -> Result<T, <T as TryFrom<U>>::Error>

Performs the conversion.
source§

impl<T, U> TryInto<U> for T
where U: TryFrom<T>,

§

type Error = <U as TryFrom<T>>::Error

The type returned in the event of a conversion error.
source§

fn try_into(self) -> Result<U, <U as TryFrom<T>>::Error>

Performs the conversion.