Struct data_encoding::Specification
source · pub struct Specification {
pub symbols: String,
pub bit_order: BitOrder,
pub check_trailing_bits: bool,
pub padding: Option<char>,
pub ignore: String,
pub wrap: Wrap,
pub translate: Translate,
}
Expand description
Base-conversion specification
It is possible to define custom encodings given a specification. To do so, it is important to understand the theory first.
Theory
Each subsection has an equivalent subsection in the Practice section.
Basics
The main idea of a base-conversion encoding is to see [u8]
as numbers written in
little-endian base256 and convert them in another little-endian base. For performance reasons,
this crate restricts this other base to be of size 2 (binary), 4 (base4), 8 (octal), 16
(hexadecimal), 32 (base32), or 64 (base64). The converted number is written as [u8]
although
it doesn’t use all the 256 possible values of u8
. This crate encodes to ASCII, so only values
smaller than 128 are allowed.
More precisely, we need the following elements:
- The bit-width N: 1 for binary, 2 for base4, 3 for octal, 4 for hexadecimal, 5 for base32, and 6 for base64
- The bit-order: most or least significant bit first
- The symbols function S from [0, 2N) (called values and written
uN
) to symbols (represented asu8
although only ASCII symbols are allowed, i.e. smaller than 128) - The values partial function V from ASCII to [0, 2N), i.e. from
u8
touN
- Whether trailing bits are checked: trailing bits are leading zeros in theory, but since numbers are little-endian they come last
For the encoding to be correct (i.e. encoding then decoding gives back the initial input), V(S(i)) must be defined and equal to i for all i in [0, 2N). For the encoding to be canonical (i.e. different inputs decode to different outputs, or equivalently, decoding then encoding gives back the initial input), trailing bits must be checked and if V(i) is defined then S(V(i)) is equal to i for all i.
Encoding and decoding are given by the following pipeline:
[u8] <--1--> [[bit; 8]] <--2--> [[bit; N]] <--3--> [uN] <--4--> [u8]
1: Map bit-order between each u8 and [bit; 8]
2: Base conversion between base 2^8 and base 2^N (check trailing bits)
3: Map bit-order between each [bit; N] and uN
4: Map symbols/values between each uN and u8 (values must be defined)
Extensions
All these extensions make the encoding not canonical.
Padding
Padding is useful if the following conditions are met:
- the bit-width is 3 (octal), 5 (base32), or 6 (base64)
- the length of the data to encode is not known in advance
- the data must be sent without buffering
Bases for which the bit-width N does not divide 8 may not concatenate encoded data. This comes from the fact that it is not possible to make the difference between trailing bits and encoding bits. Padding solves this issue by adding a new character to discriminate between trailing bits and encoding bits. The idea is to work by blocks of lcm(8, N) bits, where lcm(8, N) is the least common multiple of 8 and N. When such block is not complete, it is padded.
To preserve correctness, the padding character must not be a symbol.
Ignore characters when decoding
Ignoring characters when decoding is useful if after encoding some characters are added for convenience or any other reason (like wrapping). In that case we want to first ignore thoses characters before decoding.
To preserve correctness, ignored characters must not contain symbols or the padding character.
Wrap output when encoding
Wrapping output when encoding is useful if the output is meant to be printed in a document where width is limited (typically 80-columns documents). In that case, the wrapping width and the wrapping separator have to be defined.
To preserve correctness, the wrapping separator characters must be ignored (see previous subsection). As such, wrapping separator characters must also not contain symbols or the padding character.
Translate characters when decoding
Translating characters when decoding is useful when encoded data may be copied by a humain instead of a machine. Humans tend to confuse some characters for others. In that case we want to translate those characters before decoding.
To preserve correctness, the characters we translate from must not contain symbols or the padding character, and the characters we translate to must only contain symbols or the padding character.
Practice
Basics
use data_encoding::{Encoding, Specification};
fn make_encoding(symbols: &str) -> Encoding {
let mut spec = Specification::new();
spec.symbols.push_str(symbols);
spec.encoding().unwrap()
}
let binary = make_encoding("01");
let octal = make_encoding("01234567");
let hexadecimal = make_encoding("0123456789abcdef");
assert_eq!(binary.encode(b"Bit"), "010000100110100101110100");
assert_eq!(octal.encode(b"Bit"), "20464564");
assert_eq!(hexadecimal.encode(b"Bit"), "426974");
The binary
base has 2 symbols 0
and 1
with value 0 and 1 respectively. The octal
base
has 8 symbols 0
to 7
with value 0 to 7. The hexadecimal
base has 16 symbols 0
to 9
and
a
to f
with value 0 to 15. The following diagram gives the idea of how encoding works in the
previous example (note that we can actually write such diagram only because the bit-order is
most significant first):
[ octal] | 2 : 0 : 4 : 6 : 4 : 5 : 6 : 4 |
[ binary] |0 1 0 0 0 0 1 0|0 1 1 0 1 0 0 1|0 1 1 1 0 1 0 0|
[hexadecimal] | 4 : 2 | 6 : 9 | 7 : 4 |
^-- LSB ^-- MSB
Note that in theory, these little-endian numbers are read from right to left (the most significant bit is at the right). Since leading zeros are meaningless (in our usual decimal notation 0123 is the same as 123), it explains why trailing bits must be zero. Trailing bits may occur when the bit-width of a base does not divide 8. Only binary, base4, and hexadecimal don’t have trailing bits issues. So let’s consider octal and base64, which have trailing bits in similar circumstances:
use data_encoding::{Specification, BASE64_NOPAD};
let octal = {
let mut spec = Specification::new();
spec.symbols.push_str("01234567");
spec.encoding().unwrap()
};
assert_eq!(BASE64_NOPAD.encode(b"B"), "Qg");
assert_eq!(octal.encode(b"B"), "204");
We have the following diagram, where the base64 values are written between parentheses:
[base64] | Q(16) : g(32) : [has 4 zero trailing bits]
[ octal] | 2 : 0 : 4 : [has 1 zero trailing bit ]
|0 1 0 0 0 0 1 0|0 0 0 0
[ ascii] | B |
^-^-^-^-- leading zeros / trailing bits
Extensions
Padding
For octal and base64, lcm(8, 3) == lcm(8, 6) == 24 bits or 3 bytes. For base32, lcm(8, 5) is 40 bits or 5 bytes. Let’s consider octal and base64:
use data_encoding::{Specification, BASE64};
let octal = {
let mut spec = Specification::new();
spec.symbols.push_str("01234567");
spec.padding = Some('=');
spec.encoding().unwrap()
};
// We start encoding but we only have "B" for now.
assert_eq!(BASE64.encode(b"B"), "Qg==");
assert_eq!(octal.encode(b"B"), "204=====");
// Now we have "it".
assert_eq!(BASE64.encode(b"it"), "aXQ=");
assert_eq!(octal.encode(b"it"), "322720==");
// By concatenating everything, we may decode the original data.
assert_eq!(BASE64.decode(b"Qg==aXQ=").unwrap(), b"Bit");
assert_eq!(octal.decode(b"204=====322720==").unwrap(), b"Bit");
We have the following diagrams:
[base64] | Q(16) : g(32) : = : = |
[ octal] | 2 : 0 : 4 : = : = : = : = : = |
|0 1 0 0 0 0 1 0|. . . . . . . .|. . . . . . . .|
[ ascii] | B | end of block aligned --^
^-- beginning of block aligned
[base64] | a(26) : X(23) : Q(16) : = |
[ octal] | 3 : 2 : 2 : 7 : 2 : 0 : = : = |
|0 1 1 0 1 0 0 1|0 1 1 1 0 1 0 0|. . . . . . . .|
[ ascii] | i | t |
Ignore characters when decoding
The typical use-case is to ignore newlines (\r
and \n
). But to keep the example small, we
will ignore spaces.
let mut spec = data_encoding::HEXLOWER.specification();
spec.ignore.push_str(" \t");
let base = spec.encoding().unwrap();
assert_eq!(base.decode(b"42 69 74"), base.decode(b"426974"));
Wrap output when encoding
The typical use-case is to wrap after 64 or 76 characters with a newline (\r\n
or \n
). But
to keep the example small, we will wrap after 8 characters with a space.
let mut spec = data_encoding::BASE64.specification();
spec.wrap.width = 8;
spec.wrap.separator.push_str(" ");
let base64 = spec.encoding().unwrap();
assert_eq!(base64.encode(b"Hey you"), "SGV5IHlv dQ== ");
Note that the output always ends with the separator.
Translate characters when decoding
The typical use-case is to translate lowercase to uppercase or reciprocally, but it is also used
for letters that look alike, like O0
or Il1
. Let’s illustrate both examples.
let mut spec = data_encoding::HEXLOWER.specification();
spec.translate.from.push_str("ABCDEFOIl");
spec.translate.to.push_str("abcdef011");
let base = spec.encoding().unwrap();
assert_eq!(base.decode(b"BOIl"), base.decode(b"b011"));
Fields§
§symbols: String
Symbols
The number of symbols must be 2, 4, 8, 16, 32, or 64. Symbols must be ASCII characters (smaller than 128) and they must be unique.
bit_order: BitOrder
Bit-order
The default is to use most significant bit first since it is the most common.
check_trailing_bits: bool
Check trailing bits
The default is to check trailing bits. This field is ignored when unnecessary (i.e. for base2, base4, and base16).
padding: Option<char>
Padding
The default is to not use padding. The padding character must be ASCII and must not be a symbol.
ignore: String
Characters to ignore when decoding
The default is to not ignore characters when decoding. The characters to ignore must be ASCII and must not be symbols or the padding character.
wrap: Wrap
How to wrap the output when encoding
The default is to not wrap the output when encoding. The wrapping characters must be ASCII and must not be symbols or the padding character.
translate: Translate
How to translate characters when decoding
The default is to not translate characters when decoding. The characters to translate from must be ASCII and must not have already been assigned a semantics. The characters to translate to must be ASCII and must have been assigned a semantics (symbol, padding character, or ignored character).
Implementations§
source§impl Specification
impl Specification
sourcepub fn new() -> Specification
pub fn new() -> Specification
Returns a default specification
Trait Implementations§
source§impl Clone for Specification
impl Clone for Specification
source§fn clone(&self) -> Specification
fn clone(&self) -> Specification
1.0.0 · source§fn clone_from(&mut self, source: &Self)
fn clone_from(&mut self, source: &Self)
source
. Read more