Regina 7.4 Calculation Engine
Tight encodings of data

Regina includes support for tight encodings, which are encodings of various data types as short printable strings.

Tight encodings have the following properties:

  • They use only printable ASCII characters (the 94 ASCII values from 33 to 126 inclusive), and do not contain any whitespace. This means (for example) you can use them as whitespace-separated tokens in plain text files. However, they do make use of all of the ASCII punctuation symbols, and so you must take care when (for example) trying to hard-code them as strings in source code, or using them as components of filenames.
  • They aim to be short (typically much shorter than the usual human-readable representations, such as decimal representations of integers or full text representations of polynomials).
  • When reading an encoded object character-by-character, the encoding contains enough information to know when the last character has been read (even without the foreknowledge of whether the input stream contains more characters).
  • Objects with the same inherent value, even if they use different underlying C++ types, will encode to the same string. For example, the integer 7 will have the same encoding regardless of whether it is stored as an int, a long, or a regina::Integer. Note that this guarantee only extends to types that "conceptually" intend to represent the same broad types of objects, possibly with different limitations. So, for example, there is no guarantee that the integer 7, the rational 7/1, and/or the constant polynomial 7 would encode to the same string.
  • Conversely, objects of the same type but with different inherent values will encode to different strings. So, for example, the integers 7 and -7 will have different encodings.

A consequence of the last two points is that, if the type of an object is known in advance, then its value can be recovered from its encoding. However, the encoding does not contain enough information to deduce the type if this is not already known.

Because encodings contain enough information to identify where they end, this means that you can encode a sequence of objects by concatenating the individual encodings with no separators, and (assuming the types of the objects are fixed) this will be enough to guarantee that different sequences likewise have different encodngs. Of course, a tight encoding of a sequence must encode extra information so that the sequence itself maintains the property that the end of its encoding can be detected without look-ahead - typically this extra information would be either a length written at the beginning, or a sentinel written at the end.

Tight encodings were originally designed to support perfect hashing (essentially "compressing" data into a short printable string whilst preserving the correctness of equality tests). As a result, they were originally intended to be used only in one direction. However, Regina does provide matching decoding routines if you need to reconstruct objects from their tight encodings.

For native C++ data types where tight encodings and decodings are supported, these are provided in the header utilities/tightencoding.h through overloads of the global functions tightEncoding() and tightDecoding() (which work with strings), and tightEncode() and tightDecode() (which work with input/output streams).

For Regina's own data types where tight encodings and decodings are supported, these are provided through member functions tightEncoding(), tightEncode(), tightDecoding() and tightDecode() within the corresponding classes.

Note that classes that provide a tightEncoding() member function will typically also provide a hash() member function (which often uses the tight encoding in its implementation). Unlike tight encodings, which are string-based and preserve equality/inequality perfectly, hashes map into a fixed-size integer range and so may have collisions (i.e., different objects may have the same hash value). In short: tight encodings are designed for compression and printability, whereas hashes are designed to be used as integer keys in hash tables (and, for Python users, dictionaries and sets).