© 1999–2021 Rick Jelliffe. PageSeeder and hosting generously provided by Allette Systems (Australia)
Rick Jelliffe (C) 2021-2024
The general goals of RAN:
Rapid Access Notation (RAN) is a possible document format current, currently under design, to allow fast and efficient semi-random access to elements in fragments on the raw text with lazy or deferred or multi-threaded parsing1.
The basic structure of a RAN document is a tree of specialist structural elements, using various variants of the familiar element/attribute tag, for:
A RAN file can be a single branch (like a XML document or a single scope (like an XML document), or a series of fragments, or a single finite stream of fragments with an explicit open and close.
Element and attribute names and attributes may be literals (with double quote) or tokens. The token types have simple lexical rules to distinguish them without full parsing.
Linking metadata, such as direction, is assisted syntactically, by providing a range of delimiters between attribute names and values.
One way of understanding many RAN design decisions is that a lexer or parser can start at any point in the document and it only needs to find the first preceding (or following) < or > delimiter in order to know how to parse the document from that point. To allow this, “<” and “>” are always delimiter characters no matte where they appear in the document (you cannot “comment out” tags), namespaces may only be declared at the top-level of the document, there are no CDATA sections. This allow parallel or reverse parsing.
RAN “simplifies” XML by removing features that prevent fast parsing using modern processors (parallelism, SIMD, GPUs) but enhances it by making more advantage of delimiters and lexical typing.
Here is a very simple RAN document, like XML:
<book id==eg2 alt="an example">
<!-- A comment -->
<p>Hello world <b>!</b></p>
</book>
The first unusual thing compared to XML is that the first attribute on book uses "==". This delimiter indicates that the attribute is a primary identifier for that element within its scope (similar to an XML ID): an implementation may index this element for faster reference or retrieval. (There is also a corresponding delimited “=}” for IDREF.) RAN documents do not need to have a schema or schema to indicate this functionality.
In RAN, the text in between data is text; however the values of attributes can be typed data. RAN provides an extremely rich set of complex datatypes, and these are reliably determined using RAN’s datatype rules. All the named entities for Unicode characters are built-in, and can be used anywhere (and they won’t be recognized as delimiters or whitespace.)
The second unusual thing in the example above is that the attribute value does not use double quote delimiters, which means it is a token that will be lexically typed: in this case it is a name token. The lexical typing rules allows far richer types than other data-transfer or schema langauges: in particular, it allows information vital to interpret some number to be kept with the number. Here are some examples of lexical typing of rich values:
<lexically-typed-attributes>
<quantities
frequency=32_Hz temperature=32°F reading=-5.123e4 transmit=0xBEEF
></quantities>
<dates
birthday=2020-06-06TZ
era=1995-05-?01/2024-10-?11 ></dates>
<currency
polish-amount=¤100.10zł_PLN old-uk-amount=£4.3s.8d euro-amount=€100
></currency>
<tuples
cords=[ 134.5° 126° 18km 2024-12-01T10:20:10 ]
amount-at-time= [$100 2024-01-01T12:00 ]
></tuples>
</lexically-typed-attributes>
In the example above, you can see:
These allow much better datatyping for scientific, historical and financial information. No schema is needed.
Fragments: A RAN document can be an unbounded stream of fragments:
<<<"I am a fragment" id}=f1>>>
....
<<</"I am a fragment" id}=f1>>>
...
In the above example, there is a single fragment: it uses <<< and >>> delimiters. It has an ID attribute. Unlike XML, you can see that tags names can be string literals (e.g. "I am a fragment") not just name tokens. Similarly, you can see that the end-tag has a matching ID attribute, to help with random-access processing. Fragments also allow parsers to skip past sections that will not be needed, for efficiency in retrieving data from large documents. (To indicate that a stream is finite, you can wrap the fragments in <<<< and >>>> tags.)
1 For information on the kinds of parsing and application that motivate the design of RAN, see the papers “Mison: A Fast JSON Parser for Data Analytics” (Li, et al., 2017) and "Parsing Gigabytes of JSON per Second" ( Langdale and Lemire, 2020)
© 1999–2021 Rick Jelliffe. PageSeeder and hosting generously provided by Allette Systems (Australia)