Dataspace 11: AR2, A Better Array Representation

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator
Dataspace 10: An Array Representation

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

‘Pointable data’ needs a data model. The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense.

A year ago, in Dataspace 10, I outlined a tentative ‘array representation’ for T-expressions, for use in runtimes (like Javascript, C or most other modern languages) that don’t provide Lisplike cons cell storage but do provide arrays. Even apart from just the pragmatic concern of ‘arrays are all we have and JSON is becoming the standard data format of the Web’, there are a number of advantages that arrays give over conses. One advantage is that array indexing gives us O(1) access time to elements, and another is that we can nest arrays inside arrays to get recursively contained blocks of storage. Recursive containment (rather than an undifferentiated planetwide storage pool or ‘soup’) is a feature of our real physical world and is key to achieving scalability and data transportability.

However I now think the array representation (let’s call it AR1) I outlined in Dataspace 10 is too clumsy and we can do better. It’s clumsy for a reason – I wanted to preserve arrays in their natural form when embedding them into terms, to prevent undue array slicing/copying operations and to take advantage of runtimes like Javascript which can do optimisation of large array layouts if all the elements have the same shape. But I’m now thinking that’s a bit of premature optimisation. Let’s make a simpler format that preserves some nicer properties, at the expense of maybe making copying an expensive operation.

This new array representation – let’s call it AR2 – is much simpler. It’s just the natural extension of cons cells to n-length arrays.

Let the 0-length array [] be NIL, the empty list.

Let every other array of lengths 1 and up represent two parts:

A ‘prefix’ of LENGTH-1 cells (cells 0 to LENGTH-2 in 0-based indexing), representing the listlike portion
A ‘suffix’ at the last index (cell LENGTH-1 in 0-based indexing), representing the termlike portion, which is itself a T-expression.

I’m also settling on the following characters for markup: [, ], ` (term marker) and \ (character escape). These four are chosen because they are available unshifted on the standard keyboard, \ is the standard C escape, and they don’t interfere with English text. Conspicuously missing is any kind of string quotation character. There are only string words/atoms, and T-expressions. For now, not even numbers.

(Because numbers immediately raise the question: which numbers? Decimal, hex, octal, binary? Integer, floating point, full numeric tower? JSON-compatible numbers or non-JSON formats? How do we deal with prematurely identifying a string of digits as the wrong kind of number and misparsing it? But if we do want to bake numbers into the syntax, we can do that with a simple rule: invoking \ inside a word marks it as a “quoted word” which is guaranteed to be a string. If we don’t invoke \ at any time, and it parses as a number, then it’s a number. When printing a word which could be parsed as a number, escape the first character.)

Some of my thinking here about removing string quoting entirely is my own, an idea I’ve been toying with for years; but I’ve been motivated towards this recently by the remarkable new Interactive Fiction language Dialog, which takes this approach, and demonstrates how well it works. More on Dialog later.

In AR2, we have a fairly natural and intuitive encoding/decoding, which is easy to do in your head:

If it’s a proper list, just append [] to the end.
If it’s a term [`foo bar], just wrap it in brackets, ie, [[“foo”,”bar”]]
If it’s an improper list, just put the term component on the end of the list, ie [1 2 3 `foo bar] becomes [“1″,”2″,”3”, [“foo”, “bar”]]

We can still tell the difference between NIL or EMPTY LIST [] ([] in AR2) and EMPTY TERM [`] ([[]] in AR2), if this is algebraically important to us. ( Specifically: [foo bar] in TX becomes [foo bar [] ] in AR2 while [foo bar `] becomes [foo bar [[]] ] )

It is now very easy to write a parser for AR2, and the resulting array data structure is about as easy to work with as one could hope for. Most of the pain in a language like Javascript comes from nested character escaping – \ becomes \\ in T-expressions which becomes \\\\ inside a Javascript string. And ` occasionally causes problems in some contexts (for example in WordPress text boxes), but that’s as good as we can get, I think. It’s a much better character to use as an escape than ‘ or ” which can occur in names and English sentences – and which also get mangled to curly quotes in some contexts (Microsoft Word, and WordPress text boxes).

Given an AR2 array, we can very quickly determine some key properties of it:

Take LENGTH – this should be a O(1) operation for modern sane length-prefixed non-null-terminated arrays. Maybe not in raw C. So don’t use raw C. I mean you can if you want, it’ll still work, just LENGTH won’t be a O(1) operation.
If LENGTH == 0: it’s NIL. All NILs should maybe be unique, though this is not the case in, eg, Javascript, so maybe don’t rely on that. Some implementations of Prolog, for example, may want to rely on Unknowns (unbound variables) being represented by unique NILs. However, if you do that, be aware that you’ve then got an in-RAM structure that can’t be uniquely serialised as T-expressions, which is not a particularly good thing.
If LENGTH == 1: it’s a Term. In this case, Array[0][0] is the Functor or Head, Array[0][1..] is the Body, Array[0][1] is the first argument, etc.
If LENGTH == 1 and Array[LENGTH-1] == NIL, then it’s EMPTY TERM.
If LENGTH > 1 and Array[LENGTH-1] == NIL, then it’s a Proper List
If LENGTH > 1 and Array[LENGTH-1] != NIL, then it’s an Improper List, and Array[LENGTH-1] is its Suffix. If the Suffix is an array, then Array[LENGTH-1][0] is the Functor/Head, etc.

And the very nice part is that all these properties apply recursively to all AR2 arrays.