Dataspace 10: An Array Representation

Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

‘Pointable data’ needs a data model. The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense.

So far, we have been looking at term-expressions as an extension of (or implemented on top of) Lisp or Scheme cons-cell structure. This is fine if we’re running on a Lisp or Scheme. But the most popular languages today are not Lisp or Scheme, and don’t usually have a native cons-cell implementation. Further, the model of all storage as a big undifferentiated soup of cons-cells has a couple of big limitations: 1) an O(n)  to O(log n) access time, depending on the data structure, if we don’t already have a pointer, and 2) pointers are relative to a big memory pool – they don’t give us an easy way to break our data into chunks and make sure that related data is stored close by.

One way of solving all of these problems is to look at how we can represent term-expressions not on cons-cells, but on a much more fundamental and widely-available data structure: arrays.

An array (assuming it’s implemented reasonably like a raw chunk of RAM, ignoring things like caching; not a very good assumption, but close enough in many cases) gives us a O(1) access time. That’s a big win! We can also nest arrays inside arrays; this means we could recursively represent an entire chunk of storage (a RAM bank, say or a hard drive) as one big array, then have term-expressions containing integers to represent pointers into one or another array. We can start to model the low-level behaviour of a language’s memory allocator, or a whole computer and network, down to the operating system and physical level, in ways we couldn’t using just cons cells. We can now start to think about using term-expressions to model all of a computing network.

Naively, we might think that mapping term-expressions to arrays is trivial: just reserve one symbol (say ‘/’ itself) and then insert that into the array. So, the term-expression

(1 2 3 /foo 4 5 6 /bar 7 8 9)

would become the array-expression (using say JSON syntax)

[1, 2, 3, “/”, “foo”, 4, 5, 6, “/”, “bar”, 7, 8, 9]

But this immediately gives us two rather nasty problems. Can you spot what they are?

The first is that we’ve now polluted our data represention space by consuming the string “/”. This is no longer available to us to store as actual data. (Remember, in the native term-expression we could represent it as the string “/”, ie by quoting it.)

This is a very, very bad thing as we’ve now irrevocably broken round-trip data representation. What if we need to store “/” in an expression? (Something that’s going to happen very quickly if we use these expressions in a programming language). We’re going to be flat out of luck, and a whole string of problems will follow.

The second problem is that we’ve lost our nice O(1) access time guarantee. Because two term-expressions can be appended, so that the CDR of any given cons cell might be an expression beginning with /, we can no longer find the n-th cell of an expression by indexing into its array at position n. We have to do a linear or binary-tree search instead; we’re back to the performance behaviour of cons cells, if not their memory allocation behaviour.

So what other options do we have? Some features that we would like:

* use pure arrays, not any higher-level language features like types or objects (ie, don’t assume we’re running on any particular language VM or runtime)

* assume that our array representation gives us at least the length of an array, so not C-type strings with a magic terminator value where you have to iterate through to the end to get the length – that’s just asking for trouble

* assume that pointing to an array is cheap but that re-ordering or modifying an array is expensive. Therefore if representing a raw array itself as an expression, try to keep that array by itself as one piece

* keep our one weird but defining feature of term-expressions: that you can append one term-expression to another. But, don’t let this feature break the ability to safely and reliably access the nth element of any appended chunk

The good news is we can get all these features with one representation: nested arrays.

The term expression

(1 2 3 /foo 4 5 6 /bar 7 8 9)

becomes the array expression

[ [], [1, 2, 3], foo, [4, 5, 6], bar [7, 8, 9]]

What does this give us?

* All expressions become arrays of length 2n, where n is the number of appended ‘chunks’

* Expressions are in the format: [head, body, … <repeat>]

* head gives us the type of the term and can be any array-expression. If the head is [] – term-expression (/), ie, True Null – it means a raw array. This seems to make sense logically: the type of a raw, untyped array is literally ‘null’, it doesn’t have a type.

* raw arrays are represent as themselves, unmodified, so they don’t cost expensive array rebuild operations

* it is very easy to look at an array expression, count its length, and work out how many chunks it has, and so immediately tell whether it’s an ‘awkward’ expression or not, whether the appended expression is legal for this term type, etc.  Similarly, we can get the arity of a term (if we care about that, eg, in the way Prolog does) by counting the length of its body. All this very useful metadata suddenly comes ‘for free’, from the array length data.

* there are no reserved symbols , so we have 100% data round trip compatibility and we can use these expressions to represent program terms or binary data

* we get the ability to express something we didn’t really think about before, but is very useful for memory management: the ‘continuation of an array’. We can break an array into fixed-length subarrays, then represent it as a chained sequence of these chunks. Eg:

[[], [1, 2, 3,], [], [4, 5, 6]]

which is the term expression

(1 2 3 / () 4 5 6)

This ability can be very, very useful for, eg, modelling an array representation which might start out at a fixed small size then double the chunk size as we add chunks. By being able to represent this directly, we get the ability to use array-expressions at the system programming or low-level database-storage-layout level.

How do we get the n-th element of a continued array like this? Simple enough: check n; if it’s less than the length of the first chunk, index it in the first chunk; if not, check if there’s a next chunk and if it’s a raw array; subtract the length of the first chunk; repeat. If n falls into a chunk which is not a raw array, return an error or null (ie: we’re asking for a cell which does not exist as a cell in this array; whatever the canonical algebraic definition of that ‘not here’ answer is, return it)

That should give us O(x) access time, where x is the number of chunks in the continued array rather than the number of elements.

* with a slight ‘abuse of syntax’ we can represent S-expression improper lists ending in a dotted pair, eg:

(1 2 . 3)

represents as

[[], [1, 2], [], 3]

eg, if the body of a chunk is not a list (something which is normally not a valid array-expression, so it’s not breaking any data representation), then it’s a CDR cell pointing to a system non-list type. We’ve now got full support for S-expressions within an array environment. And, we can quickly look at a list and tell if it’s a ‘proper’ or ‘improper’ list without having to scan it.

Given this representation, we can now think about implementing term-expressions directly on top of JSON, in a Javascript environment (or on top of any other array facility in a mainstream programming language). We could use JSON objects as fast, system-provided representations where they make sense (for simple string-indexed key-value maps) and use array-expressions for all other expressions that JSON objects struggle with.