Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator
Dataspace 10: An Array Representation
I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.
‘Pointable data’ needs a data model. The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense.
This new array representation – let’s call it AR2 – is much simpler. It’s just the natural extension of cons cells to n-length arrays.
Let the 0-length array  be NIL, the empty list.
Let every other array of lengths 1 and up represent two parts:
- A ‘prefix’ of LENGTH-1 cells (cells 0 to LENGTH-2 in 0-based indexing), representing the listlike portion
- A ‘suffix’ at the last index (cell LENGTH-1 in 0-based indexing), representing the termlike portion, which is itself a T-expression.
I’m also settling on the following characters for markup: [, ], ` (term marker) and \ (character escape). These four are chosen because they are available unshifted on the standard keyboard, \ is the standard C escape, and they don’t interfere with English text. Conspicuously missing is any kind of string quotation character. There are only string words/atoms, and T-expressions. For now, not even numbers.
(Because numbers immediately raise the question: which numbers? Decimal, hex, octal, binary? Integer, floating point, full numeric tower? JSON-compatible numbers or non-JSON formats? How do we deal with prematurely identifying a string of digits as the wrong kind of number and misparsing it? But if we do want to bake numbers into the syntax, we can do that with a simple rule: invoking \ inside a word marks it as a “quoted word” which is guaranteed to be a string. If we don’t invoke \ at any time, and it parses as a number, then it’s a number. When printing a word which could be parsed as a number, escape the first character.)
Some of my thinking here about removing string quoting entirely is my own, an idea I’ve been toying with for years; but I’ve been motivated towards this recently by the remarkable new Interactive Fiction language Dialog, which takes this approach, and demonstrates how well it works. More on Dialog later.
In AR2, we have a fairly natural and intuitive encoding/decoding, which is easy to do in your head:
- If it’s a proper list, just append  to the end.
- If it’s a term [`foo bar], just wrap it in brackets, ie, [[“foo”,”bar”]]
- If it’s an improper list, just put the term component on the end of the list, ie [1 2 3 `foo bar] becomes [“1″,”2″,”3”, [“foo”, “bar”]]
We can still tell the difference between NIL or EMPTY LIST  ( in AR2) and EMPTY TERM [`] ([] in AR2), if this is algebraically important to us. ( Specifically: [foo bar] in TX becomes [foo bar  ] in AR2 while [foo bar `] becomes [foo bar [] ] )
Given an AR2 array, we can very quickly determine some key properties of it:
- Take LENGTH – this should be a O(1) operation for modern sane length-prefixed non-null-terminated arrays. Maybe not in raw C. So don’t use raw C. I mean you can if you want, it’ll still work, just LENGTH won’t be a O(1) operation.
- If LENGTH == 1: it’s a Term. In this case, Array is the Functor or Head, Array[1..] is the Body, Array is the first argument, etc.
- If LENGTH == 1 and Array[LENGTH-1] == NIL, then it’s EMPTY TERM.
- If LENGTH > 1 and Array[LENGTH-1] == NIL, then it’s a Proper List
- If LENGTH > 1 and Array[LENGTH-1] != NIL, then it’s an Improper List, and Array[LENGTH-1] is its Suffix. If the Suffix is an array, then Array[LENGTH-1] is the Functor/Head, etc.
And the very nice part is that all these properties apply recursively to all AR2 arrays.