Dataspace 11: AR2, A Better Array Representation

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

Dataspace 10: An Array Representation

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

‘Pointable data’ needs a data model. The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense.

A year ago, in Dataspace 10, I outlined a tentative ‘array representation’ for T-expressions, for use in runtimes (like Javascript, C or most other modern languages) that don’t provide Lisplike cons cell storage but do provide arrays. Even apart from just the pragmatic concern of ‘arrays are all we have and JSON is becoming the standard data format of the Web’, there are a number of advantages that arrays give over conses. One advantage is that array indexing gives us O(1) access time to elements, and another is that we can nest arrays inside arrays to get recursively contained blocks of storage. Recursive containment (rather than an undifferentiated planetwide storage pool or ‘soup’) is a feature of our real physical world and is key to achieving scalability and data transportability.

However I now think the array representation (let’s call it AR1) I outlined in Dataspace 10 is too clumsy and we can do better. It’s clumsy for a reason – I wanted to preserve arrays in their natural form when embedding them into terms, to prevent undue array slicing/copying operations and to take advantage of runtimes like Javascript which can do optimisation of large array layouts if all the elements have the same shape. But I’m now thinking that’s a bit of premature optimisation. Let’s make a simpler format that preserves some nicer properties, at the expense of maybe making copying an expensive operation.

This new array representation – let’s call it AR2 – is much simpler. It’s just the natural extension of cons cells to n-length arrays.

Let the 0-length array [] be NIL, the empty list.

Let every other array of lengths 1 and up represent two parts:

  1. A ‘prefix’ of LENGTH-1 cells (cells 0 to LENGTH-2 in 0-based indexing), representing the listlike portion
  2. A ‘suffix’ at the last index (cell LENGTH-1 in 0-based indexing), representing the termlike portion, which is itself a T-expression.

I’m also settling on the following characters for markup: [, ], ` (term marker) and \ (character escape). These four are chosen because they are available unshifted on the standard keyboard, \ is the standard C escape, and they don’t interfere with English text. Conspicuously missing is any kind of string quotation character. There are only string words/atoms, and T-expressions. For now, not even numbers.

(Because numbers immediately raise the question: which numbers? Decimal, hex, octal, binary? Integer, floating point, full numeric tower? JSON-compatible numbers or non-JSON formats? How do we deal with prematurely identifying a string of digits as the wrong kind of number and misparsing it? But if we do want to bake numbers into the syntax, we can do that with a simple rule: invoking \ inside a word marks it as a “quoted word” which is guaranteed to be a string. If we don’t invoke \ at any time, and it parses as a number, then it’s a number. When printing a word which could be parsed as a number, escape the first character.)

Some of my thinking here about removing string quoting entirely is my own, an idea I’ve been toying with for years; but I’ve been motivated towards this recently by the remarkable new Interactive Fiction language Dialog, which takes this approach, and demonstrates how well it works. More on Dialog later.

In AR2, we have a fairly natural and intuitive encoding/decoding, which is easy to do in your head:

  1. If it’s a proper list, just append [] to the end.
  2. If it’s a term [`foo bar], just wrap it in brackets, ie, [[“foo”,”bar”]]
  3. If it’s an improper list, just put the term component on the end of the list, ie [1 2 3 `foo bar] becomes [“1″,”2″,”3”, [“foo”, “bar”]]

We can still tell the difference between NIL or EMPTY LIST [] ([] in AR2) and EMPTY TERM [`] ([[]] in AR2), if this is algebraically important to us. ( Specifically: [foo bar] in TX becomes [foo bar [] ] in AR2 while [foo bar `] becomes [foo bar [[]] ] )

It is now very easy to write a parser for AR2, and the resulting array data structure is about as easy to work with as one could hope for. Most of the pain in a language like Javascript comes from nested character escaping – \ becomes \\ in T-expressions which becomes \\\\ inside a Javascript string. And ` occasionally causes problems in some contexts (for example in WordPress text boxes), but that’s as good as we can get, I think. It’s a much better character to use as an escape than ‘ or ” which can occur in names and English sentences – and which also get mangled to curly quotes in some contexts (Microsoft Word, and WordPress text boxes).

Given an AR2 array, we can very quickly determine some key properties of it:

  1. Take LENGTH – this should be a O(1) operation for modern sane length-prefixed non-null-terminated arrays. Maybe not in raw C. So don’t use raw C. I mean you can if you want, it’ll still work, just LENGTH won’t be a O(1) operation.
  2. If LENGTH == 0: it’s NIL. All NILs should maybe be unique, though this is not the case in, eg, Javascript, so maybe don’t rely on that. Some implementations of Prolog, for example, may want to rely on Unknowns (unbound variables) being represented by unique NILs. However, if you do that, be aware that you’ve then got an in-RAM structure that can’t be uniquely serialised as T-expressions, which is not a particularly good thing.
  3. If LENGTH == 1: it’s a Term. In this case, Array[0][0] is the Functor or Head, Array[0][1..] is the Body, Array[0][1] is the first argument, etc.
  4. If LENGTH == 1 and Array[LENGTH-1] == NIL, then it’s EMPTY TERM.
  5. If LENGTH > 1 and Array[LENGTH-1] == NIL, then it’s a Proper List
  6. If LENGTH > 1 and Array[LENGTH-1] != NIL, then it’s an Improper List, and Array[LENGTH-1] is its Suffix. If the Suffix is an array, then Array[LENGTH-1][0] is the Functor/Head, etc.

And the very nice part is that all these properties apply recursively to all AR2 arrays.

Dataspace 10: An Array Representation

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

‘Pointable data’ needs a data model. The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense.

So far, we have been looking at term-expressions as an extension of (or implemented on top of) Lisp or Scheme cons-cell structure. This is fine if we’re running on a Lisp or Scheme. But the most popular languages today are not Lisp or Scheme, and don’t usually have a native cons-cell implementation. Further, the model of all storage as a big undifferentiated soup of cons-cells has a couple of big limitations: 1) an O(n)  to O(log n) access time, depending on the data structure, if we don’t already have a pointer, and 2) pointers are relative to a big memory pool – they don’t give us an easy way to break our data into chunks and make sure that related data is stored close by.

One way of solving all of these problems is to look at how we can represent term-expressions not on cons-cells, but on a much more fundamental and widely-available data structure: arrays.

Continue reading “Dataspace 10: An Array Representation”

Dataspace 9: A Tower of Nulls, And Awkward Sets

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

‘Pointable data’ needs a data model. The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense.

Looking at term-expressions, one of the first things we notice is that there are a large number of null-like terms. I’m wondering what the meaning of these varieties of null might be.

  • The simplest null-like term is the nil pair or empty list: ()
  • The next one is the empty term : (/)
  • Then we have the empty set (if can think of /all as a set) or empty union:  (/all)
  • Then, for every other term functor X, the empty X: (/X)

An interesting question is whether terms correspond to types, (and if so, in what particular type system) or whether the notion of ‘type’ is unrelated to what we’re looking at here.

Continue reading “Dataspace 9: A Tower of Nulls, And Awkward Sets”

Dataspace 8: Example: Movie data

Sidebar: Here’s a quick comparison of what I’m hoping to achieve in terms of syntax and readability, and an example of why I think it’s important to spend a fair bit of time thinking about syntax. Particulary, about what’s not in the syntax, so it’s not there to get in the way.

SWI Prolog’s SWISH has some wonderful example programs on the web; here, for example is a simple movie database  with nearly 3000 separate facts (probably taken from IMDB, I guess).

Continue reading “Dataspace 8: Example: Movie data”

Dataspace 7: A Low-Level Encoding

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

‘Pointable data’ needs a data model. The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense.

Up till now we’ve been looking at term-expressions as a thin layer over S-expressions (ie, one reserved symbol, the term marker), and assuming that at a machine level they will use a Lisplike cons cell structure (ie, linked lists).

The architecture of PicoLisp makes a good argument for using cons cells as the only method of storage, as it simplifies memory management, and simplicity may be more important for reliability and security than raw performance.

But if we wanted, we could have quite a dense encoding for term-expressions, based on the old Lisp Machine tricks of CDR coding and tagged pointers. This means we could map term-expressions directly onto sequences of memory cells.

Continue reading “Dataspace 7: A Low-Level Encoding”