Dataspace 7: A Low-Level Encoding

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

‘Pointable data’ needs a data model. The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense.

Up till now we’ve been looking at term-expressions as a thin layer over S-expressions (ie, one reserved symbol, the term marker), and assuming that at a machine level they will use a Lisplike cons cell structure (ie, linked lists).

The architecture of PicoLisp makes a good argument for using cons cells as the only method of storage, as it simplifies memory management, and simplicity may be more important for reliability and security than raw performance.

But if we wanted, we could have quite a dense encoding for term-expressions, based on the old Lisp Machine tricks of CDR coding and tagged pointers. This means we could map term-expressions directly onto sequences of memory cells.

With 32-bit cells that store memory pointers, the cells have to be aligned on 4-byte boundaries, so the lower two bits can be used for tags. With 64-bit cells, three tag bits are available.

We could use one bit to indicate the presence or absence of a term marker, which means using terms as a type code for a block of memory only takes one extra cell.

This leaves at least one more tag bit, which we could use for a ‘pointer’ or ‘jump’ bit: if this bit is set, then the cell acts like a Lisp ‘cdr’- it indicates the address to read for the continuation of the list. If it’s unset, then the cell acts like a ‘car’, ie a standard contents of a list.

If we set the pointer bit to be 0, and let memory cell 0 be NIL as in many Lisps then a memory cell initialised to 0 automatically acts as a jump to NIL, ie an end-of-list sentinel. This would work well with the use of /all as an explicit ‘dot’ – we don’t have to worry about how to represent dotted pairs, because everything is a proper list, ie a memory cell sequence terminating in 0.

But because we have terms as well, which act like type codes, we also have the option of special handling for certain low-level datatypes. Eg: we could have terms representing ‘integer’, where the next memory cell is read as a full integer, or ‘memory block’, where a whole sequence of cells could be read without being interpreted as tag-coded addresses.

(One possibility is to to take ‘jump’ and ‘term’ together as indicating ‘integer’, because it doesn’t make much sense to combine them, so an integer only takes two machine cells – pointer plus data).

CDR coding systems like this typically don’t work well with mutable data. But having the option of ‘jumps’ or ‘redirects’ which take up just one cell means that we have some options for moving data around in memory and leaving a ‘forwarding address’.

Also, combined with immutability (which is a property several other strands of research seem to be converging  towards), such an encoding might become more useful. It’s good to know that we have this option – particularly if we wanted to use term-expressions as a mass storage or transfer format between systems – and in fact that term-expressions seem to make CDR coding more viable than it might be in pure S-expressions.