Dataspace 4: The Term-inator

Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

Before we can build such a system, we have to settle on a data model. The logic programming language Prolog gives us a potentially useful universal data model – term structure –  but we don’t get the most out of it unless we express it in Lisp-style S-expressions,  which reveal hidden semantics that even the formal logic and logic programming communities didn’t catch up with until 1989. But S-expressions also introduce ambiguity that wasn’t there in the original Prolog term structure.

There is a very simple way forward from here, and it’s one that I haven’t seen described before. I think it has potential as a fundamental data semantics for building very large or very small distributed systems.

This is an approach to representing term structure in S-expressions, so I’m calling it term expressions. If we wanted to be cheeky, we could call it T-expressions (because it’s one more than S…)

So S-expressions are a way of representing lists as pairs plus a final nil. But they don’t have to have a final nil. Instead, they can end in a dotted pair or dotted list – with a symbol after the dot indicating ‘the rest of the list’. This comes in really useful in variable matching. Eg:

(a b c . D)

is a dotted list. That D might perhaps be a variable, and if we were matching, it might stand for ‘everything after the c’. Which could be a list itself.

Here’s the core idea:

What if, instead of just a nil, or a dot and then a symbol, we allow a S-expression list to end with – or simply be – a third option: an arbitrary logical term?

So perhaps we had a list looking like this:

(a b c / d e f)

That ‘/’ is the magic character that replaces the dot. Anything after a /  is a logical term. It can have arbitrary structure.

Given this, we can represent a Prolog term and a Prolog list differently:

(/ likes john mary)  -> likes(john, mary)
(likes john mary) -> [likes, john, mary]

That’s it. The rest of this series is unpacking what this very simple, dumb syntax-level idea allows us to do. But it’s not just syntax –  like moving from Prolog terms to S-expressions, it opens a rather large new semantic level doorway. It lets us express things that we can’t quite fit into either S-expressions, or Prolog terms (or SQL relations, or JSON objects, or OOP objects, or filesystems – or sets, or Lisp reader macros, or arbitrary data types), and also simplifies dealing with and representing all of these.

At least that’s my hope. And I think there are reasons to expect hope.

In short:

  • The low-level data model is linked list pair structure, as in Lisp. It could be run in a more compressed form, perhaps – like straight sequences of machine words. But we’ll start with normal Lisp pairs as our basic layer. We can look at PicoLisp for inspiration for how to build a full language (now even an OS) on top of this.
  • The serialisation syntax is S-expressions but without the dot. We have parentheses () to open and close lists.
  • (We can also keep the dot if we want, for embedding within standard Lisps. But we  have the option to completely lose the dot for extra simplicity. Sometimes getting rid of a single reserved character can help a lot. Dots are not the best character to reserve, frankly, since they exist inside English text and numbers.)
  • Instead of the dot, we reserve one character to be a ‘term marker’. Initially I was favouring the vertical bar |, but because that’s a reserved character in many Lisps, I’m now going with the backslash / .
  • We probably also need a quoting strategy, and I’m more leaning toward C-style backslash \ for ‘quote the next character’ rather than double-quotes “” (because quotes get mangled, and more importantly, quotes don’t nest well at all, and we want recursive nestability)
  • Yeah, I’m thinking double quote is the better escape character after all, just because it’s so pervasive elsewhere that it’s immediately something all programmers understand, it’s how quoting works in Scheme, Unix and HTTP URLs are so absolutely pervasive as forms of data that it’s best not to mess with them, and all in all the good old quote is the best of a terrible set of options. Two quotes in a row can represent a quote inside a string, so the escape character becomes its own escape. We can probably manage quote injection issues since we won’t often be passing raw strings around. Well, let’s hope.
  • Though, you know? Looking at the standard keyboard, if I were writing a character-level parser from scratch, and not caring about embedding… I’d probably go with [] for brackets, and / for term. Because they’re all on four keys right next to each other, next to Enter, visually distinct, and unshifted.
  • At a low-level data pointer level, we can probably get rid of the / as a symbol and put it as a tag bit in the pointer. But that’s an optimisation to think about later.