Dataspace – Natepod

Dataspace 14: What Question Should I Have Asked You Instead Of This One?

I don’t want to say that my Dataspace project is on hold, but I have hit a conceptual wall, a very familiar one that I’ve been hitting for a long time.

I’m reasonably happy with Term-Expressions as as an approach to a universal data syntax. They’re not perfect and they can take a bit more ongoing tinkering. (One piece of tinkering I’ve been doing involves looking at Retro Forth and thinking about the possibilities of a word-based rather than character-based syntax).

The wall is more to do with the execution model. What I want, and have wanted since about 2006, is a way of combining the SLD Resolution algorithm from Prolog and Kanren with the more familiar application/evaluation/reduction algorithm from Lisp and lambda calculus. It is a hard problem, or at least it appears to be. If there is a solution, it will appear very simple and obvious once it is solved, because simplicity is the requirement for being an acceptable solution. But I don’t have that solution yet.

The problem comes because there are two dimensions in which Prolog-like resolution expands on Lisp-like evaluation: a space-like dimension and a time-like dimension.

The “space-like” dimension of Prolog’s “logic programming” or Kanren’s “relational programming” comes from the way they both expand functions into relations. If the core abstraction in your programming language is a relation rather than a function, then it means an “evaluation” can return zero, one or more output values rather than just one, for any given input.

This is a good thing, in my opinion, and it’s where a large amount of the power of logic/relational programming comes from. But dealing with every return value being a set (or list) of values means you have to think quite hard about where those values are going to go, and what you are going to do if there are either none, or a very large number of them. And also, how the whole system resolves down to its simplest form, which in normal programming we assume would be just one value. Both Prolog and Kanren (at least Micro-Kanren, its simplest form so far) solve this by giving you only one value at once, with the rest of the values being something like a continuation (in Prolog) or a function you have to force (in Micro-Kanren). This solution, however, forces a linear search strategy on what would or should otherwise be a parallel search space, and that doesn’t seem the best. Prolog came up with a pretty good, cheap linear search strategy (depth-first search) but it fails on some fairly commonplace situations (left recursion). Kanren appears to have a better default search strategy that can cope with more forms of recursion, but it’s still linear.

It’s awkward that we have to think about this extra “space-like” dimension, where every “function call” can stall out with no values or balloon out into a large or even infinite number. But it’s something we can cope with.

The “time-like” dimension, though, is where things get weird. In Prolog and Kanren, you can pass an “unbound variable” to a predicate/relation.

I call this weird unbound-variables behaviour of Prolog “time-like” because in normal “Lisp-style” function calling you are asking:

“My question is X. Your answer to that will be Y. What’s Y?”

But in Prolog, you can also ask a question about questions themselves:

“What question should I have asked you instead of this one to get you to answer Y?”

and you can even have neither question nor answer bound:

“What question should I ask, on any possible subject, to get any answer from you at all? Please list all possible questions and answers, as pairs, in order. I’ll wait.”

(But obviously you don’t want and won’t get every possible question and answer because that’s not just an infinite set, but an infinitely infinite set of the kind that makes Georg Cantor wince, ie, it’s not even countably infinite, but you will get an extremely large set and you don’t want it all at once, so some kind of lazy evaluation will need to be part of even the very simplest solution.)

Prolog handles this sort of weird query (as it handles everything else) via backtracking, which is essentially time travel. Computations run forward until they hit a snag (ie, zero values returned rather than at least one) and when that happens the computation gets unwound and time-travelled back to the last variable binding that worked.

Unbound variables are where the second chunk of Prolog magic happens, and they’re also very very weird to think about in traditional programming terms. They’re like… if you had a function, right, but instead of applying that function to any value, you just…. didn’t apply it, to anything, and instead you examined the type of that function, found its domain, and iterated through the values of that domain, at runtime. So you were deconstructing the type of a function as a side effect of applying it. And that was a perfectly normal thing to do in your program, and every program could do it. There is therefore a perfect equivalence between “functions” and “types” inside a Prolog program, with the two both being constructable and examinable at runtime… or rather, neither functions nor types exist as things-in-themselves but as components or views of a single unified abstraction, predicates/relations. It’s all very interesting indeed. At least I think it is.

This behaviour is something that the good old workhorse of computer science, Lambda Calulus, does not model well at all. Or at least, it does not model it through the usual mechanism of applying a lambda to other things. Computer scientists do use Lambda Calculus in complicated ways to prove all sorts of theorems but they often do it by “cheating”, ie, by directly accessing the components of a lambda expression itself. This is cheating because a computer, at runtime, can’t do this, at least not in languages descended from Lisp, which is most of our current programming languages. One of the ground rules of Lisp and all her children is that once you have a lambda, a function, the only thing you can do with it is apply it to a value and get a value out. It is forbidden to ask “what’s inside the box”. The box is an opaque box and the whole art of compilation rests on the box – the function – remaining opaque.

(It may be important and relevant, though, to note here that Lambda Calculus was not invented to write Lisp, or even to be a computer programming language of any kind. Rather, it was a failed project in logic that John McCarthy dug up a couple of decades later and used for a completely different purpose. So it’s entirely possible that Lambda Calculus is not actually the best tool for doing things related to computer programming – because it wasn’t invented for that. It was just the screwdriver sitting in the tool box that happened to be there and available, when maybe a spanner would have done a better job).

Prolog predicates work differently to Lambdas. They do have a sort of opaque-ness about them… it might not always be possible to ask a Prolog system exactly what the definition of a predicate is, for example. But, you can always “look inside” a predicate in a sense by asking it what its arguments would be (“what question should I have asked to get answer Y?”) in a way that you just can’t do with a Lambda in Lambda Calculus.

(Or to be more precise aren’t allowed to do, if you are not a mathematician with a pencil sitting outside the system, but are instead a computer program – ie, a Lambda – looking at a memory full of other Lambdas. If you are a Lambda, you have only one thing you can do to another Lambda – “eat it” – but maybe what you would like to do is just look at the other Lambda, but you weren’t given eyes by your mathematician creator, only a mouth).

Kanren relations are just functions, so they’re opaque boxes like in Lambda Calculus, but, they’re functions that are pretending to be Prolog predicates, so what they return is the equivalent of “the thing that you can look inside”. They’re opaque boxes holding a transparent box.

To really model what the Prolog SLD Resolution algorithm is doing in Lambda Calculus – to get to the core of what’s going on in the simplest terms – you would have to have some kind of primitive other than ‘abstract’ and ‘apply’, I think.

(Yes, you can implement the Resolution algorithm in Lambda Calculus… that’s what Kanren does. But it doesn’t directly model it. What I mean by that is that Kanren is a layer that runs over another language, like Lisp/Scheme. It is not a language in itself. Kanren sort of operates at two levels: Lambda Calculus at the lower level, interpreting or emulating Prolog at the higher level. That split-level feel about them is what makes it feel a little clumsy and “ugly” to me.)

Finally, Prolog (and Kanren)’s unbound variables don’t really have a parallel in other computing or data storage systems, and that makes people nervous. It makes me nervous anyway. How do you name them? (Micro-Kanren uses integers, but how would that even begin to parallelise? What happens once you make more than BIGNUM queries over the lifetime of any given processing node or data set?) How do you transport them between systems? Do they make a set of data intractable for some processing? For example, one of their possible close relatives, RDF’s Blank Nodes, really scare people because they add all sort of weird logical complexities (and massive CPU processing time sinks) into what would otherwise be simple bulk data handling processes. The SQL rival, Datalog, was derived from Prolog but removed unbound variables, which I think really limits what it can do but no doubt makes it tractable for some of these tasks.

The other side effect of unbound variables is that because of them – because you can ask Prolog or Kanren “what question should I have asked instead of this one?” and get a sensible answer, and the whole system relies on this – it is very hard to cleanly map a Prolog predicate or a Kanren relation onto a “function that you ask questions and get answers”. If you ask a “question”, it’s a question padded with a lot of other stuff (an entire predicate, something like “a database row”) ; if you get an “answer”, it’s an answer padded with even more stuff, and of a completely different shape to the question – not even “a database row” but rather a set of variables and their bindings. And maybe even the variables in that answer don’t have names, they have numbers, but the original question you asked used variables which were names.

None of this feels either simple or “right”. In Christopher Alexander’s terms, it does not have a sense of Beauty to it, which must mean that there are unresolved (or unacknowledged) forces in the problem space that are not getting the attention that they need. I feel that there is an extremely strong and very important sense of Beauty very close to both Prolog and Kanren…. but neither are there yet. Something is either missing, or something unnecessary has been added that breaks a symmetry.

Trying to understand this (on a deep enough level to make it feel simple… in other words, to feel my way to a solution that really has Beauty) is the conceptual wall that I keep hitting.

I am sure that there is something over that wall, but I can’t climb it yet.

Dataspace 13: Quotes and Whitespace

Prelude:
Dataspace 12: AR3 to AR4, or [`] = []

I have a prototype parser for AR4, but it’s time to rewrite it.

But first, a couple of notes about the syntax that have become clearer to me, I think.

One of the things I’ve really wanted to get a handle on is how to do string quotation in a sane way. You’d think that quotes would be easy – just use " – but right there we have multiple problems:

Dataspace 12: AR3 to AR4, or [`] = []

Prelude:
Dataspace 11: AR2, A Better Array Representation

Okay, that didn’t quite work.

I tried to implement a Javascript parser for AR2, and it kept feeling like it was much more complicated than it ought to be. Eventually I figured out that the problem was the distinction between NIL [] and EMPTY [`]. EMPTY just… did not like existing.

So I took the plunge and added an axiom that NIL == EMPTY, calling AR2 with this axiom AR3, and everything seemed to be much better. I wrote a working Javascript parser for AR3 and it was okay-ish.

But then I started to feel that even AR3 had some problems. The main one being that it loses the array character: you can’t feed any AR3 array to a standard Javascript function that expects an array, and get a sensible result. You always have that extra cell on the end which is not part of the array.

So I took the next step, a new notation I call AR4. This one seems better.

AR4 is an array with three slots.

[Head,Body,Tail]

Head is the head (Prolog functor equivalent). It is NIL [] if the term is a List.

Body is an array. Each element of the array is an AR4.

Tail is [] or a new AR4 continuing the expression.

Then, this array can be simplified to a one- or two-slot array with a few rules:

1 If Head and Tail are both [], ie, it’s a simple proper list, it can be represented as just one slot. This is very common so we want this to be as simple as possible.

[Body]

Term-expression

[1 2 3]

represents into JSON as

[[1,2,3]]

2 If Body and Tail are both [], AND if Head is an atom (a string or number), then it can be represented as just one slot. This is also a very common case.

[Head]

eg term-expression

[`true]

represents into JSON as

["true"]

3 If Tail is [] then we can drop the third slot entirely.

[Head, Body]

eg term-expression

[`foo 1 2 3]

represents into JSON as

["foo", [1, 2, 3]]

This is pretty much all there is to AR4. A couple other special-case simplification rules:

4 If both Head and Body are [], then we can reduce the whole thing to just Tail.

eg term-expression

[`[] `foo bar]

becomes

[`foo bar]

or JSON

["foo", ["bar"]]

This simplification rule is pretty important, as it means that NIL is always [].

5 If Body only has one element and it’s an atom (string or number), then we can reduce Body to its first element.

eg term-expression

[`foo bar]

becomes JSON

["foo", "bar"]

which is really just a little bit of syntax sugar for the JSON, and I’m not sure if it’s a good idea or not, but we can do it so we might as well.

Parsing up AR4 is not too much more difficult than parsing AR3, and means we can now use all standard array functions on the Body of an expression, and it’s still fairly simple to read the raw Javascript arrays.

Dividing a term-expression conceptually into a Head, a Body and a Tail also feels like a very natural thing to do, and helps explain what is going on. It’s just an array (Body) marked up with two standard added fields: the Head telling us what kind of object it is, and the Tail giving us the Lisp CDR equivalent.

Putting together both the Head and the Tail, as well as the size and random-addressable array-ness of an array Body, gives us a “tagged linked list of arrays”, which seems like it must be a fairly well understood kind of data type, but letting the tags be fully general and combining it with a standard serialisation format still seems like something that hasn’t been explored very much. We’ve had universal data types that give us a Head (objects and predicates), we’ve had universal data types that give us a tail (Lisp conses, and other forms of singly linked lists), but this is a type that includes both as standard. It’s really just a parse tree, but it’s a concrete parse tree not an abstract one.

My feeling is that this is indeed a very powerful universal data type, one that could serve as the basis of a thousand-year language, but making that argument take actual shape will be the next step.

Dataspace 11: AR2, A Better Array Representation

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator
Dataspace 10: An Array Representation

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

‘Pointable data’ needs a data model. The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense.

A year ago, in Dataspace 10, I outlined a tentative ‘array representation’ for T-expressions, for use in runtimes (like Javascript, C or most other modern languages) that don’t provide Lisplike cons cell storage but do provide arrays. Even apart from just the pragmatic concern of ‘arrays are all we have and JSON is becoming the standard data format of the Web’, there are a number of advantages that arrays give over conses. One advantage is that array indexing gives us O(1) access time to elements, and another is that we can nest arrays inside arrays to get recursively contained blocks of storage. Recursive containment (rather than an undifferentiated planetwide storage pool or ‘soup’) is a feature of our real physical world and is key to achieving scalability and data transportability.

However I now think the array representation (let’s call it AR1) I outlined in Dataspace 10 is too clumsy and we can do better. It’s clumsy for a reason – I wanted to preserve arrays in their natural form when embedding them into terms, to prevent undue array slicing/copying operations and to take advantage of runtimes like Javascript which can do optimisation of large array layouts if all the elements have the same shape. But I’m now thinking that’s a bit of premature optimisation. Let’s make a simpler format that preserves some nicer properties, at the expense of maybe making copying an expensive operation.

This new array representation – let’s call it AR2 – is much simpler. It’s just the natural extension of cons cells to n-length arrays.

Let the 0-length array [] be NIL, the empty list.

Let every other array of lengths 1 and up represent two parts:

A ‘prefix’ of LENGTH-1 cells (cells 0 to LENGTH-2 in 0-based indexing), representing the listlike portion
A ‘suffix’ at the last index (cell LENGTH-1 in 0-based indexing), representing the termlike portion, which is itself a T-expression.

I’m also settling on the following characters for markup: [, ], ` (term marker) and \ (character escape). These four are chosen because they are available unshifted on the standard keyboard, \ is the standard C escape, and they don’t interfere with English text. Conspicuously missing is any kind of string quotation character. There are only string words/atoms, and T-expressions. For now, not even numbers.

(Because numbers immediately raise the question: which numbers? Decimal, hex, octal, binary? Integer, floating point, full numeric tower? JSON-compatible numbers or non-JSON formats? How do we deal with prematurely identifying a string of digits as the wrong kind of number and misparsing it? But if we do want to bake numbers into the syntax, we can do that with a simple rule: invoking \ inside a word marks it as a “quoted word” which is guaranteed to be a string. If we don’t invoke \ at any time, and it parses as a number, then it’s a number. When printing a word which could be parsed as a number, escape the first character.)

Some of my thinking here about removing string quoting entirely is my own, an idea I’ve been toying with for years; but I’ve been motivated towards this recently by the remarkable new Interactive Fiction language Dialog, which takes this approach, and demonstrates how well it works. More on Dialog later.

In AR2, we have a fairly natural and intuitive encoding/decoding, which is easy to do in your head:

If it’s a proper list, just append [] to the end.
If it’s a term [`foo bar], just wrap it in brackets, ie, [[“foo”,”bar”]]
If it’s an improper list, just put the term component on the end of the list, ie [1 2 3 `foo bar] becomes [“1″,”2″,”3”, [“foo”, “bar”]]

We can still tell the difference between NIL or EMPTY LIST [] ([] in AR2) and EMPTY TERM [`] ([[]] in AR2), if this is algebraically important to us. ( Specifically: [foo bar] in TX becomes [foo bar [] ] in AR2 while [foo bar `] becomes [foo bar [[]] ] )

It is now very easy to write a parser for AR2, and the resulting array data structure is about as easy to work with as one could hope for. Most of the pain in a language like Javascript comes from nested character escaping – \ becomes \\ in T-expressions which becomes \\\\ inside a Javascript string. And ` occasionally causes problems in some contexts (for example in WordPress text boxes), but that’s as good as we can get, I think. It’s a much better character to use as an escape than ‘ or ” which can occur in names and English sentences – and which also get mangled to curly quotes in some contexts (Microsoft Word, and WordPress text boxes).

Given an AR2 array, we can very quickly determine some key properties of it:

Take LENGTH – this should be a O(1) operation for modern sane length-prefixed non-null-terminated arrays. Maybe not in raw C. So don’t use raw C. I mean you can if you want, it’ll still work, just LENGTH won’t be a O(1) operation.
If LENGTH == 0: it’s NIL. All NILs should maybe be unique, though this is not the case in, eg, Javascript, so maybe don’t rely on that. Some implementations of Prolog, for example, may want to rely on Unknowns (unbound variables) being represented by unique NILs. However, if you do that, be aware that you’ve then got an in-RAM structure that can’t be uniquely serialised as T-expressions, which is not a particularly good thing.
If LENGTH == 1: it’s a Term. In this case, Array[0][0] is the Functor or Head, Array[0][1..] is the Body, Array[0][1] is the first argument, etc.
If LENGTH == 1 and Array[LENGTH-1] == NIL, then it’s EMPTY TERM.
If LENGTH > 1 and Array[LENGTH-1] == NIL, then it’s a Proper List
If LENGTH > 1 and Array[LENGTH-1] != NIL, then it’s an Improper List, and Array[LENGTH-1] is its Suffix. If the Suffix is an array, then Array[LENGTH-1][0] is the Functor/Head, etc.

And the very nice part is that all these properties apply recursively to all AR2 arrays.

Dataspace 10: An Array Representation

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

So far, we have been looking at term-expressions as an extension of (or implemented on top of) Lisp or Scheme cons-cell structure. This is fine if we’re running on a Lisp or Scheme. But the most popular languages today are not Lisp or Scheme, and don’t usually have a native cons-cell implementation. Further, the model of all storage as a big undifferentiated soup of cons-cells has a couple of big limitations: 1) an O(n) to O(log n) access time, depending on the data structure, if we don’t already have a pointer, and 2) pointers are relative to a big memory pool – they don’t give us an easy way to break our data into chunks and make sure that related data is stored close by.

One way of solving all of these problems is to look at how we can represent term-expressions not on cons-cells, but on a much more fundamental and widely-available data structure: arrays.

Continue reading “Dataspace 10: An Array Representation”