Dataspace 2: Revenge of the Data Model

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

But to even begin we need a data model that allows us to point at small pieces of data. This turns out, surprisingly, to be harder than it looks. The data models in common use that we’ve looked at and rejected so far are:

  1. Filesystems (as in Unix or the Web – the end-point components are too big)
  2. Object-oriented programming (objects are fragmented,  ill-defined and too complex)
  3. Dictionaries (promising, but don’t let us have multiple values per key)
  4. Relational databases (our data is not structured in fixed-length tuples)
  5. Graphs made of triples (they’re neither homoiconic nor recursively structured)

So are we out of options? Not quite. There’s one weird old trick we haven’t looked at. #6 may surprise you.

Option 6. Logic programming

Remember we mentioned (back when we looked at #3, dictionaries) that we’d ‘broken functions’ at the same time as breaking dictionaries, because we wanted to say that ‘the weather is rainy and sunny’ and this is not something the dictionary or function data model can represent? And in #4, that relations are strictly more powerful than functions, because they can represent such functions or dictionaries with multiple values? And we also mentioned in #5 that we don’t want to separate our program from our data  – that, like a spreadsheet, we’d like to mix and match code and data anywhere?

Well guess what. We can program directly in relations. The programming paradigm to do this exists, but it’s a bit old and dusty. It’s called logic programming.

The main language known for logic programming, Prolog, goes back to 1972 in France. This is only three years newer than Unix – it’s practically in the prehistoric era. Because of this, Prolog as a language is a little funky – it has a feel somewhere between Lisp and Fortran, but more like Fortran. It doesn’t handle strings well (at least not in a modern manner).

(There is a modern revival of the logic programming concept, Kanren – or now, miniKanren – which is sort of less of a language and more of an approach to embedding Prolog-like programming inside other languages. It’s worth looking at, but one problem with Kanren is that it depends on all the machinery of its host language, originally Scheme. It’s usually implemented as a library, but we’re looking for a data model rather than a library, so we’ll put Kanren on one side for now and maybe come back to it later.)

Prolog (and also Kanren) work by using a process called resolution using unification with backtracking.  This procedure dates back only to the 1960s and is, essentially, a reasonably fast brute-force depth-first search algorithm. It maintains state in the form of a binding environment of logic variables – both of which are concepts that don’t quite map onto the standard functional or OOP ideas of environment or evaluation. Logic variables, in particular, are quite weird beasts, and don’t play especially well with pure-functionality or parallel execution. This gives logic programming a bit of a hand-me-down flavour, because there’s been lots of work on the core methods of function evaluation, and not nearly as much (until the Kanren revival, at least) on unification and backtracking.

(Most of this logic programming research was big in the 1980s – headlined by the infamous Japanese Fifth Generation Computer Project – and flamed out by around 1993 after that project was cancelled. Now coming slowly back into view due to interest in Semantic Web and Big Data technologies.)

The general approach though, is: a program is built out of function-like things called (in Prolog) predicates. A predicate is a named n-ary relation – very much like a Codd relation or SQL table – which can be searched, but during the search, new values can be computed. So you get the best of both worlds: place to store data, and functions to compute new data. Code and data, mixed together.

Prolog has a kind of timeless thrill to it: that you can just start entering data and then worry later about what it means, is very freeing. It doesn’t even have types! Because it doesn’t need them. You just use predicates (relations) instead. Suddenly three categories which seemed entirely separate: code, data and types – all merge into one. A huge simplification in thought.

(The standard free Prolog today is SWI Prolog, which has a wonderful web-based scratchpad called SWISH. Give it a try.)

But of course, there’s a catch. Because of its roots in formal logic (which way predate programming, even Lisp – even typewriters were a bit high-tech for the logic era) and 1970s computing (which still hadn’t really come to terms with structured programming, let alone objects or pure functionality), Prolog has some very rough edges, which we’d like to sand off. Some examples:

  • You can’t put a Prolog program inside another Prolog program. A Prolog program is a list of clauses, which end in a period. Clauses are vitally important units of structure because they define the scope of logic variables – two variables with the same name inside a different clause are different entities. But you can’t nest a clause inside another clause. That breaks recursive decomposition, and means we can’t use Prolog (without modification) as the universal data model we’re looking for.
  • Prolog has one  ‘database’ which it stores all its information in. That’s not especially helpful if we want something we can run across multiple computers, or that even doesn’t have the idea of servers.
  • Prolog programs can change the state of the database as they run. That can cause problems with parallelisation.
  • Worse, Prolog doesn’t have a standard way of describing the database, or changes to it, as a logical statement, so even if we wanted to try to be pure-functional, we couldn’t easily do that.
  • Prolog’s handling of strings is weird, and variables have to start with a capital letter. Small things, but big enough to break a lot of connections to the outside world.
  • Prolog’s syntax allows us to define infix and postfix operators, which is nice, but those can become very tricky to parse, since we have to read code first to know how to read code. Also, once symbols have been reserved as operators, we then lose the ability to use that symbol normally. Prolog has already reserved a whole lot of symbols: . , ; () [] :- and lots, lots more. Good luck if you wanted to use any of those yourself.
  • A Prolog predicate is an n-ary relation, just like a Codd relation, and as we’ve seen earlier, these have big problems with trying to represent ragged, not-completely-uniform data. For example, if you tried to represent an object directly as a Prolog predicate, you’d have the same trouble as if you tried to represent it as a SQL table: not all the rows would be the same length.
  • As we saw with triples, Prolog represents lists as relations, and that means they have the same problems we get with lists in triple systems: we need a special syntax just for lists.

The underlying data model that Prolog uses, called term structure, seems useful, though. A Prolog term (ignoring any infix-y stuff) looks like this:

funny(cats)

which encodes the sentence

Cats are funny.

It looks like a function call, but the thing at the front isn’t an actual function name. It’s called a functor, a horrible and ambiguous word, but the definition I think comes not from mathematics but from linguistics – basically it’s a ‘function-like word’ that expresses a relation. The whole thing is called a term , because it’s not actually a predicate unless it’s defined in – you know what, let’s skip that. Just call it a term and move on.

The important thing is that functors don’t actually have any inherent meaning except what we give them. And a term doesn’t have any meaning other, unless we choose to think it does. This is important because it means we don’t have to automatically evaluate any term – and we don’t have to run screaming into the wilds of second-order logic * with it either . It can just sit there in the database, until we read it, and then we decide how to process it.

(* seriously, don’t follow that link, you will be eaten by a logician. The important thing is you don’t need to know any of this. We’re purely thinking about syntax here, not cascading hierarchies of infinities.)

Let’s look at how all those sentences we modelled earlier might appear in Prolog term structure.

Cats are funny.
The capital of North Korea is Pyongyang.
It is raining today in New Zealand.
Today, North Korea test-launched an ICBM
In response to (the event above), South Korea test-launched an ICBM.
It is both raining and sunny today in New Zealand.

funny(cats)
capital(northkorea,pyongyang)
ondate(20170704,weather(newzealand,raining))
ondate(20170704,launched(northkorea,icbm))
inresponseto(ondate(20170704,launched(northkorea,icbm)),launched(southkorea,icbm))
ondate(20170704,weather(newzealand,and(raining,sunny)))

Most of these are triples – except for funny(cats), which we can now see is a pair, and we don’t have to add that extra ‘are’ we needed for triples.

There’s a slight sense of sadness with this, as well as triples, in that we notice we’ve lost the nice ability we had with dictionaries, where we could put statements in any order we wanted (subject-relation-object, or relation-subject-object, or even object-relation-subject). Because we’ve lost this ability, and we’re forced into putting the relation first, we’ve actually lost the ability to organise data by objects.

That feels like it might be an important feature to have.

And actual Prolog programmers will notice that while the last line – weather(newzealand,and(raining,sunny)) – is syntactically okay, it’s not really standard Prolog usage. It’s fine-ish, if we wanted to do that, but it’s not really how things are done. Prolog doesn’t really frown on having one fact represent multiple facts, but it doesn’t really support it either.

‘and(X,Y)’ has a definition in Prolog after all (it is a true predicate), and that definition means ‘consult the current database and check if either of the two facts X and Y are in the database, or can be inferred’. We’re not really using that and-term to mean an and-predicate, and that might be a problem.

A more idiomatic Prolog way might be to use a list instead:

ondate(20170704,weather(newzealand,[raining,sunny]))

Which again is syntactically okay – and idiomatic as well, and many Prolog systems would do something like this.

But there’s a problem. It’s ambiguous. We don’t have a way, now, to distinguish between: ‘both of these two values are true’ and ‘this one value is true, which happens to be a list’.

What if we wanted to have a list? And also have multiple values? Eg, if we had these two statements:

shoppinglist([milk,eggs,butter])
shoppinglist([catfood,lightbulbs])

and we wanted to combine them into two? (Because, perhaps, we wanted to keep all the information about shoppinglist together in one place). Yes, we might be able to append the lists, but that loses information – we actually have two shopping lists, each of which is itself a list.

Long story short: there is no good answer to this problem within Prolog term structure as it exists. If you want collections, you just put things into lists, but they never quite do what you want. You have to be very careful with your context. You can’t unambiguously specify ‘a collection of stuff, which can be any kind of stuff’; you have to specify some kind of schema beforehand.

And unfortunately ‘collections of stuff which are self-describing’ is really what we want for a ‘dataspace’.

However, there is a solution – a very small tweak which also, I think, solves a whole bunch of other problems – including a more ‘object-y’ way of organising data.