Dataspace 1: In Search of a Data Model

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

So… like the Web? you say. Just use the Web, it’s been here for 20 years, problem solved, right?

Yes, like the Web. But there’s a slight problem there. The Web is a distributed filesystem – dating back to at least 1969 for Unix – using DNS for the server name, then folders on the server, then finally a file on the server. This gives us a file path. HTTP augments this with an anchor (the thing after a #), which is a single named node inside an HTML file. It then augments this further with query parameters (the things after a ?), which are lists of variables and values, but which don’t relate to a file (what HTTP calls a document) at all. Query parameters are only sensible to a HTTP server. And we’d like to get away entirely from the concept of servers. We want paths that identify a piece of data in a ‘dataspace’, which might be bigger or smaller than a document.

Long story short, HTTP and URLs won’t do it. It’s okay – ish – for large, chunky bits of data (‘web documents’) but not really good enough for small pieces of data. Because the core HTTP data model is not good enough is exactly why we have so many web application server frameworks on the servers and Javascript frameworks on the client – to try to reimplement the pieces that weren’t there in the original HTTP/HTML Web vision.

We need a data model that’s coherent and built for the job of ‘pointing at small pieces of data from anywhere’. So let’s look at some of our options.

Option 1: Filesystems

Filesystems have been really popular since Unix, and they’re everywhere. But we kinda just just looked at them right now and decided they don’t work. Why don’t they work? Because filesystems end in files. And files are just too big for the small pieces of data we want to point at.

Option 2: Object systems

Objects! As in, the entities described by Object Oriented Programming. These are typed chunks of data that are much smaller than files – some ‘compound document’ systems, in fact, decompose documents into objects (as does, eg, Javascript using the Document Object Model). They seem like they’d fit the bill. Obviously we don’t want both documents and objects as in HTML (documents could be a type of object), but can’t we just use objects everywhere?

Unfortunately, we looked at those too and the short answer is: no. We’d like something very much like objects, but there is no ‘one true object system’ that exists everywhere, or even enough places that we can use. Much worse: the very idea of an object has no formal definition, there’s no clear way to transport them or point at them between systems, and they just generally feel too complicated and too heavy.

We want something like objects! But simpler. We can add the complicated bits in later, hopefully.

Option 3: Dictionaries and functions

I will reserve the term ‘object ‘to mean object in the full Object-Oriented Programming sense: a thing with methods and (optionally) classes. But we don’t need to go all the way there. Javascript’s subset JSON (Javascript Object Notation), for instance, has ‘objects’ which don’t have methods – they are just sets of key-value pairs. And JSON isn’t the only example of these – most ‘scripting’ languages (Perl, Python, Ruby, Lua) have this data structure as well as most modern more traditional languages.

Oddly enough for something so ubiquitous (although before the 2000s, you didn’t really see many languages providing these as a standard built-in data type – somehow the more complicated objects got there first), this structure goes by many names in computer science. It’s sometimes called a hash, a map, an associative array, or a dictionary. I’ll call it a dictionary.

Anyway! We should surely be able to make a universal data space out of dictionaries… right? Just link them together, figure out some way of indicating a far pointer, and we’re ready to go.

But again – and this is more than a little surprising – it’s not quite enough! Let’s look at why, because this is a controversial claim to many working programmers (who aren’t deeply interested in, for example, the Semantic Web) and so it needs some unpacking.

The kind of information I want to store in a Memex type system turns out to be very similar to logical statements.

That is, I want to store small chunks of information in the form of something like sentences:

Cats are funny.
The capital of North Korea is Pyongyang.
It is raining today in New Zealand.
Today, North Korea test-launched an ICBM
In response to (the event above), South Korea test-launched an ICBM.
It is both raining and sunny today in New Zealand.

Okay. So let’s see how we could store these statements in an object. Perhaps something like this, using a JSON syntax but just stripping out the quotes for clarity:

{cats: funny}

Looking good so far.

{capital-of: {North-Korea: Pyongyang}}
{North-Korea: {capital: Pyongyang}}

This seems promising! We even have two ways of posing a statement: relation first, or subject first. The subject-first one is particularly interesting because it maps neatly to the object-oriented way of thinking about the world, which we already understand quite well from programming. Even better, it lets us group statements together by object: so everything we know about North-Korea would go into one dictionary.

Let’s try some more then:

{on-date: {20170704: {New-Zealand: {weather: raining}}}}

This also isn’t too bad. It encodes fine, although it’s rather noisy – all those close-brackets are a bit intimidating. We could bunch up a bunch of them together to get, say, news of the day:

{on-date: {20170704: {North-Korea: {launched: icbm} , New-Zealand: {weather: raining}}}}

This encodes fine, although it’s rather noisy – all those close-brackets are a bit intimidating.

{in-response-to: {on-date: {20170704: {north-korea: {launched: icbm}}}}: south-korea: {launched: icbm}}

This one is interesting. It’s a perfectly sensible logical statement, but that’s not valid JSON. It turns out this is a problem with JSON specifically, not with dictionaries as such: JSON will only let you use a text string as a key, not a dictionary itself. Other systems like Lua don’t have this restriction.

If we’re going to be storing logical statements, though, it’s going to be really really important that we can have dictionaries with keys that are arbitrary dictionaries. Because in the real world we’re going to have a lot of statements about other statements. We might have a program and want to describe its data type; or we might have a name that’s a compound name and we want to describe that name as a dictionary. If we restrict ourselves to keys that are only text, we lose almost all of the power of computing and of logic. Probably best not to do that, at least not right away in our fundamental data grammar.

Many filesystems do have this ‘paths must be text strings only’, restriction, so that’s another strike against filesystems.

We can use Lua, then, or any language with a ‘real’ dictionary data type. But, sadly, it gets worse. Let’s try that final statement:

{on-date: {20170704: {New-Zealand: {weather: raining, sunny}}}}

Whoops! That’s not only not valid JSON, it’s not valid dictionary semantics. Let’s see that spectacular failure again, in closeup and slow motion.

{weather: raining, sunny}

What is that? It’s not a dictionary. It’s what we get if we try to take the union of these two statements (that is, assert that they are both true, if we take them as being logical statements). Individually, they each are sensible dictionaries:

{weather: raining}
{weather: sunny}

But put together – not only do we not have a syntax for it, but we don’t have a semantics. We broke our data model. We broke programming’s favourite data model.

We’ll come back to this later, because it’s important. But that case – one key with multiple values – is in short is why we can’t use dictionaries – and why dictionaries are, in fact, not a sufficient data type on which to build programming or data representation.

But guess what? It gets even worse. Functions are dictionaries – as in, sets of key-value pairs, just ones that might be infinite sets, and which are computable – so we broke functions too. We have discovered that there is structured data that can’t be represented in function format. Now that’s fairly serious. Isn’t all of computing built on functions?

Well, most of it is. Most of post-1990s computing, that is. But not quite all.

Option 4: Relational databases

All of computing is built on functions? But no. Functions are just a special case of a much more fundamental mathematical structure – relations. And if you’re an ordinary working programmer, you’ve met these before – in SQL databases.

Relations are kind of awkward things. They have roots in formal logic, but weren’t really formalised until the 20th century, and even then, done in multiple ways.

Functions are a special case of a special kind of relation: binary relations. That is, a relation between two things. You can think of it roughly as a function that can have multiple values. For example, square root, which we think of as a function, is actually a binary relation because it has two return values (we usually just throw away the negative one).

There’s a whole bunch of maths about binary relations, so that’s great! They’re well known and understood. Lambda calculus, the core of modern functional programming theory, doesn’t work with binary relations, but still, how hard could it be to make work?

Hard enough, maybe. But what computing generally gives us isn’t binary relations. Instead we have finitary relations (relations of many places) – and in fact we don’t even have those. What we have in our SQL database engines is Dr Codd’s relational algebra.

Relational algebra is nice in a lot of ways, but restricted in a lot of others. One way it’s restricted is that it divides your data into relations (what databases call tables), but each entry (row) in each relation must be the same length. That’s a pretty serious restriction, and means relations can really only describe very regular, highly processed and highly structured data. Great for corporate databases, but not so wonderful for what you find on the Web, or what might be in a personal organiser.

In fact, you couldn’t store that set of sentences earlier in a single Codd relation – because they’re different lengths.

Another serious restriction, which isn’t in Codd’s relational algebra itself, but in SQL databases built on it, is that you can’t generally nest relations inside relations. That would count out our example of the ‘in-response-to’ statement, as well as almost every other case in computer science of putting things inside other things. For example, you couldn’t model a database itself except as a relation of relations.

So if we were to use a relational database, it would have to either be one not based on SQL, or try to find some way of wrangling some more flexibility out of the existing SQL databases we have.

Which brings us to:

Option 5: Graphs and triples

Since the 1990s, a lot of work on storing things like logical statements has been inspired by the Semantic Web idea. And it seems like a good idea! Led by Tim Berners-Lee, the guy who invented the Web itself, the Semantic Web looks like what we want: a way of pointing at small pieces of data.

And yet.

The Semantic Web concept is based on the graph – (based on the mathematical realm of graph theory, more specifically graphs with labelled edges, but stored as a large Codd relation – and the graph is built out of a vast number of triples. A triple is a three-place logical statement:

(subject predicate object)

You might already be seeing a problem, because those statements earlier didn’t quite look like three-place relations. But let’s try it. In a fictitious graph triple syntax which is just listing them inside parentheses, and assumes that we can have a triple itself as an entry:

(cats are funny)
(North-Korea capital Pyongyang)
(20170704 on-date (New-Zealand weather raining))
( 20170704 on-date (North-Korea launched ICBM))
((on-date 20170704 (North-Korea launched ICBM)) in-response (South-Korea launched ICBM)
(on-date 20170704 (New-Zealand weather (raining and sunny))

We’ve done a lot better! We actually can get most of our statements naturally into triple form. We had to wodge things a bit though with our ‘(X and Y)’ triple, and we’ve lost a bit of flexibility around which way round ‘on-date’ and ‘in-response/in-response-to’ should go. But it sorta kinda works.

This is where most of the Semantic Web and Big Data is right now. Lots and lots of triples, inside relational databases, with custom query systems on top.

It looks like, finally, we’ve found our ultimate data model.

But there’s a small problem and a big problem.

The small problem is right at the top: ‘cats are funny’. To get this into triple format, we had to expand a simple pair into a triple. This doesn’t look like a problem yet, but it becomes a medium-sized problem if we want to represent a very old, very simple data structure: lists.

Lists aren’t made out of triples. They’re made out of pairs. But we have no way to natively represent a list in triple form.

Guess what is a list? That list of triples itself, right there, that’s a list.

(Or, if you’re a bit more particular – and we will be later – a set, which might not quite be the same as a list, but you might want to represent as a list.)

To represent a list in triple-based systems you’d need to add an extra thing in the middle, so eg

(a b c)

becomes

(a . (b . (c . ()))

You might recognise this as good old S-expression dot notation. The kind that you don’t use because it has a more terse, clean syntax that everyone uses. But if you’re using triples you can’t use that. Or you have to roll your own, special syntax, just for lists – and now you’ve kinda broken your data model, because you’ve got special syntax, and you wanted to have a universal data model to get away from having special cases for everything.

In computer science (especially programming language geek) terminology, by committing to triples we’ve thrown away a very important, very useful property: we’ve abandoned homoiconicity.

What does that mean? Well, for one, if you wanted to serialise your list

(a . (b . (c . ()))

as a list itself – say, if you care about syntax and you want to export your source code, or do macro-style operations on it – then it’s going to grow each time you do. Eg, if you serialise that, you’ll get

(a . (. . (b . (. . (c . (. . ()))))))

There are ways around it (for example ‘don’t try to put source code for triples in your triple store as triples’), or perhaps try to invent custom list-representation formats, but those ways start to seem unhelpful if we want everything on our desktop to be in our triple store as a single big graph.

If you’re using RDF (Resource Description Framework) things get even messier because you might then serialise those triples over XML or a half dozen custom syntaxes invented just to solve this problem of ‘how do I serialise triples?’

That’s the small problem. It’s annoying enough. It might though be a pointer, a symptom of something we can’t quite see yet.

But the big problem is that you might want to put your triples – or your sets of triples – somewhere… and that means you might want to have graphs of graphs, or relations containing relations… and although Codd relations can do that, most triple stores, or graph query languages, haven’t quite got there.

So all these triples sort of float around in the ether, or in lots of very large datastores. And we’d like to try to democratise things a bit, and have recursively structured ‘places’ to put things. That was the idea we started out with, remember? Of the notion of ‘place’ as an organising principle. Triples and graphs seem great, but they don’t seem quite as simple as they could be, they seem to have a sort of odd roughness of feel to them, and they don’t quite latch onto this idea of recursive confinement either.

A third problem which might be so huge it’s not even registering on our radar: triples (or Codd relations generally) don’t have a programming model associated with them. They’re purely a data storage model. And while we can sorta live with that, again, it should give us pause. Because we’ve now broken our computing world into two big separate chunks: data vs program, and that’s not really how we’d like to see things. What part of a spreadsheet is the ‘program’, for example? Something seems awry if we have to split our brain into two separate hemispheres like this.

What I’d like to propose next – after a few more sightseeing stops – is a very slight tweak to the relation/graph/triple model which might make some of these problems go away. And maybe unify ‘triples’ with the other models, too. But it takes us into somewhat uncharted territory.