Dataspace 0: Those Memex Dreams Again

Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

Computing in the Internet age has a number of inspiring visions: legendary systems, some of which got built, some of which remained hypothetical, “dream machines”. Among them are: Vannevar Bush’s Memex (1945).  Ted Nelsen’s Xanadu (1960). J R Licklider’s Intergalactic Computer Network (1963).  Douglas Engelbart’s NLS (1968) and Alan Kay’s Dynabook (also 1968). William Gibson’s Cyberspace (1982).

These visions serve to anchor our ideas about what’s possible and how we might achieve it.

This is not one of those.

It is, however, a very rough sketch of an idea about what a future computing system might look like. I don’t know how to get from here to there, or even if ‘there’ is entirely satisfactory. But I feel that a ‘there’ roughly in this vicinity is somewhere we should be heading towards.

Let’s start with what the ‘here’ is that is less satisfactory.

We currently have an Internet made of vast layers of complexity layered on each other; software layers going back to the 1960s at the very latest, built on traditions and workflows originated in the 1950s. Our current model of deploying computing services, ‘the cloud’, thinks nothing of *simulating entire computers* – with gigabytes of RAM and hundreds of gigabytes of disk – on other computers, just to get one service that listens on one TCP/IP port and sends a few bytes in response to a few other bytes.

The operating system inside these simulated computers-on-computers then consists of, essentially, an entire simulated computing department from the 1950s: a bank of clerks operating card punches (text editors and Interactive Development Environments), other clerks translating these punchcards from high-level to low-level languages (compiler toolchains), machine operators who load the right sets of cards into the machine (operating systems, schedulers, job control systems), banks of tape drives (filesystems and databases), printers (web servers, UIs )… and a whole bunch of prewritten software card stacks (libraries, component object systems, open source projects).

This seems a bit less than optimal. If we didn’t have that pile of prewritten software that we must stay compatible with (but of course we do), and we were rebuilding computers from scratch in an Internet environment, we might do things a bit differently. A bit simpler, even.

Back around 1980, as interactive systems were starting to become commonplace and home computers were showing that they were the future, the idea of Objects started to catch hold, driven by Alan Kay’s Smalltalk. This seemed to give us the solution: replace the ‘computer’ with the ‘object’, and build everything out of objects. One unified paradigm to build an entire network from.

We didn’t do that.

Object -Oriented Programming did become popular – in fact it became so popular through the 1990s-00s that there were so many different object systems that almost none of them could communicate. Even as we built the Web, based not on objects but on the much older notions of documents and data packets, we fell backwards – from a vision of Compound Documents based on objects that could move between systems (anyone remember OpenDoc?), to today’s walled gardens of sealed-box ‘apps and services’, which only run on one company’s platform, don’t let you export data, and won’t even boot if the corporate server goes offline.

The object vision was pretty neat for 1980. Now that we’ve had a few years to play with it, I think we’ve also found that it has some serious limitations:

  • An object is opaque by design, for safety, simplicity and compatibility. You can’t look inside an object, there’s no standard protocol for ‘serialising’ or dumping an object into a standard form, and because of this you can’t guarantee that you can transfer an object between systems.
  • An object has side effects. If you send a message to an object, things change, somewhere in the universe – potentially any object it has a link to can change. Since an object is opaque by design, those links can point anywhere. You can’t know what part of the universe just changed from that message you just sent.
  • An object ‘message send’ is not in fact, usually, an actual message. It’s generally a function call, which is executed locally and waits, blocking a thread, until it’s finished. This means the programmer has to manage all those side effects manually.
  • Not everything in the object world is in fact made out of objects. Objects are compiled from source code, which generally is stored in files, not in objects. Objects are made of methods, which (generally) aren’t themselves objects. And the messages which objects don’t send to one another aren’t themselves objects.
  • Most seriously: after 27 or so years with the concept, we still don’t have anything approaching a standard, formal definition of what an object is. Not like ‘function’, or ‘relational database’ for example, which started with a whole bunch of maths (though we often ignore most of it and use C and SQL instead, which have almost functions and almost relations, but not quite). An object seems to be sort of a sociological tendency rather than a mathematical theory.

(The 1990s is littered with distributed object systems that either failed spectacularly, or got trapped in various tiny niches. See, for example, IBM’s System Object Model; Sun’s Distributed Objects Everywhere; and NeXT’s Portable Distributed Objects, which live on kinda-sorta in OSX).

But there’s one thing that objects do give you: and that’s the idea that the virtual ‘giant computer’ of the network or Internet is something like a space. A space made out of ‘places’ (objects) which have ‘directions’ you can ‘travel’ between them (methods, or fields). You can chain those directions together to make a ‘path’. And that path tells you what you will ‘find there’: another object, representing a return value, or collection of values.

This ‘space’ concept seems to be fairly powerful both for computers to use to do processing and for humans to use to store data. We use it all over: in filesystems, for example, and in directories like DNS or LDAP, where we use a slightly different syntax to chains of object method calls, but we have a similar sense that you ‘go to a place and find things’.

What a filesystem doesn’t give you, though, is a standard method of computing a value that doesn’t already exist as literal data. And both filesystems and objects put constraints on what kind of data you can store: with filesystems, it generally has to be fairly large ‘documents’ (not, eg, small values of data) and with object systems it general has to be certain well-formed types or classes of data. And there are some quite heavy restrictions over who gets to define what those types or classes are, and how that information gets updated.

In the Internet age we often find that we have data that doesn’t quite fit into the category of ‘document’ or ‘typed object’. We might have very small pieces of data – small strings, numbers, structures like lists or sets – and we might have a lot of them. We might find that they don’t quite fit into any standard notion of ‘type’. We might want to keep these pieces or collections of data on our personal desktop, or personal network, but might also want to import some from around the Internet, or share someto the Internet. And we might often want to do computation on them.

And when we do a computation, we’d like for it not to have side effects – because we’d like to not touch one piece of string on our desktop and have it bring down a server in China. We’d like to keep updating state separate from computing state, if possible.

This set of requirements, taken together, roughly equates to what I call ‘dataspace’:

  • something with a hierarchical, ‘spatial’ organisation like a filesystem
  • something that has semantics like a set, so you can have collections of data
  • something that’s not really strictly typed, or if it has types, they can be added in later – as more data – by anyone who looks at the system, using another piece of the same system to do it
  • something that can take both very small pieces of data or very large collections of very small pieces of data
  • something that, if you go to certain places in it, can compute data as a pure function, without causing side effects elsewhere
  • something that you can share pieces of with other people, or import pieces of it that they’ve shared
  • something that’s simple and flexible enough that you can build a page, or a small application, or a modern desktop, or a network, or maybe something as big as the Internet out of it, just by adding more pieces
  • something that’s mathematically well-defined at a low level, and has well-defined formats for reading and printing every kind of ‘object’

Interestingly, it turns out that although we have a huge pile of software and data standards, we don’t really have anything like this! And so this is an exploration of what kinds of properties a system like this might need, and how we might go about getting them.