Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.
To get ‘pointable data’, I want a data model that allows me to embed pointers anywhere, but keeps the simplicity and clarity that, eg, RDF triples and XML/HTML don’t have.
The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense. In term-expressions we simply represent a term as a list whose head is the designated term marker – in this case, I’m using () to indicate a list and / as the term marker. We can have the dot, as in S-expressions, or we could skip it entirely (which gives us some interesting expanded semantics for storage that I’ll talk about later).
I’ll tweak my syntax a little, and say that /, like ( and ), is treated specially by our character parser, so that
/head body
is the same as
/ head body
Here for instance are the test sentences we started with:
Cats are funny.
The capital of North Korea is Pyongyang.
It is raining today in New Zealand.
Today, North Korea test-launched an ICBM
In response to (the event above), South Korea test-launched an ICBM.
It is both raining and sunny today in New Zealand.
and term-expressions now give us some slightly expanded options for representing them in ways that are both clear for a human to read and precise for a machine. We could, for example, just literally translate our tentative Prolog terms:
funny(cats)
capital(northkorea,pyongyang)
ondate(20170704,weather(newzealand,raining))
ondate(20170704,launched(northkorea,icbm))
inresponseto(ondate(20170704,launched(northkorea,icbm)),launched(southkorea,icbm))
ondate(20170704,weather(newzealand,and(raining,sunny)))
into term-expressions as:
(/funny cats)
(/capital northkorea pyongyang)
(/ondate 20170704 (/weather newzealand raining))
(/ondate 20170704 (/launched northkorea icbm))
(/inresponseto (/ondate 20170704 (/launched northkorea icbm)) (/launched southkorea icbm))
(/ondate 20170704 (/weather newzealand (/and raining sunny)))
But because we have a much less noisy ‘list’ syntax, and (at the moment) no restrictions on things like variable names, we suddenly have some expanded options for cleanly representing tokens like names, or dates. We could do the following:
(/funny cats)
(/capital (North Korea) (Pyongyang))
(/on-date (2017 07 04) (/launched (North Korea) ICBM))
(/in-response-to
(/on-date (2017 07 04) (/launched (North Korea) ICBM))
(/launched (South Korea) ICBM))
(/on-date (2017 07 04) (/weather (New Zealand) /all (raining) (sunny)))
okay wait what was that last one? What’s /all?
Here’s where it gets interesting.
Because term-expressions allow us to end a list with a term (which, remember, you can’t do in standard Prolog term structure) – we suddenly have an extra degree of freedom, which we can use in multiple ways.
/all means ‘what follows is a set, expressed as a list. Interpret all of the above as completions of this term’
That is: we simply choose to define the semantics of ‘/all’ to mean, ‘the logical union of all the entries in the list’, such that the two logical statements
a
b
can equivalently be represented as the single logical statement
(/all a b)
and similarly, the logical statements
(a b)
(a b c)
(a b d)
can be represented by the single logical statement
(a b /all () (c) (d))
and so on.
This is useful because it lets us gather together sets of logical statements into their simplest form, avoiding duplication of terms, and keeping related terms together in a hierarchical structure. It also lets us create dictionary-like structures – which, without the ability to end lists with terms, we simply couldn’t express as logical terms. But now we can.
For example, here’s one way you might express,say, a person-details dictionary structure in term-expressions using /all:
(/all
(name Clark Kent)
(employer Daily Planet)
(friends /all (Lois Lane) (Jimmy Olsen))
)
So this is a setlike structure containing key-value pairs. It’s almost exactly like a dictionary – except that each key can, if we introduce ‘/all’, have multiple values.
(One of the odd things about /all is that, if we trace out the semantics of it – at least in the way it seems most natural to use – it doesn’t quite work like a set but like a multiway logical ‘and’. There’s some weirdness to that which I’ll explore later. For now, let’s think of it as essentially a set, but do remember that it’s not literally a set.)
Just using pairs and /all, we could create the semantics of JSON – but we can also expand that semantics from dictionaries/functions to true relations.
For example, if we chose, we could express those earlier sentences in something like the dictionary form we originally tried, but now using /all:
(/all
(cats funny)
(capital (North Korea) (Pyongtang)
(on-date (2017 07 04) /all
(weather (New Zealand) /all (raining) (sunny))
(launched (North Korea) ICBM))
(in-response-to (on-date (2017 07 04) (launched (North Korea) ICBM)) (launched (South Korea) ICBM))
)
Since /all lets us combine entire relations (or sets of logical statements) into a single statement, we can then do another thing that is very difficult to imagine in standard Prolog term structure or in SQL relations: we could prefix an entire relation (for example, a dump from a SQL server) with a list that represents a logical statement that might be a global context. For example:
(dns (myserver mycompany com) mydb /all
(table1 /all
(1 data1)
(2 data2)
(3 data3))
(table2 /all
(a dataa)
(b datab)
(c datab)))
This single logical statement (or fragment of a logical statement) could represent the data contents of an entire database within a server in a DNS namespace. Eg, it is the logical union of the following individual statements:
(dns (myserver mycompany com) mydb table1 1 data1)
(dns (myserver mycompany com) mydb table1 2 data2)
(dns (myserver mycompany com) mydb table1 3 data3)
(dns (myserver mycompany com) mydb table2 a dataa)
(dns (myserver mycompany com) mydb table2 b datab)
(dns (myserver mycompany com) mydb table2 c datac)
We wouldn’t want to – nor would we be able to – model an entire network as a set of those long individual ‘table rows’ or logical statements. There’d be huge amounts of space wasted by the prefix at the start of each statement, and we don’t have any concept of statements ‘being contained by’ any other statement. This is the situation we have in ordinary First Order Predicate Logic. This is how Prolog models its world of data: a single flat namespace inhabited by logical assertions.
But /all , just by using the syntactic trick of ending a list with a term, suddenly gives us a spacelike operator which is still just an ordinary logical term. You can see how entire ‘tables’ could be nested inside ‘databases’ inside ‘servers’ inside ‘DNS domains’ – or any other spacelike, setlike containment abstraction that we can imagine for our data – simply by using this one operator /all to represent ‘sets of lists’. Using recursive structures like this, it starts to seem possible that we could represent an entire network as a single logical statement.
Obviously a list would be very slow to search, so while /all might work as a data-transport format, it wouldn’t be appropriate to use as an in-memory representation. But we could imagine having, for example, a /tree term which represents the same data, but as a sorted binary tree. It might now be possible to think about logical terms which exactly correspond to the in-memory contents of machines.
You could get a lot of the capability of term expressions with Prolog term structure. But the extra degree of freedom that term expressions adds would be the equivalent of allowing a Prolog list that ends in a term – or a Prolog term that ends in another Prolog term.
/all is one example of how we could exploit ‘ending a term in another term’ to easily (and in a recursively decomposable manner) represent concepts like ‘the logical union of multiple logical terms’. In Prolog, you could represent a set of logical terms (for example, the current Prolog database) with a list of terms, or a term such as ‘and’ – but it wouldn’t easily be recursively decomposable (ie, the composition of terms happens only once, at the top level, which becomes ‘special’) . Because it’s not recursively decomposable, such a term doesn’t have the spacelike quality of related data being clustered together that we can get from objects or dictionaries.
However, it’s at this point I start wondering just what kind of logic would describe such ‘logical statements’ that end in other statements (rather than a well-mannered ‘end of list’ terminator like ‘nil’) . By ‘what kind of logic’, I don’t necessarily mean exotic creatures like higher-order logic – as with HiLog, it seems that we’re dealing here with something logically equivalent to First Order Predicate Logic (should we even get as far as defining logical operations).
But syntactically, we’re looking at configurations of symbols that – as with HiLog – the usual formulations of FOPL don’t model and therefore don’t allow or consider (because they’re not in the model). So while we might not be dealing with logical properties that aren’t expressible in FOPL, we are looking at expressions that aren’t cleanly and naturally expressible in most syntactic representations of FOPL. And a little bit of added syntactic freedom can make a huge difference to the clarity and efficiency of a representation, as I hope to show in some of these next posts.
One of the arguments underlying this whole program of exploration is that syntax is semantics: that is, both syntax and semantics are examples of patterns (in space and time). In programming and data representation, the fewer transformations of our syntax that we have to make, the fewer places we have to introduce ambiguity or lose data. If we arbitrarily limit our syntax, we also limit the semantics that we can represent cleanly and naturally with that syntax – and that means we limit opportunities for organising (or reorganising, without loss of clarity) our thoughts.
Mathematicians often focus purely on whether we can compute or express something. If the answer is not zero or infinity, a mathematician’s job is done. But for a programmer, the question is much more often: how clearly can we express something, and how efficiently can we compute it? And that’s where syntax becomes vitally important.
We might put it this way: if semantics is about ‘what symbols represent’, syntax is about ‘what symbols, and where do you put them’. But where you put something is, itself, a question with deep semantic significance – for example, is your data on a local server, or on a remote server over a link that is now unreachable because the network topology just changed due to an emergency or a financial or political upheaval or a rogue ICBM? And who might be listening in between, and where are they?
Syntax cares deeply about where your data is stored, and what shape it takes, and so should you.