Dataspace 6: Terms as Types

Prelude:
Dataspace 0: Those Memex Dreams Again
Dataspace 1: In Search of a Data Model
Dataspace 2: Revenge of the Data Model
Dataspace 3: It Came From The S-Expressions
Dataspace 4: The Term-inator

I want a Memex. Roughly, I want some kind of personal but shareable information desktop where I can enter very small pieces of data, cluster them into large chunks of data, and – most importantly – point to any of these small pieces of data from any of these chunks.

‘Pointable data’ needs a data model. The data model that I am currently exploring is what I call term-expressions (or T-expressions): a modified S-expression syntax and semantics that allows a list to end with (or even simply be, with no preceding list) a logical term in the Prolog sense.

I’m not a huge fan of type theory as it currently exists in functional programming languages such as Haskell. Type theory seems, to me, to be merely an application of logic – and I think what we’d find much more useful than a type-inference engine that only runs at compile time, is a general logical inference engine like Prolog that operates on general logical terms, and can operate at runtime.

(Because on the Internet, it’s always runtime. There is no ‘compile time’ where you can escape the entire network; all you can do is pass the output of one program into another via sending and receiving data. At some point, we need to start thinking about ‘types’ as merely syntactic transformations of data, or logical statements (themselves pieces of data) made, inferred and proved about data. To handle the creation of new types at runtime, we need functions that can take types – or structured data containing types – as arguments, and return types as values.)

But given a generalised logical term structure like term-expressions, we can markup any piece of data we wish as a term in a very lightweight manner. And we can interpret such terms as something like a type-annotated piece of data. This means that we could implement any number of different type systems at runtime, using terms to represent typed data.

One of the interesting side effects of generalised term-expressions is that we can reduce even what we’d normally think of as atoms – eg, numbers and strings – to structured terms. For example

ABC

might simply be a syntactic representation of a term which is actually, at a low level,

(/string 65 66 67)

or – perhaps- where even ‘string’ is a number, so it might be

(/01 65 66 67)

This solves a long-standing problem that Prolog has, where its native ‘string’ representation is literally a list of ASCII integers. Eg:

[65,66,67]

But this is ambiguous: if you don’t know that this list of integers is meant to be a string, then you can’t tell what it is. It just looks like a list, because it is a list.

In Prolog terms, you could represent every occurrence of a string as a term:

string([65,66,67])

but syntactically this adds complexity. Showing the two side by side as term-expressions, we can see the difference:

(/string 65 66 67)
(/string (65 66 67))

One layer of parentheses might not seem like much, but over a large program or dataset, every layer of unnecessary complexity adds up. More importantly, unnecessary complexity means we can’t clearly express our data model in its simplest form.

In term-expressions, we can see that

(1 2 3 . A)

could be the same as

(1 2 3 /string 65 66 67)

and logically equivalent to

(1 2 3 /all A)

(We can in fact end a list with a term in Prolog using the vertical bar operator, so

(A . B)

with the above definition of ‘string’ as a term can map to Prolog as

[string([65]) | string([66])]

but as you can see, this is syntactically noisy – three levels of parenthesis/bracket nesting. Compared to the term-expression equivalent:

((/string 65) /string 65)

which is much more pleasant for humans to read, as well as potentially more efficient for a machine to parse.

What we strictly can’t do in Prolog is create terms that end in other terms – because the body of a term is not considered to be a list, but a tuple.)

What we could do with strings we could do with numbers, characters, anything else that we can construct out of, ultimately, pointers, integers and a term marker:

(/int 123)
(/float 6 3145124)
(/complex 12 34)
(/date 1999 12 31)
(/vector 1 2 3 4 5)

And this ability to insert terms easily into list structure seems like it would do away with a large number of Lisp and Scheme reader macros. Instead of transforming input at read time, simply establish the simplest possible syntax, and use terms to represent more complex data types.

Although there’s a loose equivalency between terms and types, one thing that conventional type systems give us that Prolog-like term structure doesn’t is a way of attesting that a given term (or object) is in fact a well-formed member of its type. Well-formedness may also depend strongly on context. How to handle this in a network that could span the Internet is an interesting problem, and one I haven’t yet attempted to solve.

But probably the direction to go in would be to establish some in-memory representation for ‘sealed’ terms which have been validated by an authoritative source. This would be the rough equivalent of an object which was produced by a valid object constructor. Such terms might have a canonical serialisation format, but it would be as an ‘unsealed’ variant; unsealed terms, on import, would have to be processed by a validation function before they were released to other parts of the system.

This could be a hive of security issues, so would need to be very carefully thought through. At some point a trust border would need to be established, beyond which guarantees of typing are ignored and incoming data required to be re-validated. Remembering that ‘trust border’ in a cloud computing and virtual machine environment could mean that your entire contents of RAM might be subtly altered at any time.