Dataspace 13: Quotes and Whitespace

Prelude:
Dataspace 12: AR3 to AR4, or [`] = []

I have a prototype parser for AR4, but it’s time to rewrite it.

But first, a couple of notes about the syntax that have become clearer to me, I think.

One of the things I’ve really wanted to get a handle on is how to do string quotation in a sane way. You’d think that quotes would be easy – just use " – but right there we have multiple problems:

  • if you just pick ", that’s okay, but now you have to work out how to quote the " character itself. Microsoft (in Visual BASIC) came up with the idea of "" inside "" means ", so for example, """" means " and """""" means "", which isn’t terrible, but also not maybe as good as it could be..
  • okay, you think, so let’s add ' as well as "! So if you want a string with a ' in it you put it inside ", like "'", and if you want a string with a " in it, you put it in ', like '"'. This strategy is very popular in modern scripting languages, like Javascript.

    But if you do this, now you have two problems: you now have two magic characters, ' and ", each of which can break your string, not just one. Worse – a ' is a terrible character to make be a magic character, because it appears very commonly in English words and in English surnames. Well, Irish surnames. And Pacific Island surnames. And Chinese surnames and… well, it’s just generally terrible. And " appears in “English speech”, too, which, again — not a problem unless you happen to have the English language and people speaking the English language appear anywhere inside a string.
  • Well then, you think, I’ll just fix those two problems by introducing a third quotation form! I’ll add an ‘escape’ character, maybe \, which I can use for unprintable ASCII characters too, so eg \n means newline, \t means tab, \' means ', \" means ", \\ means \. Oh, we forgot Unicode. Well, no problem, \uxxxx means the Unicode character with four… wait, Unicode just extended its block range to FIVE digits? Back in the mid-90s? Oh. Well. Um. Okay, we’ll add a SECOND Unicode escape, \U{XXXXX} and….

    Oh, and don’t forget that Microsoft path names use \ exclusively instead of / , so depending what operating system you’re on, you may see a lot of strings like \\server\share\path\file become \\\\server\\share\\path\\file , which is less than helpful, especially if that gets fed through that same quoting algorithm a few times.

    So then all this escaping is looking tricky and you ask, what if I just want to include a literal piece of a large string file inside my large string file, newlines and tabs and all, and I don’t want to deal with escapes?
  • So you think: I know! A fourth quotation form will sort THAT out! We’ll add some kind of multiline string, maybe with """ on its own line to start and end a raw string block and
  • wait but what happens if I want to include that string, """ inside that other string?
  • oh well we’ll add a fifth quotation form maybe like |====| and the number of equals signs will tell us how deep the rabbithole goes
  • oh and what if we wanted to do string interpolation while inside a string block, should we have the $ form for string substitution and maybe only allow that inside " strings not ' strings, so this maybe counts as a sixth quotation form
  • or maybe we want string interpolation inside our """ string blocks, so we’ll say maybe the ` character does interpolation, whether you want it or not (Javascript, I’m looking at you) and okay it’s not strictly a seventh string form because $ forms only appear in one set of families of scripting languages (the ones descending from bash) while ` appears in another family
  • ANYWAY
  • at this point you might well be thinking “this is actually quite a lot of crazy, isn’t it? Like multiple recursive layers of crazy. Can’t we just do something much simpler?”

And yes, I think we can.

So here are some rules I’m thinking about to make strings a bit more tractable in this new syntax.

We don’t use single or double quotes at all. These are just ordinary characters. You can use them in string atoms.

Simple string atoms that don’t have a space or unprintable Unicode or a [ or ` or ] don’t need quoting.

If you need to quote, you use a “quotation sequence”, which is a series of one or more backticks (and only backticks) inside brackets. It looks like this:

[`]a quoted string[`]

or

[``]a deeply [`]nested string [``]

The reason this works, is that [`] and [``] are so on are all equivalent to [] (because of the axiom we just introduced in AR4) and so any character sequence in this form is guaranteed to not be a real term-expression. So we can use it as a magic sequence.

Three characters for a quote which can handle arbitrary nesting – eliminating about four or five different quote forms – isn’t too bad, I think.

For escaping unprintable characters in strings, I’m thinking of a different approach: an expression which composes strings and escape sequences together. For example:

[`~s [`]string expression with [`] [t n _ n u1234] [`]weird characters in it[`]]

where lists represent a series of escapes – t for tab, n for newline, _ for space, and strings beginning with u for hex Unicode codepoints.

(I’m figuring that heads beginning with ~ will be reserved for use by the parser/writer. There will be a number of these. But the point is that they are otherwise just ordinary term-expressions.)

There’s one more piece of syntax/parsing magic, I think, that may be useful enough that it’s worth building in: the ability to parse sections of code with ‘significant whitespace’. This will let term-expressions compete with formats like Markdown.

There are two term types that I think would let this work:

[`~~~
a set of
lines with whitespace
`~~~]

would be a format for delimiting whitespace-significant sections. (The trailing `~~~ isn’t strictly needed, but I think would useful for human-readability).

If written out in a non-whitespace-significant way, for example if we wanted to send this by email or across social media that might destroy whitespace, this would appear as

[`~w `[]
a set of `~w `[]
lines with `~w _ _ _ _ `[] whitespace `~w]

The idea is that we translate whitespace as `~w terms – with the body containing the escapes for the whitespace characters, and if the body is empty it means a single newline – and the tail being the content after the whitespace. Using the term in tail place makes it fairly easy to skip without losing the semantics of the expression.

Whitespace-significance begins from the first occurrance of `~w in a term, and lasts only until the end of that term. This should stop it ‘leaking’ beyond places it was intended to be.

It’s a somewhat complicated idea, and it might be more trouble than it’s worth, but it opens the door to a whole lot of applications where we currently have ad-hoc formats. Wiki articles, for example, where line breaks are fairly important to preserve, but we would also like to be able to include arbitrarily complex structured data.

(By the way, trying to type some of these sequences in WordPress is really tricky. WordPress is trying too hard to be ‘clever’, and ends up distorting text, especially text that looks like code, and especially text with backticks in it, and even more especially text with whitespace in it. This is exactly the sort of situation that I’m trying to make term-expressions do the right thing with – or at least a reasonably simple and sane thing.)