2019/07/10

Literally awesome!

Ecstasy supports a rich set of literals -- too much to cover in a single post, so consider this the first installment. It's important to lay out, up front, why a language supports literals, and what its goals are for the design in doing so.

First, when building a language, literals are often terminal constructs, such that other things in the language can be composed of them.

Second, a literal allows an efficient, human-readable encoding of information. For example, for most of us, it is far easier to read the number 42 than something like: 

new Byte(false, false, true, false, true, false, true, false)

Ecstasy's design goals for literals are fairly straight-forward:
  1. Common constant types supported by the core runtime library should have a literal form. Examples include: Bits, nibbles, bytes, binary strings, integers, characters, character strings, dates, times, date/times, time durations, etc.
  2. Common complex types supported by the core runtime library should have a literal form. Examples include: Tuples, arrays, lists, sets, and maps (aka directories).
  3. Literal formats should emphasize readability, and the formats should be fairly obvious to a programmer.
  4. It should be easy to work with literal formats using only a text editor.
  5. Literals should make common programming tasks simpler, where possible.
Integers
A "whole number", or an integer, starts with an optional sign, followed by an optional radix indicator (such as "0b" for binary, "0o" for octal, or "0x" for hex), followed by the digits of the appropriate radix, with optional underscores between digits to separate digits as desired. The BNF is in the language specification, but the simple explanation above should suffice. Here are some examples:
0
-1
42
0xFF
0b10_1010_1010_1010_1010
12345678901234567890123456789012345678901234567890
So, what is the type of each of the above? A 32-bit "int"? A 64-bit "int"? No. Each of the above is an IntLiteral, a const class. Just think of IntLiteral as an object that has a good idea how to look on the screen, and simultaneously knows what values of various numeric types it can represent. The benefits are fairly obvious, in terms of support for arbitrary integer sizes (without weird type casting or literal suffixes like "L"), and support for other numeric types whose range may be far beyond the range of any arbitrary fixed-length integer type.

Characters
A character is a single-quoted Unicode code point, with predictable support for escapes using the backslash. If necessary to encode Unicode characters in the range up to U+FFFF, the format \u1234 can be used; beyond that range, the format \U12345678 can be used. Here are some examples:
'a'
' '
'\''
'\t'
This literal type is implemented by Char, a const class.

Strings
A (character) string is a double-quote enclosed sequence of characters, supporting the same escapes as are supported for character literals. Here are some examples:
""
"Hello, world!"
"This is an example of \"quotes\" inside \"quotes\""
"Multiple\nlines\nof\ntext."
Multi-line strings are freeform, which means that character escapes are not processed; Unicode escapes, on the other hand, are supported, because they are handled by the earlier "lexer" stage of the compilation. Multi-line strings use a hard left border, defined by the "pipe" ("|") character; the first line of a multi-line string begins with a back-tick ("`") followed by a pipe. Here is an example:
String s = `|This is a test of
            |a "multiline" string
            |containing | and \ and ` and ' and " etc.
            ;  // <--- look at this
Like an end-of-line comment, the multi-line string takes everything from the pipe to the end of the line, as-is, which is why the semicolon in the example above has to be placed on the following line.

A template allows a string to be formed dynamically from any valid expression. The format of the template string is the same as a normal string, except prefixed by the dollar sign ("$"); expressions inside the string are prefixed by dollar-sign + open-curly ("${") and suffixed by close-curly ("}"). Here are a few examples:
$"Hello, ${name}!"
$"2 + 2 = ${2 + 2}."
$"Finished in ${timer.elapsed.milliseconds}ms."
$"Finished in ${{timer.stop(); return timer.elapsed;}}"
Templates are handy, and making up good examples is challenging, but we already use templates all over the place. The last example is quite interesting, in that it shows a statement expression (syntactically, a lambda body) inside of the template expression.

Templates can also be used with multi-line strings, which is denoted by using a dollar sign instead of the opening back-tick:
String s = $|# TOML doc
            |[name]
            |first = "{person.firstname}"
            |last = "{person.lastname}"
            ;
Finally, if the string you need to glue into your code is too big and ugly to put into the source file, then don't. Just stick it in its own file in the same directory; for example, in a file named "ugly.txt":
String s = $./ugly.txt;
Yeah. That was easy.

The literal type for all of these forms of string is implemented by String, a const class.

Arrays
An array literal is a square-bracket enclosed list of values.

Here are some examples:
[]
['a', 'b', 'c']
[1, 2, 3]
This literal type is implemented by the Array class, which is variably mutable: Array literals are either Persistent (if they contain any values that are not compile-time constants) or Constant (if they contain only compile-time constants).

Summary
This was just a brief introduction to literals in Ecstasy. Each of these literal forms has many more rules than we covered here, but those rules are there to allow for more expression (readability) in the source code, and not to restrict it. The forms for these literals are designed to make it super easy to write and very pleasant to read.

The rules do make the lexer and the parser more complex, but we look at it this way: The compiler only has to get written twice (one prototype to bootstrap the language, and then the real one written in natural Ecstasy code), so no matter how much work it is to make the language easier to use, we get to amortize that cost across many, many users over many, many years.

Literally.

2 comments:

  1. I like the multiline string starting with |, and the many usage of the $ to denote the many flavors of string literals. The statement expression is great as well.

    ReplyDelete
    Replies
    1. We spend a lot of cycles evaluating different approaches, and we spend a lot of time looking at the code from an aesthetic point of view. We know that "code gets written once, and read many times"!

      Delete

All comments are subject to the Ecstasy code of conduct. To reduce spam, comments on old posts are queued for review before being published.