2019/06/20

The Quest for Equality

It's shocking how difficult equality is to get right in software. What does equality even mean?

Equality used to be so simple in the days of assembly language, when "bitwise equality" was all that existed, and the only things that could be compared were fixed-size CPU registers. Things got more and more complex, until we ended up with Object Oriented languages, in which each class may have its own idea of what equality means.

On top of that, the most popular languages today each have multiple forms of equality. Python offers "is" versus "==". Java and C# offer "==" and "o1.equals(o2)" -- which can easily differ from "o2.equals(o1)" because it is neither symmetric nor transitive. Javascript has both an "is" and "==", and just keeps adding equals signs when new meanings are desired; so far, they're up to "===", but anyone who denies the likelihood of a future "=====" operator is just kidding themselves.

Obviously, equality is not a simple problem. Most of the existing solutions are broken, because they cannot provide either symmetric or transitive behavior. Many of these problems are the direct result of having multiple type systems within a single language; that helps to explain why many of these languages provide two forms of equality: One for the primitive "value" type system, and one for the object type (reference-based) system.

It's obvious that in an object type system, a class needs to be able to define equality for that class, replacing any superclass definition of the same, which means that the definition of equality is virtual, as in "virtual method invocation". It's also obvious that such a definition cannot be a single dispatch method, because then "o1.equals(o2)" may provide a completely different answer from "o2.equals(o1)". There have been languages that attempt to address this type of conundrum with multiple dispatch support, but by the same token, one can swat a fly with an atomic bomb.

Generic types further add complexity to the notion of equality, because each type parameter may contribute its own potentially conflicting notion of equality to the equation.

So what's the answer?

Intent. Intent matters. The developer's intent when they write a line of code is important, but somehow that intent manages to be erased by a language compiler. (It's no mistake that the term type erasure is used to describe a generic type system that is implemented by forgetting what those types actually were.)

Ecstasy captures the developer intent by capturing the compile time type of the references in question, and retaining that information in the compiled code, and using that information for hard problems like equality.

Consider the following example:
Collection<String> c1 = foo();
Collection<String> c2 = bar();
if (c1 == c2)
    {
    // ...
    }
It doesn't matter what the actual runtime type of the object is that c1 or c2 refers to, because the developer clearly explained that they are Collections of String objects. The class of the collection is unknown here at compile time, and instead the Collection interface is used, which defines equality in a certain way. In theory, those collections could contain objects of various sub-classes of String, but when those contents are compared, they will be compared as Strings. Because that was the intent of the developer.

To accomplish this, equality is implemented non-virtually; here is the equals function on Collection:

/**
 * Two collections are equal iff they are they contain the same values.
 */
static <CompileType extends Collection>
        Boolean equals(CompileType collection1, CompileType collection2)
    {
    // they must be of the same arity
    if (collection1.size != collection2.size)
        {
        return False;
        }

    if (collection1.sortedBy() || collection2.sortedBy())
        {
        // if either is sorted, then both must be of the same order;
        // the collections were of the same arity, so the second iterator
        // shouldn't run out before the first
        Iterator<CompileType.Element> iter1 = collection1.iterator();
        Iterator<CompileType.Element> iter2 = collection2.iterator();
        for (val value1 : iter1)
            {
            assert val value2 := iter2.next();
            if (value1 != value2)
                {
                return False;
                }
            }

        // the collections were of the same arity, so the first iterator
        // shouldn't run out before the second
        assert !iter2.next();
        return True;
        }
    else
        {
        return collection1.containsAll(collection2);
        }
    }
What makes this work is that the compile-time type is passed to the equals function. Not the name of the type. Not the "type value". The type. The actual, for-real, usable-as-a-type-name-in-your-code type. A formal type. (Yes, types are objects. But they're also types.)

So how did this function get called? Well, because the compiler determined, and retained, and used, the compile-time type. Both the "==" operator and the "!=" operator ultimately cause a call to this function to occur.

And what if, instead, the developer had specified these variables as being Lists?
List<String> c1 = foo();
List<String> c2 = bar();
if (c1 == c2)
    {
    // ...
    }
Again, it doesn't matter what the actual runtime types of those object are, because the developer clearly explained that they must be Lists of String, and the result of the "==" operator is a call to List:
/**
 * Two lists are equal iff they are of the same size, and
 * they contain the same values, in the same order.
 */
static <CompileType extends List> Boolean equals(CompileType a1, CompileType a2)
    {
    Int c = a1.size;
    if (c != a2.size)
        {
        return False;
        }

    for (Int i = 0; i < c; ++i)
        {
        if (a1[i] != a2[i])
            {
            return False;
            }
        }

    return True;
    }
Even more obvious is that the equals functions themselves simply use the "==" and "!=" operators on the values with the Collections or Lists, because the compile time type of those contents has not been erased, and was encoded into the compiled code, and was used in the call to the Collection or List equals function. And this works for arbitrarily deep nesting of generic types, because turtles.

And it's type safe.
And it's symmetric.
And it's transitive.
And it's readable.
And it's obvious.
And it's correct.

One more thing. What if you really want to know if two objects are actually "the same object"? In Ecstasy, as described previously, the reference (Ref) to an object contains the type and the identity of the object that is referred to. Sameness refers to that identity, and the equals function on Ref tests for the equality of that identity:

List<String> c1 = foo();
List<String> c2 = bar();
if (&c1 == &c2)
    {
    // ...
    }
... which is exactly what the Object.equals function does itself:
static <CompileType extends Object>
        Boolean equals(CompileType o1, CompileType o2)
    {
    return &o1 == &o2;
    }

2019/06/18

A pointed paean for C

Less than 20% of programmers claim to use C today, and the real number of programmers actually using C is likely far smaller, but C lives on quite pervasively in its influence on C++, Java, C#, and other languages. Starting with Java, most of the "managed runtime" languages purposefully omitted one of the most powerful features in C: The pointer. In its place, these languages provide a concept called a reference, which is a type-safe pointer whose value (a memory address) cannot be obtained or manipulated by the programmer.

Two important things were lost in the process, however:
  • The reference itself became opaque, in that its only capability in these languages is to be de-referenced; and
  • Pass-by-reference is no longer possible.
Ecstasy references, on the other hand, are themselves objects (because turtles), and because references are objects, references have references (because turtles). It may make your head hurt to picture this, but in use, it becomes the most obvious and simple concept imaginable. In Ecstasy, a reference is represented by the Ref interface:

A Ref represents a reference to an Ecstasy object. In Ecstasy, "everything is an object", and the only way that one can interact with an object is through a reference to that object. The referent is the object being referred to; the reference (encapsulated in and represented by a Ref object) is the object that refers to the referent.

An Ecstasy reference is conceptually composed of two pieces of information:
  • A type;
  • An identity.
The type portion of an Ecstasy reference, represented by the actualType property of the Ref, is simply the set of operations that can be invoked against the referent and the set of properties that it contains. Regardless of the actual operations that the referent object implements, only those present in the type of the reference can be invoked through the reference. This allows references to be purposefully narrowed; an obvious example is when an object only provides a reference to its public members.

The Ref also has a RefType property, which is its type constraint. For example, when a Ref represents a compile time concept such as a variable or a property, the RefType is the compile time type of the reference. The reference may contain additional operations at runtime; the actualType is always a super-set (⊇) of the RefType.

The identity portion of an Ecstasy reference is itself unrepresentable in Ecstasy. In fact, it is this very unrepresentability that necessitates the Ref abstraction in the first place. For example, the identity may be implemented as a pointer, which points to an address in memory at which the state of the object is stored. However, that address could be located on the process' program stack, or allocated via a dynamic memory allocation, or could point into a particular element of an array or a structure that itself is located on the program stack or allocated via a dynamic memory allocation. Or the identity could be a handle, adding a layer of indirection to each of the above. Or the identity could itself be the object, as one would expect for the simplest (the most primitive) of types, such as booleans, bytes, characters, and integers.

To allow the Ecstasy runtime to provide the same behavioral guarantees regardless of how objects are allocated and managed, how they are addressed, and how house-keeping activities potentially affect all of the above, the Ref provides an opaque abstraction that hides the actual identity (and thus the actual underlying implementation) from the program and from the programmer.

Because it is impossible to represent the identity in Ecstasy, the Ref type is itself simply an interface; the actual Ref instances used for parameters, variables, properties, array elements, and so on, are provided by the runtime itself, and exposed to the running code via this interface.
Ref is read-only; the read/write form is the Var interface, which extends Ref. To obtain a Ref or a Var, we use the C address-of operator, "&":

String str = "Hello world!"; 

// get a read-only reference to the variable
Ref<String> ref = &str;

// alternatively, get a read/write reference to the variable
Var<String> var = &str;

// modify the variable via a reference
var.set("Goodbye, cruel world!"); 

// that modified the value that is held in the variable!
assert str == var.get();

// which we can also see through the read-only reference
assert ref.get() == str;

The last concept to grasp is this: Objects are of a class, but references are of a type. In most OO languages, the object's class is its type, but one takes a different route when designing a language -- like Ecstasy -- to build portable, containerized, safe, and secure applications in the cloud, versus designing a language -- like C -- to build an operating system.

This concept is unusual coming from the C++ (vtable-based, compile-time types only) family of languages, but -- very importantly! -- this concept does not create any additional cognitive load for the application developer. What it does allow, though, is for a systems developer to dynamically and securely reduce the surface area of an object when sharing that object across a container boundary.

2019/06/17

An Introduction to the Ecstasy Type System

Ecstasy's type system is designed to be an obvious and predictable type system:
  • Ecstasy has a single type system, which is an object type system, and that's it. In other words, Ecstasy only has objects.
  • Unlike C# or Java, there is no secondary primitive type system with types like "int" or "boolean" that everything else has to be constructed out of. In Ecstasy, all types are built out of other Ecstasy types. For someone who has used Smalltalk, this concept should be familiar.
  • There is a single root called Object. For someone who has used C# or Java, this should seem quite familiar, with one little twist: In Ecstasy, Object is an interface, not a class.
  • The Ecstasy type system is called the Turtles Type System, because the entire type system is bootstrapped on itself, and -- lacking primitives -- solely on itself. An Int, for example, is built out of an Array of Bit, and a Bit is built out of an IntLiteral (i.e. 0 or 1), which is built out of a String, which is an Array of Char, and a Char is built out of an Int. Thus, an Int is built out of many Ints. It's turtles, the whole way down.
  • The type system is fully generic, and fully reified. This means that a List-of-Int is actually a List-of-Int, and not just a List with some compile-time syntactic sugar.
  • The type system is covariant. This means that an Array-of-Int is a List-of-Int is a List-of-Number is a List-of-Object is a List is an Object.
  • The type system is module-based, transitively closed, type-checked, and type safe.
  • Most type safety checks are performed by the compiler and re-checked by the link-time verifier. Most of the remaining type safety checks are performed by the linker when it transitively closes over the type system. The remaining type safety checks are performed at runtime, for the specific cases in which the types cannot be fully known until runtime.
  • A class has a type, but a class is not a type. (A class may have many types, but minimally its existence implies at least four types: structural, private, protected, and public.)
  • The type system explicitly supports actual immutability.
In software, making something simple is the hardest thing that we do. The Ecstasy Turtles Type System is proof of that, but the benefits are worth it. Using the type system is incredibly easy, and reading the code is a joy.



2019/06/13

Conditional Methods

There's a common pattern in software libraries, which is the conditional result. For example, a language may have an iterator type, something that has a method that returns a type T:

T next();

Now, if there is no "next item" to return, the iterator could return some other value, like -1 or null. Of course, in some cases, -1 and null are perfectly valid values, so this design turns out to be a fairly poor one.

Instead, the iterator type may have to represent the "no next item" as a separate method, which must be called separately:

interface Iterator<T> { // note the ugly brace placement
  boolean hasNext();
  T next();
}


Now the caller can iterate like:

while (iter.hasNext()) {
  T value = iter.next();
}

This could be simplified if a language supported more than one return value, like most languages support more than one parameter to a method:

(boolean, T) next();

The problem is what to return as a value when the boolean value is false. You could return a null value, but the cost to doing this is one billion dollars (payable to Tony Hoare).
one billion dollars

Ecstasy addresses this common challenge by using conditional return values. For example:

/**
 * An iterator over a sequence of elements.
 */
interface Iterator<Element>
    {
    /**
     * Get a next element.
     *
     * @return a tuple of (true, nextValue) or (false) if no elements are available
     */
    conditional Element next();
    }

The next() method is allowed to return True and an element, or False if there is no element. Consuming a conditional method is incredibly simple; here's a helper methods on the Iterator interface itself:

/**
 * Perform the specified action for all remaining elements in the iterator, allowing for
 * a possibility to stop the iteration at any time.
 *
 * @param process  an action to perform on each element; if the action returns true, the
 *                 iterator is considered "short-circuited", the method returns immediately
 *                 and no more elements are iterated over
 *
 * @return true iff the iterator was short-circuited; otherwise false if the iteration
 *         completed without short-circuiting
 */
Boolean untilAny(function Boolean process(Element))
    {
    while (Element value := next())
        {
        if (process(value))
            {
            return True;
            }
        }
    return False;
    }

There's one other thing, the underlying reason that the next() method can't return a "null", and that is because Ecstasy corrects the billion dollar mistake:

/**
 * The Nullable type is the only type that can contain the value Null.
 *
 * Nullable is an Enumeration whose only value is the singleton enum value {@code Null}.
 */
enum Nullable { Null }

That one line is the actual code that defines the concept of Null in Ecstasy. And here's what it means:

Object   o = Null;    // ok ... Null is an Object
String   s = Null;    // error
Int      i = Null;    // error
Object[] a = Null;    // error


In other words, Null is an object just like "Hello world!" is an object, and all of the type rules apply to Null just like they apply to any other object. (Among other benefits, you can kiss those NullPointerExceptions goodbye.)

It really is that simple.


2019/06/12

How to assert yourself more

Assertions in software are valuable. Coupled with automated testing and continuous integration (CI), assertions help to harden software, and in doing so, provide an early warning system for breaking changes.

But different assertions address different types of requirements. For example, some assertions should only be run while testing, because they are simply too expensive to be used in production. Other assertions are actually runtime checks, so they must always run, and must not be disabled. Lastly, assertions should be easy to write, even easier to read -- and when they fail, they should provide the necessary information to help a developer track down and understand the cause of the failure.

The first thing to know about Ecstasy assertions is that they are checked at runtime, "in production". For example, if it is illegal to proceed using a negative number, then this assertion would check and prevent that condition:

assert n.sign != Negative;

Conceptually, that assertion is similar to (but simpler than) writing code like:

if (n.sign == Negative)
    {
    throw new IllegalState(
        $"Assertion failed: n.sign != Negative, n={n}");
    }


Which would throw an exception with a text message like:

Assertion failed: n.sign != Negative, n=-1

An assertion handles all of that complexity automatically on behalf of the developer. Having detailed information -- the inputs to the assertion -- show up automatically when the assertion fails is invaluable for debugging a failure after the fact -- especially when a problem is not easily reproducible!

Assertions can also be fine-tuned to throw an appropriate exception:
  • assert   throws  IllegalState
  • assert:arg   throws  IllegalArgument
  • assert:bounds   throws  OutOfBounds
  • assert:TODO   throws  UnsupportedOperation
The syntax for an assertion condition is the same syntax used in an if or while statement,  so conditional declarations and assignments are naturally supported:

// the iterator must have at least one item left
assert String s := iterator.next();


A condition isn't even necessary if the assertion is being used to indicate that somehow execution reached some place that should have been impossible:

assert;  // just use "assert;" instead of "assert False;"

In each case, the syntax is intended to clearly convey the intent of the developer, with as little excess as possible for the reader of the code to ingest.

And what if the intent of the developer is to only have the assertion execute during testing, and not when the software is running "in production"? Fortunately, Ecstasy provides a simple way to enable an assertion only when testing:

assert:test checkReferentialIntegrity();

(Or only when debugging, by using assert:debug to trigger a breakpoint when the assertion fails.)

Similarly, if something is important to verify at runtime, but only needs to be checked the first time that the code is executed, then assert:once can be used:

assert:once configLoaded;

Lastly, if an assertion needs to be occasionally tested at runtime, but is too expensive to check every time -- such as when the assertion occurs inside a performance-critical loop -- then a sampling-based assertion may be appropriate:

assert:rnd(100) checkConsistency();  // on average, 1 in 100 times

But at the end of the day, the advanced options are just that: Advanced, and optional.

Just the way that they should be.