Tuesday, April 29, 2014

Rust for C++ programmers - part 4: unique pointers

Rust is a systems language and therefore must give you raw access to memory. It does this (as in C++) via pointers. Pointers are one area where Rust and C++ are very different, both in syntax and semantics. Rust enforces memory safety by type checking pointers. That is one of its major advantages over other languages. Although the type system is a bit complex, you get memory safety and bare-metal performance in return.

I had intended to cover all of Rust's pointers in one post, but I think the subject is too large. So this post will cover just one kind - unique pointers - and other kinds will be covered in follow up posts.

First, an example without pointers:
fn foo() {
    let x = 75;

    // ... do something with `x` ...
}
When we reach the end of `foo`, `x` goes out of scope (in Rust as in C++). That means the variable can no longer be accessed and the memory for the variable can be reused.

In Rust, for every type `T` we can write `~T` for an owning (aka unique) pointer to `T`. We use the `box` keyword to allocate space on the heap and initialise that space with the supplied value (this has very recently changed from using `~` for allocation too). This is similar to `new` in C++. For example,
fn foo() {
    let x = box 75;
}
Here `x` is a pointer to a location on the heap which contains the value `75`. `x` has type `~int`; we could have written `let x: ~int = box 75;`. This is similar to writing `int* x = new int(75);` in C++. Unlike in C++, Rust will tidy up the memory for us, so there is no need to call `free` or `delete`. Unique pointers behave similarly to values - they are deleted when the variable goes out of scope. In our example, at the end of the function `foo`, `x` can no longer be accessed and the memory pointed at by `x` can be reused.

Owning pointers are dereferenced using the `*` as in C++. E.g.,
fn foo() {
    let x = box 75;
    println!("`x` points to {}", *x);
}
As with primitive types in Rust, owning pointers and the data they point to are immutable by default. Unlike C, you can't have a mutable (unique) pointer to immutable data or vice-versa. Mutability of the data follows from the pointer. E.g.,
fn foo() {
    let x = box 75;
    let y = box 42;
    // x = y;         // Not allowed, x is immutable.
    // *x = 43;       // Not allowed, *x is immutable.
    let mut x = box 75;
    x = y;            // OK, x is mutable.
    *x = 43;          // OK, *x is mutable.
}
Owning pointers can be returned from a function and continue to live on. If they are returned, then their memory will not be freed, i.e., there are no dangling pointers in Rust. The memory will not leak however, eventually it must go out of scope and then it will be free. E.g.,
fn foo() -> ~int {
    let x = box 75;
    x
}

fn bar() {
    let y = foo();
    // ... use y ...
}
Here, memory is initialised in `foo`, and returned to `bar`. `x` is returned from `foo` and stored in `y`, so it is not deleted. At the end of `bar`, `y` goes out of scope and so the memory is reclaimed.

Owning pointers are unique (also called linear) because there can be only one (owning) pointer to any piece of memory at any time. This is accomplished by move semantics. When one pointer points at a value, any previous pointer can no longer be accessed. E.g.,
fn foo() {
    let x = box 75;
    let y = x;
    // x can no longer be accessed
    // let z = *x;   // Error.
}
Likewise, if an owning pointer is passed to another function or stored in a field it can no longer be accessed:
fn bar(y: ~int) {}

fn foo() {
    let x = box 75;
    bar(x);
    // x can no longer be accessed
    // let z = *x;   // Error.
}
Rust's unique pointers are similar to C++ `std::unique_ptr`s. In Rust, as in C++, there can be only one unique pointer to a value and that value is deleted when the pointer goes out of scope. Rust does most of its checking statically rather than at runtime. So, in C++ accessing a unique pointer whose value has moved will result in a runtime error (since it will be null). In Rust this produces a compile time error and you cannot go wrong at runtime.

We'll see later that it is possible to create other pointer types which point at a unique pointer's value in Rust. This is similar to C++. However, in C++ this allows you to cause errors at runtime by holding a pointer to freed memory. That is not possible in Rust (we'll see how when we cover Rust's other pointer types).

As shown above, owning pointers must be dereferenced to use their values. However, method calls automatically dereference, so there is no need for a `->` operator or to use `*` for method calls. In this way, Rust pointers are a bit similar to both pointers and references in C++. E.g.,
fn bar(x: ~Foo, y: ~~~~~Foo) {
    x.foo();
    y.foo();
}
Assuming that the type `Foo` has a method `foo()`, both these expressions are OK.

Using the `box` operator on an existing value does not take a reference to that value, it copies that value. So,
fn foo() {
    let x = 3;
    let mut y = box x;
    *y = 45;
    println!("x is still {}", x);
}
In general, Rust has move rather than copy syntax (as seen above with unique pointers). Primitive types have copy semantics, so in the above example the value `3` is copied, but for more complex values it would be moved. We'll cover this in more detail later.

Sometimes when programming, however, we need more than one reference to a value. For that, Rust has borrowed pointers. I'll cover those in the next post.

Wednesday, April 23, 2014

Rust for C++ programmers - part 3: primitive types and operators

Rust has pretty much the same arithmetic and logical operators as C++. `bool` is the same in both languages (as are the `true` and `false` literals). Rust has similar concepts of integers, unsigned integers, and floats. However the syntax is a bit different. Rust uses `int` to mean an integer and `uint` to mean an unsigned integer. These types are pointer sized. E.g., on a 32 bit system, `uint` means a 32 bit unsigned integer. Rust also has explicitly sized types which are `u` or `i` followed by 8, 16, 32, or 64. So, for example, `u8` is an 8 bit unsigned integer and `i32` is a 32 bit signed integer. For floats, Rust has `f32` and `f64` (`f128` is coming soon too).

Numeric literals can take suffixes to indicate their type (using `i` and `u` instead of `int` and `uint`). If no suffix is given, Rust tries to infer the type. If it can't infer, it uses `int` or `f64` (if there is a decimal point). Examples:
fn main() {
    let x: bool = true;
    let x = 34;   // type int
    let x = 34u;  // type uint
    let x: u8 = 34u8;
    let x = 34i64;
    let x = 34f32;
}
As a side note, Rust lets you redefine variables so the above code is legal - each `let` statement creates a new variable `x` and hides the previous one. This is more useful than you might expect due to variables being immutable by default.

Numeric literals can be given as binary, octal, and hexadecimal, as well as decimal. Use the `0b`, `0o`, and `0x` prefixes, respectively. You can use an underscore anywhere in a numeric literal and it will be ignored. E.g,
fn main() {
    let x = 12;
    let x = 0b1100;
    let x = 0o14;
    let x = 0xe;
    let y = 0b_1100_0011_1011_0001;
}
Rust has chars and strings, but since they are Unicode, they are a bit different from C++. I'm going to postpone talking about them until after I've introduced pointers, references, and vectors (arrays).

Rust does not implicitly coerce numeric types. In general, Rust has much less implicit coercion and subtyping than C++. Rust uses the `as` keyword for explicit coercions and casting. Any numeric value can be cast to another numeric type. `as` cannot be used to convert between booleans and numeric types. E.g.,
fn main() {
    let x = 34u as int;     // cast unsigned int to int
    let x = 10 as f32;      // int to float
    let x = 10.45f64 as i8; // float to int (loses precision)
    let x = 4u8 as u64;     // gains precision
    let x = 400u16 as u8;   // 144, loses precision (and thus changes the value)
    println!("`400u16 as u8` gives {}", x);
    let x = -3i8 as u8;     // 253, signed to unsigned (changes sign)
    println!("`-3i8 as u8` gives {}", x);
    //let x = 45u as bool;  // FAILS!
}
Rust has the following numeric operators: `+`, `-`, `*`, `/`, `%`; bitwise operators: `|`, `&`, `^`, `<<`, `>>`; comparison operators: `==`, `!=`, `>`, `<`, `>=`, `<=`; short-circuit logical operators: `||`, `&&`. All of these behave as in C++, however, Rust is a bit stricter about the types the operators can be applied to - the bitwise operators can only be applied to integers and the logical operators can only be applied to booleans. Rust has the `-` unary operator which negates a number. The `!` operator negates a boolean and inverts every bit on an integer type (equivalent to `~` in C++ in the latter case). Rust has compound assignment operators as in C++, e.g., `+=`, but does not have increment or decrement operators (e.g., `++`).

Sunday, April 20, 2014

Formatting change

Quick note - yes, I have changed the formatting on my blog - hope you like it!

Rust for C++ programmers - part 2: control flow

If

The `if` statement is pretty much the same in Rust as C++. One difference is that the braces are mandatory, but brackets around the expression being tested are not. Another is that `if` is an expression, so you can use it the same way as the ternary `?` operator in C++ (remember from last time that if the last expression in a block is not terminated by a semi-colon, then it becomes the value of the block). There is no ternary `?` in Rust. So, the following two functions do the same thing:
fn foo(x: int) -> &'static str {
    let mut result: &'static str;
    if x < 10 {
        result = "less than 10";
    } else {
        result = "10 or more";
    }
    return result;
}

fn bar(x: int) -> &'static str {
    if x < 10 {
        "less than 10"
    } else {
        "10 or more"
    }
}
The first is a fairly literal translation of what you might write in C++. The second is in better Rust style.

You can also write `let x = if ...`, etc.

Loops

Rust has while loops, again just like C++:
fn main() {
    let mut x = 10;
    while x > 0 {
        println!("Current value: {}", x);
        x -= 1;
    }
}
There is no do...while loop in Rust, but we do have the `loop` statement which just loops forever:
fn main() {
    loop {
        println!("Just looping");   
    }
}
Rust has `break` and `continue` just like C++.

For loops

Rust also has `for` loops, but these are a bit different. Lets say you have a vector of ints and you want to print them all (we'll cover vectors/arrays, iterators, and generics in more detail in the future. For now, know that a `Vec<T>` is a sequence of `T`s and `iter()` returns an iterator from anything you might reasonably want to iterate over). A simple `for` loop would look like:
fn print_all(all: Vec<int>) {
    for a in all.iter() {
        println!("{}", a);
    }
}
If we want to index over the indices of `all` (a bit more like a standard C++ for loop over an array), you could do
fn print_all(all: Vec<int>) {
    for i in range(0, all.len()) {
        println!("{}: {}", i, all.get(i));
    }
}
Hopefully, it is obvious what the `range` and `len` functions do.

Switch/Match

Rust has a match expression which is similar to a C++ switch statement, but much more powerful. This simple version should look pretty familiar:
fn print_some(x: int) {
    match x {
        0 => println!("x is zero"),
        1 => println!("x is one"),
        10 => println!("x is ten"),
        y => println!("x is something else {}", y),
    }
}
There are some syntactic differences - we use `=>` to go from the matched value to the expression to execute, and the match arms are separated by `,` (that last `,` is optional). There are also some semantic differences which are not so obvious: the matched patterns must be exhaustive, that is all possible values of the matched expression (`x` in the above example) must be covered. Try removing the `y => ...` line and see what happens; that is because we only have matches for 0, 1, and 10 and obviously there are lots of other ints which don't get matched. In that last arm, `y` is bound to the value being matched (`x` in this case). We could also write:
fn print_some(x: int) {
    match x {
        x => println!("x is something else {}", x)
    }
}
Here the `x` in the match arm introduces a new variable which hides the argument `x`, just like declaring a variable in an inner scope.

If we don't want to name the variable, we can use `_` for an unnamed variable, which is like having a wildcard match. If we don't want to do anything, we can provide an empty branch:
fn print_some(x: int) {
    match x {
        0 => println!("x is zero"),
        1 => println!("x is one"),
        10 => println!("x is ten"),
        _ => {}
    }
}
Another semantic difference is that there is no fall through from one arm to the next.

We'll see in later posts that match is extremely powerful. For now I want to introduce just a couple more features - the 'or' operator for values and `if` clauses on arms. Hopefully an example is self-explanatory:
fn print_some_more(x: int) {
    match x {
        0 | 1 | 10 => println!("x is one of zero, one, or ten"),
        y if y < 20 => println!("x is less than 20, but not zero, one, or ten"),
        y if y == 200 => println!("x is 200 (but this is not very stylish)"),
        _ => {}
    }
}
Just like `if` expressions, `match` statements are actually expressions so we could re-write the last example as:
fn print_some_more(x: int) {
    let msg = match x {
        0 | 1 | 10 => "one of zero, one, or ten",
        y if y < 20 => "less than 20, but not zero, one, or ten",
        y if y == 200 => "200 (but this is not very stylish)",
        _ => "something else"
    };

    println!("x is {}", msg);
}
Note the semi-colon after the closing brace, that is because the `let` statement is a statement and must take the form `let msg = ...;`. We fill the rhs with a match expression (which doesn't usually need a semi-colon), but the `let` statement does. This catches me out all the time.

Motivation: Rust match statements avoid the common bugs with C++ switch statements - you can't forget a `break` and unintentionally fall through; if you add a case to an enum (more later on) the compiler will make sure it is covered by your `match` statement.

Method call

Finally, just a quick note that methods exist in Rust, similarly to C++. They are always called via the `.` operator (no `->`, more on this in another post). We saw a few examples above (`len`, `iter`). We'll go into more detail in the future about how they are defined and called. Most assumptions you might make from C++ or Java are probably correct.

Friday, April 18, 2014

Rust for C++ programmers - an intermission - why Rust

I realise that in terms of learning Rust, I had jumped straight to the 'how' and skipped the 'why'. I guess I am in enough of a Rust bubble that I can't imagine why you wouldn't want to learn it. So, I will make a bit more of an effort to explain why things are how they are. Here I will try to give a bit of an overview/motivation.

If you are using C or C++, it is probably because you have to - either you need low-level access to the system, or need every last drop of performance, or both. Rust aims to do offer the same level of abstraction around memory, the same performance, but be safer and make you more productive.

Concretely, there are many languages out there that you might prefer to use to C++: Java, Scala, Haskell, Python, and so forth, but you can't because either the level of abstraction is too high - you don't get direct access to memory, you are forced to use garbage collection, etc. - or there are performance issues - either performance is unpredictable or its simply not fast enough. Rust does not force you to use garbage collection, and as in C++, you get raw pointers to memory to play with. Rust subscribes to the 'pay for what you use' philosophy of C++. If you don't use a feature, then you don't pay any performance overhead for its existence. Furthermore, all language features in Rust have predictable (and usually small) cost.

Whilst these constraints make Rust a (rare) viable alternative to C++, Rust also has benefits: it is memory safe - Rust's type system ensures that you don't get the kind of memory errors which are common in C++ - memory leaks, accessing un-initialised memory, dangling pointers - all are impossible in Rust. Furthermore, whenever other constraints allow, Rust strives to prevent other safety issues too - for example, all array indexing is bounds checked (of course, if you want to avoid the cost, you can (at the expense of safety) - Rust allows you to do this in unsafe blocks, along with many other unsafe things. Crucially, Rust ensures that unsafety in unsafe blocks stays in unsafe blocks and can't affect the rest of your program). Finally, Rust takes many concepts from modern programming languages and introduces them to the systems language space. Hopefully, that makes programming in Rust more productive, efficient, and enjoyable.

I would like to motivate some of the language features from part 1. Local type inference is convenient and useful without sacrificing safety or performance (it's even in modern versions of C++ now). A minor convenience is that language items are consistently denoted by keyword (`fn`, `let`, etc.), this makes scanning by eye or by tools easier, in general the syntax of Rust is simpler and more consistent than C++. The `println!` macro is safer than printf - the number of arguments is statically checked against the number of 'holes' in the string and the arguments are type checked. This means you can't make the printf mistakes of printing memory as if it had a different type or addressing memory further down the stack by mistake. These are fairly minor things, but I hope they illustrate the philosophy behind the design of Rust.

Rust for C++ programmers - part 1: Hello world

This is the first in a series of blog posts (none written yet) which aim to help experienced C++ programmers learn Rust. Expect updates to be sporadic at best. In this first blog post we'll just get setup and do a few super basic things. Much better resources are at the tutorial and reference manual.

First you need to install Rust. You can download a nightly build from http://www.rust-lang.org/install.html (I recommend the nighlties rather than 'stable' versions - the nightlies are stable in that they won't crash too much (no more than the stable versions) and you're going to have to get used to Rust evolving under you sooner or later anyway). Assuming you manage to install things properly, you should then have a `rustc` command available to you. Test it with `rustc -v`.

Now for our first program. Create a file, copy and paste the following into it and save it as `hello.rs` or something equally imaginative.
fn main() {
    println!("Hello world!");
}
Compile this using `rustc hello.rs`, and then run `./hello`. It should display the expected greeting \o/

Two compiler options you should know are `-o ex_name` to specify the name of the executable and `-g` to output debug info; you can then debug as expected using gdb or lldb, etc. Use `-h` to show other options.

OK, back to the code. A few interesting points - we use `fn` to define a function or method. `main()` is the default entry point for our programs (we'll leave program args for later). There are no separate declarations or header files as with C++. `println!` is Rust's equivalent of printf. The `!` means that it is a macro, for now you can just treat it like a regular function. A subset of the standard library is available without needing to be explicitly imported/included (we'll talk about that later). The `println!` macros is included as part of that subset.

Lets change our example a little bit:
fn main() {
    let world = "world";
    println!("Hello {}!", world);
}
`let` is used to introduce a variable, world is the variable name and it is a string (technically the type is `&'static str`, but more on that in a later post). We don't need to specify the type, it will be inferred for us.

Using `{}` in the `println!` statement is like using `%s` in printf. In fact, it is a bit more general than that because Rust will try to convert the variable to a string if it is not one already*. You can easily play around with this sort of thing - try multiple strings and using numbers (integer and float literals will work).

If you like, you can explicitly give the type of `world`:
    let world: &'static str = "world";
In C++ we write `T x` to declare a variable `x` with type `T`. In Rust we write `x: T`, whether in `let` statements or function signatures, etc. Mostly we omit explicit types in `let` statements, but they are required for function arguments. Lets add another function to see it work:
fn foo(_x: &'static str) -> &'static str {
    "world"
}

fn main() {
    println!("Hello {}!", foo("bar"));
}
The function `foo` has a single argument `_x` which is a string literal (we pass it "bar" from `main`). We don't actually use that argument in `foo`. Usually, Rust will warn us about this. By prefixing the argument name with `_` we avoid these warnings. In fact, we don't need to name the argument at all, we could just use `_`.

The return type for a function is given after `->`. If the function doesn't return anything (a void function in C++), we don't need to give a return type at all (as in `main`). If you want to be super-explicit, you can write `-> ()`, `()` is the void type in Rust. `foo` returns a string literal.

You don't need the `return` keyword in Rust, if the last expression in a function body (or any other body, we'll see more of this later) is not finished with a semicolon, then it is the return value. So `foo` will always return "world". The `return` keyword still exists so we can do early returns. You can replace `"world"` with `return "world";` and it will have the same effect.



* This is a programmer specified conversion which uses the `Show` trait, which works a bit like `toString` in Java. You can also use `{:?}` which gives a compiler generated representation which is sometimes useful for debugging. As with printf, there are many other options.

Monday, April 07, 2014

Anger

There was a lot of anger about Brendan being appointed CEO. There has been a lot of anger since he quit. It is in no way my place to tell people when they should and should not be angry. One of the big points made by progressive movements is that everyone has a right to get angry about things which affect them and people who aren't affected shouldn't tell people not to be angry. Doing so is a control tactic and just generally unfair (especially where it intersects with myths and stereotypes - 'the gay agenda', 'the angry black woman', etc.). I agree. It is one reason I said nothing much about the whole affair. Better for me to listen.

Still, after all this has played out (hopefully), I am left feeling a bit angry myself. And a lot disappointed. Previously, I have mostly agreed with the progressive movements (feminism, LGBT rights, anti-racism, and so forth). When I have not agreed, I have often had my mind changed. I have learnt a lot and I have a great respect for many people in these movements. It feels bad to be on the wrong side of that. It seems to me that the subtlety in the discussion was lost - assumptions were made, opinions were fought for, there was not much attempt to establish empathy and tolerance, nor to accurately learn the specifics of the situation.

Back to anger. Though it is important not to tell people when they are allowed to get angry or what that anger should look like, I would like to suggest how that anger should be used. Anger can be constructive - it is one of the most motivating human emotions and has led to great changes over the years. It can also be amazingly destructive with no purpose - from a child's tantrum to pretty much every war ever fought. Sometimes it is good, emotionally, to get angry and break things. But we must try to put some thought into what gets broken. It was in large part anger that brought LGBT (and other civil) rights to where they are today. We need more of that, and less just breaking stuff, even if it makes us feel better.

In the last couple of weeks we've seen a lot of (justified) anger, but the result has not been positive. Things got broken, but nothing has changed for the better. A small, non-profit organisation which fights for freedom and privacy on the internet against corporate interests and overbearing governments has been damaged in many ways. All to harm a man who made a semi-public donation to an admittedly odious cause. I can't think of anyone who's life has got better from this, maybe some CEOs of other companies who probably have private views worse than Brendan's, but weren't as honest about declaring them.

Sunday, April 06, 2014

Marriage

I have a lot of thoughts and feelings around the recent Brendan/CEO kerfuffle. Mostly though I think it has been better to listen than to talk, so I haven't blogged about it. I will just say that I am very unhappy with the result of it all. I would also like to voice my thoughts on marriage.

One element of misunderstanding, it seems to me, is about whether marriage is important. For some people, marriage is viewed as a purely religious/cultural construct which should be dictated by their religion/culture. They don't see why it is so important for gay people (or sometimes other minority groups) to be able to marry. Especially if alternatives such as civil unions exist. They have the privilege of not being denied the marriage of their choice.

In contrast, for many people marriage is a large and practical thing since it can affect things such as immigration status, benefits, hospital visitation, etc. (As well as having their relationships treated as second class in the eyes of the wider society, of course). In my view, it is unfortunate that the practical side of things exists. I am lucky enough to live in a country where marriage is (mostly) not important in that way, and I prefer it greatly.

Marriage is a wonderful thing, and I would not deny it to anybody who wants it. In my view, it should not involve either the state or any cultural or religious institution. I find the fact that a couple has to be married by a third party weird. In my ideal world, the people being married would only have to marry themselves to each other, and no-one else would get a say. Marriage should simply be a public declaration of commitment in front of the people who are important to those being married. No-one should have to officiate or register it, and no-one should have to say who can or who can't get married. And certainly, being married should have no effects on your legal or moral life.

To clarify, I don't think marriage should lead to tax breaks or extra respect from any institution. I don't believe adultery should be judged any better or worse because of it, etc.

Once marriage brings material benefit from the state or the legal system, and once marriage is bestowed by an institution rather than being freely chosen, it becomes just another tool for enforcing established power. By allowing powerful groups of people to bestow benefits either social or material on individuals, it becomes open to corruption. It becomes something minorities have to fight for and which exclusive majorities seek to prevent others obtaining. That an expression of love and commitment ends up like this is immensely saddening, and says a lot about human society.

And don't even get me started about the commercial side of things. The whole wedding industry makes me feel sick.

A very rough few thoughts on initialising data structures in Rust

This is not a well thought out blog post. It ought to be a tweet, but it is too long.

Thoughts (in kind of random order):
  • Rust has an ownership model based on hierarchies.
  • It is easy to support tree-based data structures this way.
  • Structures with cycles are much more annoying.
  • The real pain is in initialisation.
  • You can have a data structure with cycles using ref-counting or borrowed references as long as you 'delete' the whole data structure at once.
  • Rust helps you do this with its type system.
  • BUT there is no way to initialise such a data structure because of the way `mut` works. Specifically the requirement that `mut` variables are unique and have move/borrow semantics.
  • Therefore you need unsafe blocks, which is a pain.
  • Since it seems that you ought to be able to verify safe initialisation (I'm not sure how though).
  • I think this is an area of the language which has not caused much pain because compilers (especially the frontend) and browsers are both very 'tree-heavy' programs. Long term we should find a solution to this, but it is certainly something that can be put off till post-1.0 (since it has an unsafe solution and I can' imagine a safe solution not being backwards-compatible).
  • Kind of related - when using a data structure where all components have the same lifetime (i.e., the whole structure must be destroyed at once) you end up writing a huge number of really un-interesting lifetime parameters on functions and data which basically just get passed around. I would love to have a shorthand here - perhaps modules could have lifetime parameters? Not sure if that would help that much. More thought required in any case...

Saturday, April 05, 2014

Some thoughts on data structures in Rust

One thing I'm not so keen on in Rust (more of an itch, than a pain) is that there are a lot of data structures. C++ has classes and structs (which are basically the same thing), enums, and unions. Java has classes and enums (unless its gained some recently). Scripting and functional languages tend to have more (lists, tuples, dictionaries/maps, and so forth). (I'm ignoring arrays, lists, and other sequences for this post). Rust has:
  •     structs
  •     tuples
  •     tuple structs
  •     unit structs
  •     enums
  •     unit variants
  •     'tuple' variants
  •     struct variants
This seems like a lot, and I think it is a bit confusing. But there is good motivation for all of them. Still I think it is a little bit of an ad-hoc collection.

We've been thinking recently about how to support structures like the DOM, and there have been a bunch of different proposals. One of mine (https://github.com/rust-lang/rfcs/pull/24) suggests using a kind of unification of structs and enums to support Java style inheritance. I think that idea probably won't work. But I have a lot of thoughts which I would like to preserve for posterity. I'm not proposing any changes to Rust here, just trying to put into order the data structures a bit, and maybe provide a design for a core calculus or something. Basically intellectual masturbation, but perhaps it is interesting.

Classifying data structures

I think it is useful to classify the data structure along several axes:
  1. Are there many variants or just one (enums  - many, everything else - just one);
  2. Do the data structures give types (yes, except for the variants of enums);
  3. Does the data structure introduce a name, and thus nominal typing (tuples don't, everything else does);
  4. Are the fields of the structure named (enums and unit structs/variants - NA, tuples and tuple variants - no, everything else - yes);
  5. Can the data structure be instantiated (enums - no, everything else - yes);
  6. Can the data structure be a variant of another (variants - yes, everything else - no).
Do these categories uniquely identify each data structure? Yes.

Are there combinations of categories which are empty? Also, yes. Some of these are not interesting - there are no instantiable data structure which don't introduce a type. Some are just kind of holes - why is there no anonymous record type (i.e., a structure with named fields, but which can't be named itself)? Some are design decisions - why can't variants be used as types (maybe they will in the future)? Finally, some are really interesting questions - why don't we have variants which are themselves enums? Why don't enums have fields (named or not)? What would it mean to instantiate an enum? Could we have un-named variants?

What would a minimal set of data structures look like

In RFC PR 24, I proposed nesting enums and unifying enums and structs. I would like to go even further and try and think about a minimal set of data structures with the same expressivity as Rust data structures (more, in fact).

First a few observations. If fields are not named, then order is important, otherwise it isn't. That means mixing named and un-named fields doesn't really work. If a structure is named then it is nominally typed, otherwise, it is structurally typed. A unit structure is the same as a structure with n fields (named or not), where n = 0.

Lets start with records - we want named and un-named records with named and un-named fields. That gives us structs and tuples, both with and without names (as opposed to Rust which is missing un-named structs). Here is a Rust-like syntax (I use `[]` to mean a comma-separated sequence):
S ::= 'struct' id '{' ([id:T] | [T]) '}'
T ::= id | '{' ([id:T] | [T]) '}'
e ::= id? '{' ([id:e] | [e]) '}'
The next thing we need are enums and variants. Lets follow my unification proposal and just add variants to structs. The next question is what the variants should be - lets just use the existing data structures. And then we have the question of should an enum be instantiable - rather than following my RFC and using enum vs struct to differentiate, lets just add a keyword - abstract. Now we have:
S ::= 'abstract'? 'struct' id '{' ([id:T] | [T]), [S] '}'
T ::= ... | id | '{' ([id:T] | [T]) '}'
e ::= ... | id? '{' [[id:e] | [e]] '}'
Note that the initialiser syntax has changed to be a sequence of sequences of possibly named expressions. That is because I imagine the fields of outer data structures being inherited by inner ones. We have to accommodate un-named fields (where order is important) and named fields (where order is not, at least within the sub-sequence).

I believe this emulates all Rust data structures and adds anonymous records. Enums are abstract structs with no fields, structs are structs with named fields and no variants. Unit, tuple, and struct variants are nullary structs, structs with un-named fields, and structs with named fields - each nested inside an abstract struct with no fields. And so forth. One difference is that our encoding of variants introduce types, but that is an increase in expressivity.

With reference to my earlier classification, the obviously missing structures are un-named structures with variants and un-named variants - I don't believe either are practically useful; neither are present in Rust.

We have also added expressivity in the form of inheritance of variants from their enclosing structures. In terms of data, we can therefore emulate Java-style classes (we need to add virtual functions to emulate their behaviour, see the RFC for details of that).

To make the syntax more friendly we would add unit variants and structs. Then eliminate the struct keyword. We could use different kinds of brackets. Continuing we could end up back at Rust. I hope the little syntax snippet gives a glimpse of an elegant foundation for all these data structures.

A practical matter

I said above that my RFC/proposal wouldn't actually work, and this thought experiment has the same problem - how to implement enums? As with any value we must know the size of the value at compile time. In a language like Java where everything is a pointer, that is easy - the size is always one word. In Rust we want to really pass values by value. Since each enum variant could be a different size, that is a problem.

The current Rust implementation is to use the maximum size of the variants as the static size for an enum value. Then pad any variant values which are smaller than that. That is fine if we don't have too many enum objects or all enum variants are similar sizes. That is mostly the case.

However, under the scheme sketched above, we might expect both those constraints to be false. If we think about the DOM for instance, each object could be wildly different in size and we would have lots of objects, so there would be a lot of wasted memory.

One solution might be to represent enum values as a tag and a pointer to a bunch of fields, rather than as a tag and a bunch of fields. That would work, although you would have to do be careful to copy the pointed-to fields and not just the pointer when passing by value. However, it adds the cost of a dereference to every access and it is often convenient to really have a value. Rust is a systems language in the C-spirit and so there should not be this kind of magic.

In the context of virtual structs, we have previously discussed treating struct objects as dynamically sized types. We could then allocate precisely sized instances and refer to them via pointer which gives subtyping. However, such values must have restricted uses and so this would be incompatible with the current use of enum values. A solution suggested by bill-myers on GitHub is to use some keyword or attribute (`unsized`, say) which indicates that instances should be dynamically sized and mostly passed by reference. Omitting the keyword would give the current semantics and it is up to the programmer to ensure there is not too much memory wasted by padding.