Wednesday, October 22, 2014

Thoughts on numeric types

Rust has fixed width integer and floating point types (`u8`, `i32`, `f64`, etc.). It also has pointer width types (`int` and `uint` for signed and unsigned integers, respectively). I want to talk a bit about when to use which type, and comment a little on the ongoing debate around the naming and use of these types.

Choosing a type


Hopefully you know whether you want an integer or a floating point number. From here on in I'll assume you want an integer, since they are the more interesting case. Hopefully you know if you need your integer to be signed or not. Then it gets interesting.

All other things being equal, you want to use the smallest integer you can for performance reasons. Programs run as fast or faster on smaller integers (especially if the larger integer has to be emulated in software, e.g., `u64` on a 32 bit machine). Furthermore, smaller integers will give smaller code.

At this point you need to think about overflow. If you pick a type which is too small for your data, then you will get overflow and usually bugs. You very rarely want overflow. Sometimes you do - if you want arithmetic modulo 2^n where n is 8, 16, 32, or 64, you can use the fixed width type and rely on overflow behaviour. Sometimes you might also want signed overflow for some bit twiddling tricks. But usually you don't want overflow.

If your data could grow to any size, you should use a type which will never overflow, such as Rust's `num::bigint::BigInt`. You might be able to do better performance-wise if you can prove that values might only overflow in certain places and/or you can cope with overflow without 'upgrading' to a wider integer type.

If, on the other hand, you choose a fixed width integer, you are asserting that the value will never exceed that size. For example, if you have an ascii character, you know it won't exceed 8 bits, so you can use `u8` (assuming you're not going to do any arithmetic which might cause overflow).

So far, so obvious. But, what are `int` and `uint` for? These types are pointer width, that means they are the same size as a pointer on the system you are compiling for. When using these types, you are asserting that a value will never grow larger than a pointer (taking into account details about the sign, etc.). This is actually quite a rare situation, the usual case is when indexing into an array, which is itself quite rare in Rust (since we prefer using an iterator).

What you should never do is think "this number is an integer, I'll use `int`". You must always consider the maximum size of the integer and thus the width of the type you'll use.

Language design issues


There are a few questions that keep coming up around numeric types - how to name the types? Which type to use as a default? What should `int`/`uint` mean?

It should be clear from the above that there are only very few situations when using `int`/`uint` is the right choice. So, it is a terrible choice for any kind of default. But what is a good choice? Well first of all, there are two meanings for 'default': the 'go to' type to represent an integer when programming (especially in tutorials and documentation), and the default when a numeric literal does not have a suffix and type inference can't infer a type. The first is a matter of recommendation and style, and the second is built-in to the compiler and language.

In general programming, you should use the right width type, as discussed above. For tutorials and documentation, it is often less clear which width is needed. We have had an unfortunate tendency to reach for `int` here because it is the most familiar and the least scary looking. I think this is wrong. We should probably use a variety of sized types so that newcomers to Rust get aquainted with the fixed width integer types and don't perpetuate the habit of using `int`.

For a long time, Rust had `int` as the default type when the compiler couldn't decide on something better. Then we got rid of the default entirely and made it a type error if no precise numeric type could be inferred. Now we have decided we should have a default again. The problem is that there is no good default. If you aren't being explicit about the width of the value, you are basically waving your hands about overflow and taking a "it'll be fine, whatever" approach, which is bad programming. There is no default choice of integer which is appropriate for that situation (except a growable integer like BigInt, but that is not an appropriate default for a systems langauge). We could go with `i64` since that is the least worst option we have in terms of overflow (and thus safety). Or we could go with `i32` since that is probably the most performant on current processors, but neither of these options are future-proof. We could use `int`, but this is wrong since it is so rare to be able to reason that you won't overflow when you have an essentially unknown width. Also, on processors with less than 32 bit pointers, it is far to easy to overflow. I suspect there is no good answer. Perhaps the best thing is to pick `i64` because "safety first".

Which brings us to naming. Perhaps `int`/`uint` are nor the best names since they suggest they should be the default types to use when they are not. Names such as `index`/`uindex`, `int_ptr`/`u_ptr`, and `size`/`usize` have been suggested. All of these are better in that they suggest the proper use for these types. They are not such nice names though, but perhaps that is OK, since we should (mostly) discourage their use. I'm not really sure where I stand on this, again I don't really like any of the options, and at the end of the day, naming is hard.