Wednesday, October 22, 2014

Thoughts on numeric types

Rust has fixed width integer and floating point types (`u8`, `i32`, `f64`, etc.). It also has pointer width types (`int` and `uint` for signed and unsigned integers, respectively). I want to talk a bit about when to use which type, and comment a little on the ongoing debate around the naming and use of these types.

Choosing a type


Hopefully you know whether you want an integer or a floating point number. From here on in I'll assume you want an integer, since they are the more interesting case. Hopefully you know if you need your integer to be signed or not. Then it gets interesting.

All other things being equal, you want to use the smallest integer you can for performance reasons. Programs run as fast or faster on smaller integers (especially if the larger integer has to be emulated in software, e.g., `u64` on a 32 bit machine). Furthermore, smaller integers will give smaller code.

At this point you need to think about overflow. If you pick a type which is too small for your data, then you will get overflow and usually bugs. You very rarely want overflow. Sometimes you do - if you want arithmetic modulo 2^n where n is 8, 16, 32, or 64, you can use the fixed width type and rely on overflow behaviour. Sometimes you might also want signed overflow for some bit twiddling tricks. But usually you don't want overflow.

If your data could grow to any size, you should use a type which will never overflow, such as Rust's `num::bigint::BigInt`. You might be able to do better performance-wise if you can prove that values might only overflow in certain places and/or you can cope with overflow without 'upgrading' to a wider integer type.

If, on the other hand, you choose a fixed width integer, you are asserting that the value will never exceed that size. For example, if you have an ascii character, you know it won't exceed 8 bits, so you can use `u8` (assuming you're not going to do any arithmetic which might cause overflow).

So far, so obvious. But, what are `int` and `uint` for? These types are pointer width, that means they are the same size as a pointer on the system you are compiling for. When using these types, you are asserting that a value will never grow larger than a pointer (taking into account details about the sign, etc.). This is actually quite a rare situation, the usual case is when indexing into an array, which is itself quite rare in Rust (since we prefer using an iterator).

What you should never do is think "this number is an integer, I'll use `int`". You must always consider the maximum size of the integer and thus the width of the type you'll use.

Language design issues


There are a few questions that keep coming up around numeric types - how to name the types? Which type to use as a default? What should `int`/`uint` mean?

It should be clear from the above that there are only very few situations when using `int`/`uint` is the right choice. So, it is a terrible choice for any kind of default. But what is a good choice? Well first of all, there are two meanings for 'default': the 'go to' type to represent an integer when programming (especially in tutorials and documentation), and the default when a numeric literal does not have a suffix and type inference can't infer a type. The first is a matter of recommendation and style, and the second is built-in to the compiler and language.

In general programming, you should use the right width type, as discussed above. For tutorials and documentation, it is often less clear which width is needed. We have had an unfortunate tendency to reach for `int` here because it is the most familiar and the least scary looking. I think this is wrong. We should probably use a variety of sized types so that newcomers to Rust get aquainted with the fixed width integer types and don't perpetuate the habit of using `int`.

For a long time, Rust had `int` as the default type when the compiler couldn't decide on something better. Then we got rid of the default entirely and made it a type error if no precise numeric type could be inferred. Now we have decided we should have a default again. The problem is that there is no good default. If you aren't being explicit about the width of the value, you are basically waving your hands about overflow and taking a "it'll be fine, whatever" approach, which is bad programming. There is no default choice of integer which is appropriate for that situation (except a growable integer like BigInt, but that is not an appropriate default for a systems langauge). We could go with `i64` since that is the least worst option we have in terms of overflow (and thus safety). Or we could go with `i32` since that is probably the most performant on current processors, but neither of these options are future-proof. We could use `int`, but this is wrong since it is so rare to be able to reason that you won't overflow when you have an essentially unknown width. Also, on processors with less than 32 bit pointers, it is far to easy to overflow. I suspect there is no good answer. Perhaps the best thing is to pick `i64` because "safety first".

Which brings us to naming. Perhaps `int`/`uint` are nor the best names since they suggest they should be the default types to use when they are not. Names such as `index`/`uindex`, `int_ptr`/`u_ptr`, and `size`/`usize` have been suggested. All of these are better in that they suggest the proper use for these types. They are not such nice names though, but perhaps that is OK, since we should (mostly) discourage their use. I'm not really sure where I stand on this, again I don't really like any of the options, and at the end of the day, naming is hard.

6 comments:

Stuart said...

Is it actually true that smaller types are always faster?

I was under the impression that 8-bit and 16-bit arithmetic can theoretically end up being slower, if the compiler backend ends up having to emulate mod-2^8 or mod-2^16 arithmetic using 32-bit operations.

But I don't know if this concern actually applies to modern-day CPUs.

Michael L said...

How about:
1. default to bigint
2. give a hint with the number of times bigint was used, suggesting the use of suffixes or type declarations.
3. provide a flag that turns the hint into an error.

Josh Tumath said...

Maybe a low-level warning should be given in the compiler, as well, when using int or uint. Renaming it would also make sense. Personally, I favour the arguments of defaulting to i32 rather than i64.

It's very likely that if something isn't done, use of int will just become very common since it's what's used in so many other languages.

Cole Mak said...

Is there any performance penalty in using a int8 or int16 on a 32-bit CPU? Or performance benefit?

mnem said...

I don't think it's the case that smaller types are always faster. Take the case of a loop counter. For a counter that is the width of a byte on a modern x86 chip, it'll undergo a movzx (move, clearing with zeros) before the cmp instruction which compares the counter value. ARM will generally perform a similar step (see http://www.davespace.co.uk/arm/efficient-c-for-arm/localvar.html ). So in those cases there will be more code in the executable file but more crucially, more instructions executed during each loop iteration. 16 bit widths behave similarly on x86 I think, although I'm not sure about ARM.

In general I imagine that dealing with widths smaller than the machine's word size will result in a pair of load-clearing-with-zero and unload-masking-some-bits type of instructions when moving data in and out of registers.

Having said all that, perhaps it makes no practical difference for speed these days. Chips are all damn fast. It's memory access that hurts. Also it's many years since I actually did anything useful with assembler, so I may just be writing rubbish :)

Anonymous said...

If 'int' is meant to be a pointer size, then calling it 'ptr' makes much more sense. Why not 'ptr' and 'u_ptr'? How does the 'int_' in 'int_ptr' help except to add more letters to type?