Monday, January 23, 2023

A Short Note on Types in Rust

Types, loosely defined, are about capabilities. What are the capabilities available on a value of a certain type? Programmers experienced with strong and statically typed languages will probably have a more visceral appreciation of this, but even programmers of weakly and dynamically typed languages like JavaScript, encounter the reality of types and their capabilities; only that they do so at runtime.

For example, a JavaScript developer might type out the following code:

> "3".toExponential()

But this will fail at runtime because the "3" is of the String type and it does not have the capability to be shown as exponential. The following, on the other hand, will work seamlessly:

> (3).toExponential()
'3e+0'

Because 3 is of the number type and the number type has the capability to be formatted in exponential form.

Rust extends the idea of types to now include capabilities that relate to memory management. These capabilities ensure memory safety in Rust.

A short note on memory

Dealing with memory while programming is an area that is mostly hidden from developers who have mostly worked with languages with garbage collectors. This section quickly provides a brief overview of some of the key aspects of memory management, as moving to Rust will require a deeper understanding of what is happening behind the scenes.

Stack and Heap

Values within a program take up memory. There are various memory regions in a computer, where values can be placed, but two of the most common regions are the stack and heap.

You can think of the stack as a temporary memory location for values needed by an executing function. This includes function parameters and locally defined variables. Since this is more of a scratch area, access to it is fast, and once a function is done executing, the values can be discarded/overridden. Hence, the typical lifespan of a value on the stack is tied to the function that needs it for execution.

The heap, on the other hand, is a memory location that is more durable, where values that are not tied to the lifetime of an executing function can be placed. Dealing with the heap is more involved for obvious reasons. The heap is not as simple as the stack, and finding space to put new values might require additional low level memory management operation needed to make such space available. The stack does not have this complexity.

The idea is to limit the use of the heap region as much as possible, as it is not as fast as the stack memory region.

Sized Types and Dynamically Sized Types

The type of a variable determines the type of values such a variable can contain. A consequence of this is that types also dictate how big or small the memory required by such a value is.

A variable of type u32 can contain numeric values in the range of 0 up to 4,294,967,295, while a variable of type u8, on the other hand, has a smaller size and can contain numeric values between the range of 0 and 255. This means a variable of u32, will require a memory that is 32-bit long, while u8, requires memory that is 8-bit in length.

Most types have a particular size, hence required memory length, that is knowable at compile time. u8 and u32 fall into this category of types. These types are called Sized Types. They are guaranteed to remain uniquely the same. A type of u32 will always need a 32-bit length of memory regardless if it contains 0 or 4,294,967,295. The same applies to all other sized types.

On the other hand, there are other types whose sizes cannot be uniquely known at compile time.

One example is an array represented by [T]. This type represents a certain number of T in sequence, but we don’t know how many there are; it could be 0, 1, 132, or 1 million T's. Hence, it is not possible to ascribe a unique size to these types at compile time. These types are known as Dynamically sized types (DST).

The str type is another example of DSTs. It represents a slice of strings. But for the same reason why we cannot uniquely determine the size of [T] at compile time, we also cannot determine the size of all slices of strings. Because a slice of string, represented by the type str could be 0, 1, 131, or 1 million long.

The idea with DSTs is not that the size is not known at compile times; they are, but they can vary, hence no unique size can be ascribed to them at compile time.

The Rust compiler generally prefers to know the size of types at compile time, for various reasons such as better management and optimization. So, we have a problem here with dynamically sized types, since creating a value and annotating it with a DST won't compile.

This:

fn main() {
   let dst_value: str = "hello world";
}

Will fail compilation with the following error

error[E0277]: the size for values of type `str` cannot be known at compilation time
  --> src/main.rs:70:9
   |
70 |     let dst_value: str = "hello world";
   |         ^^^^^^^^^ doesn't have a size known at compile-time
   |
   = help: the trait `Sized` is not implemented for `str`
   = note: all local variables must have a statically known size
   = help: unsized locals are gated as an unstable feature

The Rust compiler is kind enough to tell us why it is refusing to compile the code. A very instructive reason it gave is that "all local variables must have a statically known size."

So, how do we deal with this?

To fully understand how we first take a look at the concepts of ownership and the borrow checker in Rust.

Ownership, Copying, Referencing, Moving and Borrowing

Let's first look at copying and referencing. We can illustrate using Javascript.

Copying

This is one of the atomic operations performed in any programming language. You have value in one variable, you copy it into another. In Javascript:

> let a = 1
undefined
> let b = a
undefined
> b
1
> b = 2
2
> b
2
> a
1

Basically, once you have b = a the value of a is copied to b, hence when b is modified, it does not affect the value still contained in a.

Referencing

Referencing is when multiple variables do not have their unique copy of a value, but instead point to the same value. This means that if one of the variables changes the value, the other variable will also be changed.

> let a = [1,2,3]
undefined
> let b = a
undefined
> a.push(4)
4
> a
[ 1, 2, 3, 4 ]
> b
[ 1, 2, 3, 4 ]

To prevent creating a reference, a clone will have to be created.

The concepts of copying and referencing also exist in Rust, but Rust handles them differently. Before discussing copying in Rust, let's take a look at the concepts of ownership and moving of values, which on the other hand, are unique to Rust.

Ownership and moving of values

Rust introduces the concept of ownership and the moving of values. Simply put, it means values are not assigned to variables, rather values are owned by variables. For example:

>> let a = 1;

Should now be seen as the variable a was granted ownership of the value 1.

It is also possible to copy a value owned by a variable, which creates another value that is owned by the other variable.

>> let a = 1;
>> let b = a;
>> a
1
>> b
1

This should be interpreted as variable a owning the value 1, then with let b = a, a new copy of the value owned by the variable a is created and handed over to be owned by variable b.

The above is more or less the same behavior as the copying we saw with JavaScript.

The additional feature in Rust is the concept of moving values.

This means, instead of copying, a variable yields ownership of a value to another variable. As a result, once the value is moved and ownership is transferred, the old variable can no longer be used.

fn main() {
   let a = String::from("Hello world");
   let b = a; // value has moved from a to b
   println!("{b}");
   println!("{a}"); // this line won't make it compile
}

The above code won't compile. This is because with the values of type String, the let b = a line means the value from variable a is moved to variable b. That is, variable a has transferred ownership to variable b. Hence, trying to use the variable a later on in the code by attempting to print it out won't compile.

What could be confusing is the fact that in the previous example where we worked with values of type u8, the behavior was copying, which is similar to what we had with JavaScript. But, when the value is of type String, the value was moved!

What gives?

Well, it comes down to the idea of types again and their capabilities. The general idea is, types allocated on the heap generally default to moving values, while most primitive types that are placed on the stack copy instead. String is a type that manages memory on the heap hence why its default capability is not to copy. To create a distinct copy of String it will have to be cloned instead.

It is illustrative to point out that Rust further distinguishes between copying and cloning. Where copy can be interpreted as a simple bit-wise transfer of bits from one memory location to another while clone, apart from moving bits, also require additional logic to be executed.

This whole concept of moving values is something unique to Rust and new programmers coming to Rust need to be aware of this.

Borrowing and referencing of values

Now to recreate the scenario we had in Javascript where mutating an array in one variable led to the mutation on another variable, we have the following Rust code:

fn main() {
   let mut a = vec![1,2,3,4];
   let b = &mut a;
   b.push(5);
   println!("{a:?}");
}

Variable a holds a vector, but it was mutated via variable b and 5 was added, and this got reflected in variable a. Essentially, the same reference scenario we had with JavaScript arrays.

The only difference, however, is that Rust makes referencing explicit.

Let's go through the syntax. let mut a = vec![1,2,3,4]; grants ownership of the vector to variable a. Most data structures are immutable by default in Rust, so we use the mut keyword to indicate that the value owned by a can be modified.

Then let b =&mut a; is where the magic happens.

This, instead of moving the value from a to b, allows b to borrow the value instead.

This form of borrowing is achieved by creating a reference and handing that over to b.

The syntax &mut is used to achieve this, and the mut keyword indicates that it is possible to mutate the value through this reference.

It is possible to borrow a value, i.e. create a reference to a value but not be able to mutate the value. Sort of a read-only. In such a scenario, you only use & for example:

fn main() {
   let mut a = vec![1,2,3,4];
   let b = &a;
   println!("{b:?}");
}

But there are certain things you can and cannot do depending on if a value is owned, moved, borrowed, mutable, or immutable. These whole classes of restrictions are there to ensure memory safety, and this is what the borrow checker enforces.

In summary, the borrow checker ensures: you can have multiple read-only references to the same memory location as long as there are no mutable references. If a mutable reference is created to a memory location, it is only this mutable reference that can read from and write to the memory location.

To get a more elaborate exploration of the rules the borrow checker enforces, see Rust Ownership Rules

Introduction to Pointers and using References to work with DST

DSTs were already introduced as types that do not have a unique size at compile time, and as mentioned above, Rust prefers to deal with types it can know their size at compile time. So, how does Rust work with DSTs? To understand this, we need to talk about pointers.

Pointers and References in Rust

Pointers are variables, just like any other type of variable, but their values hold the memory address of other variables. They are an indirection that points to the memory location where another value can be found.

In Rust, pointers also have types, which makes sense since one can imagine, specific capabilities are only available to pointers. Pointers can also be found in languages like C and C++, but they can be difficult, and unsafe to work with due to the potential for misusing the unhindered access to memory locations.

This is the reason in Rust, you hardly deal with pointers directly. Instead, you have references, which are pointers with safety or liveness guarantees.

You can think of them as a protective layer that makes working with pointers safer. Pointers that are not references, i.e. that don't have this protective guarantee in Rust are usually called raw pointers.

The thing with references is that they also have types, and the good thing about them, is that they have a known size. This is because references hold memory addresses, hence it is possible to have a constant bit length that will always be assigned to references, big enough to enable it to store whatever memory address it needs to store.

Sometimes extra meta-information about the value in the memory address they point to is also stored, when this is the case, the reference is usually called a fat pointer.

The syntax for creating references is &T. To create a reference to a type T, use &T. For example, in the following code:

use std::collections::HashMap;
fn main() {
   let phone_code: HashMap<String, u8> = HashMap::from([("NL".to_string(), 31), ("USA".to_string(), 1)]);
   let ref_to_phone_code: &HashMap<String, u8> = &phone_code;
   dbg!(ref_to_phone_code);
}

phone_code is of the type HashMap<String, u8>, but ref_to_phone_code, a reference, is of the type &HashMap<String, u8> - notice the ampersand now in the type.

Also, the value of the type &HashMap<String, u8> was created by appending & to the variable it will reference, that is &phone_code in the code above.

Rust uses references to work with DSTs. Since references are types with known size at compile time, the trick is to only allow interaction with DSTs via references to them. This is why annotating a type with str, a DST, will cause a compilation failure, but &str, a reference to a DST, compiles fine.

In Summary: T (Sized and DST), &T and &mut T

As mentioned at the beginning of this post, types are about encoding capabilities allowed with a value. Rust extends this to encompass capabilities related to memory management and layout.

So a type T can exist in one of two flavors: One where its memory size is always known to be of a particular bit length; these are called Sized Types, and the ones whose size/bit length is not unique and can vary; these are called Dynamically Sized Types.

Then you have references, which are pointers (variables that hold memory location) with guarantees that ensure safe memory access.

These guarantees are enforced by the borrow checker.

These references could be of type &T or &mut T, depending on whether the reference is mutable or not. If type T is a DST, then a variable of type &T (or &mut T) can be used to reference the DST.

Additional Resources


No comments: