Rust borrow and lifetimes

Animats · on Nov 30, 2014

That covers the easy cases for ownership and lifetimes. But it's apparently necessary in Rust to use "unsafe" code in libraries to implement basic structures like re-sizable arrays. This suggests a limitation of the ownership primitives.

I looked at this problem a decade ago for C/C++, and got as far as borrow semantics (I used the term "keep", as the opposite of borrow). But there were ownership cases I couldn't figure out a good way to handle. This sort of thing is why the C++ committee beat their head against the wall with "auto_ptr". Rust has the same problems, and the solutions don't seem to be entirely clean.

Living with strict single ownership semantics is hard. One amusing approach is to have an atomic "swap" as a primitive. Swap is safe as regards single ownership. With enough swaps, you can do things like move the contents of an array to a new, larger space. This is an inefficient way to do it, but easy to prove sound. "Move semantics" can be viewed as a special case of swap, where one of the moved pointers is known to be null.

It can be useful to view something like moving the contents of an array as a sequence of swaps, which are then optimized. If the destination array is all NULL at the start, each swap operation becomes a pointer copy followed by a store of NULL into the source pointer. Then, if we can show that all the swap operations are non-overlapping, we can do all the pointer copies before all the stores of NULL. Then, if the source buffer is about to be discarded and is never read again, we can omit the stores of NULL. Then if the pointer copies are all adjacent, they can be rolled up into one MOV operation.

This may be a way to avoid using "unsafe" so much.

kibwen · on Dec 1, 2014

  > But it's apparently necessary in Rust to use "unsafe" 
  > code in libraries to implement basic structures like 
  > re-sizable arrays.

Efficient data structures are one domain where single-ownership prevents barriers to implementation, true. Your domain will determine how much unsafe code you need to deal with. For efficient data structures, you will need a large amount of unsafe code (probably a third to half of your code). For applications that need to really push the performance envelope, I expect you'll need around 10% unsafe code (this is a ballpark estimate for the amount of unsafe code in Servo, last I checked). For most domains I'd expect the volume of unsafe code to be less than 1% of your codebase (contrast this with C and C++, where by Rust's definition 100% of the codebase is unsafe).

  > Living with strict single ownership semantics is hard.

This is why Rust provides a reference-counted smart pointer (somewhat like C++'s shared_ptr) in the standard library, for those cases where multiple ownership is necessary and you don't want to manually wrangle raw pointers. There's unsafe code underlying it (the aforementioned raw pointer wrangling), but putting it in the stdlib means that it can be well-audited.

  > This may be a way to avoid using "unsafe" so much.

Rust tries very hard to give you the tools to avoid using `unsafe` blocks, but it's a thoroughly pragmatic language and so it knows that sometimes such code is unavoidable. If you could afford inefficiency in the name of 100% safety, you'd just be using a GC'd language. If it denied you efficiency in the name of dogmatic safety, you'd just go back to using C++. The complexity here is inherent to the domain. There must exist a compromise if we expect to ever make progress.

pcwalton · on Dec 1, 2014

> For applications that need to really push the performance envelope, I expect you'll need around 10% unsafe code (this is a ballpark estimate for the amount of unsafe code in Servo, last I checked).

10% seems higher than any number I've seen.

In any case, it matters quite a bit where the unsafe code is, and that's why I don't like just counting numbers of lines of unsafe code. Empirically, a large number of security-critical bugs in browsers are in layout code, often due to use-after-free. Servo core layout code denies unsafe via `#[deny(unsafe_block)]`. We do not allow unsafe core layout code to be checked into the tree; the only unsafe code in layout is general code that manages relatively simple invariants.

The question that matters is: Can you write high-performance code in Rust without writing unsafe code of your own? I'm confident that the answer is yes.

JoshTriplett · on Dec 1, 2014

> For applications that need to really push the performance envelope, I expect you'll need around 10% unsafe code (this is a ballpark estimate for the amount of unsafe code in Servo, last I checked).

Servo isn't just "a browser engine written in Rust"; it's a project that serves as the largest driving force for the language. As a result, many abstractions that will in the future end up in the standard library or one or more packages will show up in Servo first. And I'd expect the implementation of the standard library to contain a significant amount of unsafe code, as part of implementing abstractions usable without invoking "unsafe".

I would expect other application projects to have far less unsafe code on average. New Rust libraries wrapping C libraries, or creating new low-level abstractions around memory, will have quite a bit of unsafe code, with a safe exported API.

Animats · on Dec 1, 2014

If it denied you efficiency in the name of dogmatic safety, you'd just go back to using C++.

For new work, most efforts are going forward to C# or D or Go or Java, all of which solve this problem by using a garbage collector. The claim for Rust is that it's memory-safe without a garbage collector. This claim does not appear to stand scrutiny. If 10% of a browser has to be "unsafe", the language is seriously flawed.

pcwalton · on Dec 1, 2014

> For new work, most efforts are going forward to C# or D or Go or Java, all of which solve this problem by using a garbage collector. The claim for Rust is that it's memory-safe without a garbage collector.

In C#, D, Go, and Java, arrays are implemented as language primitives in the runtime. Their runtimes are implemented in raw C or C++ and are not formally verified. So my question is: Why is implementing an array in C as a language primitive safer than implementing it in the standard library via unsafe code?

There's a similar story for the swap operation. Rust in fact used to have a swap operation, as you suggested, but it was removed because such an operation can easily be implemented in the library. We could have moved the code back into the compiler, but how would that have made the language safer? (In fact, I would be tempted to argue that implementing language primitives in codegen is less safe, since programmatically generating LLVM IR is more difficult than writing essentially C with a different syntax.)

tomp · on Dec 1, 2014

You didn't really answer the question. The GP was asking, how can Rust claim to be "memory-safe" if 10% of application code (e.g. Servo) is unsafe code. We can accept unsafe code in the standard libraries (equivalent to how C# or Java or Go are implemented), as it's reasonable to assume it will be extensively peer-reviewed and battle-tested.

pcwalton · on Dec 1, 2014

You shouldn't be writing unsafe application code. If your applications needs new unsafe abstractions (which you can think of as language extensions), then you can write those. But unsafe is absolutely not for application code.

(This is, as I explained downthread, part of the reason why I dislike going around quoting "X% of the code in Y app is unsafe." Your application code shouldn't be unsafe.)

pjmlp · on Dec 1, 2014

> Their runtimes are implemented in raw C or C++ and are not formally verified.

This is not true of all implementations. There are meta-circular implementations of Java and C# compilers, that people keep forgetting about.

pcwalton · on Dec 1, 2014

Metacircular implementations still have a runtime underneath. You can't escape the need for formal verification if you want to be provably correct.

pjmlp · on Dec 1, 2014

Yes, but it is simplified by removing a whole class of exploits C and C++ allow for.

Rusky · on Dec 1, 2014

Right now, pretty much 100% of all browser code is "unsafe", because they can't (for a pragmatic definition of "can't") be written in C#/Go/Java and have to be written in either C++ (or something equally unsafe), or something else with more static analysis. That's why dogmatically going with safety drives those projects back to C++.

The implementations of C#/Go/Java are also 100% unsafe, as they're in the same boat.

Getting that 100% down to 10%, where the 10% is clearly marked and sequestered behind safe interfaces that are well-tested and well-audited, is a much better solution than giving up on one or the other of performance or safety entirely.

pjmlp · on Dec 1, 2014

> The implementations of C#/Go/Java are also 100% unsafe, as they're in the same boat.

No, because there are(will be in Go's case) meta-circular implementations available.

C and C++ aren't the only way to implement compilers.

Rusky · on Dec 1, 2014

That's the exact principle I'm talking about- meta-circular implementations shrink the amount of unsafe code and confine it to specific places where it can be better tested and checked.

Rust's unsafe-with-safe-interfaces style is just another way to do that, which honestly is a lot more straightforward.

pjmlp · on Dec 1, 2014

That is how meta-circular implementations work anyway. You need to have some intrinsics for the really unsafe parts, or get them in Assembly.

Sorry for not getting your point fully.

arthursilva · on Dec 1, 2014

10% is totally unrealistic, and you also have to consider that most unsafe usage is actually interacting with C libraries and custom low-level data structures. Projects like cargo don't have a single unsafe.

rayiner · on Dec 1, 2014

> Rust has the same problems, and the solutions don't seem to be entirely clean.

There's no entirely clean solution, just as there is no perfect GC algorithm. There's a big design space where the type-system restricts aliasing, and pointers carry static type information about what they can point to. Rust represents one point in that design space, one tuned towards memory management. There's all sorts of other design points: http://ilyasergey.net/papers/ownership-survey.pdf.

Most of the work in this area has actually been done in a Java context, and focus more on access control than on memory management. These other type systems can enforce properties that are useful for encapsulation, but aren't necessarily all that useful for memory management.

sanxiyn · on Dec 1, 2014

I must point out that Rust standard library includes std::mem::swap, which provides your safe atomic swap primitive. It is implemented using unsafe, but the interface is safe.

Historically, swap actually was a part of Rust language (operator <->). It was removed from the language as a primitive, because it can be implemented in a library.

kzrdude · on Dec 1, 2014

Swap is already used everywhere, see for example these popular functions and methods:

* std::mem::replace, http://doc.rust-lang.org/std/mem/fn.replace.html

* Option::take http://doc.rust-lang.org/core/option/enum.Option.html#method...

shadowmint · on Dec 1, 2014

The recent discussion about the word 'lifetime' in rust interests me, because it is something I found confusing at first... but I still don't see any of the alternatives ('scope', 'lifespan', 'borrow bound') as being any meaningful improvement.

I think the lifetime guides should be more explicit about how memory management occurs.

Something like this:

    When the instance X of type T is freed in rust and its destructor (if any) 
    is invoked, we refer to this as 'dropping' X.

    X will be dropped if:

    - It goes out of a scope
    - It's immediate parent is dropped

    *T in rust is a 'raw' pointer, like a C pointer.

    *T can result in a segmentation fault like in C, if the memory it points
    to has already been dropped (Use after free).

    &T is a special type in rust that prevents use after free errors by 
    applying compile time guards enforcing safe usage.

    Typically the compile time guards are automatically determined using
    static analysis, but sometimes ambiguities require explicit hints when
    writing an API. In these cases it is necessary to hint to static type
    checker what the the appropriate lifetime for a &T is.

The existing lifetime definition:

    A lifetime is a static approximation of the span of 
    execution during which the pointer is valid

Doesn't really cut it for me.

It needs to be explicit 1) that lifetimes are for the static type checker, and 2) that they are more than just for borrowed pointers.

For example, for a trait or closure, the lifetime is not the duration during which the pointer is valid. It's a lower bound on the possible lifetimes of other types that can go into either that closure or trait.

For example, to pass a closure to a thread it must be 'static, which means that the lower bound for any lifetime used in the closure is static. This for example:

    let foo = 0u;
    let bar = |:| { let y = foo + 1; ... }

The closure in this example has a lifetime which is <= the lifetime of foo, which it uses. This is < 'static, so bar cannot be 'static.

The same is true of structures and traits. A boxed trait Box<Foo + 'a> means the struct implementing Foo contained in the box must have an lifetime of at least 'a.

It's almost like we have two completely separate ideas:

1) Lifetime <--- The lifetime of a borrowed pointer

2) Lifetime bound <---- The lifetime on a closure, struct, trait, etc.

arthurtw · on Dec 1, 2014

OP Here. Yes, now I agree with you that doing a “Replace All” on lifetime with another term does not really solve the problem. (I have replied you on the RFC discussion.) We need to distinguish some existing lifetime-related terms (concepts) clearly, including the two you just mentioned.

The issue is &'a is so common in Rust code. We simply call it lifetime 'a, which is misleading. It is not any object’s or pointer’s lifetime, but a set of lifetime requirements (or lifetime bound in your term) that every borrower imposes on the borrowed resources. It is the B I mentioned in the last section of my blog post:

When we talk about borrow, there are three different kinds of “lifetime” involved:

A: the lifetime of the resource owner (or the owned/borrowed resource)

B: the “lifetime” of the whole borrow, i.e. from the first borrow to the last return

C: the lifetime of an individual borrower or borrowed pointer

The required condition is A >= B, where B = C1 U C2 U ... U Cn.

kzrdude · on Dec 1, 2014

Memory management is tangential to borrowing.

I think scope is a great word, because my own important insight was that "lifetimes" represent the scope a borrowed reference lives in -- it's not really about the life of the value you borrow from.

Yes if you have a lifetime 'a, it will correspond to some reference on the stack in a scope somewhere up the callchain.

norswap · on Dec 1, 2014

What I never see discussed is how this borrowing system works in the presence of global state. Can't you have two global variables that reference the same value? How do you then go about building indexes and things like that, is there a form of weak reference?

I don't know Rust, so the question may be non-sensical, still, an explanation would be much appreciated :)

kzrdude · on Dec 1, 2014

* You can have thread-local storage, which is a form of "global" state for a whole task/thread.

* You can use Arc (Atomically reference counted smart pointer), to share immutable data with all your tasks, without needing any locking. You may generate this data, and either hand tasks an Arc pointer at their creation, or send it out using channels.

* You can use RWLock or Mutex to share owned data with limited/locked mutable access between several tasks. These types are smart wrappers that only allow access to their protected value while properly locked.

Oh, and fyi the borrow checker is strictly local to a function or method. Other features of the Rust language, like the type kinds Send and Sync, are involved in checking which values are safe to share between tasks. Normal rust references (&T) only point across tasks if you get them through one of the primitives Arc, Mutex, RWLock or similar.