Swift Value Semantics, COWs and Clean Threads

How safe is safe?

Among programing language aficionados there's an old saying: Pascal keeps your hands tied. C gives you enough rope to hang yourself.

In other words, the Pascal language compromises programming ease of expression and speed for safety, but C does exactly what you tell it, as quickly as possible. Unfortunately in the real world, this leads to unpredictable crashes and security vulnerabilities due to low-level programming errors like pointer mismanagement and writing past the end of buffers.

In its heyday, C became the preferred operating system development language largely because of raw speed compared to safer languages like Pascal. However, now the pendulum is swinging back in the direction of safety. Memory and CPU cycles are more plentiful so we're willing to give up a little efficiency for improved reliability and security. The Swift language was created to be fast, but not to the exclusion of safety. It includes runtime checks for programming errors like attempted buffer overflows and invalid memory access.

However, currently these safety precautions exist mainly for single threads of execution. In concurrent software it's possible to make the access of shared memory safe by making it atomic, so only a single thread can access a portion of memory at a time. However, this has significant performance and other practical limitations for a language; at the time this was written, Swift doesn't attempt to solve that problem. Perhaps a future version of Swift will include features to address this, such as primitive support for the actor model, but for now, thread safety is mostly left to the developer.

Fortunately, there are other techniques to minimize safety concerns due to shared memory access. For example, as we'll discuss here, the need for shared memory between threads can be reduced with the use of data structures based on value types.

Value and reference types

To work with Swift at even an introductory level, it's important to understand the difference between value and reference types. If you're already familiar with the concept, feel free to skip this section.

The Swift language contains several examples of the influence of C++ on its designers. In both Swift and C++, a data structure defined with the class keyword indicates a reference type, usually allocated in dynamic memory (i.e., the heap), whereas a data structure defined with either the struct or enum keywords indicates a value type, usually allocated from automatic memory (i.e., the stack). These terms, value and reference type, refer directly to the behavior of properties (i.e., variables) used to contain the two types.

A property or function argument for a reference type contains a reference (i.e., a pointer) to the memory allocated for the object. Copying one property to another, or passing it as a function parameter, creates another reference to the same object.

On the other hand, a property or function argument for a value type contains the actual contents (i.e., the value) of the data structure directly. Copying one property to another, or passing it as a function parameter, makes a new copy of the data.

In Swift, it's helpful to understand how constant properties behave for both value and reference types. A constant for a value type (defined with the let keyword), considers all the elements of the value to be immutable. However a constant for a reference type (declared with the var keyword) considers only the reference itself to be immutable, but the elements of the object referred to can still be mutated. In other words, a constant reference will always refer to the original object, but the object itself can still be modified.

Because the arguments of Swift functions are always constants, this has important implications for function parameters. A value type passed to a function can never be changed within the function (excluding inout arguments, which are always passed as references), but the contents of a reference parameter can change. We'll discuss the ramifications of this difference next.

The pros and cons of value types

Many languages use value types for primitives like booleans, integers and floating-point numbers, and reference types for most everything else. However, by convention Swift uses value types whenever possible, and reference types only when there is a clear benefit to doing so. Programmers coming from other languages are probably familiar with the behavior of value types when working with these language primitives. If you pass an object as an argument to a function it is passed as a reference, so changes to the object within the function will also affect that object outside of the function. Thus the function is said to have side effects on its environment.

However, if you pass a primitive such as an integer variable to a function, since it's a value type, nothing within the function will change the value of the original integer variable. This makes it easier to reason about the behavior of your code because you can treat the function as a black box. You don't care what happens inside the box, just what you put in and what you get out. In functional programming, this is called a pure function because it simply transforms inputs into outputs without any side effects. In general, pure functions are easier to reason about and test, especially when concurrency comes into play.

Value types are especially helpful in multi-threaded situations where more than one thread is potentially working with a copy of a value concurrently. With value copies, each thread can safely change its own copy without affecting values in use elsewhere. In contrast, reference copies must be explicitly protected because each thread is referring to the same data, so changing a value with one thread can cause conflicts or unexpected results for other threads.

However, it's important to note this aspect of value types only applies when threads own distinct copies of a value. If multiple threads are using a single shared value, explicit protection such as a mutex is still needed. Going back to the integer example above, if multiple threads are trying to read and write to a global integer simultaneously without mitigating conflicts, it's well understood that problems can be expected.

So far we've discussed the advantages of value types, but there are also drawbacks. Sometimes it's desirable to have multiple references to a single data instance as a shared state, or for efficiency with large data sets. Because each copy of a value type contains its own distinct range of bytes set aside in memory for the value, using multiple copies of large data structures can quickly become an inefficient use of memory.

Execution speed can also suffer when large data structures are copied repeatedly. For example, each time a value type is used as an argument to a function, the entire data structure must be copied as part of the function call rather than just quickly copying a relatively small reference to that data. Also, as discussed above, value types are generally stored on the stack rather than allocated on the heap, as are reference types. This can make them faster to create, but most operating systems provide substantially more memory for the heap than the stack, so a software design using many large value type instances may experience hard to predict runtime failures such as stack overflows.

Clearly, there are good applications for both value and reference types in modern software design.

Standard collections in Swift

As mentioned earlier, Swift uses value types whenever possible, and this includes the collection types defined in the Swift standard library — arrays, sets, dictionaries and strings. For experienced programmers new to Swift, this can be unexpected because these collections represent potentially large data sets. Therefore most other language ecosystems understandably allocate them as reference types such as objects. Given the limitations introduced by using value types for large data sets, how does Swift derive the benefits of value types for collections, while overcoming the limitations?

These standard collections are actually hybrid types transparently using internal references for storage. A standard Swift collection is a value containing a reference to the actual storage. This arguably creates the best of both worlds, the ability to reason about code behavior together with the performance benefits of large reference types. Performance is good with these hybrids because copying them is essentially the same as copying a reference, and as long as the copies aren't altered that's all that's needed.

However, if it's necessary to alter a copy, these hybrid types defer the performance impact as long as possible using a scheme called copy-on-write (COW). Multiple copies are maintained internally as a single shared reference as long as they're only being read from, but if an owner writes to its copy, that instance is first physically duplicated from the shared original before being changed. This preserves the contract provided by value types while eliminating much of the performance impact. Most modern computer memory architectures allow copy-on-write to happen at a very low level in the operating system, but it is also possible to implement this scheme at the application level, as we'll see below.

A hybrid example

The Swift standard library provides value type equivalents for most of the reference types in Apple's legacy Objective-C Foundation library. However, one exception is the attributed string, which is a text string with associated typesetting composition attributes such as fonts, styles, sizes and colors. As an example, we can build on the legacy reference type to create a simple Swift value-based attributed string.

public struct AttributedString {
	public var string: String {
		get { storage.string }
        set { replaceString(with: newValue) }
	}

	public init(_ string: String, attributes: [NSAttributedString.Key: Any]? = nil) {
		self.storage = NSAttributedString(string: string, attributes: attributes)
	}

	public func attributes(at index: Int) -> [NSAttributedString.Key: Any] {
		storage.attributes(at: index, effectiveRange: nil)
	}

	private mutating func replaceString(with string: String) {
		storage = NSAttributedString(string: string, attributes: attributes(at: 0))
	}

    private var storage: NSAttributedString
}

There are a couple of things worth calling out here. Our AttributedString type is a struct containing a single property called storage of type NSAttributedString. This creates a Swift value type containing a reference to an Objective-C class instance which implements the actual attributed string logic.

The size of the value data in memory is the sum of the size of all of its properties, properly byte aligned. In this case, where our struct contains a single reference, it's the size of a pointer on our platform (i.e., 8 bytes for 64-bit architecture). This will work great, and if we don't expect to modify the string much we're basically done.

However, notice we're using the immutable variant of NSAttributedString here, so to change the text in our string we need to allocate completely new storage for each change. Since Foundation provides a mutable version of NSAttributedString, it would be nice to use that and eliminate that extra object allocation by changing the last few lines. However, as we'll see, this adds more complexity to maintain the contract provided by a well-behaved value type. To start, we can just swap in the mutable form of our storage class.

public struct AttributedString {

    // similar to above

    private mutating func replaceString(with string: String) {
        storage.replaceCharacters(in: NSRange(location: 0, length: storage.length), with: string)
    }

    private let storage: NSMutableAttributedString
}

Now we've solved the need to allocate new storage on each change, but the attentive reader might wonder what we're doing to fulfill the implicit contract of value types regarding copy independence. To consider that concern it's helpful to write a unit test to see what happens when we copy a value and change it.

func testCopiedStringsAreIndependent() {
    let expectedOriginalValue = "Original"
    let expectedCopyValue = "Copy"
    var originalString = AttributedString("")
    var copiedString = originalString

    originalString.string = expectedOriginalValue
    copiedString.string = expectedCopyValue

    XCTAssertEqual(originalString.string, expectedOriginalValue)
    XCTAssertEqual(copiedString.string, expectedCopyValue)
}

This test creates an original attributed string, makes a copy and sets the values of each to separate strings. We expect both the original and copy to retain the differing values we set.

But instead, the values of both strings are the last value we set. This tells us it's time to implement copy-on-write, so again we change the last few lines of our struct and extend it with a private storage class.

public struct AttributedString {

    // similar to above

	private mutating func replaceString(with string: String) {
		if !isKnownUniquelyReferenced(&storage) {
			storage = Storage(NSMutableAttributedString(attributedString: storage.attributedString))
		}

        storage.replaceCharacters(in: NSRange(location: 0, length: storage.length),
                                  with: string)
	}

    private var storage: Storage
}

extension AttributedString {
	private final class Storage {
		fileprivate let attributedString: NSMutableAttributedString

		fileprivate init(_ attributedString: NSMutableAttributedString) {
			self.attributedString = attributedString
		}
	}
}

This change is a little more complex. Rather than using NSMutableAttributedString directly as our storage, we embed it in a custom private reference type (i.e., a Swift class). Before mutating an AttributedString, we check for multiple references to the storage reference. If the storage is uniquely referenced it means only one copy of our struct exists and there's no need to copy the storage, we can just change it in place. However, if our storage isn't uniquely referenced, we know more than one copy of our struct is sharing it, so we have to make a copy before writing to it.

The function isKnownUniquelyReferenced is a little bit odd in Swift. It's a Swift runtime function to compare the reference count of a class instance to one. It accepts an inout parameter, not because it changes the value of the parameter, but because Swift guarantees inout parameters are referenced uniquely and it doesn't add a new reference for the function argument.

Now our test passes, demonstrating we have independent strings in our copies. However, there's one other implicit contract requirement of value types we need to consider. Separate copies of an original must work correctly on separate threads. If multiple threads try to change the string in their copy simultaneously, we have a potential race condition where the first thread to get to the data might change it behind the back of another thread.

This means we need another unit test. Since we're testing behavior in multiple concurrent threads, assuming we're working with Apple's development tools, it's useful to turn on an Xcode feature called Thread Sanitizer. This is a runtime tool to constantly monitor memory shared by threads for conflicting access. Since it adds runtime overhead we only want to use it in non-production situations where we're more concerned about catching race conditions than evaluating performance. Thread Sanitizer is activated through the Xcode build scheme, as seen here.

Now we're ready to run a unit test with multiple threads.

func testStringsInThreads() {
    let queue1 = DispatchQueue(label: "queue1")
    let queue2 = DispatchQueue(label: "queue2")

    var attributedString = AttributedString("")

    let iterations = 10

    var copy1 = attributedString
    queue1.async {
        for i in 1 ... iterations {
            copy1.string = copy1.string + "\(i)"
        }
    }

    var copy2 = attributedString
    queue2.async {
        for i in 1 ... iterations {
            copy2.string = copy2.string + "\(i)"
        }
    }

    for i in 1 ... iterations {
        attributedString.string = attributedString.string + "\(i)"
    }

    queue1.sync {}
    queue2.sync {}

    let expectedString = attributedString.string
    XCTAssertEqual(copy1.string, expectedString)
    XCTAssertEqual(copy2.string, expectedString)
}

This test creates an original attributed string, makes two copies of it and modifies it identically in three threads: the main thread and two background threads. Finally, after all modifications are complete, it asserts that all three copies are identical. One important detail to note here is both copies are created from the main thread — otherwise, we'd need to protect access to the original with some kind of mutex. If everything is working correctly, we expect the three strings to be equal, and both the thread sanitizer and any internal consistency checks in NSMutableAttributedString to remain silent.

With the thread sanitizer, we're adding a runtime check for a race between two or more threads accessing the same specific address in memory while it's changing. This can be a hard to detect programming error, and a specific test like the one above won't always catch it because the error is highly timing-dependent. But with some luck and a well-considered test, our test either fails or the thread sanitizer generates a runtime error during testing.

But finding race conditions can be something of a "dark art." It might be helpful to adjust the number of concurrent copies, iterations or timing to generate a reproducible error. When detected, data race conditions do not necessarily cause test failures so vigilance is required when running these tests to notice runtime errors.

In this case, we can resolve the problem by adding a mutex to lock our copy-on-write function to ensure we've fully copied our storage from one thread at a time.

private mutating func replaceString(with string: String) {
    let mutex = self.storage.copyOnWriteMutex
    mutex.lock()
    if !isKnownUniquelyReferenced(&storage) {
        storage = Storage(NSMutableAttributedString(attributedString: storage.attributedString))
    }

    storage.attributedString.replaceCharacters(in: NSRange(location: 0, length: storage.attributedString.length),
                                               with: string)
    mutex.unlock()
}

Now our test runs without error, which unfortunately doesn't prove we've eliminated all potential race conditions, but it at least provides greater confidence they've been removed.

What does this mean to me?

The emphasis in the Swift ecosystem on value types impacts virtually all work in the Swift language through the collections available from the standard library and built into the language. For even simple applications, it's important to understand how this differs from most other programming environments, but when using concurrency or large data sets, the examples above illustrate complications to account for when designing software in Swift.

Modern hardware limitations point to concurrency as the best way to increase software performance for the foreseeable future, but multi-threaded software adds a great deal of complexity and potential flaws. These flaws can be hard to find and correct, but that's all the more reason to constrain their impact. Value types provide a powerful tool when working with multiple threads by reducing possible side effects and making the behavior of code easier to reason about. However, when using large data structures, it's important to understand how copy-on-write semantics can affect performance, especially when modifying shared data.

If not accounted for, this can cause unexpected performance impacts at unpredictable times. However, with forethought, these factors can be anticipated and mitigated before they become problems.