Ford Fusion

Abstract

Ford Fusion was an overhaul of Urbit’s over-the-air upgrade process and a rewrite of its build system. The new update system corrected a few long-standing bugs with the previous one, and the new build system is simpler, smaller (by around 5,000 lines), and easier to manage. This historical report was published on the Urbit Blog as a capitulation of the project, which successfully revamped the Hoon code build system. It is lightly annotated to update the minor technical changes that have occurred since the original publication; this is intended as a living document to some extent. Ford Fusion remains the state of the art for building Urbit software as of writing.

1 Overview and Rationale

Ford Fusion was an overhaul of Urbit’s over-the-air upgrade process and a rewrite of its build system. The new update system corrects a few long-standing bugs with the previous one, and the new build system is simpler, smaller (by around 5,000 lines), and easier to manage.

Since deployment of Ford Fusion to the livenet in late June, over-the-air updates (otas) have been much smoother. Before Ford Fusion, it was common for an ota to take several hours, use too much memory, and leave ships in inconsistent states. After Ford Fusion, multiple otas have been pushed out, including kernelspace changes, and most users didn’t even notice.

Urbit has always been able to update itself ota, but this process has often been rocky. Updating an operating system kernel on-the-ﬂy is a diﬃcult problem in general, like performing heart replacement surgery on yourself while running a marathon. Code that allows Linux to update its kernel in this way became a startup called Ksplice, won multiple awards, and sold to Oracle. Even that, as impressive as it is, and as brilliant as its programmers are, can only perform certain limited kinds of patches to the kernel.

Urbit isn’t exactly a traditional operating system, so the comparison is somewhat unfair, but the purpose of better architecture is to create unfair comparisons. In this case, because the Nock layer is frozen, upgrading everything above that layer is easier. Upgrades are also facilitated by pure-functional semantics, transactional event processing, a type system oriented toward concrete data, and orthogonal persistence. These features make it feasible for Urbit to upgrade itself in the general case, not just some special cases.

Ford Fusion has ﬁxed the major upgrade issues of the past by guaranteeing three properties that in retrospect are obvious requirements, but, like much of Urbit, took many years and rewrites to identify as such:

1.1 Atomicity

In previous versions of Urbit, updates failed atomicity by deferring parts of the update to later events, which are separate transactions that can fail independently. Generally, each deferral causes an exponential increase in the number of failure states that needed to be handled.

We’ve learned that asynchronicity is an entropic state. A system will tend toward more asynchronicity over time unless eﬀort is put into keeping it synchronous. As developer Jonathan Blow has noted ( Blow (2019), 42m27s), the language server protocol has turned every editor plugin into a distributed system, since now it has to communicate asynchronously with the main editor process.

Consider an update system that took multiple Arvo events to complete. An ad-hoc higher-level transaction system would need to be built to roll back the eﬀects of the ﬁrst few events in case of failure. It’s important for various parts of the system to be able to emit eﬀects on upgrade; since those eﬀects would need to be rolled back if a later event in this upgrade fails, the system would need to maintain a queue of those events and only apply them once all the other upgrade events have completed.

Note that the asynchronicity has now spread. Some eﬀects that would normally be guaranteed to be processed synchronously might now be asynchronous. Entropy has begun to take hold, chipping away at the set of invariants the system is capable of guaranteeing.

This observation is not purely theoretical. False modularity was the cause of internal asynchronicity in Clay where it had to wait for responses in a complex dance with Ford, which was another vane (Arvo kernel module); moving Ford into Clay allowed function calls that were synchronous from Clay’s perspective, which allowed further simpliﬁcations, culminating in about a twenty percent reduction of source code size of the Arvo kernelspace.

Steve Yegge’s “platform rant” ( Yegge, 2011) describes a Bezosian edict prohibiting synchronous communication among modules through direct linking. This can be seen as an acknowledgment of the diﬃculty Amazon was going to have when it needed to turn internal services external. If your software needs to run in hell, build it that way from the start.

An Urbit ship is not an enterprise SaaS product and does not need to run in this hell; it exists for just one person, with natural pressure pushing it in the opposite direction from Amazon’s web services. Instead of needing a ship’s state and functionality sharded into microservices strewn across multiple clusters, an Urbit instance is easiest to manage as a single server with all its state uniﬁed into one data structure and its event log stored as one totally ordered sequence of state updates – the antithesis of a distributed system.

1.2 Self-Containment

Before Ford Fusion, each commit to the Clay ﬁlesystem validated its ﬁles using ﬁletypes (called “marks”) deﬁned by ﬁles in the previous commit (see ~lagrev-nocfep (2024), pp. 35–50 in this issue). This could cause bugs if the ﬁletype deﬁnitions had changed in a backward-incompatible manner. It also meant a commit could not add both a new ﬁletype and new ﬁles of that type; instead, you needed two commits: one to deﬁne the ﬁletypes, and a second to add ﬁles of that new type. More theoretically, it caused history-dependence. The validated contents of ﬁles in a commit could vary based on the history of commits that led to this one.

Another way source code failed to be self-contained was that it had access to symmetry-breaking information at build-time, namely ship, desk (Urbit’s answer to a git branch), and (faked) date. A build recipe should be able to shared, cached, and rerun without dependence on local conditions, so user code now no longer learns its ship, desk, or date until runtime.

Source code also had build-time access to Urbit’s immutable global namespace, called the “scry namespace”, which the kernel makes available as an implicit argument to userspace Hoon code. The scry namespace is immutable and referentially transparent, i. e. a request must always yield the same result for all time, but if an agent asks the kernel for a resource that’s from the future, hosted on another ship, or to which that agent doesn’t have permission, the kernel will deny the request.

If the kernel denies a scry request that user code made during the build process, the build system has no choice but to treat it as a nondeterministic error. Nondeterministic errors can never be fully eradicated, if for no other reason than that the user always has the option to defenestrate the machine—there’s nothing deterministic about that. But we try to minimize them, and especially to minimize uncertainty as to under what conditions they might occur.

No build should be killed by the absence of ﬁles outside the desk, so as of this update, user code can no longer scry at build time. Once built, userspace programs can scry if run in a context with a scry handler; a Gall agent’s runtime scry requests still work just ﬁne.

1.3 Order

The ﬁnal kind of failure ﬁxed by Ford Fusion was the lack of ordered layering during a software update. The most common form of this failure was that old Ford had a tendency to try to build userspace code using the previous version’s standard library. This didn’t work too well, unsurprisingly.

Emerging from this underworld required making a number of changes to the Arvo kernel, Clay, Gall, and the procedure for kernel updates. To avoid turning into a pillar of salt, I’ll skip the details of how the old system worked and instead describe the new update procedure.

2 How Updates Work Now

These are the layers of the stack that update themselves on the ﬂy, from lowest to highest:

An update to one layer necessitates a reload of all layers above it; e. g., a change to Zuse should trigger updates to the vanes and userspace. Conversely, an update to a higher layer should not cause a spurious reload of lower layers, which should not be aﬀected by the change; for example, an update to just userspace should not cause any reloads of system code.

Clay is responsible for enforcing the layering of updates. An update to a module is triggered when an attempt is made to commit a change to Clay that aﬀects one or more ﬁles needed to build the module. For example, if the foo agent’s source, deﬁned in /app/foo/hoon, imports the bar library from /lib/bar/hoon, then a modiﬁcation to /lib/bar/hoon triggers an update to the foo agent. All vanes and userspace ﬁles depend on Zuse, which depends on the Arvo and Hoon sources, so a change to the Hoon, Arvo, or Zuse sources will trigger updates to all vanes and userspace ﬁles.

When asked to perform a commit, Clay determines which layers need to be updated based on which ﬁles have changed and which modules depend on those ﬁles. For now, all running programs load their source from the %base desk, so only changes to %base trigger stateful updates. Files in other desks can be built, but not installed into the system. This might be relaxed in the future.

The process of updating varies by layer. The Hoon and Zuse layers are stateless, so their newly rebuilt cores (Nock executables) must be stored (somewhere in the system’s Nock tree, in memory; remember, Urbit is a single-level store), but they have no state that would need to be migrated. The Arvo kernel, vanes, and userspace agents are all live, stateful programs, so in order to update one of those, the system must extract the state from the old program, pass that data into the newly built program, then discard the old program and store the new one. Arvo and agent state injection routines can emit eﬀects, but vane updates cannot.

To work around this limitation, Gall has a two-phase update process. First it enters a dormant “pupal” phase that stores not running agent cores, but only the agent states that the old Gall had extracted from its agents. When Clay notiﬁes Gall that its agents have been rebuilt, Gall “molts” back into normal functionality by loading the agent cores from Clay and then running their +on-load routines to inject the old state.¹

If there’s a change to ﬁles in /sys on the %base desk, Clay asks Arvo to update kernelspace. Clay sends a sequence of moves (eﬀects) to Arvo to ask Arvo to perform any necessary updates to Hoon, the kernel, Zuse, and vanes. This sequence is terminated by an extra move back to Clay itself, which will be received by the updated version of Clay after migrating its state. The rebuilt Clay can then use the newly rebuilt version of Zuse to rebuild userspace and notify clients of the update. One client is Gall, which molts when Clay notiﬁes it.

Clay triggers updates, but the Arvo kernel is responsible for performing updates to all kernelspace layers, and Gall is responsible for updating userspace agents. Agents are stored in Gall’s state, but all other layers are stored directly in the Arvo core’s state, so the Arvo kernel contains the routines that reload Hoon, the kernel itself, Zuse, and the vanes.

The Arvo kernel reloads itself by compiling the future version of itself, then calling the new core’s +load routine with the relevant parts of the old state. The state passed to the new Arvo now includes not just the vane cores and their states, but also the Arvo “duct” call stack, which maintains a stack of queues of moves to be passed from one vane to another, and a list of eﬀects to emit to Unix at the end of the current Arvo event. If needed, Arvo could migrate the outstanding moves themselves – if, say, the duct datatype changes.

Passing the Arvo call stack state to new Arvo allows a kernel update to happen in the middle of a more complex event without disturbing other sequences of processing steps happening concurrently in the vanes.

This entire update process happens in one Arvo event and doesn’t break event-dispatching semantics. This not only provides atomic rollback, but allows the update to be combined with other actions into a larger transaction – for example, to stage complex changes, user code could trigger two kernel updates in a row, both in the same event.

Note that this is the opposite situation from the entropically leaking asynchronicity described earlier. Now the guarantees don’t deteriorate; they can be composed into stronger guarantees.

3 How Clay Validates a Desk

A desk is Urbit’s answer to a Git repository. It’s almost identical, except all ﬁles are typed and validated, and whenever a commit becomes the equivalent of Git’s HEAD, it’s assigned a semantically meaningful revision number, and all ﬁles are typed and validated.

If Clay has been asked to perform a commit, it needs to validate all the ﬁles in this desk and notify all subscribers to live queries of this desk’s data. Gall, for example, maintains live queries on builds of its live agents. Validation uses the Ford build system, which as of this update is no longer a standalone vane but a core within Clay.

A Clay commit, like a Git commit, is speciﬁed as the current value of all its changed ﬁles (and, separately, references its parent commits by hash), not as the diﬀ from a parent commit. Unlike Git, Clay is typed, and every ﬁle must be validated according to its "mark". A mark is named like a ﬁle extension, e. g., %txt, %png, or %noun, and Clay maintains a mapping from that name to behaviors of values of that type under various operations. The last segment of any Clay path speciﬁes the mark to use for operations on that ﬁle, including validation.

Mark operations include conversion to and from other marks (such as converting %json to %txt), revision control operations (diﬀ, patch, and merge), and validating an untyped noun. Operations for mark %foo-bar are deﬁned by a core built using the source code at /mar/foo-bar/hoon, or if that doesn’t exist, at /mar/foo/bar/hoon.

Consider a ﬁle at /web/foo/json. In order to validate this ﬁle, Clay must load the mark deﬁnition core and use its validation routine to ensure the untyped value of /web/foo/json is in fact valid json. To obtain this core, Clay must build the ﬁle at /mar/json/hoon from source and then process the resulting raw mark core using some mild metaprogramming to get a standard interface core for dealing with marks, called a $dais, whose type is deﬁned in Zuse.

Since building a source ﬁle only makes sense if the ﬁle has been validated as a %hoon ﬁle, but mark deﬁnitions themselves must be built from source, there’s a logical dependency cycle – who validates the validators? To break this cycle, Clay hard-codes the validation of %hoon ﬁles. This allows mark deﬁnitions to be built from source, and in fact any ﬁle can depend on any other ﬁle of any mark as long as there are no cycles. As of Ford Fusion, Ford performs a cycle check to ensure acyclicity.

Since building a ﬁle is a pure function, Clay memoizes the results of all builds, including builds of marks, mark conversions, and Hoon source ﬁles. These memoization results are stored along with the desk and are used by later revisions of that desk. Future work should allow merge commits to pull memoized builds from all parents, but for now only the previous revision of the current desk is used. This is a major simpliﬁcation of previous Ford architectures, which maintained much more complex caches with less clear eviction semantics. Now on every commit, we just throw away any unused memoized builds from the previous revision’s Ford cache.

Once Clay has validated every ﬁle in this new revision of a desk, it constructs and sends updates to any subscriptions that other vanes or agents have requested. More Ford builds may be run to fulﬁll these requests, including builds for any running agents whose dependencies changed in this commit.

When Gall receives a newly rebuilt agent from Clay, it calls the gate produced by the +on-load arm of the new agent with the state extracted from the old agent. If there is a crash in any +on-load calls or in the handling of any eﬀects they emit (which can include more agent activations), then the whole event crashes, canceling the commit. This eﬀectively gives any agent the ability to abort a commit by crashing.

It is a bit counterintuitive that an app reload failure could prevent a kernel update. The reason is that we don’t want the system to update itself into a broken state. An Urbit can be rendered practically unusable by the presence of broken agents, even if the kernel hasn’t lost integrity, so it’s kinder to the user not to break their agents by installing an incompatible kernel update. This also puts virtuous pressure on kernel developers not to “break userspace”, the importance of which has been insisted on for decades by Linus Torvalds, among others.

If an agent does crash a commit event that included a kernel update, the attempted commit is now trivially rolled back, and the system can deliver an error message to the user. This does not leave the system in an inconsistent or stuck state, so the user could modify the failing agent and try the kernel update again later. Supporting better workﬂows for keeping third-party agents up-to-date will be an important aspect of Urbit’s upcoming software distribution work.

4 Ford Build Semantics

4.1 The Three Types of Ford Builds: Files, Marks, and Casts

The Ford build semantics have been simpliﬁed. There are now three kinds of builds that Ford can perform: ﬁles, marks, and casts, all of which happen synchronously as function calls inside Clay and are available (without memoization) as scry interfaces.

4.1.1 File Builds

A ﬁle build takes in a ﬁlepath containing Ford runes and Hoon source, runs the Ford runes to perform imports, and then compiles the source, producing a $vase, a noun tagged with its Hoon type.

Clay exposes ﬁle builds into the scry namespace with %ca: as an example, .^(vase %ca /~zod/base/3 /lib/sole/hoon) will build the sole library.

4.1.2 Mark Builds

A mark build produces a $dais mark-interface core. It ﬁrst performs a ﬁle build on the Hoon ﬁle in /mar that deﬁnes the mark core, then it does some metaprogramming to make the operations more convenient to use. If the raw mark core delegated revision control operations to another mark core, the mark build will also load the delegate mark core and resolve the result into the $dais.

Clay exposes mark builds into the scry namespace with %cb: as an example, .^(dais:clay %cb /~zod/base/3 /mar/foo/hoon) builds a $dais for the %foo mark.

4.1.3 Cast Builds

A cast build produces a $tube: a gate that takes a value of one mark as input and converts it to a valid value of another mark or crashes. To convert from mark %foo to mark %bar, Clay tries the following operations, in order:

The %foo mark can “grow to” %bar by providing an arm in its +grow core named +bar. %bar can convert from %foo using a +foo arm in its +grab core. %foo can also chain a conversion through an intermediary using an arm in its +jump core, and %bar can specify an “indirect grab” by having a +grab arm produce a delegate mark instead of directly deﬁning a conversion gate.

Clay exposes cast builds into the scry namespace with %cc: as an example, .^(tube:clay %cc /~zod/base/3 /foo/bar) builds a $tube conversion gate from %foo to %bar.

4.2 Ford Runes

There are now only seven Ford runes. A ﬁle can contain zero, one, or many of each, but each Ford expression can only be one line, and they must be in the standard order of /-s, /+s, /=s, and then /*s.

The /- rune imports a structures ﬁle from /sur. You can import it as just foo, in which case the build result of that ﬁle (usually a core with mold deﬁnitions) will be pinned into the compilation subject with the face foo. If you preﬁx it with a * as in *bar, the result will be pinned into the subject with no face; if the structures ﬁle compiled to a core, this exposes all the arms into the namespace of the compilation subject. Finally, if you import it as baz=qux, the baz face will be applied instead of qux. This is similar to “import as” in other languages.

The /+ rune imports a library ﬁle from /lib. Aside from the diﬀerent source folder, the syntax and semantics are the same as for /-.

The /= rune imports the result of building a Hoon ﬁle from a user-speciﬁed path (the second argument), wrapping it in a face speciﬁed by the ﬁrst argument. The ﬁnal /hoon at the end of the path must be omitted. This is mostly useful for importing a ﬁle for testing. The ﬁle at the speciﬁed path will be built as a normal userspace Hoon ﬁle; i. e. its compilation subject will be Zuse augmented with the results of any Ford runes it has at the top of the ﬁle.

The /* rune imports the contents of a ﬁle in the desk, speciﬁed as the third argument with the full path including the trailing mark, converted to the mark speciﬁed by the second argument, and pinned into the compilation subject wrapped in the face speciﬁed by the ﬁrst argument. This can be used to import static data at build-time, such as a data ﬁle, a media ﬁle, or, in the case of this example, a Hoon ﬁle as source text rather than already built.

A valid userspace Hoon ﬁle must contain a nonempty list of hoons (Hoon source expressions) below the Ford runes, separated by gap (more than one space, or at least one newline). The system wraps this list of hoons in a =~ expression so that the result of the previous hoon is used as the subject of the next hoon. The result of the Ford runes is used as the compilation subject for this =~ hoon; informally, the shape of the compilation subject can be thought of as:

The /$ rune imports a mark conversion gate between two types. These are marks on the same desk as the ﬁle.

The /~ rune imports, builds, evaluates, and pins the results of many hoon ﬁles in a directory. Each Hoon ﬁle in the speciﬁed directory will be built and evaluated. The result of evaluating each ﬁle will be added to a ++map and pinned with the speciﬁed face some-face. The keys of the map will be the name of each ﬁle, and the values of the map will be the result of evaluating each ﬁle and casting its result to the type speciﬁed some-type.

All of the hoon ﬁles in the speciﬁed directory, when evaluated, must produce data of a type that nests under the type speciﬁed some-type. File with a mark other than %hoon will be ignored.

The /% rune imports a mark deﬁnition from the /mar directory. The mark deﬁnition will be built and pinned with the speciﬁed face some-face.

5 Future Work

Urbit still needs to make better use of desks other than %base and the development process should be adjusted given the tighter coupling between source code and kernel and tighter criteria for accepting an update.

This work also hopefully provides a good foundation of a package management and software distribution system for Urbit. As ~wicdev-wisryt has said, a user should be able to run |install ~norsyr-torryn %canvas to load and build remote source. No one should experience dependency hell on Urbit, but we’re not there yet.

At least now, building a desk has no dependencies, other than a Ford with a compatible Hoon compiler. No decisions have been made on this yet, but Ford might get moved to inside the desk, possibly by making Zuse callable. This could allow a desk to expose a Nock interface in addition to a typed Hoon interface, which could even let a desk be used as a “pill” bootloader.

6 Conclusion

~littel-ponnys and I (~rovnys-ricfer) spent most of 2018 rewriting Ford with the intent of improving its performance. Compared to its predecessor, its result was better in some ways but worse in others. The caching system was labyrinthine and poorly factored, making the system diﬃcult to debug or prove correct, even informally. Some things were faster, but the caching and dependency tracking were actually complex enough that a number of common operations, like mark conversion, were too slow.

In early 2020, ~master-morzod suggested moving Ford into Clay to reduce asynchronicity. It seemed absurd at ﬁrst, but at some point I realized I could combine that idea with a simpler build-caching scheme and self-contained desk builds, and ~wicdev-wisryt realized he could use that to further simplify Clay’s commit and merge code, which he did as part of this project.

The ﬁrst time I rewrote Ford, it took me six months, with help from ~littel-ponnys, and it weighed in at 6,000 lines of code. The second time, in late 2018, took a few weeks. The third time, in January 2020, took a week. I wrote +ford in Ford Fusion in one long day, and it’s about 500 lines of synchronous, functional code.

It has taken me two or three years to understand this problem as well as I do, and I expect there are parts of it I still don’t understand. The code itself isn’t the issue; it’s ﬁnding the right answer to ontological and teleological questions. What is Ford? What will it be in a hundred years? I’m conﬁdent Ford Fusion is more similar than its predecessor to the Ford of 2120, because it’s smaller, more functional, and easier to understand and administer.

As an engineering discipline and organizational practice, working on a system intended to be frozen yields surprising simpliﬁcations like this every so often. Urbit is now reaching the point where we’re starting to see more of the obsidian edges of the frozen future system emerge from the lava.

¹Including agent state upgrade handling.⤴