A revision control system1 is responsible for tracking the state and history of assets and asset changes within a particular continuity. It may also manage the production of assets through a build system, as used for instance by a continuous integration/continuous deployment (CI/CD) workflow. Trivially, an rcs is a logical filesystem: a database for managing files. (Conventional rcss are agnostic to the underlying disk hardware, however.2 ) Computer revision control systems have developed from the 1960s onwards once hard drives permitted files to be stored on tape rather than simply as punchcard files. Succeeding generations have introduced new approaches and features, converging more or less on the broad functionality today afforded by Git ( Torvalds, 2005) in its elaborated form as a distributed source control system.3 Much care has been taken to treat the problem of dependencies (e. g. Ball et al. (2015)) and efficient user interfaces (e. g. GitHub, GitLab, etc.). However, although some modest interest has been expressed for a more complete notion of type in revision control, no current system known to the authors besides Urbit’s Clay fully implements a typed revision-controlled filesystem. In this article, we explore the approach of current rcss and elaborate Clay’s contributions to the ongoing task of information management and auditable, reproducible source builds.
The problem of type in revision control has been of both academic and practical interest for decades ( Perry, 1987). Essentially, can file artifacts be organized and classified in such a way as to make them susceptible of comparison across their history? The simplest arrangement which is commonly made is to designate a file as either plaintext or binary. A plaintext file can be updated by specifying the character offset and run length to be replaced (an example of a “diff”, or difference between two examples of text; typically these are resolved at the line level rather than character level). Plaintext files, whether ascii or some form of utf, are defined by a regular format with well-understood and predictable symbol widths and offsets. For instance, a little-endian representation of the text ’clay’ would be reproduced in ascii hexadecimal as 0x7961.6c63 and ascii binary as 0b111.1001.0110.0001.0110.1100.0110.0011. (In binary, each character has a leading zero; that is, each entry is seven bits wide plus a 1-bit zero header.) In contrast, binary files (which is a byword for “everything else”) have arbitrary layout and offset; it may be difficult to determine how and why to compare two revisions of any given file. Many rcss simply treat all registered data as byte streams.
All modern rcss support plaintext and binary representations of files. Plaintext diffing at the level of line changes is straightforward; the Hunt–McIlroy algorithm is often employed ( Hunt and McIlroy, 1976).4 However, the most popular revision control systems (such as Subversion and Git) have included only minimal direct support for binary diffs. Some systems only support storing successive file copies of a binary file; that is, the diff is the entire deletion of the old file and addition of the new file (e. g. cvs). Other systems may support binary file diffs by granular block and offset. Some rcss support diffing particular binary files using a custom automatic merge tool. For example, the textconv tool converts each version of the file to a plaintext representation and compares those representations. (To show a diff for a changes in a pdf file, a configured Git instance could compare the text output of pdfinfo in each case as a proxy. This is generally legible as metadata but does not of itself include sufficient information to reproduce or reverse the intervening changes; the underlying rcs still stores the entire binary in its conventional representation.) In any case, type beyond file extension is not generally supported by rcss; they cannot meaningfully merge files classified as binary.5
A general-purpose notion of file type is not isomorphic to conventional type in programming language theory. For instance, consider a json file. The following are equivalent as json artifacts:
{ "name": "Alice", "age": 30 }
{ "age": 30, "name": "Alice" }
{ "name": "Alice", "age": 30, }
Indeed, since json is structurally agnostic to syntactic whitespace, an infinite number of equivalent files could be constructed. Clearly as plaintext representations, each of these differ, yet they are fully equivalent as json artifacts. An rcs which can recognize a json input can meaningfully store them as equivalent, while an rcs that makes only the trivial distinction between plaintext and binary must consider them all to be different—nonsemantic changes in structure may trigger spurious diffs in the file’s history. Furthermore, a json-aware rcs could meaningfully merge these files whereas a plaintext/binary rcs could not. Finally, a json-aware rcs could yield a different plaintext output when asked to produce the file, as it could normalize the whitespace to a canonical form. This may be considered by other systems to be an “incorrect” round trip.
Thus we must consider two intersecting notions of file type:
The file type as understood by the system, typically based on bitwise reproducibility (e. g. plaintext, binary).
The file type as understood conceptually, typically based on structural relationships (e. g. json, xml/html, Word docx, etc.).
The employment of an intermediate representation allows us to track changes to files and manage their history—state—more effectively. Consider the popular JavaScript framework React ( Meta, 2013), aspects of which were developed to resolve difficulties with slow repainting in browsers (for a popular explanation, see Baer (2018), chapter 1). React’s virtual Document Object Model (dom) system acts as an intermediate representation between the actual dom and the developer’s code. When a developer updates the virtual dom, the framework compares the new virtual dom with the previous virtual dom and updates the real dom only if necessary. This allows React to optimize the rendering process and improve performance. It also means that we can consider the structural representation of the dom object as an intermediate representation, and the production and application of the diff as a type-aware rcs operation. Higher-level interfaces like Microsoft Word’s “Track Changes” feature ( Microsoft, 2024) similarly resolves diffs across text and formatting within a document, an artifact processed from a file into a user-friendly representation. While the “Track Changes” user interface is optimized for interactivity, conceptually it is quite similar to conventional merge tools from rcss, and de facto resolves the same sorts of diffs for its particular file type.6 Furthermore, the “Source Code in Database” (scid) approach parses all code into a database as an intermediate representation; this shares some features with a hypothetical type-aware rcs. Thus a significant part of managing typed file data is deciding what the input, output, and diffs should look like, and the remainder is systematic application of these rules to the data.
Given this frame, consider Perry’s taxonomy of version control for module interfaces:
No version control. Files are simply replaced.
Basic version control. Files are versioned with a number, but no diff is applied.
Strongly-typed version control. Files are resolved at a structural level, such as procedures or arrays, and diffs are applied at that level.
In this scheme, contemporary Git-style rcs is a “weakly-typed” rcs: the system is aware of versions and diffs, but typically not below the file or line level.7 Perry was concerned with the behavior and representation of semantic objects, rather than files or data:
Version equivalence in this type of version control mechanism is defined in terms of syntactic equivalence: data objects are equivalent if their types or structures are equivalent; operations and modules are equivalent if their signatures are equivalent. ( Perry, 1987)
Strongly-typed version control requires coupling to the build
system, since the system must be aware of the structure of the
data in order to apply diffs. (Indeed, ~rovnys-ricfer
(2024),
pp. 6–7, in ustj vol. 1 iss. 1, pp. 1—46, discussed one solution
to the linking problem which is well-suited to Urbit’s global
namespace.)
Urbit provides the Clay vane at /sys/vane/clay
to manage
Unix-style files and source builds (cf. ~sorreg-namtyv
(2016)).
Clay acts as both the Unix-like filesystem and the source
control vane. While Hoon itself (/sys/hoon
) is responsible to
compile a source noun into a Hoon abstract syntax tree (ast)
and ultimately into executable Nock, the actual construction
of source input from files is delegated to the ++ford
arm of
Clay. This build process includes compiling the correct
import files (such as libraries and shared structure files).
Clay is also tightly coupled to the userspace vane Gall
since system updates and agent updates both occur via
Clay.8
To motivate how Clay functions as a revision control
filesystem, we need to briefly examine some major data
structures utilized by Clay. At the highest level, files are
organized into “desks”, described in previous literature as being
analogous to Git branches but really more like Git repositories.
A desk is a self-contained continuity of files; its state is the
result of a history of commits. Desks may not include links to
off-desk resources. Each desk must expose a few files at definite
paths; primarily /mar
for the marks used to read the desk
contents.9
Ultimately, Arvo and Clay deal in nouns. A noun is a binary tree of unsigned integers. These may be evaluated as Nock formulas or manipulated as data. A noun is formally agnostic to the underlying hardware and is almost always accessed via some representation path (i. e. as phenomenon). A noun serves as a universally legible intermediate representation for all data and code.
Urbit has an immutable data model; the system state of
Arvo is the unique result of the events in its history, stored as
its event log. If a noun changes, it is the result of definite
discrete changes to the system state. This yields very
nice properties of referential transparency and has some
ramifications in library management; see ~rovnys-ricfer
and
~wicdev-wisryt
(2024) for more details.
Typically, a noun is altered either directly via
subject-modifying code (%~
censig, =.
tisdot) or via a mark
transformation, on which more later.
While conventional file systems denote file type primarily by file suffix or a cursory search using “magic tests” on the file header, Urbit instead stores all data as a noun accessed via a mark. A mark is essentially a representation rule for nouns. Marks permit nouns to be stored with more granularity than the text/binary dichotomy facilitates, since details of conversion are stored as part of the mark.
A mark may be considered an executable mime type.10 A mark essentially describes how to map a data representation to a noun, and from a noun back out to a data representation. This conversion may be trivial (e. g. the binary storage of an audio file) or sophisticated (from a text stream to a linked-list utf representation). Data are accessed via a particular mark, and the appropriate conversion routine is invoked.11 Clay is capable of searching for transition paths between mark representations, and in principle permits conversion between all compatible representations (if there were a path from Markdown md to rich text rtf to plaintext txt in marks, then the conversion is automatically supported by Clay).
A mark is a term (symbol) which is the Urbit equivalent of a mime type, if mime types were names of typed validation functions. (
~sorreg-namtyv
et al. (2016), p. 51)
Unix filesystem mounting and committing involves
synchronizing a desk’s state with a Unix logical representation
of the files involved. The actual file data are imported as
%mime
-typed nouns and then converted to their target mark
as file type.
Each mark defines several standard arms:
++grab
converts from other marks (by arm name)
to a given mark.
++grow
converts from the mark to other marks.
++grad
defines the diffs applicators for the mark.
Marks do not need to be symmetrical or round-trip. A json input may yield an equivalent json output but altered by structural whitespace. Clay is also proactive in searching for possible mark paths—if an immediate conversion is not available, then a mutually available intermediary form may be used. To implement a custom filetype, all that is required is the production of a suitable mark file. Conversions from the filesystem require at a minimum a mime mark file, which acts as the global intermediary from the host OS.
A desk is a filesystem continuity,12 and represents the main way that data are synchronized between Urbit instances (ships). The state of a desk is a history of sequential commits, with each commit after the first having at least one parent commit. Desks are often analogized to Git repositories.13 Each desk must contain sufficient internal data—especially marks—to produce its own code, given Hoon and the standard library (which are always available).
Since Clay is a global filesystem, it has to maintain information about both local and remote desks. A local desk’s state maintains information about noun data, build and mark cache state, userspace app state, various policies, and synchronization with the host OS filesystem.
:: Domestic desk state. +$ dojo $: qyx=cult :: subscribers dom=dome :: desk state 5 per=regs :: read perms per path pew=regs :: write perms per path fiz=melt :: state for mega merges == :: Desk state. 10+$ dome $: let=aeon :: top id hit=(map aeon tako) :: versions by id lab=(map @tas aeon) :: labels tom=(map tako norm) :: tomb policies 15 nor=norm :: default policy mim=(map path mime) :: mime cache fod=flue :: ford cache wic=(map weft yoki) :: commit-in-waiting liv=zest :: running agents 20 ren=rein :: force agents on/off == :: Support types +$ mizu [p=@u q=(map @ud tako) r=rang] :: new state 25+$ rang :: repository $+ rang $: hut=(map tako yaki) :: changes lat=(map lobe page) :: data ==
The keystone structure is hit, which connects a given version of the desk (its aeon, which is merely a @ud) to the tree of associated commits. These are accessed from the “slush pool” of known commits via hut in the Clay’s rang, which is not associated with a particular local desk.
Much of Clay’s machinery is intended to support noun changes propagated as commits, or desk changes. These may arise from three sources: a local host OS filesystem sync, a remote desk sync, or an on-Urbit local edit. A commit is a snapshot of a set of files with a list of parent commits and a timestamp.
Various merge strategies are available to support desk
updating. Given a commit as a timestamped snapshot of files,
how should two conflicting histories be reconciled? For
instance, %fine
is a Git-style “fast-forward” for use when
histories nest but one is “ahead”. %meet
combines changes
as long as the same file/noun has not been modified.
(Several others are available and are described in the system
documentation.) While Clay has at times supported noun diffs
in commits, it currently simply stores the entire file at each
commit.14
Urbit developers in practice have not encountered difficulties
with desk and commit management due to this limitation at
the current time, but resolution using mark diffs will
eventually be desirable.
Structurally, Clay tracks desk state as a sequential +$aeon
or @ud
, but files may be arbitrarily accessed using a
+$case
:
+$ case $% [%da p=@da] :: %da: date [%tas p=@tas] :: %tas: label [%ud p=@ud] :: %ud: sequence 5 [%uv p=@uv] :: %uv: hash ==
A case resolves into the corresponding aeon
via the canonical
timestamp of the commit; the label map; or the hash of the
corresponding commit. A userspace system to support
detailed commit messages, called Story, is available at
@urbit/story.
As a vane of Arvo, Clay lives “on Mars”; that is, it describes a logical filesystem inside of the Urbit ship but does not directly map its nouns to physical hardware. The particular implementation used to store the total state of Arvo, which contains Clay’s state including file data, depends on the runtime. At the time of writing, the Vere runtime uses the Lightning Memory-Mapped Database Manager (lmdb) which is a B-tree representation for a key-value store ( Chu, 2011). While of practical consequence for lookup time and system behavior, the architecture of Clay to an Urbit developer is independent of the runtime implementation.
The posix-compliant host OS supports a filesystem mirror
of Clay files organized by desk. Importantly, while these files
can be synced to Clay, they are not the canonical reference for
each (which lives in lmbd). Clay maintains a “sync duct” with
the runtime, which upon receipt of a %dirk
task produces a
list of file changes each way and implements the latest
changes.15
In principle, as we have intimated, an rcs may abstract
away from the file as the fundamental artifact being tracked.
While Clay does not fully commit to this, in a concession to
its posix-compliant host OS, it gestures towards the
possibility of conceiving of build sources and artifacts as
objects (nouns) other than files. Since Urbit “unifies” nouns
(that is, it structurally stores only one copy of a noun and
maintains references to that shared value), repetition of
critical resources across desks does not lead to filesystem
bloat. (See ~rovnys-ricfer
and ~wicdev-wisryt
(2024) in ustj
vol. 1 iss. 1, pp. 75—82, for details.)
Clay is also directly responsible for building code. Formerly,
Urbit built code from source files using the Ford vane. In 2020,
the core developers realized that integrating the build system
directly into the rcs solved a number of problems with
updating the system correctly. Updates to the system should
be atomic (complete or fail in one transaction); self-contained
(no implicit dependencies or dependencies on previous
system states); and ordered (sequenced within the system
stack). By coupling the build system into the rcs, the
system can be updated in a single transaction, with the
build system ensuring that the system is always in a
consistent state. In a typed rcs, the compiler or build
system must be always at hand to process diffs in artifacts.
Although of relatively minor consequence to userspace
application developers, the integration of ++ford
into Clay
transforms the system closer into an integrated rcs for generic
nouns.
As currently implemented, Clay presents several wrinkles in developer ergonomics. Chief among these is the way that marks, as simple type tags, are underspecified. The original philosophy of marks centered on network transmission of nouns as tagged data:
Like a mime type, the mark is just a label. There is no way to guarantee that the sender and receiver agree on what this label means. A noun which doesn’t normalize to itself is a packet drop. (
~sorreg-namtyv
et al. (2016), pp. 51–52)
In practice, different Clay desks and different Urbit versions can have different mark processors (and thus different behvaior for the same tag). No simple version system tracks these; there is no global type registry (other than the scry namespace), and any given mark is simply identified by its desk and path. Developers have expressed a desire for a more robust mark management capability, and solutions have ranged from a global type registry (e. g. that proposed by Archetype) to using paths instead of simple tags for marks. In practice as of writing, marks are disambiguated by supplying the intended mark with the desk; this may lead to multiple different implementations of a mark being present on the same system, although uniquely identified by different Clay paths.
The current architecture of Clay (largely in place by 2016 except for the unification of the former Ford vane into Clay in 2020) leads to three ongoing developer concerns:16
Implementing rcs behavior at the file system layer seems to be incorrect for Urbit (that is, too much of a sop to Unix). trying to do revision control at base layer, but file system is wrong layer (you just want history of actions/tx log rather than full state); see migrev codebases for clay state
Desks seem to be the wrong abstraction in practice. The theory of the desk doesn’t account for actual distribution patterns people would like to use; we want one desk but one spot in file tree is capable of acting as a desk (chroot analysis); overuse of desks makes Urbit not feel as much like a filesystem you can explore
The application system (userspace) and the build system and filesystem should be co-located. Currently, Clay and Gall are deeply entangled but must interact via a somewhat awkward message-passing interface. This leads to a number of ergonomic issues, such as the need to suspend a desk in order to suspend an agent. Employing paths rather than simple tags would lead to a desire to constrain what can exist at certain paths. This relates to the definition of files as data structures, and ultimately a noun-maximalist scenario abstracts more towards nouns as database entries rather than “files”.
One possible future for Clay sees it being restricted to source and build management, rather than expansion to a more fully-featured filesystem. In this contingency, the userspace management vane may take over conventional file storage, as Gall or a successor. This may only implement a filesystem interface for the host OS and external applications that “think” in files, such as web browsers.
Alternatively, the build system, rcs, and application
engine fuse into a single userspace vane, which is a possible
endpoint of the “shrubbery” project. Mark cores would be
replaced with simpler conversion rules permitting only
straightforward type casts between nouns as $-(from to)
gates.17
In its current instantiation, Clay exhibits some notable characteristics as a typed revision control system. It is capable of tracking file changes at a structural level, permitting meaningful diffs and merges for files of the same mark. It is also capable of converting between marks, permitting a more flexible approach to file type than the conventional text/binary dichotomy of Git and other rcss. However, the future of userspace and code building is still in flux and may see a reworking of the Clay vane to better suit the needs of developers and users.
Baer, Eric (2018). What React Is and Why It Matters. Sebastopol, CA: O’Reilly Media. url: https://www.oreilly.com/library/view/what-react-is/9781491996744/ch01.html (visited on ~2024.8.12).
Ball, Thomas et al. (2015). “Beyond Open Source: The TouchDevelop Cloud-based Integrated Development Environment.” In: 2nd acm International Conference on Mobile Software Engineering and Systems. url: https://ieeexplore.ieee.org/document/7283033 (visited on ~2024.8.12).
Chu, Howard (2011) “lmdb”. url: http://www.lmdb.tech/doc/ (visited on ~2024.8.12).
Collins-Sussman, Ben, Brian W. Fitzpatrick, and C. Michael Pilato (2016). Version Control with Subversion. Sebastopol, CA: O’Reilly Media. url: https://svnbook.red-bean.com/en/1.8/svn.forcvs.binary-and-trans.html (visited on ~2024.8.12).
Hunt, James W. and M. Douglas McIlroy (1976). An Algorithm for Differential File Comparison. Computing Science Technical Report 41. Bell Laboratories. url: http://www.cs.dartmouth.edu/~doug/diff.pdf (visited on ~2024.8.12).
Meta (2013) “React”. url: https://react.dev/ (visited on ~2024.8.12).
Microsoft (2024) “Track changes in Word”. url: https://support.microsoft.com/en-us/office/track-changes-in-word-197ba630-0f5f-4a8e-9a77-3712475e806a (visited on ~2024.8.12).
Perry, D. E. (1987). “Version Control in the Inscape Environment.” In: Proceedings of the 9th International Conference on Software Engineering, pp. 142–149. url: https://users.ece.utexas.edu/~perry/work/papers/icse9b.pdf (visited on ~2024.8.12).
~rovnys-ricfer
, Ted Blackman
(2024).
“The
State
of
Urbit:
Eight
Years
After
the
Whitepaper.”
In:
Urbit
Systems
Technical
Journal
1.1,
pp. 1–46.
~rovnys-ricfer
, Ted Blackman
and
Philip C. Monk ~wicdev-wisryt
(2024).
“A
Solution
to
Static
vs.
Dynamic
Linking.”
In:
Urbit
Systems
Technical
Journal
1.1,
pp. 75–82.
~sorreg-namtyv
, Curtis Yarvin
(2016)
“Toward
a
New
Clay”.
url:
https://urbit.org/blog/toward-a-new-clay
(visited
on
~2024.8.12).
~sorreg-namtyv
, Curtis Yarvin
et al.
(2016).
Urbit:
A
Solid-State
Interpreter.
Whitepaper.
Tlon
Corporation.
url:
https://media.urbit.org/whitepaper.pdf
(visited
on
~2024.1.25).
Torvalds, Linus (2005) ““Initial revision of "git", the information manager from hell” (Git)”. url: https://github.com/git/git/commit/e83c5163316f89bfbde7d9ab23ca2e25604af290 (visited on ~2024.8.12).
2Compare other logical file systems: network file systems like Microsoft Server Message Block (smb) drives or distributed file systems like the Hadoop Distributed File System (hdfs) or the InterPlanetary File System (ipfs).⤴
3Notably, Git has seemingly failed to fulfill its original promise of being a fully decentralized rcs: most users prefer to use the affordances of centralized Git services such as GitHub or GitLab.⤴
5As always, there are exceptions to the rules. The Git rcs, for instance, can use a “binary diff” format on exact preimages. The Subversion rcs does provide an elaborated mime support system, wherein users can assign each file a mime type that Subversion can use in applying and displaying diffs. However, in practice this only results in the classic text/binary dichotomy with slightly more granularity. “Subversion treats all file data as literal byte strings, and files are always stored in the repository in an untranslated state” ( Collins-Sussman, Fitzpatrick, and Pilato (2016), section “Binary Files and Translation”).⤴
6Indeed, one can imagine a Word-aware command-line rcs that could meaningfully merge changes to a Word document, rather than simply storing successive copies of the file. The OpenDocument file form (odt, ods, odg, etc.) consists of zip-compressed xml files, which would be highly amenable to type-aware merge tools at the command line.⤴