Typed Revision Control

N. E. Davis ~lagrev-nocfep
Urbit Foundation

Abstract

Contents

1 Introduction
2 Type in Revision Control Systems
3 Clay
3.1 Revision Control and Marks
3.2 Desks
3.3 Runtime
3.4 Building Code
3.5 Shortcomings
4 Conclusion
References

1 Introduction

A revision control system1 is responsible for tracking the state and history of assets and asset changes within a particular continuity. It may also manage the production of assets through a build system, as used for instance by a continuous integration/continuous deployment (CI/CD) workflow. Trivially, an rcs is a logical filesystem: a database for managing files. (Conventional rcss are agnostic to the underlying disk hardware, however.2 ) Computer revision control systems have developed from the 1960s onwards once hard drives permitted files to be stored on tape rather than simply as punchcard files. Succeeding generations have introduced new approaches and features, converging more or less on the broad functionality today afforded by Git ( Torvalds, 2005) in its elaborated form as a distributed source control system.3 Much care has been taken to treat the problem of dependencies (e. g. Ball et al. (2015)) and efficient user interfaces (e. g. GitHub, GitLab, etc.). However, although some modest interest has been expressed for a more complete notion of type in revision control, no current system known to the authors besides Urbit’s Clay fully implements a typed revision-controlled filesystem. In this article, we explore the approach of current rcss and elaborate Clay’s contributions to the ongoing task of information management and auditable, reproducible source builds.

2 Type in Revision Control Systems

The problem of type in revision control has been of both academic and practical interest for decades ( Perry, 1987). Essentially, can file artifacts be organized and classified in such a way as to make them susceptible of comparison across their history? The simplest arrangement which is commonly made is to designate a file as either plaintext or binary. A plaintext file can be updated by specifying the character offset and run length to be replaced (an example of a “diff”, or difference between two examples of text; typically these are resolved at the line level rather than character level). Plaintext files, whether ascii or some form of utf, are defined by a regular format with well-understood and predictable symbol widths and offsets. For instance, a little-endian representation of the text ’clay’ would be reproduced in ascii hexadecimal as 0x7961.6c63 and ascii binary as 0b111.1001.0110.0001.0110.1100.0110.0011. (In binary, each character has a leading zero; that is, each entry is seven bits wide plus a 1-bit zero header.) In contrast, binary files (which is a byword for “everything else”) have arbitrary layout and offset; it may be difficult to determine how and why to compare two revisions of any given file. Many rcss simply treat all registered data as byte streams.

All modern rcss support plaintext and binary representations of files. Plaintext diffing at the level of line changes is straightforward; the Hunt–McIlroy algorithm is often employed ( Hunt and McIlroy, 1976).4 However, the most popular revision control systems (such as Subversion and Git) have included only minimal direct support for binary diffs. Some systems only support storing successive file copies of a binary file; that is, the diff is the entire deletion of the old file and addition of the new file (e. g. cvs). Other systems may support binary file diffs by granular block and offset. Some rcss support diffing particular binary files using a custom automatic merge tool. For example, the textconv tool converts each version of the file to a plaintext representation and compares those representations. (To show a diff for a changes in a pdf file, a configured Git instance could compare the text output of pdfinfo in each case as a proxy. This is generally legible as metadata but does not of itself include sufficient information to reproduce or reverse the intervening changes; the underlying rcs still stores the entire binary in its conventional representation.) In any case, type beyond file extension is not generally supported by rcss; they cannot meaningfully merge files classified as binary.5

A general-purpose notion of file type is not isomorphic to conventional type in programming language theory. For instance, consider a json file. The following are equivalent as json artifacts:

{ 
  "name": "Alice", 
  "age": 30 
}
{ 
  "age": 30, 
  "name": "Alice" 
}
{ "name": "Alice", "age": 30, }

Indeed, since json is structurally agnostic to syntactic whitespace, an infinite number of equivalent files could be constructed. Clearly as plaintext representations, each of these differ, yet they are fully equivalent as json artifacts. An rcs which can recognize a json input can meaningfully store them as equivalent, while an rcs that makes only the trivial distinction between plaintext and binary must consider them all to be different—nonsemantic changes in structure may trigger spurious diffs in the file’s history. Furthermore, a json-aware rcs could meaningfully merge these files whereas a plaintext/binary rcs could not. Finally, a json-aware rcs could yield a different plaintext output when asked to produce the file, as it could normalize the whitespace to a canonical form. This may be considered by other systems to be an “incorrect” round trip.

Thus we must consider two intersecting notions of file type:

  1. The file type as understood by the system, typically based on bitwise reproducibility (e. g. plaintext, binary).

  2. The file type as understood conceptually, typically based on structural relationships (e. g. json, xml/html, Word docx, etc.).

The employment of an intermediate representation allows us to track changes to files and manage their history—state—more effectively. Consider the popular JavaScript framework React ( Meta, 2013), aspects of which were developed to resolve difficulties with slow repainting in browsers (for a popular explanation, see Baer (2018), chapter 1). React’s virtual Document Object Model (dom) system acts as an intermediate representation between the actual dom and the developer’s code. When a developer updates the virtual dom, the framework compares the new virtual dom with the previous virtual dom and updates the real dom only if necessary. This allows React to optimize the rendering process and improve performance. It also means that we can consider the structural representation of the dom object as an intermediate representation, and the production and application of the diff as a type-aware rcs operation. Higher-level interfaces like Microsoft Word’s “Track Changes” feature ( Microsoft, 2024) similarly resolves diffs across text and formatting within a document, an artifact processed from a file into a user-friendly representation. While the “Track Changes” user interface is optimized for interactivity, conceptually it is quite similar to conventional merge tools from rcss, and de facto resolves the same sorts of diffs for its particular file type.6 Furthermore, the “Source Code in Database” (scid) approach parses all code into a database as an intermediate representation; this shares some features with a hypothetical type-aware rcs. Thus a significant part of managing typed file data is deciding what the input, output, and diffs should look like, and the remainder is systematic application of these rules to the data.

Given this frame, consider Perry’s taxonomy of version control for module interfaces:

  1. No version control. Files are simply replaced.

  2. Basic version control. Files are versioned with a number, but no diff is applied.

  3. Strongly-typed version control. Files are resolved at a structural level, such as procedures or arrays, and diffs are applied at that level.

In this scheme, contemporary Git-style rcs is a “weakly-typed” rcs: the system is aware of versions and diffs, but typically not below the file or line level.7 Perry was concerned with the behavior and representation of semantic objects, rather than files or data:

Version equivalence in this type of version control mechanism is defined in terms of syntactic equivalence: data objects are equivalent if their types or structures are equivalent; operations and modules are equivalent if their signatures are equivalent. ( Perry, 1987)

Strongly-typed version control requires coupling to the build system, since the system must be aware of the structure of the data in order to apply diffs. (Indeed, ~rovnys-ricfer (2024), pp. 6–7, in ustj vol. 1 iss. 1, pp. 1—46, discussed one solution to the linking problem which is well-suited to Urbit’s global namespace.)

3 Clay

3.1 Revision Control and Marks

Urbit provides the Clay vane at /sys/vane/clay to manage Unix-style files and source builds (cf. ~sorreg-namtyv (2016)). Clay acts as both the Unix-like filesystem and the source control vane. While Hoon itself (/sys/hoon) is responsible to compile a source noun into a Hoon abstract syntax tree (ast) and ultimately into executable Nock, the actual construction of source input from files is delegated to the ++ford arm of Clay. This build process includes compiling the correct import files (such as libraries and shared structure files). Clay is also tightly coupled to the userspace vane Gall since system updates and agent updates both occur via Clay.8

To motivate how Clay functions as a revision control filesystem, we need to briefly examine some major data structures utilized by Clay. At the highest level, files are organized into “desks”, described in previous literature as being analogous to Git branches but really more like Git repositories. A desk is a self-contained continuity of files; its state is the result of a history of commits. Desks may not include links to off-desk resources. Each desk must expose a few files at definite paths; primarily /mar for the marks used to read the desk contents.9

Ultimately, Arvo and Clay deal in nouns. A noun is a binary tree of unsigned integers. These may be evaluated as Nock formulas or manipulated as data. A noun is formally agnostic to the underlying hardware and is almost always accessed via some representation path (i. e. as phenomenon). A noun serves as a universally legible intermediate representation for all data and code.

Urbit has an immutable data model; the system state of Arvo is the unique result of the events in its history, stored as its event log. If a noun changes, it is the result of definite discrete changes to the system state. This yields very nice properties of referential transparency and has some ramifications in library management; see ~rovnys-ricfer and ~wicdev-wisryt (2024) for more details.

Typically, a noun is altered either directly via subject-modifying code (%~ censig, =. tisdot) or via a mark transformation, on which more later.

While conventional file systems denote file type primarily by file suffix or a cursory search using “magic tests” on the file header, Urbit instead stores all data as a noun accessed via a mark. A mark is essentially a representation rule for nouns. Marks permit nouns to be stored with more granularity than the text/binary dichotomy facilitates, since details of conversion are stored as part of the mark.

A mark may be considered an executable mime type.10 A mark essentially describes how to map a data representation to a noun, and from a noun back out to a data representation. This conversion may be trivial (e. g. the binary storage of an audio file) or sophisticated (from a text stream to a linked-list utf representation). Data are accessed via a particular mark, and the appropriate conversion routine is invoked.11 Clay is capable of searching for transition paths between mark representations, and in principle permits conversion between all compatible representations (if there were a path from Markdown md to rich text rtf to plaintext txt in marks, then the conversion is automatically supported by Clay).

A mark is a term (symbol) which is the Urbit equivalent of a mime type, if mime types were names of typed validation functions. ( ~sorreg-namtyv et al. (2016), p. 51)

Unix filesystem mounting and committing involves synchronizing a desk’s state with a Unix logical representation of the files involved. The actual file data are imported as %mime-typed nouns and then converted to their target mark as file type.

Each mark defines several standard arms:

  1. ++grab converts from other marks (by arm name) to a given mark.

  2. ++grow converts from the mark to other marks.

  3. ++grad defines the diffs applicators for the mark.

Marks do not need to be symmetrical or round-trip. A json input may yield an equivalent json output but altered by structural whitespace. Clay is also proactive in searching for possible mark paths—if an immediate conversion is not available, then a mutually available intermediary form may be used. To implement a custom filetype, all that is required is the production of a suitable mark file. Conversions from the filesystem require at a minimum a mime mark file, which acts as the global intermediary from the host OS.

3.2 Desks

A desk is a filesystem continuity,12 and represents the main way that data are synchronized between Urbit instances (ships). The state of a desk is a history of sequential commits, with each commit after the first having at least one parent commit. Desks are often analogized to Git repositories.13 Each desk must contain sufficient internal data—especially marks—to produce its own code, given Hoon and the standard library (which are always available).

Since Clay is a global filesystem, it has to maintain information about both local and remote desks. A local desk’s state maintains information about noun data, build and mark cache state, userspace app state, various policies, and synchronization with the host OS filesystem.

Listing 1: Clay desk types
::  Domestic desk state. 
+$  dojo 
  $:  qyx=cult              ::  subscribers 
      dom=dome              ::  desk state 
5      per=regs              ::  read perms per path 
      pew=regs              ::  write perms per path 
      fiz=melt              ::  state for mega merges 
  == 
::  Desk state. 
10+$  dome 
  $:  let=aeon              ::  top id 
      hit=(map aeon tako)   ::  versions by id 
      lab=(map @tas aeon)   ::  labels 
      tom=(map tako norm)   ::  tomb policies 
15      nor=norm              ::  default policy 
      mim=(map path mime)   ::  mime cache 
      fod=flue              ::  ford cache 
      wic=(map weft yoki)   ::  commit-in-waiting 
      liv=zest              ::  running agents 
20      ren=rein              ::  force agents on/off 
  == 
::  Support types 
+$  mizu  [p=@u q=(map @ud tako) r=rang] 
                            ::  new state 
25+$  rang                    ::  repository 
  $+  rang 
  $:  hut=(map tako yaki)   ::  changes 
      lat=(map lobe page)   ::  data 
  ==

The keystone structure is hit, which connects a given version of the desk (its aeon, which is merely a @ud) to the tree of associated commits. These are accessed from the “slush pool” of known commits via hut in the Clay’s rang, which is not associated with a particular local desk.

Much of Clay’s machinery is intended to support noun changes propagated as commits, or desk changes. These may arise from three sources: a local host OS filesystem sync, a remote desk sync, or an on-Urbit local edit. A commit is a snapshot of a set of files with a list of parent commits and a timestamp.

Various merge strategies are available to support desk updating. Given a commit as a timestamped snapshot of files, how should two conflicting histories be reconciled? For instance, %fine is a Git-style “fast-forward” for use when histories nest but one is “ahead”. %meet combines changes as long as the same file/noun has not been modified. (Several others are available and are described in the system documentation.) While Clay has at times supported noun diffs in commits, it currently simply stores the entire file at each commit.14 Urbit developers in practice have not encountered difficulties with desk and commit management due to this limitation at the current time, but resolution using mark diffs will eventually be desirable.

Structurally, Clay tracks desk state as a sequential +$aeon or @ud, but files may be arbitrarily accessed using a +$case:

+$  case 
  $%  [%da p=@da]     ::  %da:  date 
      [%tas p=@tas]   ::  %tas: label 
      [%ud p=@ud]     ::  %ud:  sequence 
5      [%uv p=@uv]     ::  %uv:  hash 
  ==

A case resolves into the corresponding aeon via the canonical timestamp of the commit; the label map; or the hash of the corresponding commit. A userspace system to support detailed commit messages, called Story, is available at @urbit/story.

3.3 Runtime

As a vane of Arvo, Clay lives “on Mars”; that is, it describes a logical filesystem inside of the Urbit ship but does not directly map its nouns to physical hardware. The particular implementation used to store the total state of Arvo, which contains Clay’s state including file data, depends on the runtime. At the time of writing, the Vere runtime uses the Lightning Memory-Mapped Database Manager (lmdb) which is a B-tree representation for a key-value store ( Chu, 2011). While of practical consequence for lookup time and system behavior, the architecture of Clay to an Urbit developer is independent of the runtime implementation.

The posix-compliant host OS supports a filesystem mirror of Clay files organized by desk. Importantly, while these files can be synced to Clay, they are not the canonical reference for each (which lives in lmbd). Clay maintains a “sync duct” with the runtime, which upon receipt of a %dirk task produces a list of file changes each way and implements the latest changes.15

In principle, as we have intimated, an rcs may abstract away from the file as the fundamental artifact being tracked. While Clay does not fully commit to this, in a concession to its posix-compliant host OS, it gestures towards the possibility of conceiving of build sources and artifacts as objects (nouns) other than files. Since Urbit “unifies” nouns (that is, it structurally stores only one copy of a noun and maintains references to that shared value), repetition of critical resources across desks does not lead to filesystem bloat. (See ~rovnys-ricfer and ~wicdev-wisryt (2024) in ustj vol. 1 iss. 1, pp. 75—82, for details.)

3.4 Building Code

Clay is also directly responsible for building code. Formerly, Urbit built code from source files using the Ford vane. In 2020, the core developers realized that integrating the build system directly into the rcs solved a number of problems with updating the system correctly. Updates to the system should be atomic (complete or fail in one transaction); self-contained (no implicit dependencies or dependencies on previous system states); and ordered (sequenced within the system stack). By coupling the build system into the rcs, the system can be updated in a single transaction, with the build system ensuring that the system is always in a consistent state. In a typed rcs, the compiler or build system must be always at hand to process diffs in artifacts. Although of relatively minor consequence to userspace application developers, the integration of ++ford into Clay transforms the system closer into an integrated rcs for generic nouns.

3.5 Shortcomings

As currently implemented, Clay presents several wrinkles in developer ergonomics. Chief among these is the way that marks, as simple type tags, are underspecified. The original philosophy of marks centered on network transmission of nouns as tagged data:

Like a mime type, the mark is just a label. There is no way to guarantee that the sender and receiver agree on what this label means. A noun which doesn’t normalize to itself is a packet drop. ( ~sorreg-namtyv et al. (2016), pp. 51–52)

In practice, different Clay desks and different Urbit versions can have different mark processors (and thus different behvaior for the same tag). No simple version system tracks these; there is no global type registry (other than the scry namespace), and any given mark is simply identified by its desk and path. Developers have expressed a desire for a more robust mark management capability, and solutions have ranged from a global type registry (e. g. that proposed by Archetype) to using paths instead of simple tags for marks. In practice as of writing, marks are disambiguated by supplying the intended mark with the desk; this may lead to multiple different implementations of a mark being present on the same system, although uniquely identified by different Clay paths.

The current architecture of Clay (largely in place by 2016 except for the unification of the former Ford vane into Clay in 2020) leads to three ongoing developer concerns:16

  1. Implementing rcs behavior at the file system layer seems to be incorrect for Urbit (that is, too much of a sop to Unix). trying to do revision control at base layer, but file system is wrong layer (you just want history of actions/tx log rather than full state); see migrev codebases for clay state

  2. Desks seem to be the wrong abstraction in practice. The theory of the desk doesn’t account for actual distribution patterns people would like to use; we want one desk but one spot in file tree is capable of acting as a desk (chroot analysis); overuse of desks makes Urbit not feel as much like a filesystem you can explore

  3. The application system (userspace) and the build system and filesystem should be co-located. Currently, Clay and Gall are deeply entangled but must interact via a somewhat awkward message-passing interface. This leads to a number of ergonomic issues, such as the need to suspend a desk in order to suspend an agent. Employing paths rather than simple tags would lead to a desire to constrain what can exist at certain paths. This relates to the definition of files as data structures, and ultimately a noun-maximalist scenario abstracts more towards nouns as database entries rather than “files”.

One possible future for Clay sees it being restricted to source and build management, rather than expansion to a more fully-featured filesystem. In this contingency, the userspace management vane may take over conventional file storage, as Gall or a successor. This may only implement a filesystem interface for the host OS and external applications that “think” in files, such as web browsers.

Alternatively, the build system, rcs, and application engine fuse into a single userspace vane, which is a possible endpoint of the “shrubbery” project. Mark cores would be replaced with simpler conversion rules permitting only straightforward type casts between nouns as $-(from to) gates.17

4 Conclusion

In its current instantiation, Clay exhibits some notable characteristics as a typed revision control system. It is capable of tracking file changes at a structural level, permitting meaningful diffs and merges for files of the same mark. It is also capable of converting between marks, permitting a more flexible approach to file type than the conventional text/binary dichotomy of Git and other rcss. However, the future of userspace and code building is still in flux and may see a reworking of the Clay vane to better suit the needs of developers and users.PIC

References

Baer, Eric (2018). What React Is and Why It Matters. Sebastopol, CA: O’Reilly Media. url: https://www.oreilly.com/library/view/what-react-is/9781491996744/ch01.html (visited on ~2024.8.12).

Ball, Thomas et al. (2015). “Beyond Open Source: The TouchDevelop Cloud-based Integrated Development Environment.” In: 2nd acm International Conference on Mobile Software Engineering and Systems. url: https://ieeexplore.ieee.org/document/7283033 (visited on ~2024.8.12).

Chu, Howard (2011) “lmdb”. url: http://www.lmdb.tech/doc/ (visited on ~2024.8.12).

Collins-Sussman, Ben, Brian W. Fitzpatrick, and C. Michael Pilato (2016). Version Control with Subversion. Sebastopol, CA: O’Reilly Media. url: https://svnbook.red-bean.com/en/1.8/svn.forcvs.binary-and-trans.html (visited on ~2024.8.12).

Hunt, James W. and M. Douglas McIlroy (1976). An Algorithm for Differential File Comparison. Computing Science Technical Report 41. Bell Laboratories. url: http://www.cs.dartmouth.edu/~doug/diff.pdf (visited on ~2024.8.12).

Meta (2013) “React”. url: https://react.dev/ (visited on ~2024.8.12).

Microsoft (2024) “Track changes in Word”. url: https://support.microsoft.com/en-us/office/track-changes-in-word-197ba630-0f5f-4a8e-9a77-3712475e806a (visited on ~2024.8.12).

Perry, D. E. (1987). “Version Control in the Inscape Environment.” In: Proceedings of the 9th International Conference on Software Engineering, pp. 142–149. url: https://users.ece.utexas.edu/~perry/work/papers/icse9b.pdf (visited on ~2024.8.12).

~rovnys-ricfer, Ted Blackman (2024). “The State of Urbit: Eight Years After the Whitepaper.” In: Urbit Systems Technical Journal 1.1, pp. 1–46.

~rovnys-ricfer, Ted Blackman and Philip C. Monk ~wicdev-wisryt (2024). “A Solution to Static vs. Dynamic Linking.” In: Urbit Systems Technical Journal 1.1, pp. 75–82.

~sorreg-namtyv, Curtis Yarvin (2016) “Toward a New Clay”. url: https://urbit.org/blog/toward-a-new-clay (visited on ~2024.8.12).

~sorreg-namtyv, Curtis Yarvin et al. (2016). Urbit: A Solid-State Interpreter. Whitepaper. Tlon Corporation. url: https://media.urbit.org/whitepaper.pdf (visited on ~2024.1.25).

Torvalds, Linus (2005) ““Initial revision of "git", the information manager from hell” (Git)”. url: https://github.com/git/git/commit/e83c5163316f89bfbde7d9ab23ca2e25604af290 (visited on ~2024.8.12).

Footnotes

1Also “version control system” (vcs); “source control system”.

2Compare other logical file systems: network file systems like Microsoft Server Message Block (smb) drives or distributed file systems like the Hadoop Distributed File System (hdfs) or the InterPlanetary File System (ipfs).

3Notably, Git has seemingly failed to fulfill its original promise of being a fully decentralized rcs: most users prefer to use the affordances of centralized Git services such as GitHub or GitLab.

4An implementation in Hoon is available at @sigilante/diff.

5As always, there are exceptions to the rules. The Git rcs, for instance, can use a “binary diff” format on exact preimages. The Subversion rcs does provide an elaborated mime support system, wherein users can assign each file a mime type that Subversion can use in applying and displaying diffs. However, in practice this only results in the classic text/binary dichotomy with slightly more granularity. “Subversion treats all file data as literal byte strings, and files are always stored in the repository in an untranslated state” ( Collins-Sussman, Fitzpatrick, and Pilato (2016), section “Binary Files and Translation”).

6Indeed, one can imagine a Word-aware command-line rcs that could meaningfully merge changes to a Word document, rather than simply storing successive copies of the file. The OpenDocument file form (odt, ods, odg, etc.) consists of zip-compressed xml files, which would be highly amenable to type-aware merge tools at the command line.

7Oddly, Perry did not call out weak typing in his enumeration, although he did address it elsewhere in the article.

8We omit discussion of several other notable properties of Clay as a single-level store: referential transparency, global addressability, and event-level persistence, for instance.