Typed Revision Control

N. E. Davis ~lagrev-nocfep
Urbit Foundation

Abstract

1 Introduction
2 Type in Revision Control Systems
3 Clay
3.1 Revision Control and Marks
3.2 Desks
3.3 Runtime
3.4 Building Code
3.5 Shortcomings
4 Conclusion
References

1 Introduction

A revision control system¹ is responsible for tracking the state and history of assets and asset changes within a particular continuity. It may also manage the production of assets through a build system, as used for instance by a continuous integration/continuous deployment (CI/CD) workﬂow. Trivially, an rcs is a logical ﬁlesystem: a database for managing ﬁles. (Conventional rcss are agnostic to the underlying disk hardware, however.² ) Computer revision control systems have developed from the 1960s onwards once hard drives permitted ﬁles to be stored on tape rather than simply as punchcard ﬁles. Succeeding generations have introduced new approaches and features, converging more or less on the broad functionality today aﬀorded by Git ( Torvalds, 2005) in its elaborated form as a distributed source control system.³ Much care has been taken to treat the problem of dependencies (e. g. Ball et al. (2015)) and eﬃcient user interfaces (e. g. GitHub, GitLab, etc.). However, although some modest interest has been expressed for a more complete notion of type in revision control, no current system known to the authors besides Urbit’s Clay fully implements a typed revision-controlled ﬁlesystem. In this article, we explore the approach of current rcss and elaborate Clay’s contributions to the ongoing task of information management and auditable, reproducible source builds.

2 Type in Revision Control Systems

The problem of type in revision control has been of both academic and practical interest for decades ( Perry, 1987). Essentially, can ﬁle artifacts be organized and classiﬁed in such a way as to make them susceptible of comparison across their history? The simplest arrangement which is commonly made is to designate a ﬁle as either plaintext or binary. A plaintext ﬁle can be updated by specifying the character oﬀset and run length to be replaced (an example of a “diﬀ”, or diﬀerence between two examples of text; typically these are resolved at the line level rather than character level). Plaintext ﬁles, whether ascii or some form of utf, are deﬁned by a regular format with well-understood and predictable symbol widths and oﬀsets. For instance, a little-endian representation of the text ’clay’ would be reproduced in ascii hexadecimal as 0x7961.6c63 and ascii binary as 0b111.1001.0110.0001.0110.1100.0110.0011. (In binary, each character has a leading zero; that is, each entry is seven bits wide plus a 1-bit zero header.) In contrast, binary ﬁles (which is a byword for “everything else”) have arbitrary layout and oﬀset; it may be diﬃcult to determine how and why to compare two revisions of any given ﬁle. Many rcss simply treat all registered data as byte streams.

All modern rcss support plaintext and binary representations of ﬁles. Plaintext diﬃng at the level of line changes is straightforward; the Hunt–McIlroy algorithm is often employed ( Hunt and McIlroy, 1976).⁴ However, the most popular revision control systems (such as Subversion and Git) have included only minimal direct support for binary diﬀs. Some systems only support storing successive ﬁle copies of a binary ﬁle; that is, the diﬀ is the entire deletion of the old ﬁle and addition of the new ﬁle (e. g. cvs). Other systems may support binary ﬁle diﬀs by granular block and oﬀset. Some rcss support diﬃng particular binary ﬁles using a custom automatic merge tool. For example, the textconv tool converts each version of the ﬁle to a plaintext representation and compares those representations. (To show a diﬀ for a changes in a pdf ﬁle, a conﬁgured Git instance could compare the text output of pdﬁnfo in each case as a proxy. This is generally legible as metadata but does not of itself include suﬃcient information to reproduce or reverse the intervening changes; the underlying rcs still stores the entire binary in its conventional representation.) In any case, type beyond ﬁle extension is not generally supported by rcss; they cannot meaningfully merge ﬁles classiﬁed as binary.⁵

A general-purpose notion of ﬁle type is not isomorphic to conventional type in programming language theory. For instance, consider a json ﬁle. The following are equivalent as json artifacts:

{ 
  "name": "Alice", 
  "age": 30 
}

{ 
  "age": 30, 
  "name": "Alice" 
}

{ "name": "Alice", "age": 30, }

Indeed, since json is structurally agnostic to syntactic whitespace, an inﬁnite number of equivalent ﬁles could be constructed. Clearly as plaintext representations, each of these diﬀer, yet they are fully equivalent as json artifacts. An rcs which can recognize a json input can meaningfully store them as equivalent, while an rcs that makes only the trivial distinction between plaintext and binary must consider them all to be diﬀerent—nonsemantic changes in structure may trigger spurious diﬀs in the ﬁle’s history. Furthermore, a json-aware rcs could meaningfully merge these ﬁles whereas a plaintext/binary rcs could not. Finally, a json-aware rcs could yield a diﬀerent plaintext output when asked to produce the ﬁle, as it could normalize the whitespace to a canonical form. This may be considered by other systems to be an “incorrect” round trip.

Thus we must consider two intersecting notions of ﬁle type:

The ﬁle type as understood by the system, typically based on bitwise reproducibility (e. g. plaintext, binary).
The ﬁle type as understood conceptually, typically based on structural relationships (e. g. json, xml/html, Word docx, etc.).

The employment of an intermediate representation allows us to track changes to ﬁles and manage their history—state—more eﬀectively. Consider the popular JavaScript framework React ( Meta, 2013), aspects of which were developed to resolve diﬃculties with slow repainting in browsers (for a popular explanation, see Baer (2018), chapter 1). React’s virtual Document Object Model (dom) system acts as an intermediate representation between the actual dom and the developer’s code. When a developer updates the virtual dom, the framework compares the new virtual dom with the previous virtual dom and updates the real dom only if necessary. This allows React to optimize the rendering process and improve performance. It also means that we can consider the structural representation of the dom object as an intermediate representation, and the production and application of the diﬀ as a type-aware rcs operation. Higher-level interfaces like Microsoft Word’s “Track Changes” feature ( Microsoft, 2024) similarly resolves diﬀs across text and formatting within a document, an artifact processed from a ﬁle into a user-friendly representation. While the “Track Changes” user interface is optimized for interactivity, conceptually it is quite similar to conventional merge tools from rcss, and de facto resolves the same sorts of diﬀs for its particular ﬁle type.⁶ Furthermore, the “Source Code in Database” (scid) approach parses all code into a database as an intermediate representation; this shares some features with a hypothetical type-aware rcs. Thus a signiﬁcant part of managing typed ﬁle data is deciding what the input, output, and diﬀs should look like, and the remainder is systematic application of these rules to the data.

Given this frame, consider Perry’s taxonomy of version control for module interfaces:

No version control. Files are simply replaced.
Basic version control. Files are versioned with a number, but no diﬀ is applied.
Strongly-typed version control. Files are resolved at a structural level, such as procedures or arrays, and diﬀs are applied at that level.

In this scheme, contemporary Git-style rcs is a “weakly-typed” rcs: the system is aware of versions and diﬀs, but typically not below the ﬁle or line level.⁷ Perry was concerned with the behavior and representation of semantic objects, rather than ﬁles or data:

Version equivalence in this type of version control mechanism is deﬁned in terms of syntactic equivalence: data objects are equivalent if their types or structures are equivalent; operations and modules are equivalent if their signatures are equivalent. ( Perry, 1987)

Strongly-typed version control requires coupling to the build system, since the system must be aware of the structure of the data in order to apply diﬀs. (Indeed, ~rovnys-ricfer (2024), pp. 6–7, in ustj vol. 1 iss. 1, pp. 1—46, discussed one solution to the linking problem which is well-suited to Urbit’s global namespace.)

3 Clay

3.1 Revision Control and Marks

Urbit provides the Clay vane at /sys/vane/clay to manage Unix-style ﬁles and source builds (cf. ~sorreg-namtyv (2016)). Clay acts as both the Unix-like ﬁlesystem and the source control vane. While Hoon itself (/sys/hoon) is responsible to compile a source noun into a Hoon abstract syntax tree (ast) and ultimately into executable Nock, the actual construction of source input from ﬁles is delegated to the ++ford arm of Clay. This build process includes compiling the correct import ﬁles (such as libraries and shared structure ﬁles). Clay is also tightly coupled to the userspace vane Gall since system updates and agent updates both occur via Clay.⁸

To motivate how Clay functions as a revision control ﬁlesystem, we need to brieﬂy examine some major data structures utilized by Clay. At the highest level, ﬁles are organized into “desks”, described in previous literature as being analogous to Git branches but really more like Git repositories. A desk is a self-contained continuity of ﬁles; its state is the result of a history of commits. Desks may not include links to oﬀ-desk resources. Each desk must expose a few ﬁles at deﬁnite paths; primarily /mar for the marks used to read the desk contents.⁹

Ultimately, Arvo and Clay deal in nouns. A noun is a binary tree of unsigned integers. These may be evaluated as Nock formulas or manipulated as data. A noun is formally agnostic to the underlying hardware and is almost always accessed via some representation path (i. e. as phenomenon). A noun serves as a universally legible intermediate representation for all data and code.

Urbit has an immutable data model; the system state of Arvo is the unique result of the events in its history, stored as its event log. If a noun changes, it is the result of deﬁnite discrete changes to the system state. This yields very nice properties of referential transparency and has some ramiﬁcations in library management; see ~rovnys-ricfer and ~wicdev-wisryt (2024) for more details.

Typically, a noun is altered either directly via subject-modifying code (%~ censig, =. tisdot) or via a mark transformation, on which more later.

While conventional ﬁle systems denote ﬁle type primarily by ﬁle suﬃx or a cursory search using “magic tests” on the ﬁle header, Urbit instead stores all data as a noun accessed via a mark. A mark is essentially a representation rule for nouns. Marks permit nouns to be stored with more granularity than the text/binary dichotomy facilitates, since details of conversion are stored as part of the mark.

A mark may be considered an executable mime type.¹⁰ A mark essentially describes how to map a data representation to a noun, and from a noun back out to a data representation. This conversion may be trivial (e. g. the binary storage of an audio ﬁle) or sophisticated (from a text stream to a linked-list utf representation). Data are accessed via a particular mark, and the appropriate conversion routine is invoked.¹¹ Clay is capable of searching for transition paths between mark representations, and in principle permits conversion between all compatible representations (if there were a path from Markdown md to rich text rtf to plaintext txt in marks, then the conversion is automatically supported by Clay).

A mark is a term (symbol) which is the Urbit equivalent of a mime type, if mime types were names of typed validation functions. ( ~sorreg-namtyv et al. (2016), p. 51)

Unix ﬁlesystem mounting and committing involves synchronizing a desk’s state with a Unix logical representation of the ﬁles involved. The actual ﬁle data are imported as %mime-typed nouns and then converted to their target mark as ﬁle type.

Each mark deﬁnes several standard arms:

++grab converts from other marks (by arm name) to a given mark.
++grow converts from the mark to other marks.
++grad deﬁnes the diﬀs applicators for the mark.

Marks do not need to be symmetrical or round-trip. A json input may yield an equivalent json output but altered by structural whitespace. Clay is also proactive in searching for possible mark paths—if an immediate conversion is not available, then a mutually available intermediary form may be used. To implement a custom ﬁletype, all that is required is the production of a suitable mark ﬁle. Conversions from the ﬁlesystem require at a minimum a mime mark ﬁle, which acts as the global intermediary from the host OS.

3.2 Desks

A desk is a ﬁlesystem continuity,¹² and represents the main way that data are synchronized between Urbit instances (ships). The state of a desk is a history of sequential commits, with each commit after the ﬁrst having at least one parent commit. Desks are often analogized to Git repositories.¹³ Each desk must contain suﬃcient internal data—especially marks—to produce its own code, given Hoon and the standard library (which are always available).

Since Clay is a global ﬁlesystem, it has to maintain information about both local and remote desks. A local desk’s state maintains information about noun data, build and mark cache state, userspace app state, various policies, and synchronization with the host OS ﬁlesystem.

Listing 1: Clay desk types

::  Domestic desk state. 
+$  dojo 
  $:  qyx=cult              ::  subscribers 
      dom=dome              ::  desk state 
5      per=regs              ::  read perms per path 
      pew=regs              ::  write perms per path 
      fiz=melt              ::  state for mega merges 
  == 
::  Desk state. 
10+$  dome 
  $:  let=aeon              ::  top id 
      hit=(map aeon tako)   ::  versions by id 
      lab=(map @tas aeon)   ::  labels 
      tom=(map tako norm)   ::  tomb policies 
15      nor=norm              ::  default policy 
      mim=(map path mime)   ::  mime cache 
      fod=flue              ::  ford cache 
      wic=(map weft yoki)   ::  commit-in-waiting 
      liv=zest              ::  running agents 
20      ren=rein              ::  force agents on/off 
  == 
::  Support types 
+$  mizu  [p=@u q=(map @ud tako) r=rang] 
                            ::  new state 
25+$  rang                    ::  repository 
  $+  rang 
  $:  hut=(map tako yaki)   ::  changes 
      lat=(map lobe page)   ::  data 
  ==

The keystone structure is hit, which connects a given version of the desk (its aeon, which is merely a @ud) to the tree of associated commits. These are accessed from the “slush pool” of known commits via hut in the Clay’s rang, which is not associated with a particular local desk.

Much of Clay’s machinery is intended to support noun changes propagated as commits, or desk changes. These may arise from three sources: a local host OS ﬁlesystem sync, a remote desk sync, or an on-Urbit local edit. A commit is a snapshot of a set of ﬁles with a list of parent commits and a timestamp.

Various merge strategies are available to support desk updating. Given a commit as a timestamped snapshot of ﬁles, how should two conﬂicting histories be reconciled? For instance, %fine is a Git-style “fast-forward” for use when histories nest but one is “ahead”. %meet combines changes as long as the same ﬁle/noun has not been modiﬁed. (Several others are available and are described in the system documentation.) While Clay has at times supported noun diﬀs in commits, it currently simply stores the entire ﬁle at each commit.¹⁴ Urbit developers in practice have not encountered diﬃculties with desk and commit management due to this limitation at the current time, but resolution using mark diﬀs will eventually be desirable.

Structurally, Clay tracks desk state as a sequential +$aeon or @ud, but ﬁles may be arbitrarily accessed using a +$case:

+$  case 
  $%  [%da p=@da]     ::  %da:  date 
      [%tas p=@tas]   ::  %tas: label 
      [%ud p=@ud]     ::  %ud:  sequence 
5      [%uv p=@uv]     ::  %uv:  hash 
  ==

A case resolves into the corresponding aeon via the canonical timestamp of the commit; the label map; or the hash of the corresponding commit. A userspace system to support detailed commit messages, called Story, is available at @urbit/story.

3.3 Runtime

As a vane of Arvo, Clay lives “on Mars”; that is, it describes a logical ﬁlesystem inside of the Urbit ship but does not directly map its nouns to physical hardware. The particular implementation used to store the total state of Arvo, which contains Clay’s state including ﬁle data, depends on the runtime. At the time of writing, the Vere runtime uses the Lightning Memory-Mapped Database Manager (lmdb) which is a B-tree representation for a key-value store ( Chu, 2011). While of practical consequence for lookup time and system behavior, the architecture of Clay to an Urbit developer is independent of the runtime implementation.

The posix-compliant host OS supports a ﬁlesystem mirror of Clay ﬁles organized by desk. Importantly, while these ﬁles can be synced to Clay, they are not the canonical reference for each (which lives in lmbd). Clay maintains a “sync duct” with the runtime, which upon receipt of a %dirk task produces a list of ﬁle changes each way and implements the latest changes.¹⁵

In principle, as we have intimated, an rcs may abstract away from the ﬁle as the fundamental artifact being tracked. While Clay does not fully commit to this, in a concession to its posix-compliant host OS, it gestures towards the possibility of conceiving of build sources and artifacts as objects (nouns) other than ﬁles. Since Urbit “uniﬁes” nouns (that is, it structurally stores only one copy of a noun and maintains references to that shared value), repetition of critical resources across desks does not lead to ﬁlesystem bloat. (See ~rovnys-ricfer and ~wicdev-wisryt (2024) in ustj vol. 1 iss. 1, pp. 75—82, for details.)

3.4 Building Code

Clay is also directly responsible for building code. Formerly, Urbit built code from source ﬁles using the Ford vane. In 2020, the core developers realized that integrating the build system directly into the rcs solved a number of problems with updating the system correctly. Updates to the system should be atomic (complete or fail in one transaction); self-contained (no implicit dependencies or dependencies on previous system states); and ordered (sequenced within the system stack). By coupling the build system into the rcs, the system can be updated in a single transaction, with the build system ensuring that the system is always in a consistent state. In a typed rcs, the compiler or build system must be always at hand to process diﬀs in artifacts. Although of relatively minor consequence to userspace application developers, the integration of ++ford into Clay transforms the system closer into an integrated rcs for generic nouns.

3.5 Shortcomings

As currently implemented, Clay presents several wrinkles in developer ergonomics. Chief among these is the way that marks, as simple type tags, are underspeciﬁed. The original philosophy of marks centered on network transmission of nouns as tagged data:

Like a mime type, the mark is just a label. There is no way to guarantee that the sender and receiver agree on what this label means. A noun which doesn’t normalize to itself is a packet drop. ( ~sorreg-namtyv et al. (2016), pp. 51–52)

In practice, diﬀerent Clay desks and diﬀerent Urbit versions can have diﬀerent mark processors (and thus diﬀerent behvaior for the same tag). No simple version system tracks these; there is no global type registry (other than the scry namespace), and any given mark is simply identiﬁed by its desk and path. Developers have expressed a desire for a more robust mark management capability, and solutions have ranged from a global type registry (e. g. that proposed by Archetype) to using paths instead of simple tags for marks. In practice as of writing, marks are disambiguated by supplying the intended mark with the desk; this may lead to multiple diﬀerent implementations of a mark being present on the same system, although uniquely identiﬁed by diﬀerent Clay paths.

The current architecture of Clay (largely in place by 2016 except for the uniﬁcation of the former Ford vane into Clay in 2020) leads to three ongoing developer concerns:¹⁶

Implementing rcs behavior at the ﬁle system layer seems to be incorrect for Urbit (that is, too much of a sop to Unix). trying to do revision control at base layer, but ﬁle system is wrong layer (you just want history of actions/tx log rather than full state); see migrev codebases for clay state
Desks seem to be the wrong abstraction in practice. The theory of the desk doesn’t account for actual distribution patterns people would like to use; we want one desk but one spot in ﬁle tree is capable of acting as a desk (chroot analysis); overuse of desks makes Urbit not feel as much like a ﬁlesystem you can explore
The application system (userspace) and the build system and ﬁlesystem should be co-located. Currently, Clay and Gall are deeply entangled but must interact via a somewhat awkward message-passing interface. This leads to a number of ergonomic issues, such as the need to suspend a desk in order to suspend an agent. Employing paths rather than simple tags would lead to a desire to constrain what can exist at certain paths. This relates to the deﬁnition of ﬁles as data structures, and ultimately a noun-maximalist scenario abstracts more towards nouns as database entries rather than “ﬁles”.

One possible future for Clay sees it being restricted to source and build management, rather than expansion to a more fully-featured ﬁlesystem. In this contingency, the userspace management vane may take over conventional ﬁle storage, as Gall or a successor. This may only implement a ﬁlesystem interface for the host OS and external applications that “think” in ﬁles, such as web browsers.

Alternatively, the build system, rcs, and application engine fuse into a single userspace vane, which is a possible endpoint of the “shrubbery” project. Mark cores would be replaced with simpler conversion rules permitting only straightforward type casts between nouns as $-(from to) gates.¹⁷

4 Conclusion

In its current instantiation, Clay exhibits some notable characteristics as a typed revision control system. It is capable of tracking ﬁle changes at a structural level, permitting meaningful diﬀs and merges for ﬁles of the same mark. It is also capable of converting between marks, permitting a more ﬂexible approach to ﬁle type than the conventional text/binary dichotomy of Git and other rcss. However, the future of userspace and code building is still in ﬂux and may see a reworking of the Clay vane to better suit the needs of developers and users.

References

: Baer, Eric (2018). What React Is and Why It Matters. Sebastopol, CA: O’Reilly Media. url: https://www.oreilly.com/library/view/what-react-is/9781491996744/ch01.html (visited on ~2024.8.12).
: Ball, Thomas et al. (2015). “Beyond Open Source: The TouchDevelop Cloud-based Integrated Development Environment.” In: 2nd acm International Conference on Mobile Software Engineering and Systems. url: https://ieeexplore.ieee.org/document/7283033 (visited on ~2024.8.12).
: Chu, Howard (2011) “lmdb”. url: http://www.lmdb.tech/doc/ (visited on ~2024.8.12).
: Collins-Sussman, Ben, Brian W. Fitzpatrick, and C. Michael Pilato (2016). Version Control with Subversion. Sebastopol, CA: O’Reilly Media. url: https://svnbook.red-bean.com/en/1.8/svn.forcvs.binary-and-trans.html (visited on ~2024.8.12).
: Hunt, James W. and M. Douglas McIlroy (1976). An Algorithm for Diﬀerential File Comparison. Computing Science Technical Report 41. Bell Laboratories. url: http://www.cs.dartmouth.edu/~doug/diﬀ.pdf (visited on ~2024.8.12).
: Meta (2013) “React”. url: https://react.dev/ (visited on ~2024.8.12).
: Microsoft (2024) “Track changes in Word”. url: https://support.microsoft.com/en-us/oﬃce/track-changes-in-word-197ba630-0f5f-4a8e-9a77-3712475e806a (visited on ~2024.8.12).
: Perry, D. E. (1987). “Version Control in the Inscape Environment.” In: Proceedings of the 9th International Conference on Software Engineering, pp. 142–149. url: https://users.ece.utexas.edu/~perry/work/papers/icse9b.pdf (visited on ~2024.8.12).
: ~rovnys-ricfer, Ted Blackman (2024). “The State of Urbit: Eight Years After the Whitepaper.” In: Urbit Systems Technical Journal 1.1, pp. 1–46.
: ~rovnys-ricfer, Ted Blackman and Philip C. Monk ~wicdev-wisryt (2024). “A Solution to Static vs. Dynamic Linking.” In: Urbit Systems Technical Journal 1.1, pp. 75–82.
: ~sorreg-namtyv, Curtis Yarvin (2016) “Toward a New Clay”. url: https://urbit.org/blog/toward-a-new-clay (visited on ~2024.8.12).
: ~sorreg-namtyv, Curtis Yarvin et al. (2016). Urbit: A Solid-State Interpreter. Whitepaper. Tlon Corporation. url: https://media.urbit.org/whitepaper.pdf (visited on ~2024.1.25).
: Torvalds, Linus (2005) ““Initial revision of "git", the information manager from hell” (Git)”. url: https://github.com/git/git/commit/e83c5163316f89bfbde7d9ab23ca2e25604af290 (visited on ~2024.8.12).

Footnotes

¹Also “version control system” (vcs); “source control system”.⤴

²Compare other logical ﬁle systems: network ﬁle systems like Microsoft Server Message Block (smb) drives or distributed ﬁle systems like the Hadoop Distributed File System (hdfs) or the InterPlanetary File System (ipfs).⤴

³Notably, Git has seemingly failed to fulﬁll its original promise of being a fully decentralized rcs: most users prefer to use the aﬀordances of centralized Git services such as GitHub or GitLab.⤴

⁴An implementation in Hoon is available at @sigilante/diﬀ.⤴

⁵As always, there are exceptions to the rules. The Git rcs, for instance, can use a “binary diﬀ” format on exact preimages. The Subversion rcs does provide an elaborated mime support system, wherein users can assign each ﬁle a mime type that Subversion can use in applying and displaying diﬀs. However, in practice this only results in the classic text/binary dichotomy with slightly more granularity. “Subversion treats all ﬁle data as literal byte strings, and ﬁles are always stored in the repository in an untranslated state” ( Collins-Sussman, Fitzpatrick, and Pilato (2016), section “Binary Files and Translation”).⤴

⁶Indeed, one can imagine a Word-aware command-line rcs that could meaningfully merge changes to a Word document, rather than simply storing successive copies of the ﬁle. The OpenDocument ﬁle form (odt, ods, odg, etc.) consists of zip-compressed xml ﬁles, which would be highly amenable to type-aware merge tools at the command line.⤴

⁷Oddly, Perry did not call out weak typing in his enumeration, although he did address it elsewhere in the article.⤴

⁸We omit discussion of several other notable properties of Clay as a single-level store: referential transparency, global addressability, and event-level persistence, for instance.⤴