Testing, including unit testing, forms a critical step in software development. This article explores the philosophy and practice of testing in the context of Hoon and Urbit development, particularly as motivated by the development of urQL and Obelisk. Inductive testing provides an efficient yet complete basis for verifying code behavior. Principles and best practices are suggested, as well as a practical example of testing in Hoon.
What should I test? The short answer is, “Everything”. If you don’t test it, it doesn’t work.
However, everything raises more questions than it answers. Such questions fall into two categories, what and how.
No one writes non-trivial software without testing, even if you come to believe that “if it builds, it works”. The first time you ran your software was a test. Chances are, you ran individual components to see if those worked. Why not take the extra few minutes to record those tests and results? With practice you will get smart about efficiently recording your throw-away tests. And after you read this I hope you will have a new appreciation of why it is worth the extra time to record your tests.
Use what is useful and don’t get hung-up on a particular framework or methodology. Record tests as you develop. You may think this slows down your development. That’s not necessarily a bad thing. You will move forward with confidence that the last thing you worked on really works and you will find yourself thinking through your specification at a deeper level. This is important in a world of mostly incomplete and nebulous specs. What I am driving at is a cousin to test-driven development.1 Think of it as real-time test development.
Your test suite has two important non-obvious functions, as regression tests (OK, this one should be obvious) and as part of your specification. If an existing test starts failing, carefully consider Chesterton’s fence ( Wikipedia (2024a)) before taking the easy road and removing the test.
Regression tests are important not just to prevent unexpectedly breaking functionality, but to allow you to refactor with confidence. Great software is the result of constant refactoring.
Descending from the hypothetical to the practical, every
practical programming language comes with a unit
testing environment. Hoon supplies /lib/test
on
the %base
desk, paired with the /ted/test
thread;
the /lib file should be copied to your development
desk.2
The rest of this article assumes a working knowledge of
/lib/test
.
Don’t get hung up on unit testing dogma. A unit is whatever you want it to be. As a subject-oriented programming language, Hoon does not support private properties and functions: every arm and core in your software is accessible for testing. Generally a unit is some self-contained level of functionality, either from the programmer’s or the user’s point of view.
Listing 1 depicts a more thorough unit test for working with a Gall agent. (Some would prefer to call this an integration test.) Given that level of granularity, we can proceed to test the evolution of a database over time (Listing 2 ). The test is a bit more complex, but the principle holds constant: test a unit of functionality.
Under the hood, the Obelisk engine calls out to the urQL
parser to create ast commands from the user-created scripts.
The AS OF clause overrides the time passed in the bowl.
Finally we weld together two results and check them against
what we expect. The first check, ++expect-eq
, is of the
metadata returned by the last command. The second check is
the final database state after the evolution.
Unit testing purism insists on specifically tailoring each
test to one specific case of potential failure. This approach
tends to be pedantic and frequently we pragmatically sneak in
multiple independent tests in one test bundle, as in Listing 3 .
++test-alter-index-1
is not an atomic unit test but
tests several qualities together. Values like whitespace and
mixed-case labels are tedious to exhaustively test and
well-suited to property based testing software like %quiz
(see
below).
We were intentionally dismissive of testing methodologies above because they are mere conventions. There is however some actual theory we can leverage in figuring out what and how to test.3
Peano arithmetic is a better lay programmer’s introduction
to proof by induction. There exists (or perhaps does not exist) a
special concept, zero. Playing fast and loose with classical logic,
“ex nihil sequitur quodlibet”, from nothing (more commonly a
falsehood, or a contradiction) follows everything (i.e.
anything).4
Zero is not only an integer; rather, in terms of
Nock-based programming, ~
is not only the beginning of
counting,5
but the nothing of every inductive type, most importantly
trees and lists. In speaking to programmers, Peano tells us
there is a universal nothing ~
and there is a function called
successor \(\textrm {Succ}\), which produces some next thing from a previous
thing (or lack of thing). Applying \(\textrm {Succ}\) to nothing, \(\textrm {Succ}(0)\), gives us the
first thing. Applying \(\textrm {Succ}\) to the first thing, \(\textrm {Succ}(\textrm {Succ}(0))\), gives us the second
thing, and so on.
Induction appears in unexpected places—think zero-length
strings, which do not appear to involve ~
at all. For instance,
the Hoon type unit is also an inductive type: it is literally
either nothing or something. Knowing now what to look for,
see where you can find induction in your own code.
What does induction have to do with testing, however? For inductive types—and simple functions over inductive types—the software author only has to prove, or test, two cases: the case for zero and the case for the successor value after zero. For example, Listing 4 depicts tests for a gate over an inductive type (list) in /lib/seq ( jackfoxy/sequent, currently distributed via %yard ( urbit/yard)).
“Hey”, you say—“that’s more than two tests.” That’s right: two is the bare minimum of required tests, and only applies to the simplest units of inductive testing. In this case the first two tests suffice, but minimal tests frequently make for unhelpful documentation examples. We heartily recommend providing interesting examples in your documentation and including those examples in your test suite. You don’t want users to struggle with examples that don’t work, or worse, don’t even build. More tests don’t hurt anything; the computer doesn’t get tired.
Another reason for additional tests is taking a page out of “white box” or “gray box” testing. If you know that there is special logic for the first successor case, you need to test the first case independently as well as a subsequent successor. If you are the programmer and the tester you should approach all of your testing from this perspective. You might even see how to make your program simpler.
Through the mathematics of currying programmatic functions,
we can have input arguments of multiple inductive
types.6
This results in the minimum number of tests being the
number of inductive input elements squared, starting with
all elements set to ~
and so forth. Listing 5 depicts a
series of tests for the ++append gate subject to this \(n^2\)
explosion.
In this situation, the required number of tests may grow exponentially but in most practical cases remains a relatively small finite number.
Input argument interactions are not the only source of combinatorial explosion in testing. Imagine that your function (gate) is a black box. You start submitting random input to figure out what the underlying algorithm is. However, it turns out that you discover that there is not one consistent algorithm for all inputs. Some values or ranges of values behave differently from others. (Unexpected utf-8 whitespace characters are notorious for revealing head-scratching bugs—so imagine all possible inputs over complex xml.) You could model this behavior as multiple inductive types making for an even bigger cartesian explosion of inputs to test. It is no longer practical to construct all the tests required by our theory. The number is still finite, but impractically large. What is to be done?
There is no general solution. Complete code coverage with tests is a start, inductively testing over each clause in a unit. This is time-consuming and requires thinking deeply about the code and its structure.
Another approach is to favor so-called edge case testing, in which you test the boundaries of the input space. This is a good idea, but it is not a complete solution. It is not always clear what the edge is, and it is not always clear that the edge is the same for all inductive types. In the case of testing a string function, should an edge be the empty string, a string with one character, or a string with two characters?
An automated testing solution called property-based
testing may be applicable in these cases. The idea of
property-based testing is to develop and instantiate invariant
properties of the code and let software generate random
inputs. The software runs the random inputs and tests the
outputs against the invariant properties. An architecture for
this approach was originally developed for Haskell and has
since ported to many other languages, including Hoon. %quiz
is a well-documented Hoon implementation ( hjorthjort/quiz).
Once again, this solution requires some deep thinking about
the code and its specification.
From our experience with the F\(^\sharp \) implementation of property-based testing, we expect that you will need to boost the number of random input permutations beyond the default of 100 to get the kind of coverage that is “reasonably” exhaustive. We found 10,000 to be frequently adequate. (Since the inputs are randomly generated each run, however, a property test of production code may fail when nothing has changed. Then think about how to explain to your boss that your tests are not deterministic.)
Lastly, whenever you fix a bona fide production bug
(or one that a framework like %quiz
discovered), add a
test case to address that circumstance. This not only
provides the standard for when the bug has been fixed, but
protects against regressing to the prior behavior (thus, a
“regression test”). Congratulations, you have just refined your
specification.
In our experience failure modes are the most overlooked part of sofware development. It starts with passing insufficient, or no information from the programmed points of failure. So even before testing failure modes make sure you distinguish (i.e. make unique) each message from every point of failure and include any and all relevant information available for debugging (cf. Listing 6 ).
The standard /lib/test
library, as of this writing, can
test for failure but not for an expected message. A modified
testing arm (Listing 7 ) can be included in a testing thread to
facilitate this kind of testing. Listing 8 shows a test for an
expected error message in the Obelisk database engine which
uses this functionality to verify that the correct error message
is raised on crash.
Keep the following principles in mind when producing and evaluating code as a software developer.
Strive to make the collection of units of testing exhaustive, both primitive units of code and units of work from the user perspective.
Test inductively wherever possible.
Test from a white or gray box perspective.
Test the failure modes.
Test all the examples in your documentation.
In a world lacking documentation tests may be the only real specification.7
Regression tests are the key to refactoring with confidence. Beautiful code comes from refactoring.
Beck, Kent (2023) “Canon tdd”. url: https://tidyfirst.substack.com/p/canon-tdd (visited on ~2024.2.20).
~bithex-topnym
, Rikard Hjort
(2023)
“Quiz:
A
randomized
property
testing
library
for
Urbit”.
url:
https://github.com/hjorthjort/quiz
(visited
on
~2024.2.7).
Feathers, Michael C. (2005). Working Effectively with Legacy Code. Prentice Hall. isbn: 978-0-13-117705-5.
~lagrev-nocfep
, N. E. Davis
(2022)
“Writing
Robust
Hoon:
A
Guide
to
Urbit
Unit
Testing”.
url:
https://medium.com/dcspark/writing-robust-hoon-a-guide-to-urbit-unit-testing-82b2631fe20a
(visited
on
~2024.2.20).
~nomryg-nilref
, Jack Fox
(2023)
“Sequent:
A
library
of
Hoon
list
functions
for
mortal
developers”.
url:
https://github.com/jackfoxy/sequent
(visited
on
~2024.2.7).
~sorreg-namtyv
, Curtis Yarvin
(2006)
“U,
a
small
model”.
url:
http://urbit.sourceforge.net/u.txt
(visited
on
~2024.2.20).
Wikipedia (2024a) “G. K. Chesterton, Chesterton’s Fence”. url: https://en.wikipedia.org/wiki/G._K._Chesterton (visited on ~2024.2.7).
— (2024b) “Mathematical Induction”. url: https://en.wikipedia.org/wiki/Mathematical_induction (visited on ~2024.2.20).
“Yard: A Developer Commons” (2023). url: https://github.com/urbit/yard (visited on ~2024.2.7).
3 Wikipedia (2024b) provides a modestly technical exposition of mathematical induction. You can skip it because we will explain it non-rigorously for lay programmers. We also quibble about the inclusion of zero as a natural number. While old school maths started the natural numbers with “1”, but computer science has since infected maths. (Have you ever seen zero of anything? No—it is not natural at all.) Zero is a very, very special number. It’s an abstraction that does not map to anything in the physical world.⤴
4The “principle of explosion” rigorously follows from use of disjunctive syllogisms. Here, we jocosely indicate that by proceeding from zero inductively we can demonstrate desired properties of testing.⤴
5Yarvin opted to invert true (as 0) and false (as 1) ( ~sorreg-namtyv, 2006), meaning that Nock loobeans do not align inductively with other inductive types.⤴