Formal methods
Formal methods are techniques – usually based on mathematics – used for the specification, analysis and verification of software and hardware.
Why would anyone be interested in formal methods?
The main reasons to consider formal methods are verifying that a design is correct, verifying that a change is correct, analysing a system to learn about it and developing an intuition about the system.
We’ll be considering only model checking because I believe it is the easiest to get started with and get a return on the investment quickly.
Formal methods for design
A design doc is meant to be a recipe for a piece of software, it usually contains a problem overview, one or more possible solutions for the problem and a description of how each solution would be implemented.
The description of the solutions in the design doc always contains several assumptions about how its dependencies and how the implementation will work. These assumptions may be correct, partially correct or incorrect.
A design doc author could enumerate all assumptions and requirements the solution aims to solve manually in a matrix – and some people actually do – but that’s clearly too much effort for most projects.
Formal methods can be included in the phase where the design doc is created. Instead of only describing the solution using words, build a specification of the solution and how it interacts with other parts of the system using a specification language. The specification is then checked against the invariants you define by a model checker, the model checker outputs a sequence of steps that lead to an invariant being violated known as a counter example. The counter example is used for debugging and reproducing the issue.
TLA+
TLA+ is a formal specification language that can be used in the design process of a new system, to find flaws in existing systems, to build a better understanding of existing systems and to prove correctness of algorithms.
TLA+ is a simple language that consists mostly of variables and actions that assign values to variables.
Check the cheatsheet.
Here’s an example specification of a clock that starts between 1 and 2, goes up to 12 then goes back to 1. There’s one invariant checking that at all times the clock hours are between 1 and 12 and a temporal property saying that the clock must reach midnight (12) eventually.
[] Is read as always
<> is read as eventually
Formal methods for analysing a system
When a system already exists, it can be valuable to write a specification based on the existing system to gain a deeper understanding of how the system works. Everything must be explicit in a TLA+ specification, it’s common to start writing a specification and realize that you don’t actually understand how something works when you actually need to define its behavior. Sometimes new bugs may even be found in the already existing system after writing a specification.
Formal methods for testing a system
A model checked TLA+ specification will visit every possible behavior in the system, model checking is exhaustive up to a max number of inputs that the specification author defines. After writing a specification, it’s possible to use the specification in combination with the model checker to generate test cases for the real implementation.
MongoDB has a blog post where they talk about generating test cases for a real system using one of their TLA+ specs. They got 100% test coverage by doing that compared to 21% from manually written tests and 91% from AFL.
The paper Model-guided Fuzzing of Distributed Systems talks about using a TLA+ specification state space to guide a fuzzer to test the real implementation of the specifications. The authors found bugs in etcd-raft and RedisRaft by using this technique.
Experimenting with formal methods
Since most of the team was not familiar with TLA+, we decided to focus the efforts on a small subset of new a new project and to put some effort into knowledge sharing to make it easier for people in the team to get comfortable with the tool.
The chosen project provides a secret store type of interface where services can store secrets such as passwords and retrieve them when needed. Operations such as creating and deleting secrets were async at first and the service responsible for managing secrets – known as the secrets service – was using the outbox pattern to maintain consistency.
We started exploring TLA+ while working on the outbox so we decided to model the outbox interactions to make sure our design was sound. We did not model only the basic operations such as adding an operation to the queue but also the invariants that must always be true for the system to behave correctly.
One of the invariants is that there may be only one operation in progress at a time per secret. This invariant was modelled in our specification as follows:
SecretMetadataHasPendingStatusWhenTheresAnOperationInTheQueue ==
\* For all secrets
\A secret \in Secrets:
LET SecretIsInPendingQueue ==
\E i \in DOMAIN queue.pending:
queue.pending[i] = secret
IN
\* the secret is either in the outbox queue
\/ /\ SecretIsInPendingQueue
\* and the secret metadata status is Pending
/\ SecretMetadataHasPendingStatus(secret)
\* or the secret is not in the queue
\/ /\ ~SecretIsInPendingQueue
\* and the secret metadata status is not Pending
/\ ~SecretMetadataHasPendingStatus(secret)
An invariant that says that something bad never happens is known as a safety property. TLA+ also allows us to specify properties that say that something good eventually happens, known as liveness properties.
Our specification defines that when an operation for a secret is added to the outbox queue, the operation will eventually succeed and the secret status will be set to Succeeded.
EventuallyEveryMetadataStatusIsReady ==
\* For all secrets
\A secret \in Secrets:
\* Transform the queue.pending tuple into a set
LET PendingQueueSet == {queue.pending[i]: i \in DOMAIN queue.pending} IN
\* If the secret is in the pending queue
/\ secret \in PendingQueueSet
\* Leads to it eventually
~>
\* Being in the processed queue
/\ secret \in queue.processed
\* And removed from the pending queue
/\ secret \notin PendingQueueSet
\* And the secret being in the metadata table with status "Succeeded"
/\ \E metadata \in db.secret_metadata:
/\ metadata.name = secret
/\ metadata.status = "Succeeded"
We didn’t get to use the specification to generate tests yet but we were already able to reap one of the benefits of formal modelling: a deeper understanding of the system being specified. While writing the specification, we were forced to answer several questions since we needed to make the behavior explicit.
Catching logic bugs with formal methods
Some months ago, a team hit a bug where two regions sharing the same SQL database but with caches that were unique per instance would lead to valid tokens being cached as invalid. A cache invalidation bug.
After the bug was found, a basic specification that models the behavior of the systems with the shared database was written in a few minutes. The spec is high-level, doesn’t contain implementation details at all and found the same bug immediately.
We are also exploring deterministic simulation testing as a lightweight formal method but we haven’t got far enough with it yet.
Hands-on prerequisites: Install TLA+ tools
Download the TLA+ toolbox from the official website
Open VSCode and install the TLA+ extension
Create a file called spec.tla. This is where you’ll write your spec.
Write the following boilerplate in your spec.tla file:
---- MODULE spec ----
EXTENDS TLC, Integers, FiniteSets, Naturals
\* Write your spec here
====
Hands-on: Clock
Given the following boilerplate, write a specification for a clock that counts from 0 to 12 and then resets back to 0.
---- MODULE spec ----
EXTENDS TLC, Integers, FiniteSets, Naturals
\* Write your spec here
====
Hands-on: Waves
Given a group of people forming a circle
After choosing one person in the circle at random, the person must high five the person to its left, then duck and then high five the person to its right. A person that’s ducking cannot get up or high five someone until another high five has happened. High fives happen in counterclockwise direction starting from the person that was chosen in the beginning. The wave ends when the person that started the process is reached again.
Hands-on: Die hard
The die hard problem is a problem from the movie Die hard with a vengeance. There is a 5 liters water jug, a 3 liters water jug, a very precise scale (that doesn’t show the weight of items on top of it), and a faucet with an infinite stream of water. There’s a bomb ready to explode in a few seconds and the only way to stop it is to put exactly 4 liters of water on the scale.
The jugs start empty. The possible actions are:
Fill one of the jugs with water – remember that the jugs are unmarked so the only way to know how much water is in one of them is filling them until they’re completely full.
Remove water from one of the jugs
Move water from one jug to the other, remember that each jug has a limited capacity.
Design a specification to find the shortest sequence of steps to solve the die hard problem. The solution is found when one of the jugs has exactly 4 liters of water in it.
Hands-on: Two phase commit
Two phase commit is a distributed commit protocol where all participants in a transaction should eventually commit or rollback.
Two phase commit is a protocol that can be found in many real worlds systems, some of them being:
MySQL: Transactions in MySQL clusters
Postgres: Meant to be used by external transaction managers
The protocol works in the following manner: one node is a designated coordinator, which is the master site, and the rest of the nodes in the network are designated the participants. The protocol assumes that:
there is stable storage at each node with a write-ahead log,
no node crashes forever,
the data in the write-ahead log is never lost or corrupted in a crash, and
any two nodes can communicate with each other.
Choose between specifying the two phase commit protocol or creating your own protocol for distributed transactions. Imagine that there’s a single transaction and it must commit or abort on all nodes. Remember that nodes may crash at any time.
Hands-on: Single decree paxos
Paxos is a protocol heavily used in the industry to implement replicated systems when the systems need to decide on a value and the value that’s decided must be the same for all the nodes involved.
Paxos is used in many real world systems, some of them being:
Cassandra: For lightweight distributed transactions
Google Spanner: For state machine replication
Failure model:
Processes crash at any time
Messages can be lost
Messages can be reordered
Invent and specify a protocol to make 3 nodes decide on the same value. The value must be proposed by one of the nodes and the other nodes cannot know which value will be proposed ahead of time.
Does my implementation match the specification?
You just wrote a specification that models your system at an abstraction level that doesn’t care about things such as I/O. The specification is a high level version of the implementation that captures the main properties of the system you’re trying to build.
Next ==
\/ \E instance \in Instances, namespace \in Namespaces, value \in Values, data_key_id \in DataKeyIds:
Encrypt(instance, namespace, value, data_key_id)
\/ \E instance \in Instances, namespace \in Namespaces, data_key_id \in DataKeyIds:
Decrypt(instance, namespace, data_key_id)
\/ \E instance \in Instances, namespace \in Namespaces:
RotateDataKeys(instance, namespace)
\/ \E instance \in Instances:
\/ CacheRemoveExpiredEntries(instance)
\/ RestartInstance(instance)
\/ AdvanceTime
\/ Terminating
The model checker uses the specification to check possibly thousands of behaviors, generating a graph of the reachable states. A state is the set of variables with their values.
Every invariant is satisfied by the behaviors allowed by the specification and you believe your spec gives you enough confidence to move on to the implementation.
DataKeyUidUniqueness ==
\A dk1 \in database:
{dk2 \in database : dk2.uid = dk1.uid} = {dk1}
DataKeyUniqueness ==
\A dk1, dk2 \in database:
(/\ dk1.uid /= dk2.uid
/\ dk1.namespace = dk2.namespace)
/\ dk1.label = dk2.label
/\ dk1.active
/\ dk2.active
=>
Assert(FALSE, <<"duplicated data key", dk1, dk2>>)
CacheConsistency ==
\A instance \in Instances:
/\ \A entry \in instances[instance].cache.by_id:
\E row \in database:
/\ entry.namespace = row.namespace
/\ entry.data_key.uid = row.uid
/\ \A entry \in instances[instance].cache.by_label:
\E row \in database:
/\ entry.namespace = row.namespace
/\ entry.data_key.uid = row.uid
DataKeysAreCachedByLabelOnlyAfterCautionPeriod ==
\A instance \in Instances:
\A entry \in instances[instance].cache.by_label:
Assert((now - entry.data_key.created) > CautionPeriod,
<<"data key cached before caution period",
"now", now,
"entry.data_key.created", entry.created,
"CautionPeriod", CautionPeriod>>)
The implementation is coming along just fine but you can’t help but think how do you know that the implementation implements what’s in the specification. If the implementation follows the specification precisely, the implementation will be a refinement of the specification – it will allow every behavior allowed by the specification and possibly more behaviors.
func (s *EncryptionManager) RotateDataKeys(ctx context.Context, namespace string) error {
s.log.Info("Data keys rotation triggered, acquiring lock...")
s.mtx.Lock()
defer s.mtx.Unlock()
s.log.Info("Data keys rotation started")
err := s.store.DisableDataKeys(ctx, namespace)
if err != nil {
s.log.Error("Data keys rotation failed", "error", err)
return err
}
s.dataKeyCache.flush(namespace)
s.log.Info("Data keys rotation finished successfully")
return nil
}
Maybe at this point you decide to put the implementation and the specification side by side and pattern match between them.
RotateDataKeys(instance, namespace) ==
/\ DisableDataKeys(namespace)
/\ DataKeyCacheFlush(instance, namespace)
/\ UNCHANGED now
That can get you quite far but having the implementation execute the same sequence of actions that the specification allows would increase the confidence that the implementation actually implements the specification.
The output graph generated in the model checking process can be used to build every possible path from the initial state to the final state. Similar to what has been done here, the paths can be used to exercise the implementation in a normal test.
The generated graph is stored in a .dot file, the test starts by reading and parsing the graph stored in the file.
func TestParseDotFile(t *testing.T) {
file, err := os.Open("./testdata/spec2.dot")
require.NoError(t, err)
defer file.Close()
buffer, err := io.ReadAll(file)
require.NoError(t, err)
graphAst, _ := gographviz.ParseString(string(buffer))
graph := gographviz.NewGraph()
if err := gographviz.Analyse(graphAst, graph); err != nil {
panic(err)
}
}
Then for each possible path starting from the initial state, we apply each action in the order they appear in the path. Here’s a few examples of the possible sequence of actions:
[Init AdvanceTime AdvanceTime RestartInstance]
[Init Encrypt AdvanceTime AdvanceTime RestartInstance]
[Init Encrypt AdvanceTime AdvanceTime CacheRemoveExpiredEntries]
[Init Encrypt AdvanceTime AdvanceTime RotateDataKeys]
[Init Encrypt AdvanceTime AdvanceTime Decrypt]
[Init AdvanceTime AdvanceTime CacheRemoveExpiredEntries]
[Init AdvanceTime AdvanceTime RotateDataKeys]
[Init AdvanceTime AdvanceTime Encrypt]
Each sequence is a test case to which the implementation is exercised against.
for path := range visitor.Iter() {
check(t, path)
}
For each action in the path generated based on the specification, an action that’s equivalent to it is applied to the implementation. It boils down to a loop and a switch case.
func check(t *testing.T, path []visitor.EdgeWithLabel) {
defer func() {
if t.Failed() {
pathJSONbytes, _ := json.MarshalIndent(path, "", " ")
fmt.Printf("%s\n\n", string(pathJSONbytes))
}
}()
var (
encryptionManager *EncryptionManager
encryptedValuesByKeyId map[string][]map[string][]byte
fakeTime *fakeTime
)
for _, node := range path {
switch node.Label {
case "Init":
// Initialize the system under test
...
case "Encrypt":
// Perform the Encrypt operation
...
case "Decrypt":
// Perform the Decrypt operation
...
case "RotateDataKeys":
// Perform the RotateDataKeys operation
...
case "AdvanceTime":
// Perform the AdvanceTime operation
...
case "CacheRemoveExpiredEntries":
// Perform the CacheRemoveExpiredEntries operation
...
case "RestartInstance":
// Perform the RestartInstance operation
...
default:
panic(fmt.Sprintf("unhandled label, did you forget a switch case?: %+v", node))
}
}
}
Assertions are added at any place you seem fit. It’s recommended to assert that the system is in a valid state and that every response received – if any – makes sense at each step.
Generating tests for snapshot isolation implementation
In databases, snapshot isolation is a transaction isolation level that guarantees that a transaction will have the illusion of working on a snapshot of the database, the snapshot is taken at the time the transaction starts.
Specifications for snapshot isolation can be easily found but I’ve decided to write my own in TLA+. After writing the specification I decided to implement an example database with snapshot isolation in Go. After I was done with the implementation and had written a few tests I decided to generate the test cases using the state graph generated during the model checking process.
for path := range visitor.Iter() {
db := NewDatabase()
// Map from model tx id to impl tx
modelMap := make(map[int]int)
for _, node := range path {
switch node.Label {
case "TxRead":
modelTxId := int(mustParseInt(node.Args[0]))
key := node.Args[1]
db.Read(modelMap[modelTxId], key)
case "TxBegin":
modelTxId := int(mustParseInt(node.Args[0]))
txId := db.BeginTransaction()
modelMap[modelTxId] = txId
case "TxCreate":
modelTxId := int(mustParseInt(node.Args[0]))
key := node.Args[1]
value := node.Args[2]
db.Write(modelMap[modelTxId], key, value)
case "TxUpdate":
modelTxId := int(mustParseInt(node.Args[0]))
key := node.Args[1]
value := node.Args[2]
db.Write(modelMap[modelTxId], key, value)
case "TxDelete":
modelTxId := int(mustParseInt(node.Args[0]))
key := node.Args[1]
db.Delete(modelMap[modelTxId], key)
case "TxRollback":
require.Equal(t, 1, len(node.Args))
modelTxId := int(mustParseInt(node.Args[0]))
db.Rollback(modelMap[modelTxId])
case "TxCommit":
require.Equal(t, 1, len(node.Args))
modelTxId := int(mustParseInt(node.Args[0]))
db.Commit(modelMap[modelTxId])
case "Terminating":
// no-op
default:
panic(fmt.Sprintf("unhandled action: %+v", node))
}
}
}
The generated test cases resulted in 100% code coverage without any special effort.
I also got a specification for snapshot isolation from github.com/will62794/snapshot-isolation-spec. It was the first one I found and I decided to generate test cases from it as well. The following is the first result I got after running the tests.
After that I decided to exercise the snapshot isolation implementation from the blog post Implementing MVCC and major SQL transaction isolation levels. I got 85.8% coverage using the specification I wrote. Note that Phil’s implementation contains more than just the code for the snapshot isolation level so code paths related to other isolation levels are not covered – as expected.
I also ran the same tests but using will62794’s specification this time and got 80.1% coverage in the first try.
Generating tests for the encryption manager
The encryption manager is a package used for envelope encryption. A specification modeling some of the operations supposed by the encryption manager was used to generate test cases for the implementation. The test cases resulted in 60.3% coverage.
Next ==
\/ \E instance \in Instances, namespace \in Namespaces, value \in Values, data_key_id \in DataKeyIds:
Encrypt(instance, namespace, value, data_key_id)
\/ \E instance \in Instances, namespace \in Namespaces, data_key_id \in DataKeyIds:
Decrypt(instance, namespace, data_key_id)
\/ \E instance \in Instances, namespace \in Namespaces:
RotateDataKeys(instance, namespace)
\/ \E instance \in Instances:
\/ CacheRemoveExpiredEntries(instance)
\/ RestartInstance(instance)
\/ Terminating
Most of the uncovered lines are error handling paths or unsupported operations. Both cases were not modeled in the specification.