Advanced Namespace Tools blog

14 January 2018

Rewriting ventiprog progressive backup script and related tools

The Advanced Namespace Tools can be used with any root fileserver, but have always been designed with fossil+venti as their primary environment, even with the change to 9front-only support. One of the components is a set of scripts to help manage venti+fossil replication. I always felt that there was insufficient support tooling around venti+fossil, which contributed to the frustrations felt by users during the era in which many of them experienced failures of the system. Post 2012, I believe the old bugs have been cleaned up so reliability is much improved, but good backup hygiene is still important.

Basic concepts: arenas data and fossil rootscores

At its core, venti is an append-only log of data blocks. These blocks are stored in "arenas" partitions, which are further subdivided into arenas of 512mb each. The required "index" partition and optional "bloom" partition can be regenerated if needed from the raw arenas data. The fossil file server uses venti as a backing store for presenting a conventional hierarchical file system to the user. Because the data about how the blocks are assembled into a filesystem is itself part of the structure stored in venti, there is an ultimate "master block" representing the root of the entire filesystem, which can be referenced by its sha1 hash fingerprint. (Yes, venti does need an update to a newer hash function.) Whenever fossil stores a snapshot to venti, it records this thumbprint, known as a "rootscore". You can view it like this:

term% fossil/last /dev/sdE0/fossil
vac:98610970109e9a89f19dbc69c9d7b5b196d31459

If you can connect to a venti which holds a copy of the relevant data blocks, you can instantly create a new clone-fossil by issuing a command like:

fossil/flfmt -v 98610970109e9a89f19dbc69c9d7b5b196d31459 /dev/sdE1/fossil

Because fossil only retrieves blocks from venti as they are needed, a freshly flfmt -v fossil can be created instantly and occupies approximately no space. As data is read/written, blocks will be retrieved and cached locally by the fossil until the next snapshot is made, at which point they are considered free for reuse. The key point to understand is that a fossil/venti backup system can ignore everything about the fossils apart from the rootscores, all the data is contained in the venti arenas.

The tool for copying venti arenas is the venti/wrarena command. In action, it looks like this:

venti/wrarena -o 5369503744 /dev/sdF0/arenas 0x90c937a

That command sends data to the venti specified in the $venti environment variable from the arenas partition /dev/sdF0/arenas arena beginning at offset -o number and clump offset 0xnumber. The first offset is the location of the given sub-arena within the overall arenas partition, and the clump offset is how many data clumps to begin the reading at. A freshly initialized venti which has not dumped any backup data will begin with using a command more like:

venti/wrarena -o 794624 /dev/sdF0/arenas 0x0

That command references the very first sub-arena and the very beginning of the data clumps. The offsets are calculated by retrieving the index file from venti's http server and subtracting 8192 from the first parameter after the bracket in output lines such as:

arena='arenas0' on /dev/sdF0/arenas at [802816,537673728)

The wrarena command only traverses a single sub-arena. To fully back up an active venti, you will need to traverse the entire list of active arenas.

The old ventiprog script

For the first ANTS release back in 2013, I had creaed a script called 'ventiprog' which automates this process somewhat. It did this by maintaining a list of rootscores under /n/9fat/rootscor and also a wrcmd at /n/9fat/wrcmd. The "fossilize" command would store the most recent fossil archival snapshot rootscore, and the ventiprog command would issue the stored wrcmd, track the new clump offset, and write an updated wrcmd with the new clump offset. You would then copy your stored rootscores to the backup venti.

The main annoyance with this approach was creating maintaining the wrcmd. As soon as you filled up a given arenas, you would need to manually create a new wrcmd by finding the offset of the next sub-arena within the arenas partition. If you were dealing with large quantities of data, this was obviously way too much extra work for a supposedly automated progressive backup/replication system. Because my data needs were comparatively small, mostly plaintext source code and writing, the system worked ok for me, but clearly needed to be upgraded.

backup.example from /sys/src/cmd/venti/words

There was already a more sophisticated progressive venti backup script present in the distribution, an example created by Russ Cox. I wasn't making use of it because when I glanced at it, I didn't fully understand what it was doing and what assumptions it was making about its environment. When it was time to upgrade the ANTS venti replication system though, it was the right place to start. The core logic of the original script is as follows:

. bkup.info
fn x {
	echo x $*
	y=$1
	if(~ $#$y 0){
		$y=0
	}
	echo venti/wrarena -o $2 $3 $$y
	end=`{venti/wrarena -o $2 $3 $$y | grep '^end offset ' | sed 's/^end offset //'}
	if(~ $#end 1 && ! ~ $$y $end){
		$y=$end
		echo '#' `{date} >>bkup.info
		whatis $y >>bkup.info
	}
}
hget http://127.1:8000/index | 
awk '
/^index=/ { blockSize=0+substr($3, 11); }
/^arena=/ { arena=substr($1, 7); }
/^	arena=/ { start=0+substr($5, 2)-blockSize; printf("x %s %d %s\n", arena, start, $3); }
' |rc

This is not the easiest script in the world to grok on first reading, is it? I'll try to gloss this for you by describing what actually happens when you run it, assuming you start with a blank bkup.info file.

The tricky and clever part of the script is the $y variable, which is used to track the clump offsets. After the script has been run several times, the bkup.info file will look like this:

# Thu Jan 11 05:17:54 GMT 2018
arenas0=0x1fbbff7b
# Thu Jan 11 05:18:12 GMT 2018
arenas1=0x1fe6b78f
# Thu Jan 11 05:19:47 GMT 2018
arenas2=0x1fcc3e9e
# Thu Jan 11 05:19:55 GMT 2018
arenas3=0x12f3126
# Thu Jan 11 05:35:42 GMT 2018
arenas3=0x3602e6e
# Thu Jan 11 09:13:28 GMT 2018
arenas3=0x1fc9ddc9
# Thu Jan 11 09:15:22 GMT 2018
arenas4=0x1fc92873
# Thu Jan 11 09:17:43 GMT 2018
arenas5=0x1fba5742

As a result, when the ". bkup.info" command is executed, we create a set of variables tracking the final clump offset of each arena. As the x function executes, it uses a clever layer of indirection: $y is going to be an arenas label like "arenas3" and therefore $$y is going to be whatever the value of arenas3= was set to in the bkup.info script.

Rewriting ventiprog based on backup.example

So, this script is obviously a better foundation for progressive venti replication, because it does the tricky part - calculating the partition and clump offsets and issuing wrcmd for each partition - rather than relying on manual management. As written though, it does have a major issue in current 9front, which I'm not sure if existed in the old labs environment. The main problem is that the native awk port used by 9front does 32-bit signed arithmetic, which means that the partition offset values go wrong as soon as you are more than 2gb into the arenas partition. I fixed this by removing the arithmetic handling from awk and adding this logic:

[awk without arithmetic]' |sed 's/\,.*\)//g' >/tmp/offsetparse
numarenas=`{wc /tmp/offsetparse}
count=0
while(! ~ $count $numarenas(1)){
	data = `{grep arenas^$count /tmp/offsetparse}
	offset=`{echo $data(3) ' - 8192' |pc |sed 's/;//g'}
	echo x $data(2) $offset $data(4) >>/tmp/offsets.$mypid
	count=`{echo $count ' + 1' |hoc}
}
cat /tmp/offsets.$mypid |rc

These shenanigans just use the "pc" calculator in 9front to process the arithmetic and produce output in the same form as the original script. The other changes to backup.example are minor - adding optional parameters for what filename the bkup.info file will be so that you can backup to multiple ventis by having an info file for each one, letting you specify a different address rather than 127.1:8000 for the http request for the venti index file, and letting you specify a 'prep' file in addition to the info file, in case there is preconfiguration necessary. For instance, in the vultr vps nodes that I use for the remote potion of my grid, my prep file looks like:

webfs
bind -b '#S' /dev
bind -b '#l1' /net.alt
bind -b '#I1' /net.alt

This is necessary to set up the environment after cpu in from my terminal. The bkup.info file equivalents are named 9venti.info and ventiauth.info according to the names of the target ventis, and each begins with a line like:

venti=/net.alt/tcp!10.99.0.12!17034

This makes the wrarena command appropriately target the venti server reachable on the /net.alt private network of the vps datacenter. To replicate the data from my main venti to two other ventis I issue these commands:

ventiprog -p bkup.prep -f ventiauth.info
ventiprog -f 9venti.info

Due to deduplication, I can run a similar set of commands on the other ventis, so the datablocks present in each venti are mirrored in the others, and any fossil server can init a copy of the most recent snapshot backed by any of these ventis.

Rootscore management with fossilize and storescore

The other component of the system that needed some improvement was mangement of fossil rootscores. A single 'rootscor' file is suboptimal when a single venti is backing up multiple different fossils, you want to track each fossil's scores separately. The fossilize script was modified so that it stored rootscores in the 9fat at scores.$fsname, with fsname set by default to $sysname. A companion script, storescore, is used by the venti servers to receive output from fossilize, store it in their own 9fat partition under a parallel name, and pass it along to a possible next machine. A sample command, used after ventiprog has replicated the venti datablocks between machines:

rcpu -h fs -c fossilize |storescore |rcpu -h backup -c storescore

This is run on the primary venti server and retrieves the most recent rootscore from 'fs' via the fossilize command, then stores it locally with storescore, and continues the chain by sending the output of storescore (which is identical to its input) to storescore on the backup venti.

Testing revealed a mildly annoying issue - the fact that the default 9front namespace file does not bind '#S' to dev, and once it is bound, it is visible at the standard path to remote machines during cpu. This caused the partition-finding logic to error because the target machines when rcpu'd into would see the controlling machine's disk files under /dev/sd* rather than their own. The fix was to both make sure that #S was unmounted from the controlling machine prior to issuing fossilize and storescore commands via rcpu, and add logic to bind '#S' if needed inside those scripts, after a rfork en.

With these new, improved scripts a fully automated replication workflow for venti data and fossil rootscores is easy to put in place.