Advanced Namespace Tools blog

30 November 2018

Design of the ANTS colony-spawngrid

This post describes work-in-progress. If you are interested in the current status, stop in to gridchat on the 9gridchan public grid. The instructions for requesting an account and how to use the spawngrid can be found at doc.9gridchan.info/guides/spawngrid.

Overview

ANTS colony servers provide spawning and saving of snapshotted Plan 9 namespace environments on demand, from a globally replicated store of filesystem images. The goal is a simple Plan 9 analog of cloud computing platforms such as AWS and Google Compute Engine. The contemporary distributed architecture of containerized microservices is related to what Plan 9 pioneered in the 1990s - per process independent namespace (the foundation of container-like systems), and single purpose network services (delivered via 9p in Plan 9 as opposted to http-json) working together.

Plan 9 laid a technological and conceptual foundation for this approach to distributed systems, but in its own network architecture and administration, it has mostly remained rooted in static pre-configuration and single-point-of-failure services. The flexibility of per-process namespace has not generally been used to create container-like systems. The ANTS project perspective is that container-like partitions of the namespace represent a natural evolutionary flow from Plan 9 design principles. No attempt has been made to imitate particular features of BSD or linux jail/container systems, but rather to evolve features based on real-world experience using Plan 9 systems in this way.

Storage backend

The backing store of the grid has two components - a set of venti datablocks, and a database file in ndb format mapping rootscores, names for those rootscores, and users. By replicating this data between storage nodes, the grid services can spawn any root filesystem on any server on demand. Synchronization of venti blocks is performed by the ventiprog command. Each server stores in 9fat a server.info file for each other venti it wishes to replicate blocks, and then runs ventiprog -f server.info periodically for each of the .info files. This is efficient because only newly written blocks will be replicated. The venti protocol does not offer authentication and encryption, which we need to run a globally distributed network of venti servers. Fortunately, 9front provides general purpose tools for making authenticated and encrypted tunnels. Each venti server runs its venti listening on local loopback, and then has this script as /rc/bin/service/tcp27034:

#!/bin/rc
exec tlssrv -A /bin/aux/trampoline tcp!127.1!17034

This creates a listener which will authenticate clients and then tls-tunnel/proxy the connection to the venti listening on loopback. Clients need to start their own local listener to make the connection:

aux/listen1 -t tcp!127.1!27035 /bin/tlsclient -a tcp!server.ip!27034 &

Now, with that listener in place on the client, the client sets a venti env variable like:

venti=tcp!127.1!27035

Then commands such as the wrarena invoked by ventiprog will talk to the local listener, which will make the auth+tls connection to the remote server and tunnel the traffic.

The scores database is also stored in 9fat in the 'scorenames' file. A simple script, scorecopy, is used to sync this file between servers. Storage servers provide a /srv/fatsrv of their 9fat to make this more convenient. The scorecopy replication should match the ventiprog replication - each venti should have a matching set of blocks and corresponding scores. The current spawngrid doesn't enforce this perfectly; there may be a lag of a few minutes before the datablocks catch up to the score replication. No huge harm is caused by this but it should be tweaked.

Spawning service

The spawngrid is organized in paired storage+cpu service units. The storage server runs a hubfs called 'gridsvc' which is mounted by both the cpu and grid clients. Scripts named 'stosvc' and 'cpusvc' attach to the hubfs on the respective servers and dispatch commands based on the messages they receive. For each command issued, clients use a tlsclient to dial a separate service for the purposes of relaying an authenticated command, then notify the gridsvc that they have sent a request. The most important request type is to spawn a rootscore labeled with a given name.

When a spawn request is issued, the scorenames file is quered using the ndb tools to check that a score of the given name is available for that user. If it is, a new fossil server is started, either in a ramfs or a disk partition, and is initialized with that rootscore. Because fossil only needs to retrieve blocks from venti when they are requested by a client, initializing a new fossil is very quick and involves little more than writing the bytes of the rootscore to it. Because fossil uses "lazy loading" of data blocks from venti, even a 100mb ramdisk is very usable for a fossil root, although the full filesystem data is arbitrarily large.

This property is an unusual bonus of the fossil+venti architecture and it is somewhat surprising in practice. Intuitively, it feels wrong to be able to work freely, reading and writing data within a multigigabyte fs, using a storage partition of much smaller size. This kind of magic works because the actual amount of data you read and write as a user is usually quite small in comparison to the total amount contained within your root fs. If the user wants to perform an operation such as a full system rebuild in /sys/src, a 100mb ramdisk is inadequate, but the 1gb+ disk partitions available via 'spawndisk' are more than large enough.

After the root filesystem is running, a message is sent to to the connected cpu server to 'fetch' the given root and make it available. The cpu server starts new flow of control and rforks into a clean /srv namespace. Next, the cpu dials the storage server, acquires the file descriptor for the spawned root filesystem, and places it at /srv/boot. Then it creates a new hubfs attached rc, using the ants non-forking srv device #z to keep it accessible to the main ns, and then sends commands into the hub to start standard services, including a cpu listener.

rfork V
srv $dialstring boot
hub -b -z $hubname
mount /zrv/$hubname /n/h
echo 'auth/newns' >>/n/h/io0
cat /bin/startsrvs >>/n/h/io0
echo 'aux/listen1 -tv tcp!*!'^$port^' /rc/bin/service/tcp17019 &' >>/n/h/io0
chmod 660 /srv/boot
chgrp -o $username /srv/boot

The final chmod/chgrp is a security precaution to prevent other users, who are authenticated for the spawngrid but not the spawning user, from being able to connect to the spawned environment. This mechanism will be adjusted to use a 'custom manufactured' cpu listener to allow only specified user access.

End result

The end result is the creation of a container-style namespace which behaves for the user similarly to a fully independent system. The root fs is private and fully controlled by the user, and the private /srv directory prevents namespace collisions. Plan 9 does not have namespacing for /proc or user identities, or a method of limiting the resource use of a group of processses, so the separation and independence of environments is somewhere between that of a traditional multi-user system and fully jailed environments.

The spawning system on the spawngrid is being used mostly towards the goal of further subdividing vps resources, to make it economically feasible to provide free plan 9 namespace environments on demand to any interested plan 9 user. Another application of this system is as the foundation for a distributed task-orchestration system, which is a goal of future developments. The current public grid services will make a good test application to manage in this way.