Advanced Namespace Tools blog

14 Feburary 2018

Improvements to Hubfs

Hubfs has been a stable and useful tool for me for many years, but nothing is completely perfect. This post describes recent work to fix two longstanding issues - one annoyance and one corner-case bug. The annoyance was that a process directly reading from a hubfile couldn't be successfully interrupted until new data arrived. The bug was that very occasionally, a client would receive a re-send of the entire previous hubfs buffer. Both of these problems have been rectified now, so anyone using hubfs is invited to upgrade. Hubfs is contained within the main ANTS repos, as well as being available in a standalone repo as well as within my 9front contrib directory.

Tflushes and interrupts

When processes read from a hubfile, the 9p Tread requests are queued until data arrives to answer them. From the point of view of the reading process, it is stuck inside a blocking syscall. This is the reason that before this patch, if you catted a hubfile and no data was arriving, you coudn't interrupt the cat - it was stuck inside the read syscall and would not receive the interrupt note until it crossed back out of the kernel boundary when the pending read was fulfilled.

I didn't realize there was a fix for this for a very long time. It turns out that it was resolvable by adding a flush message handling function to the hubfs Srv structure. When an interrupt is sent to a process reading from an 9p fs, a flush message is sent to the fs. I'm not sure where this is documented - I learned it from reading a conversation in irc about someone else's fs development. (Thanks Mischief for explaining this!) When I saw the discussion, I realized that I could fix the uninterruptible-reads issue in hubfs, and set to work.

The Tflush request is sent with a datafield "oldtag" which identifies the tag of the previous request that should be flushed. Pending requests are stored in per-hub arrays called "qreads" and "qwrits" so my first implementation tried to find the hub from r->fid->file->aux in the Tflush request, and then check the pending request arrays for the requested oldtag. Unfortunately, this produced a null-pointer dereference suicide. I assumed this must be because the Tflush request didn't fill in the Req structure in the same way, and a bit of time in acid confirmed this - the Tflush doesn't duplicate the data present in the original request, so I couldn't find the correct Hub structure from it.

Fortunately, hubfs already maintains a linked list of the hubfs that it contains, so it was possible to do a comprehensive search. The fsflush(Req *r) function receives a Tflush request and then walks the list of hubfs, trying to flush it from each in turn:

Hublist *currenthub;
int flushed;
currenthub = firsthublist;
flushed = flushcheck(currenthub->targethub, r);
if(flushed)
	return;
while(currenthub->nexthub->targethub != nil){
	currenthub=currenthub->nexthub;
	flushed = flushcheck(currenthub->targethub, r);
	if(flushed)
		return;
}
respond(r, nil);

The real work is done within the flushcheck(Hub *h, Req *r) function which is called for each hub in turn by the above code.

Req *tr;
int i;
for(i = h->qrans; i <= h->qrnum; i++){
	if(h->rstatus[i] != WAIT)
		continue;
	tr=h->qreads[i];
	if(tr->tag == r->ifcall.oldtag){
		tr->ofcall.count = 0;
		h->rstatus[i] = DONE;
		if((i == h->qrans) && (i < h->qrnum))
			h->qrans++;
		respond(tr, nil);
		respond(r, nil);
		return 1;
	}
}
[parallel code for qwrits array omitted]

When I originally wrote the above code, I forgot the if(h->rstatus[i] != WAIT) check, which resulted in old tags from answered requests being matched with bad results. I'm glad I ran several rounds of debugging, because the first few tests, there weren't any already answered Reqs with a duplicate tag to confuse things, so things seemed like they were working.

The "spin" bug

After getting back into the hubfs code while implementing the Tflush handling, I decided it was time for me to hunt down a bug I had been seeing rarely, but repeatedly, for many years. The symptoms were the hubfs would suddenly re-send an almost completely filled buffer to clients. It was obviously some kind of issue with the wrap-around code for handling moving the writers and readers back to the start of the buffer once the buffer had filled, but I had spent many hours staring at the relevant code and not seeing any issues. I even had a comment in the relevant section of the reading and writing code that said WRAPAROUND LOGIC CHECK HERE FOR BUGS. Here's the relevant section of the wrsend() code for handling write requests:

count = r->ifcall.count;
if((h->buckfull + count) >= BUCKSIZE - 8192){
	h->buckwrap = h->inbuckp;
	h->inbuckp = h->bucket;
	h->buckfull = 0;
}
memmove(h->inbuckp, r->ifcall.data, count);
h->inbuckp += count;
h->buckfull += count;
r->fid->file->length = h->buckfull;
r->ofcall.count = count;

And here is the code for the readers in msgsend():

count = r->ifcall.count;
if(mq->nxt >= h->buckwrap){
	mq->nxt = h->bucket;
	mq->bufuse = 0;
}
if(mq->nxt + count > h->buckwrap)
	count = h->buckwrap - mq->nxt;
if((mq->bufuse + count > h->buckfull) && (mq->bufuse < h->buckfull))
	count = h->buckfull - mq->bufuse;
memmove(r->ofcall.data, mq->nxt, count);
r->ofcall.count = count;
mq->nxt += count;
mq->bufuse += count;

The key variables here are "h->buckwrap" which is a pointer to where in the hub data buffer wraparound occurs, "h->buckfull" which is the quantity of data in the hub, "h->inbuckp" which is a pointer to where the next message will be written, "mq->bufuse" which is how much data has been transferred to the reader, and "mq->nxt" which is a pointer to where it will add the next data to be sent to readers. If the next write would bring the amount of data written to within 8k of the end of the data bucket, the buckwrap pointer is set to the current limit of data written, and the write is moved back to the start of the bucket. For the reader, if its read point is at or beyond the bucket wrap point, it is sent back to the beginning, and if it wants to read beyond the wrap point, it reads only up to it.

Given this logic, I didn't understand how the "spinning" reads ever happened. I decided to vastly reduce the total size of the data bucket, so I could create controlled test cases with an amount of data I could easily type, rather than needing to put hundreds of thousands of bytes in the bucket to get close to the wraparound conditions. Sure enough, once I sized the buckets down to just 48 bytes and started typing random small strings in, I was able to generate a failure, with the reader "spinning" round the bucket repeatedly, until I added some more data.

I attached acid to a hubfs in this state, and examined the data structures. What I discovered was that the inbuckp was greater than the buckwrap. How can this be? When the inbuckp gets close to the end of the buffer, buckwrap is set to inbuckp and the inbuckp is reset to the start of the bucket. I stared at the code for awhile but insight did not dawn. I decided I needed to watch each step of the failure case as it happened, so I started a new hubfs, set breakpoints at msgsend and wrsend, and started stepping through as I entered the data that triggered the failure.

While I was doing this, a flash of understanding arrived, and the acid-debugging process confirmed it. The failure never could happen from a single "cycle" through the buffer - it was an interaction between two successive cycles. There is an absolute limit of the quantity of data in the buffer, but the actual buckwrap point will always be smaller than this, and can happen at a range of values. The bug is triggered when on the first cycle, the buckwrap is set to a certain value, and on the next cycle, the inbuckp takes a value that is higher than the previous buckwrap, but still not high enough to trigger the wraparound test and cause buckwrap to be reset. The reader still sees the old buckwrap point, and resets itself to the start of the buffer without realizing it is a cycle behind. Once enough additional data is written, the inbuckp resets to the start of the buffer and a new buckwrap is set, and things sync up again.

So, the fix was very simple - in the wrsend function, adding a check and fix for this condition immediately after h->inbuckp += count;

if(h->inbuckp > h->buckwrap)
	h->buckwrap=h->inbuckp+1;

In further testing, this fixed the issue and I was happy to remove the "CHECK HERE FOR BUGS" comment from the code.

Hubfs status

Now that the corner-case wraparound bug and the uninterruptible-read annoyance have been fixed, there are no current known hubfs issues. There are still some "compromises" that result from the simplicity of the design, such as the inability to synchronize output between stdout and stderr when it is used for interactive shells, but within the current design parameters, everything functions correctly so far as my testing has been able to determine.

Recently, we started using hubfs as the backend for an irc-like service on the public grid, which I believe is a good demonstration of how it fits with the basic unix "toolbox" philosophy, as well as showing the strengths of using 9p fses to provide services.