Advanced Namespace Tools blog

24 January 2018

A dive into the memory allocation pool

I wake up in the morning and see in my "faces" program that I have a new email message. I click on it and the message opens, but I see an error print tear through my subrio. After reading the message, I exit the subrio and find this:

mem user overflow
pool sbrkmem block 55da98
hdr 0a110c09 00003f40 0020ddb6 00209358 faf0f1fe deadfa11
tail 00001bde 252b4a04 00001393 5a687d4d cafebabe ffffffff | ef2800be 00003f40
user data d0 24 00 00  d8 6f 68 5a | de 1b 00 00  04 4a 2b 25
panic: pool panic
upas/fs: user: glenda; note: sys: trap: fault read addr=0x0 pc=0x220ca6
fs 64945: suicide: sys: trap: fault read addr=0x0 pc=0x220ca6

We try a check with ps -a |grep Broke reveals fs upas/fs plumbing in the Broken state. We attach acid and check lstk() and find - not much:

access(mode=0x41f248)+0x50 /sys/src/libc/9sys/access.c:27
	db=0x7ffffeffe8a8
0x7ffffeffe8a8 ?file?:0

So we have a mystery to solve: what exactly does the "mem user overflow" error mean, and what code triggered it?

Where does "mem user overflow" come from?

The first step is always to dig into /sys/src and see what we can find there. I used grep to search the kernel files in /sys/src/9/port and /sys/src/9/pc64 for "overflow" but no luck there. Probably something in the libc then? A search in /sys/src/libc found what I was looking for:

term% cd /sys/src/libc
term% grep overflow */*
[...]
port/pool.c: * marked with a magic value, which is checked to detect user overflow.
port/pool.c:	printblock(p, b, "mem user overflow (magic0)");
port/pool.c:	printblock(p, b, "mem user overflow");
port/pool.c:	panicblock(p, b, "mem user overflow");
[...]

Bingo. Looking at the code in pool.c was not immediately revelatory, so I performed the next standard step in debugging: call for help in irc. This produced a lot of output, much of which flattered me by presuming that I understood things better than I actually did.

At this point, a small digression is in order. It can be hard, when given obviously informative answers which are somewhat over your head, to not just say "ok, I'll check that out" and then abandon the whole thing and hope that nobody asks about the results of your investigation. It can also be hard not to say "I have no understanding of this; you figure it out, if you care about it."

Neither of these is the right thing to do. Treat these moments as golden opportunities for in-depth learning. If you are willing to use your brain and engage with the code and make your best effort to figure things out, people will almost certainly provide you with pointers to the pieces you are missing. You have to demonstrate a willingness to make the effort required, to show that you are putting in the effort to learn so that next time, you will be able to figure it out on your own. It is a case of "teach a man to fish" - but, to change aphorisms, you do not want to be the horse that is led to water but doesn't bother to drink.

So, I was informed that the error I saw indicated that memory had been corrupted. The code in pool.c was checking the validity of the allocated memory, and discovered an error.

The high-level overview

The goal of this debugging process, which I have not yet completed, is to find the code that corrupted memory. I was informed that this information can be obtained from the corrupted memory itself, because somewhere in the data structures created by the memory allocation process, the address of the allocating function is stored. What I need to do is:

What I have learned so far is that there is both a "pool header" and a "malloc header". The malloc header is what has the function address I am looking for. To find it, I need to use the information in the pool header to determine its location.

Where I'm at now

I've been told there are useful functions in /sys/lib/acid/pool to help view these structures. A first attempt to use them met with an error, because /sys/src/libc/port/pool.amd64.acid did not exist. To create it:

term% cd /sys/src/lib/port
term% 8c -a pool.c >pool.amd64.acid

Now it worked to start acid with acid -l pool attached to the process. I can do things like:

acid: checkblock(0x55da98)
0x0055da98 size 16152 user has overwritten boundary

Well, we knew that already.

acid: print(Block(0x55da98))
	alloc
	data 0x0055daa0
	datasize 16152
	magic	168889353
	size	16192
	0x005619d0
	magic0	0xbe
	datadiff	40
	magic1	0xef
	size	0x00003f40
	hdr	0x0055da98
{}

I was told whopointsat() was a very useful function, but unfortunately

acid: print(whopointsat(0x55da98))
{}

A very helpful email link

Someone in irc provided a link to a very useful email from many years ago on 9fans: https://marc.info/?l=9fans&m=111558947828485&w=2

This link contains a good explanation of how to decode the block hdr:

>0a110c09 - magic number for an allocated block

>00000020 - size of entire block, header and tail included

>000023b5 - pc of malloc for block

>00001f3e - pc of realloc for block

>0023ff24 - begin actual data space

>caf0f1fe ef1500be - this is four bytes: be 00 15 ef. the be and ef

>are magic numbers, the 00 15 is the amount of header

>and tail in the block, not counting the two pc words.

>since the block is 0x20, the data size is 0x20 - 0x15 - 2*4 = 3.

>00000020 - size of entire block, header and tail included

So, applying this to the data i got, this tells me that the pc of malloc for my troublesome block is 0020ddb6. Let's see what code lives there using acid:

(Digression on assumptions - I did this the hard way. Acid has, and I knew it had, a convenient src(address) function to print the source code that corresponds to a given program address. However, I didn't try to use this function at this point. Why not? Because I assumed that the ?file? I saw in the uninformative lstk() output meant that due to the process forking, acid had "lost" the information about the originating file and source code. Because of this assumption, I skipped using src() which would have directed me right to the c source code, rather than having to work badwards as I did. I keep trying to teach myself not to make assumptions about things I don't fully understand, but they keep creeping in - partly because things like this debugging process are already very time-intensive, and testing every assumption seems like it extends it even further. The thing is, inaccurate assumptions will often cost you more time in the long run than you can ever save with mental shortcuts. I retain the following material for this post because it isn't wrong, even though there was an easier method available.)

acid: asm(0x0020ddb6)
erealloc+0xb 0x000000000020ddb6	ANDB	$0x28,AL
erealloc+0x10 0x000000000020ddbb	JNE	erealloc+0x17(SB)
erealloc+0x16 0x000000000020ddc1	ADDB	CL,0xffffff8b(AX)
erealloc+0x18 0x000000000020ddc3	MOVL	p+0x0(FP),BP
erealloc+0x1d 0x000000000020ddc8	ANDB	$0x30,AL
erealloc+0x21 0x000000000020ddcc	ANDB	$0x8,AL
erealloc+0x25 0x000000000020ddd0	JMP	emalloc+0x1c(SB)
erealloc+0x2a 0x000000000020ddd5	MOVL	AX,p+0x0(FP)
erealloc+0x2f 0x000000000020ddda	CMPL	p+0x0(FP),$0x0
erealloc+0x34 0x000000000020dddf	JNE	erealloc+0x48(SB)
erealloc+0x39 0x000000000020dde4	ADDB	CL,0x8930247c(BX)
erealloc+0x3e 0x000000000020dde9	XORB	CL,0xe808247c(CX)
erealloc+0x43 0x000000000020ddee	CALL	sysfatal(SB)
erealloc+0x48 0x000000000020ddf3	LEAQ	p+0x0(FP),BP
erealloc+0x4d 0x000000000020ddf8	CALL	getcallerpc(SB)
erealloc+0x52 0x000000000020ddfd	MOVQ	AX,.safe+0x18(SP)
erealloc+0x56 0x000000000020de01	SBBB	CL,0xffffff8b(AX)
erealloc+0x57 0x000000000020de02	MOVQ	p+0x0(FP),BP
erealloc+0x5b 0x000000000020de06	SUBB	CL,0xffffff8b(AX)
erealloc+0x5f 0x000000000020de0a	ANDB	$0x18,AL
erealloc+0x64 0x000000000020de0f	ANDB	$0x8,AL
erealloc+0x69 0x000000000020de14	ADDB	AL,0x0(AX)
erealloc+0x6e 0x000000000020de19	ANDB	$0x28,AL
erealloc+0x73 0x000000000020de1e	ANDB	AL,BL

This looks very promising. Now we need to find out what code in upas generated this assembly. We have nice signposts to search for with the "sysfatal" and "getcallerpc" calls.

Searching upas for the culprit

So, grepping for those in upas/fs rapidly leads us to:

void *
erealloc(void *p, ulong n)
{
	if(n == 0)
		n = 1;
	p = realloc(p, n);
	if(!p)
		sysfatal("realloc %lud: %r", n);
	setrealloctag(p, getcallerpc(&p));
	return p;
}

This looks right on target. We need to dig deeper though, and find out what is invoking this. Let's check the next location in our block hdr:

acid: asm(0x00209358)
imap4read+0x55 0x0000000000209358	ADDB	AL,0x0(AX)
imap4read+0x5a 0x000000000020935d	MOVL	0x30(CX),SI
imap4read+0x5e 0x0000000000209361	JOC	imap4read+0xa4(SB)
imap4read+0x61 0x0000000000209364	MOVL	0x38(CX),BP
imap4read+0x64 0x0000000000209367	MOVL	0x30(CX),SI
imap4read+0x66 0x0000000000209369	XORB	AL,CL
imap4read+0x69 0x000000000020936c	ADDB	$0x89,AL
imap4read+0x6e 0x0000000000209371	CALL	erealloc(SB)
imap4read+0x73 0x0000000000209376	MOVQ	imap+0x0(FP),CX
imap4read+0x78 0x000000000020937b	MOVQ	AX,0x38(CX)
imap4read+0x7d 0x0000000000209380	JPL	imap4read+0xaf(SB)
imap4read+0x82 0x0000000000209385	JLE	imap4read+0xc8(SB)
imap4read+0x87 0x000000000020938a	MOVL	$.string+0x496(SB),DI
imap4read+0x8c 0x000000000020938f	MOVQ	DI,0x8(SP)
imap4read+0x91 0x0000000000209394	CALL	imap4cmd(SB)
imap4read+0x96 0x0000000000209399	MOVQ	imap+0x0(FP),BP
imap4read+0x9b 0x000000000020939e	CALL	imap4resp(SB)
imap4read+0x9e 0x00000000002093a1	<stdin>:11: (error) indir: Ȩ�

Now we are really getting somewhere. Again we grep for the CALL targets and hit paydirt in upas/fs/imap.c:

static char*
imap4read(Imap *imap, Mailbox *mb)
{
	char *s;
	int i, n, c;
	Fetchi *f;
	Message *m, **ll;

	imap4cmd(imap, "status %Z (messages uidvalidity)", imap->mbox);
	if(!isokay(s = imap4resp(imap)))
		return s;

	imap->nuid = 0;
	imap->muid = imap->nmsg;
	imap->f = erealloc(imap->f, imap->nmsg*sizeof imap->f[0]);
	if(imap->nmsg > 0){
		imap4cmd(imap, "uid fetch 1:* (uid rfc822.size internaldate)");
		if(!isokay(s = imap4resp(imap)))
			return s;
	}

Even with only the vaguest understanding of assembly, we can see we have found what we are looking for. We have found the CALLs to erealloc, imap4cmd, and imap4resp that we can see in the asm. Checking the manpage for the malloc(2) functions notes that "User errors can corrupt the storage arena." Evidently some bad data wormed its way into the erealloc call in imap4read. I will return to irc with my findings and perhaps will be given some more hints on how to proceed from this knowledge.

Follow-up: I wasn't quite accurate in saying that bad data was in the erealloc call - the bad data was apparently a layer lower, within the imap structure, and was fixed by this commit which checks for imap->nuid being greater than imap->muid and adjusts the value accordingly.