Advanced Namespace Tools blog
24 January 2018
A dive into the memory allocation pool
I wake up in the morning and see in my "faces" program that I have a new email message. I click on it and the message opens, but I see an error print tear through my subrio. After reading the message, I exit the subrio and find this:
mem user overflow
pool sbrkmem block 55da98
hdr 0a110c09 00003f40 0020ddb6 00209358 faf0f1fe deadfa11
tail 00001bde 252b4a04 00001393 5a687d4d cafebabe ffffffff | ef2800be 00003f40
user data d0 24 00 00 d8 6f 68 5a | de 1b 00 00 04 4a 2b 25
panic: pool panic
upas/fs: user: glenda; note: sys: trap: fault read addr=0x0 pc=0x220ca6
fs 64945: suicide: sys: trap: fault read addr=0x0 pc=0x220ca6
We try a check with ps -a |grep Broke reveals fs upas/fs plumbing in the Broken state. We attach acid and check lstk() and find - not much:
access(mode=0x41f248)+0x50 /sys/src/libc/9sys/access.c:27
db=0x7ffffeffe8a8
0x7ffffeffe8a8 ?file?:0
So we have a mystery to solve: what exactly does the "mem user overflow" error mean, and what code triggered it?
Where does "mem user overflow" come from?
The first step is always to dig into /sys/src and see what we can find there. I used grep to search the kernel files in /sys/src/9/port and /sys/src/9/pc64 for "overflow" but no luck there. Probably something in the libc then? A search in /sys/src/libc found what I was looking for:
term% cd /sys/src/libc
term% grep overflow */*
[...]
port/pool.c: * marked with a magic value, which is checked to detect user overflow.
port/pool.c: printblock(p, b, "mem user overflow (magic0)");
port/pool.c: printblock(p, b, "mem user overflow");
port/pool.c: panicblock(p, b, "mem user overflow");
[...]
Bingo. Looking at the code in pool.c was not immediately revelatory, so I performed the next standard step in debugging: call for help in irc. This produced a lot of output, much of which flattered me by presuming that I understood things better than I actually did.
At this point, a small digression is in order. It can be hard, when given obviously informative answers which are somewhat over your head, to not just say "ok, I'll check that out" and then abandon the whole thing and hope that nobody asks about the results of your investigation. It can also be hard not to say "I have no understanding of this; you figure it out, if you care about it."
Neither of these is the right thing to do. Treat these moments as golden opportunities for in-depth learning. If you are willing to use your brain and engage with the code and make your best effort to figure things out, people will almost certainly provide you with pointers to the pieces you are missing. You have to demonstrate a willingness to make the effort required, to show that you are putting in the effort to learn so that next time, you will be able to figure it out on your own. It is a case of "teach a man to fish" - but, to change aphorisms, you do not want to be the horse that is led to water but doesn't bother to drink.
So, I was informed that the error I saw indicated that memory had been corrupted. The code in pool.c was checking the validity of the allocated memory, and discovered an error.
The high-level overview
The goal of this debugging process, which I have not yet completed, is to find the code that corrupted memory. I was informed that this information can be obtained from the corrupted memory itself, because somewhere in the data structures created by the memory allocation process, the address of the allocating function is stored. What I need to do is:
- understand the datastructures created by the memory allocation process
- understand how to use acid to find the function address within those structures
What I have learned so far is that there is both a "pool header" and a "malloc header". The malloc header is what has the function address I am looking for. To find it, I need to use the information in the pool header to determine its location.
Where I'm at now
I've been told there are useful functions in /sys/lib/acid/pool to help view these structures. A first attempt to use them met with an error, because /sys/src/libc/port/pool.amd64.acid did not exist. To create it:
term% cd /sys/src/lib/port
term% 8c -a pool.c >pool.amd64.acid
Now it worked to start acid with acid -l pool attached to the process. I can do things like:
acid: checkblock(0x55da98)
0x0055da98 size 16152 user has overwritten boundary
Well, we knew that already.
acid: print(Block(0x55da98))
alloc
data 0x0055daa0
datasize 16152
magic 168889353
size 16192
0x005619d0
magic0 0xbe
datadiff 40
magic1 0xef
size 0x00003f40
hdr 0x0055da98
{}
I was told whopointsat() was a very useful function, but unfortunately
acid: print(whopointsat(0x55da98))
{}
A very helpful email link
Someone in irc provided a link to a very useful email from many years ago on 9fans: https://marc.info/?l=9fans&m=111558947828485&w=2
This link contains a good explanation of how to decode the block hdr:
>0a110c09 - magic number for an allocated block
>00000020 - size of entire block, header and tail included
>000023b5 - pc of malloc for block
>00001f3e - pc of realloc for block
>0023ff24 - begin actual data space
>caf0f1fe ef1500be - this is four bytes: be 00 15 ef. the be and ef
>are magic numbers, the 00 15 is the amount of header
>and tail in the block, not counting the two pc words.
>since the block is 0x20, the data size is 0x20 - 0x15 - 2*4 = 3.
>00000020 - size of entire block, header and tail included
So, applying this to the data i got, this tells me that the pc of malloc for my troublesome block is 0020ddb6. Let's see what code lives there using acid:
(Digression on assumptions - I did this the hard way. Acid has, and I knew it had, a convenient src(address) function to print the source code that corresponds to a given program address. However, I didn't try to use this function at this point. Why not? Because I assumed that the ?file? I saw in the uninformative lstk() output meant that due to the process forking, acid had "lost" the information about the originating file and source code. Because of this assumption, I skipped using src() which would have directed me right to the c source code, rather than having to work badwards as I did. I keep trying to teach myself not to make assumptions about things I don't fully understand, but they keep creeping in - partly because things like this debugging process are already very time-intensive, and testing every assumption seems like it extends it even further. The thing is, inaccurate assumptions will often cost you more time in the long run than you can ever save with mental shortcuts. I retain the following material for this post because it isn't wrong, even though there was an easier method available.)
acid: asm(0x0020ddb6)
erealloc+0xb 0x000000000020ddb6 ANDB $0x28,AL
erealloc+0x10 0x000000000020ddbb JNE erealloc+0x17(SB)
erealloc+0x16 0x000000000020ddc1 ADDB CL,0xffffff8b(AX)
erealloc+0x18 0x000000000020ddc3 MOVL p+0x0(FP),BP
erealloc+0x1d 0x000000000020ddc8 ANDB $0x30,AL
erealloc+0x21 0x000000000020ddcc ANDB $0x8,AL
erealloc+0x25 0x000000000020ddd0 JMP emalloc+0x1c(SB)
erealloc+0x2a 0x000000000020ddd5 MOVL AX,p+0x0(FP)
erealloc+0x2f 0x000000000020ddda CMPL p+0x0(FP),$0x0
erealloc+0x34 0x000000000020dddf JNE erealloc+0x48(SB)
erealloc+0x39 0x000000000020dde4 ADDB CL,0x8930247c(BX)
erealloc+0x3e 0x000000000020dde9 XORB CL,0xe808247c(CX)
erealloc+0x43 0x000000000020ddee CALL sysfatal(SB)
erealloc+0x48 0x000000000020ddf3 LEAQ p+0x0(FP),BP
erealloc+0x4d 0x000000000020ddf8 CALL getcallerpc(SB)
erealloc+0x52 0x000000000020ddfd MOVQ AX,.safe+0x18(SP)
erealloc+0x56 0x000000000020de01 SBBB CL,0xffffff8b(AX)
erealloc+0x57 0x000000000020de02 MOVQ p+0x0(FP),BP
erealloc+0x5b 0x000000000020de06 SUBB CL,0xffffff8b(AX)
erealloc+0x5f 0x000000000020de0a ANDB $0x18,AL
erealloc+0x64 0x000000000020de0f ANDB $0x8,AL
erealloc+0x69 0x000000000020de14 ADDB AL,0x0(AX)
erealloc+0x6e 0x000000000020de19 ANDB $0x28,AL
erealloc+0x73 0x000000000020de1e ANDB AL,BL
This looks very promising. Now we need to find out what code in upas generated this assembly. We have nice signposts to search for with the "sysfatal" and "getcallerpc" calls.
Searching upas for the culprit
So, grepping for those in upas/fs rapidly leads us to:
void *
erealloc(void *p, ulong n)
{
if(n == 0)
n = 1;
p = realloc(p, n);
if(!p)
sysfatal("realloc %lud: %r", n);
setrealloctag(p, getcallerpc(&p));
return p;
}
This looks right on target. We need to dig deeper though, and find out what is invoking this. Let's check the next location in our block hdr:
acid: asm(0x00209358)
imap4read+0x55 0x0000000000209358 ADDB AL,0x0(AX)
imap4read+0x5a 0x000000000020935d MOVL 0x30(CX),SI
imap4read+0x5e 0x0000000000209361 JOC imap4read+0xa4(SB)
imap4read+0x61 0x0000000000209364 MOVL 0x38(CX),BP
imap4read+0x64 0x0000000000209367 MOVL 0x30(CX),SI
imap4read+0x66 0x0000000000209369 XORB AL,CL
imap4read+0x69 0x000000000020936c ADDB $0x89,AL
imap4read+0x6e 0x0000000000209371 CALL erealloc(SB)
imap4read+0x73 0x0000000000209376 MOVQ imap+0x0(FP),CX
imap4read+0x78 0x000000000020937b MOVQ AX,0x38(CX)
imap4read+0x7d 0x0000000000209380 JPL imap4read+0xaf(SB)
imap4read+0x82 0x0000000000209385 JLE imap4read+0xc8(SB)
imap4read+0x87 0x000000000020938a MOVL $.string+0x496(SB),DI
imap4read+0x8c 0x000000000020938f MOVQ DI,0x8(SP)
imap4read+0x91 0x0000000000209394 CALL imap4cmd(SB)
imap4read+0x96 0x0000000000209399 MOVQ imap+0x0(FP),BP
imap4read+0x9b 0x000000000020939e CALL imap4resp(SB)
imap4read+0x9e 0x00000000002093a1 <stdin>:11: (error) indir: Ȩ�
Now we are really getting somewhere. Again we grep for the CALL targets and hit paydirt in upas/fs/imap.c:
static char*
imap4read(Imap *imap, Mailbox *mb)
{
char *s;
int i, n, c;
Fetchi *f;
Message *m, **ll;
imap4cmd(imap, "status %Z (messages uidvalidity)", imap->mbox);
if(!isokay(s = imap4resp(imap)))
return s;
imap->nuid = 0;
imap->muid = imap->nmsg;
imap->f = erealloc(imap->f, imap->nmsg*sizeof imap->f[0]);
if(imap->nmsg > 0){
imap4cmd(imap, "uid fetch 1:* (uid rfc822.size internaldate)");
if(!isokay(s = imap4resp(imap)))
return s;
}
Even with only the vaguest understanding of assembly, we can see we have found what we are looking for. We have found the CALLs to erealloc, imap4cmd, and imap4resp that we can see in the asm. Checking the manpage for the malloc(2) functions notes that "User errors can corrupt the storage arena." Evidently some bad data wormed its way into the erealloc call in imap4read. I will return to irc with my findings and perhaps will be given some more hints on how to proceed from this knowledge.
Follow-up: I wasn't quite accurate in saying that bad data was in the erealloc call - the bad data was apparently a layer lower, within the imap structure, and was fixed by this commit which checks for imap->nuid being greater than imap->muid and adjusts the value accordingly.