Saturday, January 17, 2009

Linux: Fun with Big Files

Recently, I was playing with a 150G compressed XML file containing a Wikipedia dump. Trying to delete the file gave me a fun glimpse into some Linux behavior that I normally wouldn't notice.

I had a job running that was parsing the file. I hit Control-c to kill the job. Then I deleted the file. "rm" returned immediately. I thought to myself, wow, that was fast. Hmm, I'm guessing that all it had to do was unlink the file. I would have figured it would have taken longer to mark all the inodes as free.

I ran "df -h" to see if there was now 150G more free space on my drive. There was no new free space. Hmm, that's weird. I futzed around for a bit. I started cycling through my tabs in screen. I discovered that I had only killed the job that was tailing one of the files, not the actual job itself.

This reminded me that Linux uses reference counting for files. Even if you can't get to a file through the filesystem, a file might still exist because a program has an open file handle for it. That's how "tempfile.TemporaryFile" works.

I killed the job. I ran "df -h". I now saw a bunch of free space. For some reason, even though I hit Control-c, the job hasn't returned, and I haven't been given a new shell prompt. Hitting Control-c again doesn't help. In fact, I can't even hit Control-z to "kill -9 %1" the job. Normally, that always works. Hmm, that's weird.

I switched to another tab in screen. I ran "ps aux". I don't see the job. I switched back to my other tab. The shell is still frozen. Hmm, that's really weird.

I typed "df -h" over and over again. I can see free disk space slowly returning. After several minutes, I finally got a new shell prompt. I can now see 150G of new free disk space.

Here's what I think happened. When I hit Control-c, the program exited. The kernel removed the process from the process table. While doing this, it closed the open file handle to the 150G file. Next, it had to start freeing inodes. 150G is a lot of inodes to free. Hence, even though there was no entry in the process table (hence the program was not visible to "ps aux"), the process was still stuck in kernel mode freeing up inodes.

Linux is fun ;)


Mark said...

Given that this is reproducible, it might be fun to test your hypothesis with lsof.

Shannon -jj Behrens said...

> Given that this is reproducible, it might be fun to test your hypothesis with lsof.

Interesting idea. I'm almost 100% certain that the program still had the file open even though you couldn't reference it via the filesystem. That just makes sense.

The one thing I'm not 100% certain of is whether the shell was frozen because the kernel was releasing inodes. My guess is that the file was deleted, and then the kernel started deleting files. Since the program was no longer visible via "ps aux", I'm guessing "lsof" would also not see the file.

Brandon L. Golm said...

just making this up here, but the shell was probably blocked on IO. Remember that the shell creates three pipes, forks, and dups those pipes over to 0,1,2, does a setgrp, then execs whatever process (order, completeness, and accuracy are approximate). But the whole time, the shell is reading from those pipes (and spewing it back at you).

So when the process is completely stuck in some kernel place that isn't supposed to take long, there's probably something funny that happens with select or whatever, so the shell's routines that normally don't block ... are blocked.

That's *my* guess.

Brandon L. Golm said...

and then jj had to get all smart and suggest the shell is waiting on waitpid(). Only he can tell us. :-)

Shannon -jj Behrens said...

Hahaha. Nah, I'll just wait for Kelly Yancey to tell me. He'll reply with the actual code from FreeBSD's kernel that would explain the situation ;)

Kelly Yancey said...

Hahaha, sorry JJ, I'd love to help you but FreeBSD is lacking Linux's freeze-while-it-performs-IO feature.

But, as you say, a file remains open so long as any process retains a handle to it (even if there is no name associated with the file in the filesystem). This is a common problem with naive log-rotation scripts: they rename the log file without signaling to the logging process that it needs to reopen the log file. As such, the process continues to write to the "old" file and the "new" file remains 0 bytes in size. This is because the mv operation (and rm/unlink operation) only modify directory entries, they do not touch the inode. Incidentally, that is also why stat(2)'s mtime and atime fields don't reflect name changes to files...stat only returns information about the inode, not anything about the (possibly multiple) names referring to the file.

Since you asked, the relevant logic in the FreeBSD kernel starts with the vput() and vrele() routines in src/sys/kern/vfs_subr.c. These are two variants of the drop reference part of the kernel's file handle reference counting code. When the reference count drops to zero, these routines call vinactive() which, in turn, calls the filesystem-specific implementation of the vfs_inactive callback. For the default UFS filesystem, that callback is ufs_inactive() in src/sys/ufs/ffs/ufs_inode.c.

Anyway, I don't recall FreeBSD ever having long stalls while deleting files, but that really depends on the I/O scheduler. I'm not a Linux expert by any means, but perhaps you're experiencing this bug:

Enjoying the blog as always, JJ. Even if you do put me on the spot. :)


Shannon -jj Behrens said...

Haha, nice! ;)