Regenerating Large Mailman Archives

In the fall of 2005 a server that I operated was broken into. The perpetrators defaced every index.* file on the system including those in the archives of a mailing list that I’ve run since December of 2005. Recently it came to my attention that those files were overlooked in the cleanup process. I checked my backup CDs from 2005 and found that I did not have any backups that could be used to restore the defaced archives. So I did what anyone would do, I searched for ways to regenerate the archive. Mailman includes tools to regenerate archives but there are some issues.

In mailman/bin is a script named arch. Arch regenerates all or a portion of the archives of a list. The documentation for arch states:

“This script is not very efficient with
respect to memory management, and for large archives, it may not be
possible to index the mbox entirely. For that reason, you can specify
the start and end article numbers.”

Since my archive is large (187MB mbox file) and the host machine has only 512 MB of RAM I had to generate the archive in chunks. First I had to figure out how many messages where in the mbox file. An mbox file delimits messages by inserting “From ” at the beginning of a line. To determine the number of messages I ran:

$ grep "^From " path/to/list.mbox

I had more than 64,000 messages. I wasn’t about to run arch manually for even 64 chunks of 100 messages at a time. Instead I wrote a script. Every time I have a task like this I invariably end up having to do it again. I find it to be a wise investment of my time to write simple scripts to automate the tasks.

The script figures out the number of messages in the mbox file and then runs arch as many times as necessary to complete archiving the messages. The default chunk size is 1000 messages but it can be changed to any chunk size. The script isn’t terribly robust. It assumes the current directory is the mailman bin directory. This too wouldn’t be too hard to change but I didn’t want to spend a lot of time making the script perfect.

The following is a listing of the script:

 #!/bin/bash

DRY_RUN="no"
STEP=1000
ARCH=./arch
GREP=grep

function archiveChunk()
{
	local list="$1"
	local start="$2"
	local stop="$3"

	echo "rearchiving start=$start stop=$stop"
	if [ "$DRY_RUN" == "yes" ]; then
		echo "would execute: $ARCH" "$list" -s "$start" -e "$stop"
	else
		"$ARCH"  -s "$start" -e "$stop" "$list"
	fi
}

function rearchive()
{
	local last="$2"
	local list="$1"

	local step=$STEP

	local start=0
	local stop=0
	# Do all whole blocks
	for ((i=0; i<$last - $step; i+=$step)); do
		start=$i
		stop=$(($i + $step -1))

		archiveChunk "$list" "$start" "$stop"
	done

	# Do remaineder
	local remainder=$(($last - $stop))

	# subtract 2 because this is 0 indexed
	stop=$(($start + $remainder -2))
	archiveChunk "$list" "$start" "$stop"
}

LIST="$1"
echo "Getting message count for: $LIST"
ARCHIVE_FILE="../archives/private/${LIST}.mbox/${LIST}.mbox"
LIST_COUNT=$("$GREP" --count "^From " "$ARCHIVE_FILE")
echo "$LIST_COUNT messages in $ARCHIVE_FILE"
echo "starting rearchiving..."
rearchive "$LIST" "$LIST_COUNT"
echo "rearchive complete"
Advertisements
This entry was posted in Code, Geek, Site, Software, Tips. Bookmark the permalink.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s