Designing a LAMP Based Online MUA

30 November -0001
By Justin Keane
July 3, 2003

Please note that the scripts provided here are proof of concept scripts. There are some major holes in implementation (proper datetime format, parsing mail from to name and addresses, etc.). Use them at your own risk.

I’ve been working for quite some time to develop a full featured MUA (mail user agent) that can be accessed via a web page on a LAMP (Linux, Apache, MySQL, and PHP / Perl / Python) server. I first started working on this project when I developed a fully PHP based webmail interface. Initially all I wanted was a quick and dirty way to read and send email online when I was away from my computer. PHP worked great for this, but over time I began to notice some deficiencies. In addition, I hate to have multiple, redundant applications running at the same time. I found that every time I sat down to a computer I would open a web browser and a MUA (Outlook, Mozilla mail, Kmail, etc.). It seemed a little useless, especially with Mozilla’s multiple tabbed browsing which made viewing multiple web pages at once possible. I’d usually fire up Mozilla and open a tab to my webmail interface and use the other to browse. I’ve got my webmail set to check for new email every 5 minutes and refresh the inbox page automatically, so all I had to do was tab back to my email every now and then to look for new messages while I surfed. This became so common that I began to wonder why I didn’t use the online interface for all my mail. I built a simple MySQL based contact manager to hold my email addresses and added it to the basic PHP based mail manager. The other nice thing was that the PHP interface didn’t delete (or pull) email from the server, so I could still access my email via SSH or whatever other MUA I was using at another location. However, all was not well. I began to notice as soon as I got over 200 emails that it was difficult for PHP to maintain an open connection to the POP3 server long enough to grab all the email headers for display and searching and sorting the email was impossible due to the limitations of the PHP interface (which was pretty much just a raw window into the POP3 server). From there I decided that there had to be a better way. I began investigating different technologies and settled on a system based on a MySQL database to store the email so I could sort, search, and display email in a more manageable fashion. The idea was to pull mail from the POP3 server and dump it into the database and then have the webmail interface reflect the database content rather than the POP3 server content.

The initial MySQL table looks something like this (although note that there isn’t a mechanism to track exactly to whom each email belongs, which becomes a problem on multi-user systems):

+------------------+--------------+------+-----+---------+----------------+
| Field            | Type         | Null | Key | Default | Extra          |
+------------------+--------------+------+-----+---------+----------------+
| email_id         | int(11)      |      | PRI | NULL    | auto_increment |
| email_from       | varchar(255) | YES  |     | NULL    |                |
| email_replyto    | varchar(255) | YES  |     | NULL    |                |
| email_subject    | varchar(255) | YES  |     | NULL    |                |
| email_header     | text         | YES  |     | NULL    |                |
| email_body       | text         | YES  |     | NULL    |                |
| email_attachment | text         | YES  |     | NULL    |                |
| email_datetime   | varchar(255) | YES  |     | NULL    |                |
| email_messageid  | varchar(255) | YES  |     | NULL    |                |
| email_boundary   | varchar(255) | YES  |     | NULL    |                |
| email_uid        | varchar(255) | YES  |     | NULL    |                |
+------------------+--------------+------+-----+---------+----------------+

The problem at this point was that I had over 1200 emails in my inbox. PHP simply couldn’t parse these into the database at any number greater than 3 at a time. While having a scripted page that would need to be called 400 times in a row was a tempting option I figured there had to be another way. I began looking to other technologies.

Interestingly enough, MySQL doesn’t seem to have any plug-ins to allow direct imports from a Sendmail server. I was left with two options. All Sendmail email is stored in a flat file, usually located it /var/spool/mail. I could manually parse through this file and sort out each individual email, or I could contact the POP3 server and parse the email into the database. I figured either option was going to involve a lot of string parsing, and Perl has a reputation for being the best programming language to accomplish these sorts of tasks (also its part of the LAMP architecture :).

After installing Perl I got to work writing a script that would contact the POP3 server, grab the mail, and parse it into the database. Unfortunately Perl pages encountered the same problem as PHP (they could only grab about 3 emails before the page bombed out). Something to do with preparing the page on the fly was interfering with the email pull. I decided to see if I could write a script and run it server side to gauge success. It turned out that virtually the same script, run server side, would pull hundreds of emails. I did find though that large attachments (which are embedded in the body of the emails) would stall out the script. I found I could import around 400 emails at a time safely (which was a far greater accomplishment than the three emails I could do before). I did run into one problem with the MySQL database being unable to buffer really large emails into the database successfully (see the rant article here).

The next step was to figure out a way to call the script from a web page. Unfortunately Perl pages didn’t offer a reliable way to fire off the script without bogging down the page and killing the script prematurely. After a lot of searching I finally found a reliable way to run the server side script without worrying about a page compile error or request timeout. SHTML or SSI (server side includes) turned out to be an very viable (if somewhat obscure) solution. You have to enable SSI in apache and have the module installed, but it works like a charm. SSI is a very limited dynamic scripting platform. In the end I ended up using a .pl file to check how many emails I needed to import, then using redirect to the .shtml page with a URL variable to tell the page which command line parameters to pass to the server side script, then some javascript to jump the .shtml back to the .pl page. In the end here’s what it looked like. First the Perl page:

#!/usr/bin/perl

# Load Modules
use DBI();
use Mail::POP3Client;
use Net::POP3;

$foo = "mystring";

print "Content-type: text/html\n\n";
print  <<finishHeader;
<html>
<head>
	<title>Madirish.net Webmail - Importing...</title>
</head>

<body bgcolor="white">

finishHeader

# Begin the DB portion
my $dbh = DBI->connect("DBI:mysql:database=dbname;host=localhost", "username",
"password", {'RaiseError' => 1});

# execute query
my $sth = $dbh->prepare("SELECT count(email_id) as email_id FROM email");
$sth->execute();

        # iterate through resultset
        while(my $ref = $sth->fetchrow_hashref())
        {
        $dbmax = $ref->{'email_id'};
        }


$dbh->disconnect();  


#beging POP3 check
    my $pop = new Mail::POP3Client(
	    HOST => 'www.servername.net',
	    USER => 'username',
	    PASSWORD => 'password',
	    );
    $pop->Connect();
	$maxPop = $pop->Count(); 
	
# clean up
$pop->Close();

if ($dbmax < $maxPop) 
{
	if (($maxPop - 200) > $dbmax)
	{
	$maximport = 200;
	}
	else
	{
	$maximport  = $maxPop - $dbmax;
	}
	print "<script language='javascript'>";
	print "alert('Redirecting for next " . $maximport . " of " . $maxPop . " (" . $dbmax . " done)');";
	print "location.href='import.shtml?" . $maximport . "';</script>";
}
else
{
	print "Synchronization is complete";
}

print "</body></html>";

You’ll notice I’m using javascript to do the redirect and pop an alert (so you know that the system is actually working). This page then redirects to import.shtml, which looks like this:

<HTML>
<HEAD>
<TITLE>Madirish.net Webmail - Importing</TITLE>
<script language="javascript">
function redir()
	{
	location.href="firstsync.pl";
	}
</script>
</HEAD>
<BODY bgcolor="white" onLoad="redir()">
importing email...
<!--#exec cmd="perl import.pl $QUERY_STRING"-->
</BODY></HTML>

which is pretty simple. All it is doing is making a shell call to ‘perl import.pl x’ where x is the number of emails I want to grab (passed by command line parameter). The script import.pl looks like this:

#!/usr/bin/perl

# Load Modules
use DBI();
use Mail::POP3Client;
use Net::POP3;



#beging POP3 display
    my $pop = new Mail::POP3Client(
	    HOST => 'www.servername.net',
	    USER => ‘username’,
	    PASSWORD => ‘password’,
	    );
    $pop->Connect();
	my $dbh = DBI->connect("DBI:mysql:database=database;host=localhost", "username",
"password", {'RaiseError' => 1});
	my $sth = $dbh->prepare("select max(email_id) as email_id from email");
	$sth->execute();
	while(my $ref = $sth->fetchrow_hashref())
        {$starter = $ref->{'email_id'};}
	
$theMax = $starter + $ARGV[0];

if (($starter == 0) || (! $starter)) { $starter = 1;}
else {$starter++;}

    for( my $i = $starter; $i <= $theMax; $i++ ) {

		$myHeader = $pop->Head($i);
		
		#replace all < and > with HTML compliant ASCII code
		$myHeader =~ s/</&lt\;/g;
		$myHeader =~ s/>/&gt\;/g;

		#escape all single quotes so SQL and HTML display won't break
		$myHeader =~ s/'/&#39\;/g;
		
		#replace all line breaks with <BR> tags (substr doesn't seem to catch \n)
		$myHeader =~ s/\n/<br>/g;
		
		########################################################
		####		BREAK UP MESSAGE FOR INSERT		        ####
		########################################################
		
		#date ex(Date: Thu, 12 Jun 2003 13:01:56 -0500)
		$theDateStart = index($myHeader, "Date:");
		if ($theDateStart > -1)
		{
		$theDateStart = $theDateStart + 5; 
		$theDateEnd = index($myHeader, "<br>", $theDateStart);
		$theDate = substr($myHeader, $theDateStart, ($theDateEnd - $theDateStart) );
		}
		else
		{
		$theDate = '';
		}
		#sender
		$theFromStart = index($myHeader, "From:");
		if ($theFromStart > -1)
		{
		$theFromStart = $theFromStart + 5;
		$theFromEnd = index($myHeader, "<br>", $theFromStart);
		$theFrom = substr($myHeader, $theFromStart, ($theFromEnd - $theFromStart) );
		}
		else
		{
		$theFrom = '';
		}
		#subject
		$theSubjectStart = index($myHeader, "Subject:");
		if ($theSubjectStart > -1)
		{
		$theSubjectStart = $theSubjectStart + 8;
		$theSubjectEnd = index($myHeader, "<br>", $theSubjectStart);
		$theSubject = substr($myHeader, $theSubjectStart, ($theSubjectEnd - $theSubjectStart) );
		}
		else
		{
		$theSubject = '';
		}
		#message_id (if exists) Message-id (must translate header so it matches case)
		$myHeader =~ s/Message-id:/Message-id:/gi;
		$theMessageidStart = index($myHeader, "Message-id:");
		if ($theMessageidStart > -1)
		{
		$theMessageidStart = $theMessageidStart + 11;
		$theMessageidEnd = index($myHeader, "<br>", $theMessageidStart);
		$theMessageid = substr($myHeader, $theMessageidStart, ($theMessageidEnd - $theMessageidStart) );
		}
		else
		{
		$theMessageid = '';
		}
		#boundary(if exists) - translate header so it matches case
		$myHeader =~ s/boundary=/boundary=/gi;
		$theBoundaryStart = index($myHeader, "boundary=");
		if ($theBoundaryStart > -1)
		{
		#note we have to adjust for quotes around the boundary
		$theBoundaryStart = $theBoundaryStart + 10;
		$theBoundaryEnd = index($myHeader, "<br>", $theBoundaryStart);
		$theBoundary = substr($myHeader, $theBoundaryStart, ($theBoundaryEnd - $theBoundaryStart -2) );
		}
		else
		{
		$theBoundary = '';
		}
		
		#escape single quotes from the message body
		$myBody = $pop->Body($i);
		
		#get the uid from the message
		$myUid = $pop->Uidl($i);
		
		#grab any extras
		if (length($theBoundary) == 0)
			{$theAttachment = '';}
		else {
			 $catchAttachments = index($myBody, $theBoundary);
			 $theAttachment = substr($myBody, $catchAttachments, (length($myBody)-$catchAttachments));
			 #print "$catchAttachments<BR>";
			 $myBody = substr($myBody, 0, $catchAttachments);
			 }
		
		#one more run of single quote stripping
		$theAttachment =~ s/'/&#39\;/g;
		$myBody =~ s/'/&#39\;/g;
		
		# fire at will		
		
		################################################
		#		PLEASE NOTE   
		# 
		# the next line(s) should only be ONE line if you cut and paste this script!
		#  I've broken it up here for good HTML display 
		################################################
		my $sth = $dbh->prepare("insert into email (email_uid, email_datetime, email_subject, 
		email_messageid, email_from, email_boundary, email_header, email_body, email_attachment) 
		values ('" . $myUid . "', '" . $theDate . "', '" . $theSubject . "', '" . $theMessageid . "', '" . $theFrom . 
		"', '" . $theBoundary . "', '" . $myHeader . "', '" . $myBody . "', '" . $theAttachment . "')");

		$sth->execute();
		#print ".";
    }
# clean up
$dbh->disconnect(); 
$pop->Close();
print "done.\n";

You’ll notice I’m using the Mail::POP3Client and the DBI modules in Perl. The first is used to make the connections to the POP3 server and DBI is Perl’s interface to connect to the MySQL server.

All these pages can be placed in your /var/www/perl directory on the website (or whatever directory is CGI enabled so you can make the .pl files executable). Obviously you need the mod_perl package installed on your Apache server as well as mod_include.

So that’s as far as I’ve gotten so far. The system takes roughly 20 minutes to import 1200 emails now, but I’ve got it running on a server that is fairly removed from the POP3 server (on a totally different network). I imagine that if they were both running on the same machine the transfer would be noticeably quicker. Also note that all the attachments are getting lumped together in one text field in the db, which is why I’m tracking the boundary (to parse them out later). I’ll post up some more when I get it, but this is a great way to get all your email into a MySQL database quickly and easily.