More on replacing our MIME parsers

Alex Hudson

2008-09-06 10:35:16 UTC

Some additional data points I think it's worth sharing.

When I signed up to LKML, I setup a rule in Tbird that matched on the
List-Id: header to move mails into a folder. However, I started noticing
that not all mails were being moved: my LKML folder now has 553 mails,
and my Inbox has ten mails which failed to move according to the rule
(which is probably conservative because I know I moved a few manually too).

I've discovered that these mails all have one thing in common; they're
missing store properties such as nmap.mail.headersize. I believe that
Tbird is requesting the header, and our Imap agent isn't able to oblige
because it doesn't know. The mail is appearing, but it's not getting
sorted according to the rule. Basically, these are all failures in our
mail parsing routines.

10 out of 553 is about 2% of mail, so based on the mails to LKML our
parser is only working 98% of the time. Granted, those ten mails
represent about four authors, but what's slightly odd is that their
mails don't fail every time - only some of the time.

I haven't had any failures from the opensuse list yet, though, and that
box has 360 mails in it.

Food for thought. Basically, the point I'm making is that while we don't
fail often, it's certainly regular and for some users will probably be
painful.

Another issue: when we don't parse stuff properly, we end up with
non-UTF8 data in the store. This makes backup and restore difficult,
because Python doesn't like outputting strings whose encoding it doesn't
know. At the moment, the failure path is to remove code points which
don't match the UTF-8 encoding with '?' characters - which is clearly
and obviously lossy.

Cheers,

Alex.