[lug] Stopping the New Generation of Spam
Philip.Cooper at openvest.org
Tue Dec 5 17:27:53 MST 2006
On Tue, Dec 05, 2006 at 03:25:25PM -0700, Daniel Webb wrote:
> On Tue, Dec 05, 2006 at 12:43:17PM -0700, Philip Cooper wrote:
> > CRM114 .... I was doing better than the geniuses at
> > google (story attacks seem to trip them up occasionally).
> How does it work against the "random story + GIF attachment" spam?
> I got the Debian package for crm114 and looked at the docs on how it
> classifies, and I don't see how it would work on those, since it's
> basically breaking the text of the email into tokenized phrases and
> looking for them in spam vs. non-spam.
Again just reporting my experience...this is no problem. That being
said I have even wondered why they seem so spammy to my filter. My
1. The random story still trips them up. It is much like the Story
spams, you know--my father left me this money in <$some-country> when
he <$mode-of-death>..... Random story, Story spam, word salad all
offer enough word combinations that have no business in a real email
that they are an easy target for a Markov filter.
2. The preferred delivery is via gif and that token might be enough to
tip the scales. That the the yahoo.co.uk from line and who knows what
The one that concerns me is when they eliminate all of the words from
the email and just send the image. But what legitimate email is just
a gif? Those embarrassing x-mas party photos sent around would
probably be jpegs. And anyone sending just a jpeg is probably in you
whitelist explicitly or nominally in your nonspam database because you
trained in one of their emails.
Anyway, so far so good. OCR is another one of those arms race
things. It works. Then they put in random dots and such and it stops
working. Then you google and lug around for the better ocr etc etc
etc. They could get their images past OCR right now but they are
better off waiting for everyone to build the wall, then they knock it
down. Gumption trap for sysadmin types IMHO.
There are bleeding edge versions of CRM114 that work on the tokenizers
to improve the image spam problem, but for now I haven't seen any come
through my filters.
Reasons to not use CRM114:
25Meg disk space per filter set. 100k users and you have an issue.
Performance, CRM114 is super fast but I'm not a super big mailhost.
I don't want to sound too confident. Windows is attacked by viruses
in large part because it is the most common system. Linux and OSX are
less attractive because they are relatively seldom used. The
popularity of Spamassassin keeps my statistical filter low on the
malware priority list.
More information about the LUG