Comment Spam Prevention
11 September 2008 14:13

As promised, here's a brief description of how I manage comment spam. I actually employ two different methods. I forget the original sites I found these on, so if anyone recognizes their method here, let me know and I'll be sure to give credit where it is due.

The first method is to parse the comment and assign points to it. If a comment ends up with a negative spam score, it is marked as spam and will not be displayed to regular readers. I regularly check these to see if anything has been incorrectly identified as spam (or indeed incorrectly identified as not). However, if the score is less than -20, it is deemed as definite spam and is discarded. I'm not going to explain my exact rules, but here's a basic outline of some of them:

  • Every URL in the comment is a lost point (-1)
    • However, less than a certain number of URLs is a plus
  • A comment less than a certain number of characters gets a penalty
  • Every occurance of a stop word (e.g. porn, viagra, casino) is a penalty.
  • If the comment contains nothing but links, it loses a lot of points.
  • If the comment is empty, it is spam.

As you may be able to tell, it is quite hard to get positive points. One might think that this would lead to a lot false positives. However, I have yet to come across a comment incorrectly identified as spam.

The second method is much simpler and (surprisingly) effective. The majority of comment spam is left by a bot (rather than a human). Most bots will parse the HTML to determine the form the HTTP request needs to take by looking at the elements of the comment form. I have added a text-field named "email". Using CSS, I have hidden it from view (using the display style). The bots don't apply the CSS when parsing the HTML document, so they think this is another field that needs to be filled in. However, a human will never even see it and won't (in general) be aware of its presence. So, if this field is filled in, it is assumed the comment was posted by a bot and the whole thing is automatically discarded, without even going through the scoring process above.

Since putting in place the second method, I have received virtually no comment spam. The first method is fairly redundant at the moment, but it's handy to have it in place as a backup for the day when the bots smarten up (which they will sooner or later).

UPDATE (12-Nov-09): Since it may not be entirely clear how the hidden text field method works: Because the field is hidden using CSS, a normal browser will never show it, so a human will not enter any information. A bot, however, is not aware of CSS. So, it thinks that it really is part of the form and will fill something in. So, if there is something in this field, it must have been filled in by a bot, so the post is discarded.