Cloaking! The definitive guide (Part 2 - IP delivery)

User-agent cloaking sucks! Take the code from cloaking part one, dissect it, understand it then hit that delete button. User-agent cloaking is waaaaaay too easily spoofed by anybody who wants to spoof it.

IP cloaking

Real men use IP based cloaking. That means we are going to need a list again. This time we have 3 choices; IPLists.com, Fantomaster’s SpiderSpy Database, or you can roll your own list. So which should you choose?

The IP Lists

  • IPLists.com: Only recommended if you are just experimenting a little with cloaking. Expect to get banned.

Advantages
It’s free
It’s easily accessible and parsable

Disadvantages
It’s rarely updated these days
It isn’t as comprehensive as it needs to be
You’re going to get banned lots if you base your system around this list

  • SpiderSpy: My recommendation. Ralph has done an amazing job at putting together the best damn bot list out there. Updated every 6 hours. If you get banned whilst using this list, it’s probably a manual review.

Advantages
It’s the most comprehensive list around
You can download it in a variety of formats for easy parsing
You can get support from Ralph

Disadvantages
It costs money (though the value for money as far as I’m concerned is great)

  • Roll your own: If you are skilled enough to roll your own, you probably aren’t going to learn much from this series of posts.

Advantages
It’s free
You get full control over how you identify robots

Disadvantages
You need to be a shit-hot coder
It will take time to build up a list of bots
Even after all that effort, I doubt your list would be as good as the SpiderSpy list

So now you’ve picked your method of creating a spider list, used a few instances of explode() to parse it and then put it into your database (MySQL of course). Lets assume it looks something like this:

ip botname
123.45.67.89 Google
98.76.543.21 Yahoo

You only really need the first row for cloaking, but it can be handy to have know which IP is attributed to which search engine for stats. My philosophy when it comes to data collection is a rather greedy one. I collect all the data I can, then decide what I’m going to do with it at a later date. This approach has allowed me to learn a lot about bot behaviour just by looking at my raw stats in phpMyAdmin. I’m sure if I started graphing the data and applying rules to it I could learn even more.

IP cloaking with PHP

Let’s write our first IP based cloaking script.

< ?
// Lets connect to MySQL
mysql_connect('localhost', 'username', 'password');
mysql_select_db('database');
// First we grab the visitors IP
$visitorIP = $_SERVER['REMOTE_ADDR'];
// Then we search the database for the IP
$result = mysql_query("SELECT * FROM table WHERE ip = '$visitorIP'");
$dbip = mysql_fetch_array($result);
$theip = $dbip['0'];
// We check if there is an instance
if(strlen($theip) > 0) {
echo 'uber SE content';
}else{
echo 'standard page'
}
?>

The problem of potential spoofing has now gone, you can’t spoof an IP (well, to my knowledge you can’t). So only bots can see my content now, right? Wrong!

When bots crawl your page they store a copy of the page that they see in the cache. Which means somebody could just come along, click that lovely cache button and see your spammy highly optimised content. We don’t want to make it easy for these snoopers, so lets stop them seeing our content. Really you’ve got 2 options when it comes down to it.

Cache Busting?

  1. Tell Googlebot where to shove it’s cached page: Basically include <meta name=”ROBOTS” content=”NOARCHIVE”> in the <head></head> section of your page and the cache is no more! The problem with this approach is that it is a potential red flag to the search engines and sometimes they ignore the tag anyway.
  2. Pass a naughty little bit of Javascript. Let the user visit the page then pass some javascript that you stealthly stuffed into your page when the bot visited last. Just stick in this code:
url="http://www.yourdomain.com/target.html";
if (top.location != url) {top.location = url}

(You’ll have to visit Fantomaster’s page to get the actual code as this has no javascript tags. Wordpress was having none of it.)
That beautiful piece of code was stolen borrowed from Fantomaster and will redirect anybody trying to look at the cache to your actual site. So now we’ve locked out the snoopers. ..Unless they disable javascript when they view your cache. Now, I’m not one to be beaten so easily by the self appointed internet police who have nothing better to do than run around reporting sites to Google without being paid for their time. So take the source code from your uncloaked page, find the most cluttered part of your source code, then put in your optimised content. Now put in a div and add this to the div class:

text-indent: -12000px;

This will shift all the text 12,000 pixels to the left. Preferably hide that in the most cluttered part of your css and try and confuse any snoopers a bit. In part 3 I will tell you how to use cloaking for…

…Super Targeting

One Response to “Cloaking! The definitive guide (Part 2 - IP delivery)”

  1. Nicely done. Cloaking has always been a… faux pas in my mind. But I suppose it could have it’s uses. I tend to shy away from other examples in the past. Very clean though and obviously this is very effective.

Leave a Reply