Tuesday, February 27, 2007

Keeping the spiders off

For whatever reason you might have, you may want to prevent portions of your website from being indexed by Google or any search engine for that matter. It might be a matter of choice or maybe it makes no sense to have that portion of your site in any search engine’s database. Anyway, keeping spiders off contents of your site is something you can easily do. Here are two methods to it:

Robots.txt


First method is the robots.txt file way. The robots.txt is the first file that a search engine visits on your site. With the robots.txt file, you can instruct a search engine to index or not to index your site. Like someone aptly put it, the robots.txt file is like a snooty nightclub bouncer with a velvet rope, it decides which search engines enters your pages.

The robots.txt is just your regular text file that you will have to put in the root directory of your site. A typical one looks like this:
User-agent: googlebot

Disallow : /confidential files/

Disallow: /private directory/

Disallow: /constitution/

User-agent : *

Disallow: /cgi-scripts/


Which tells Google’s spider, googlebot, not to index the confidential files,private directory, and constitution directories. It also disallow all spiders (indicated with the wild card asterisk) from the /cgi-scriots/ directory.

It is worthy to note that compliance to the robots.txt instruction is strictly by choice and not compulsion. A search engine might decide to ignore the robots.txt file, but most renowned search engines do actually obey the robots.txt file.

Robots Meta Tag


The second method involves you putting a specific meta tag into the page you want to exclude from a search engines index. It takes the following form:

<meta name= “robots” content= “noindex, nofollow”/>

This will prevent search engine’s robots from indexing content or following links from the page.