Restricting web site caching from protocol-following Internet spidering applications can be accomplished by adding a robots.txt file to a web site and inserting metadata into the HTML code of each web page. Robots.txt is a web standard to instruct spiders and bots to what they can and cannot access. Not all bots follow the robots.txt standards (see here) but many do so it is still a good idea to have a robots.txt file.
For a robot.txt file that blocks all spidering add the following lines to a blank plain text file:
User-Agent: *
Disallow: /
Save the file as “robot.txt”, then upload to the top directory of the web site. It needs to be accessible at the top level of the web site. For example, the h4k.com robots.txt file can be found here: http://h4k.com/robots.txt
The metadata tags below need to inserted on each web page that needs the caching to be restricted. If the web site runs Wordpress software it can automatically insert the metadata tags into all HTML files by editing the header php code. Log in to the Wordpress console then go to Presentations –> Theme Editor –> and select “Header” under the “theme files” listing. Directly underneath the area where it says:
<title><?php bloginfo(‘name’); ?><?php wp_title(); ?></title>
insert these metadata tags:
<META name=”ROBOTS” content=”NONE”>
<META http-equiv=”CACHE-CONTROL” content=”NO-CACHE”>
<META http-equiv=”EXPIRES” content=”0″>
* Note: you will need to currently replace the quotes with newly typed ones. The web page is not displaying the proper characters for the quotes (yet).
This also instructs web spiders to not look at anything, instructs the web browser not to cache the web page, and has the web page set to expire immediately.
Alternatively, you may want web crawlers and search engines to be able to see your web site but not make a copy of the content. To set up a web page to allow web spiders and search engines to index the information but not make a cached copy of the web page the ROBOTS metatag should be set to “NOARCHIVE”.
<META name=”ROBOTS” content=”NOARCHIVE”>
Lots of good information about metatags can be found here.
It is also important to have a well constructed robots.txt file to prevent the leech bots from wasting bandwidth on your site and preventing them from making copies of your site. Wikipedia.org has a well documented robots.txt file that the h4k.com robots.txt file is based off of.