How to make your website more crawlable

Posted by Dan Frost on Tue, 26/02/2008 - 18:17

I've read many articles on this subject but being as we've actually written a web crawler I think we're pretty well placed to give good advice on how a website should be built to be easily crawled :)

1. Well structured HTML

This doesn't mean that your site has to adhere to W3C standards but at the very least, one set of tags will help and generally ensuring your HTML tags are closed.

2. Text or image as a link - it makes no difference

From a crawling point of view, it makes no difference if a link is text or an image. It's quite feasible that an alt tag on an image is as good as well targetted anchor text for SEO purposes.

3. Absolutely no javascript links

If you want a link to be followed by a crawler, make sure it's not a javascript link. Most crawlers actively avoid javascript including ours.

There are many ways around this if it's a problem so there's really no exuse.

4. Do not put forms in front of content you wish to be crawled

This is a surprisingly common problem. Forms are fine for users - they enter the details, click submit and there's all sorts of lovely content. Crawlers can't generally read or submit forms so you should provide an alternative route to that content.

A browseable list is usually quite useful for users as well as crawlers. Sitemaps help but even if it can find the pages, the fact that there would appear to be few internal links to those pages (IE hidden behind a single submit button) may lead search engine ranking algorithms to rank those pages lower than is desired.

5. Mix of relative and absolute URLs

Most crawlers can deal with this now but it does add an overhead if the crawler has to establish the "real" URL to a page.

Our crawler does a series of formulas to ensure that the URL is consistent and no doubt Googlebot, etc does the same.

The easier the crawlers job, the more it crawls.

6. Paging through content

There are many websites where the paging is based on unnecessary dynamic URLs. Crawlers can work with dynamic URLs but again, it involves more overhead to ensure that the crawler is not going to get stuck in a loop going over and over the same content.

Generated data grids and the like use javascript for paging - this is very bad practise in terms of crawlers.

7. Sessions

Sessions are a crawlers worst nightmare. The Crawl Score crawler and many others will actively look for session based URLs to ensure it doesn't follow any links that look like sessions.

URLs that contain "session_id" or anything remotely similar are avoided at all costs.

The reason for this is that the same content can be repeated and the crawler could get stuck in a loop. Most crawlers will have a timeout value for each page or site in this event but there's a reasonable chance that crawlers will either not come back or at least go back with limited resource.

There are a surprising number of sites that use sessions. I'll be blogging about some UK government sites that use such practises in another blog post soon.

8. Use of Last Modified

A crawler has a limited amount of time to spend on your site. If you're not using Last Modified correctly then for each visit, it will be re-crawling old, already indexed pages again rather than looking for new content.

We see it time and time again where large sites are having problems getting all their pages indexed because they don't use Last Modified on their site. These headers are there for a reason - to help both the search engine crawlers and you in getting your site fully indexed.

In summary, if you ensure your site is light work for a crawler then you'll get more of your content crawled.