Tuesday, 2 November 2010

Why Isn’t my Site being Crawled Properly?

If you have concerns that your website isn’t showing up as much as you’d like within the search engine results page, consider this. It may be because you have certain parts of your website that the search engine bots can’t crawl. The more simple your site structure, the easier spiders will navigate your site and as a result, catalogue more of your content and links.
It is not uncommon for spiders to reach a page and give up caching that data as it simply can’t gain access to the specific data format.
Be on the lookout for any of these problematic areas:

Tabular Form Websites :
Using a tabled layout for today's Internet is against best practice and may also be void of any "real" semantics. In the past this method was used to aid design layout but is rarely used today. Spiders allot more priority to titles and H1 tags, likewise more weight will be attributed to content that could be perceived as a column heading, whether this is the case or not. A sure fire way to have your content cached incorrectly.

Using JavaScript for Links :
It is strongly advised NOT to use Javascript for any links that appear within your website. Javascript anchor text links are often written in such a way that a search engine will bestow much less weight to that of a regular link, or even not crawl the links at all. Spiders are getting cleverer by the minute but a grey cloud still looms over this topic. If you want to be safe for the time being, stick to regular HTML.

Crawling and Indexing Flash Content :
For a long while now Flash content or any index rich content has been a problem for spiders to extract information from. Google have been recently working on improving methods of crawling these sites with the help of Adobe but the method is by no means perfected yet. As with Java script if you want to ere on the safe side for the time being, probably best not building your whole site in Flash.

PDF and PowerPoint Files :
Search Engines will definitely crawl PDF’s but like the above points it seems to be the case that links that are found within a PDF document carry less weight than a regular link off a website. This is also the case for content and links found within PowerPoint files. An important point to note is that you should always make sure you use a text based program like Microsoft Word to create your PDF’s, NOT Photoshop, this will ensure at least all your content is correctly indexed.

Link Stuffed Pages :
Much of the literature available from the major search engines is pretty clear, in that when a spider comes across a page stuffed with over 150-200 links, there is a high probability that all these links will not be cached. Try and limit yourself to below 100 links to ensure all your links are accounted for.

Links in Forms :
The vast majority of links to be found within forms that require human intervention to proceed through the form steps will result in a spider being unable to index these links. Google can be known to attempt automated form applications but the amount of information the bot manages to extract is unclear.

No comments:

Post a Comment