Twitter Facebook Linkedin Product Hunt
SiteAnalyzer SEO Tools

Tips for crawling high volume sites in SiteAnalyzer

Comments: 0
 42
2025-01-19 | Time to read: 3 minutes
Facebook
Author: Simagin Andrey

Tips for crawling high volume sites in SiteAnalyzer

If you are running a web site scanning, you can significantly improve SiteAnalyzer's performance by doing the following stuff:

1. First, check the volume in terms of how huge is the site you are going to scan?

  • 1-1000 pages looks like as a small volume to scan. You can run it through the Fast Crawl function on most PCs without changing any specific settings.
  • 1k up to 50k pages sound not as a sufficient volume as most computers will scan the site without trouble shouting, and without changing in terms of configuration, although you will enable at least 2-3 GB of free disk space.
  • 50k-100k pages looks like as a sort of fairly large-scale crawl. Here you can change the scanning mode in SiteAnalyzer from virtual to project (storing data in the database on the hard drive) so it is better to store the database on an external SSD drive, if possible.
  • 500k+ pages is being working through the list of projects, if possible, use an external drive. Change the memory allocation in such a way as to reserve free 10+ GB of RAM.
  • 1 million+ pages – we would recommend taking a powerful PC, preferably a stationary one with at least 32 GB of RAM, on which you can allocate most of the resources directly for scanning.

Check the volume in terms of how huge is the site you are going to scan

2. Crawl your site without unnecessary parameters settled

Consider whether you really need to crawl all URLs.
If you have a dynamic site with many URL parameters, this can significantly increase the number of URLs to crawl without adding any value.

If so, determine which variables/parameters allow you are to scale the crawl.

Also, under «Settings > Exclude URLs», use a regular expression to exclude part of the URL path, i.e. ?s= / ?= parameters, etc.

Crawl your site without unnecessary parameters settled

If you have a large website with numerous parameters, you can use filtering lists to exclude URLs with parameters, i.e.:

  • https://example.com/?color
  • https://example.com/?size
  • https://example.com/?type
  • https://example.com/?sex and so on

3. Scan the site in parts

Consider creating multiple segmented site crawls as thiat’s great for large sites if you don't want to wait indefinitely for the entire site to be scanned.

Scan the site in parts

We recommend setting up a crawl using a subfolder as the crawl base rather then make sure that in SiteAnalyzer settings you set follow URL to include in the path to your subfolder – this way only URLs in that subfolder will be crawled.

4. Page Info

In «Settings > Scan Rules > Include Content types» – uncheck Images / CSS / JS etc. You can uncheck everything except HTML.

Page Info

The above mentioned solutions should help you crawl your site more efficiently!

Rate this article
5/5
1



0 comments

You must be logged to leave a comment.


<< Back

Our Clients