If you are running a web site scanning, you can significantly improve SiteAnalyzer's performance by doing the following stuff:
1. First, check the volume in terms of how huge is the site you are going to scan?
- 1-1000 pages looks like as a small volume to scan. You can run it through the Fast Crawl function on most PCs without changing any specific settings.
- 1k up to 50k pages sound not as a sufficient volume as most computers will scan the site without trouble shouting, and without changing in terms of configuration, although you will enable at least 2-3 GB of free disk space.
- 50k-100k pages looks like as a sort of fairly large-scale crawl. Here you can change the scanning mode in SiteAnalyzer from virtual to project (storing data in the database on the hard drive) so it is better to store the database on an external SSD drive, if possible.
- 500k+ pages is being working through the list of projects, if possible, use an external drive. Change the memory allocation in such a way as to reserve free 10+ GB of RAM.
- 1 million+ pages – we would recommend taking a powerful PC, preferably a stationary one with at least 32 GB of RAM, on which you can allocate most of the resources directly for scanning.
2. Crawl your site without unnecessary parameters settled
Consider whether you really need to crawl all URLs.
If you have a dynamic site with many URL parameters, this can significantly increase the number of URLs to crawl without adding any value.
If so, determine which variables/parameters allow you are to scale the crawl.
Also, under «Settings > Exclude URLs», use a regular expression to exclude part of the URL path, i.e. ?s= / ?= parameters, etc.
If you have a large website with numerous parameters, you can use filtering lists to exclude URLs with parameters, i.e.:
- https://example.com/?color
- https://example.com/?size
- https://example.com/?type
- https://example.com/?sex and so on
3. Scan the site in parts
Consider creating multiple segmented site crawls as thiat’s great for large sites if you don't want to wait indefinitely for the entire site to be scanned.
We recommend setting up a crawl using a subfolder as the crawl base rather then make sure that in SiteAnalyzer settings you set follow URL to include in the path to your subfolder – this way only URLs in that subfolder will be crawled.
4. Page Info
In «Settings > Scan Rules > Include Content types» – uncheck Images / CSS / JS etc. You can uncheck everything except HTML.
The above mentioned solutions should help you crawl your site more efficiently!
Other articles: