SiteAnalyzer is designed to analyze the site and to identify technical errors (search for broken links, duplicate pages, incorrect server responses), as well as errors and omissions in the SEO-optimization (blank meta tags, excess or complete lack of headers h1 pages, page content analysis, relink quality and a variety of other SEO-parameters). In total, more than 60 parameters are analyzed.
- Scanning of all pages of the site: images, scripts, documents, video and more
- Check web-server response codes for each page of the site (200, 301, 302, 404, 500, 503, etc.)
- Finding the content of Title tags, Keywords, meta Description and H1-H6
- Find and display "duplicate" of the: pages, meta tags and headers
- Determining the presence of the attribute rel="canonical" for each page of the site
- Following the directives of the "robots.txt", meta "robots" or X-Robots-Tag
- Follow the "noindex" and "nofollow" rules when crawling site pages
- Data scraping based on XPath, CSS, XQuery, RegEx
- Website content uniqueness checking
- Google PageSpeed score checking
- Domain analysis (WHOIS checker, CMS checker, searching for subdomains, keywords density, etc.)
- Hyperlink analysis: display of internal and external links for any page of the site
- Calculation of internal PageRank for each page
- Site Structure Visualization on the graph
- Check and Show Redirect Chains
- Scanning an arbitrary external URL and Sitemap.xml
- Sitemap "sitemap.xml" generation (with the possibility of splitting into several files)
- Filtering data by any parameter
- Search for arbitrary content on the site
- Export reports to CSV, Excel and PDF
Differences From Analogues
- Low demands on computer resources
- Scanning websites of any volumes due to the low requirements of computer resources
- Portable format (works without installation on a PC or directly from removable devices)
- Beginning Of The Work
- Program Settings
- Main Settings
- Scanning Rules
- Virtual Robots.txt
- Yandex XML
- Custom HTTP headers
- Exclude URLs
- Include URLs
- White Label
- Work With The SiteAnalyzer
- Configure Tabs & Columns
- Data Filtering
- Technical Statistics
- Custom Filters
- Custom Search
- Domain Analysis new
- Data Scraping
- Content uniqueness checking
- PageSpeed score checking
- Site Structure
- Project List Context Menu
- Visualization Graph
- Internal Links Chart
- Page Load Performance Graph
- Sitemap.xml Generation
- Scan Arbitrary URLs
- Data Export
- Multilanguage Support
- Compress Database
After the robot has bypassed all the pages of the site, a report is made available in the form of a table and displays the received data, grouped by thematic tabs.
All analyzed projects are displayed in the left part of the program and are automatically saved in the program database together with the received data. To delete unnecessary sites, use the context menu of the project list.
- when you click on the "Pause" button, the project scan is paused. In parallel, current scan progress is saved to the database, which allows, for example, close the program and continue scanning the project from the stop point after restarting the program
- the "Stop" button interrupts the scan of the current project without the possibility of continuing to scan it
The section of the main menu "Settings" is intended for fine settings of the program with external sites and contains 7 tabs:
The main settings section serves for specifying the user-defined directives used when scanning the site.
Description of the parameters:
- Number of threads
- The higher the number of threads, the more URLs can be processed per unit of time. It should be taken into account that a larger number of threads leads to a greater number of used PC resources. It is recommended to set the number of threads in the range of 10-15.
- Scan Time
- It sets the time limit for scanning a site. It is measured in hours.
- Maximum depth
- This parameter is used to specify the depth of the site's scanning. The home page has an nesting level of 0. For example, if you want to crawl the pages of a site like "somedomain.ru/catalog.html" and "somedomain.ru/catalog/tovar.html", then you need to set the maximum depth = 2.
- Delay between requests
- It is used to set pauses when the crawler calls to the pages of the site. This is necessary for sites on "weak" hosting, not withstanding heavy loads and frequent access to them.
- Query Timeout
- Setting the time to wait for a site to respond to a program request. If some of the pages of the site respond slowly (long loads), then the site scan can take quite a long time. Such pages can be cut off by specifying the value after which the scanner will go to the scanning of the remaining pages of the site and thus will not delay the overall progress.
- Maximum crawled pages
- Limitation on the maximum number of pages crawled. It is useful if, for example, you need to scan the first X pages of a site (images, style CSS, scripts and other types of files are not taken into scan).
- In this section, you can select the types of data that the parser will take into account when crawling pages (images, videos, styles, scripts) or exclude unnecessary information when parsing.
- This settings are related to exclusion settings when crawling the site using the "robots.txt" file, "nofollow" links, and using the "meta name='robots'" directives in the site page code.
- SiteAnalyzer offers 3 types of cookie management:
- Permanent – select this option if the site cannot be accessed without cookies. It is also recommended to count each request within the same session. Otherwise, each new request will create a new session.
- Sessional – in this case, each new request will create a new session.
- Ignore – turn off cookies all the time.
- We recommend using the first option, as it is the most universal and allows you to crawl most websites on the internet without any issues.
- It is also possible to export the list of all cookies of the active website to a text file using the "Export Cookies" button through the main menu of the program.
This section serves to specify the main SEO-parameters being analyzed, which will be checked for correctness in the future when parsing pages, after which the statistics obtained will be displayed on the SEO statistics tab in the right part of the main program window.
With the help of these settings, you can select a service through which you will check the indexation of pages in the Yandex search system. There are two options for checking indexing: using the Yandex XML service or the Majento.ru service.
When choosing the Yandex XML service, you need to take into account possible restrictions (hourly or daily) that can be applied when checking the indexing of pages, regarding the existing limits on your Yandex account, as a result of which situations can often arise when your account’s limits are not enough for checking all pages at once and you have to wait for the next hour.
When using the Majento service, hourly or daily restrictions are practically absent, since your limit literally merges into the general pool of limits, which is not small in itself, and also has a significantly larger limit with hourly restrictions than any of the individual user accounts on "Yandex XML".
Added the support of a virtual robots.txt file. You can use it instead of the real one, which is located at the root of the site.
This feature is especially convenient for website performance testing. For example, when you need to scan specific non-indexable sections – or do not include them during the scan. You will not need to waste the developer's time changing the real robots.txt.
A virtual Robots.txt file is stored in the program settings and is common for all projects.
Note: when importing the list of URLs, the instructions in the virtual robots.txt are taken into account (if this option is activated). Otherwise, no robots.txt is taken into account for the URL list.
In the User-Agent section, you can specify which user-agent will be presented to the program when accessing external sites during their scanning.
By default, a custom user agent is installed, however, if necessary, you can select one of the standard agents most commonly found on the Internet. Among them there are such: search engine bots YandexBot, GoogleBot, MicrosoftEdge, bots of Chrome browsers, Firefox, IE8, and also mobile devices iPhone, Android and many others.
This option allows analyzing the behavior of a website and pages when dealing with different requests. For example, someone may need to send a Referer in a request. The administrator of a multilingual site might want to send Accept-Language|Charset|Encoding. Some people might need to send unusual data in the Accept-Encoding, Cache-Control, Pragma headers, etc.
Note: the User-Agent header is configured on a separate tab in the "User-Agent" settings.
If there is a need to work through a proxy, then in this section you can add a list of proxy servers through which the program will access external resources. Additionally, it is possible to check the proxy for performance, as well as the function of removing inactive proxy servers.
This section is designed to avoid crawling certain pages and sections of the site when parsing.
Using the regular expressions you can specify which sections of the site should not be crawled by the crawler and, accordingly, should not be included in the program database. This list is a local list of exceptions for the time of site scanning (relative to it, the "global" list is the file "robots.txt" in the root of the site).
Similarly, allows you to add URLs that must be crawled. In this case, all other URLs outside of these folders will be ignored during the scan. This option also works with the regular expressions.
Using the PageRank parameter, you can analyze the navigation structure of your websites, as well as optimize the system of internal links of a web resource for transmitting reference weight to the most important pages.
The program has two options for calculating PageRank: the classical algorithm and its more modern counterpart. In general, for the analysis of internal linking of the site there is not much difference when using the first or second algorithms, so you can use any of the two algorithms.
A detailed description of the algorithm and the principles of calculating PageRank can be found in this article: calculation of internal PageRank.
Enter login and password for automatic authorization on pages closed via .htpasswd and protected by BASIC server authorization.
White Label is a feature that lets you create reports and present them under your own brand. Add your logo and contact information to offer your clients more presentable site audits.
By default, the reports include the SiteAnalyzer logo and contacts. Now you can replace them with your own data.
If you want to do it, go to the "Settings" section and click the "White Label" button. Fill in your company’s details: logo, e-mail, phone number, address, website and company name.
More information about the work of the item "White Label" can be in the SiteAnalyzer 2.7 review.
After the scan is completed, the information in the "Master data" block becomes available to the user. Each tab contains data grouped with respect to their names (for example, the "Title" tab contains the contents of the page title <title></title>, the "Images" tab contains a list of all images of the site and so on). Using this data, you can analyze the content of the site, find "broken" links or incorrectly filled meta tags.
If necessary (for example, after making changes on the site), using the context menu, it is possible to rescan individual URLs to display changes in the program.
Using the same menu, you can display duplicate pages by the corresponding parameters (duplicate title, description, keywords, h1, h2, content of pages).
The item "Rescan URL with code 0" is intended for automatic double-checking of all pages that return a response code 0 (Read Timeout). This response code is usually given when the server does not have time to deliver content and the connection is closed by timeout, respectively, the page cannot be loaded and information from it cannot be extracted.
Now you can choose which tabs will be displayed in the main data interface (finally, it became possible to say goodbye to the obsolete Meta Keywords tab). This is convenient if the tabs do not fit on the screen, or you rarely use them.
Columns can also be hidden or moved to the desired location by dragging and dropping.
Tabs and columns can be displayed using the context menu in the main data panel. Columns are moved using the mouse drag-n-drop.
For a more convenient analysis of site statistics in the program, data filtering is available. Filtration is possible in two variants:
- for any fields using the "quick" filter
- using a custom filter (using advanced data sampling settings)
Used to quickly filter data and apply it simultaneously to all fields in the current tab.
Designed for detailed filtering and can contain multiple conditions at the same time. For example, for the "title" meta tag, you want to filter pages by their length so that it doesn't exceed 70 characters and contains the "news" text at the same time. Then this filter will look like this:
Thus, applying a custom filter to any of the tabs you can get data samples of any complexity.
The site's technical statistics tab is located on the Additional Data panel and contains a set of basic site technical parameters: statistics on links, meta tags, page response codes, page indexing parameters, content types, etc.
Clicking on one of the parameters, they are automatically filtered in the corresponding tab of the site master data, and at the same time statistics are displayed on the diagram at the bottom of the page.
The SEO-statistics tab is intended for conducting full-fledged site audits and contains 50+ main SEO parameters and identifies over 60 key internal optimization errors! Error mapping is divided into groups, which, in turn, contain sets of analyzed parameters and filters that detect errors on the site.
For all filtering results, it is possible to quickly export them to Excel without additional dialogues (the report is saved in the program folder).
A detailed description of all the checked parameters is available in the SiteAnalyzer 1.8 review.
This tab contains predefined filters that allow you to create selections for all external links, 404 errors, images and other parameters with all the pages on which they are present. Thus, now you can easily and quickly get a list of external links and the pages on which they are placed, or select all broken links and immediately see on which pages they are located.
All reports are available online in the program and are displayed on the "Custom" tab of the master data panel. Additionally, it is possible to export them to Excel through the main menu.
The content search feature can be used to search the page source code and display web pages that contain the content you are looking for.
The custom filters module allows searching for micro-markup, meta tags, and web analytic tools, as well as fragments of specific text or HTML code.
The filter configuration window has multiple parameters to search for specific text on a website’s pages. You can also use it to exclude certain words or pieces of HTML code from a search (this feature is similar to the source code search of a page with a Ctrl-F).
Set of 6 domain analysis modules to the new version of SiteAnalyzer (WHOIS checker, CMS checker, searching for subdomains, keywords density, etc.). The domain analysis modules can be accessed through the context menu of the project list.
The list of modules:
- WHOIS Domain Info
- The tool is designed for mass lookup of the domains age and their major parameters. Our WHOIS checker will help you to determine age of multiple domains in years, as well as display the WHOIS data, domain name registrar, the creation and registration expiration dates, NS servers, IP addresses, organization names of the domain owners, and contact emails (if specified).
- Check Server Request
- This module allows you to check what response code is returned by the server when one of its pages is accessed. The 200 (OK) HTTP code is the standard response for successful HTTP requests. Non-existent pages return the 404 (Not Found) code. There are also other server response codes, such as 301, 403, 500, 503, etc.
- Searching for subdomains
- The tool is designed to look for subdomains of a specific website. The subdomains check will help you analyze the promoted site. It identifies all of its subdomains, including those that were "forgotten" for some reason or the indexed ones that had been used for tests. The module is also helpful if you want to analyze the structure of large online stores and portals. The "Find Subdomains" tool is based on specially constructed queries to various search engines. Thus, the search is conducted only on the subdomains of the desired web resource that are indexed.
- The module is designed to automatically determine the CMS of a website or a group of websites. Determining the type and name of the CMS is conducted by searching for certain patterns in the source code of the site pages. Thus, the module allows you to determine the CMS that runs dozens of specific websites in one click. You will not need to manually study the source code of the pages and the specifics of various content management systems.
- Text Semantic Analysis
- This text semantic analyzer analyzes the main SEO parameters of the text, including: text length, number of words, keyword density analysis (nausea), text readability.
- Text Relevance Analysis
- This tool is designed specifically for SEO-specialists. It allows you to conduct a detailed content analysis of a specific page for its relevance to specific search queries.
- The main criteria by which the page is identified "ready" for promotion are: the presence of the h1 tag, the number of keywords in the TITLE and BODY, the number of characters on a page, the number of the exact query occurrences in the text, etc.
Most of these analysis modules supports batch processing of domains and allows exporting the received data to a CSV file or clipboard.
More information about the work of the module "Domain Analysis" can be in the SiteAnalyzer 2.8 review.
The main web scraping methods are data parsing with XPath, CSS selectors, XQuery, RegExp and HTML templates.
Usually, scraping is used to solve tasks that are difficult to handle manually. For instance, it is useful when you need to extract product descriptions when creating a new online store, scrape prices for marketing research, or monitor advertisements.
Using SiteAnalyzer, you can configure scraping on the Data Extraction tab. It lets you define the extraction rules. You can save them and edit if needed.
There is also a rule testing module. Using the built-in rule debugger, one can quickly and easily get the HTML content of any page on the website and test HTTP requests. The debugged rules can then be used for data parsing in SiteAnalyzer.
As soon as the data extraction is finished, all the collected information can be exported to Excel.
To get more details regarding the module's operation and see the list of the most common rules and regular expressions, check the article How to parse and extract data from a website for free.
This tool allows searching for duplicate pages and check the uniqueness of texts within the website. In other words, you can use it to bulk check numerous pages for uniqueness by comparing them with one another.
This can be useful in such cases as:
- Searching for full duplicate pages (for instance, a specific webpage with parameters and its SEF URL).
- Searching for partial matches in content (for instance, two borscht recipes in a culinary blog that are 96% similar to each other, which suggests that one of the articles should be deleted to avoid SEO cannibalization).
- When you accidentally write an article on your blogging site on the same topic that you have already covered 10 years ago. In this case, our tool will also detect such a duplicate.
Here is how the content uniqueness checking tool works: the program downloads content from the list of website URLs, receives the text content of the page (without the HEAD block and HTML tags), and then compares them with each other using the Shingle algorithm.
Thus, using shingles of texts, the tool determines the uniqueness of each page. It can be used to find duplicate pages with the text content uniqueness of 0%, as well as partial duplicates with varying degrees of text content uniqueness. The program works with 5 characters long shingles.
Read the article How to check a large number of web pages for duplicate content to learn more details about this module.
We have created a free module that allows checking the load speed score of multiple pages at once. It utilizes a special API of the Google PageSpeed Insights tool.
Here are the main analyzed parameters:
- FCP (First Contentful Paint) – metric used for measuring the perceived webpage load speed.
- SI (Speed Index) – metric that shows how quickly the contents of a webpage are visibly populated.
- LCP (Largest Contentful Paint) – metric that measures when the largest content element in the viewport becomes visible.
- TTI (Time to Interactive) – metric that measures how much time passes before the webpage is fully interactive.
- TBT (Total Blocking Time) – metric that measures the time that a webpage is blocked from responding to user input.
- CLS (Cumulative Layout Shift) – metric used for measuring visual stability of a webpage.
Since SiteAnalyzer is a multi-threaded program, you can check hundreds of URLs or even more within several minutes. You would have to waste your entire day or even more to check web pages for duplicate content manually.
Moreover, the URL analysis itself only takes a few clicks. As soon as it is ready, you can download the report, which conveniently includes all the data in Excel.
All you need to get started is receive an API key.
Find out how to do in this article.
This functionality is designed to create the structure of the site based on the parsed data. The structure of the site is generated based on the nesting of the URL pages. After the structure is generated, its export to CSV-format (Excel) is available.
- In the list of projects, a mass scan is available by selecting the desired sites and clicking the "Rescan" button. After that all the sites are queued and scanned one by one in standard mode.
- Also, for the convenience of working with the program, mass removal of selected sites is also available by clicking on the "Delete" button.
- In addition to a single scan of sites, there is the possibility of mass adding sites to the list of projects using a special form, after which the user can scan the whole projects of interest.
- For more convenient navigation through the list of projects, it is possible to group sites by folders, as well as filter the list of projects by name.
The mode of link association visualization on a graph will help SEO specialist assess the internal PageRank distribution on webpages, as well as understand which of the pages get the most link mass (and, accordingly, the higher internal link weight for search engines), and which webpages and website sections lack internal links.
Using the website structure visualization mode, SEO specialist will be able to visually evaluate how internal linking is organized on the website, as well as by visually presenting the PageRank mass assigned to particular pages, quickly make adjustments to the current website linking, and thereby increase relevance of the pages of interest.
In the left part of the visualization window are the main tools for working with the graph:
- graph zoom
- rotation of the graph at an arbitrary angle
- switching the graph window to full-screen mode (F11)
- show / hide node labels (Ctrl-T)
- show / hide arrows on lines
- show / hide links to external resources (Ctrl-E)
- Day / Night color scheme switching (Ctrl-D)
- show / hide legend and graph statistics (Ctrl-L)
- save the graph in PNG format (Ctrl-S)
- visualization settings window (Ctrl-O)
Section "View" is intended to change the display format of nodes on the graph. In the mode of drawing "PageRank" nodes, the sizes of the nodes are set relative to their previously calculated PageRank indicator, as a result of which you can clearly see on the graph which pages get the most link weight and which ones get the least links.
In "Classic" mode, the node sizes are set relative to the selected scale of the visualization graph.
This chart shows the link juice of a website. In other words, it is yet another visualization of internal linking in addition to Visualization Graph).
Numbers on the left side represent pages. Numbers on the right side are links. Finally, numbers at the bottom are quantiles for each column. Duplicate links are discarded from the chart (if page A has three internal links to page B, they are count as one).
The screenshot above shows the following statistics for a 70-page website:
- 1% of pages have ~68 inbound links.
- 10% of pages have ~66 inbound links.
- 20% of pages have ~15 inbound links.
- 30% of pages have ~8 inbound links.
- 40% of pages have ~7 inbound links.
- 50% of pages have ~6 inbound links.
- 60% of pages have ~5 inbound links.
- 70% of pages have ~5 inbound links.
- 80% of pages have ~3 inbound links.
- 90% of pages have ~2 inbound links.
The pages that have less than 10 inbound links have a weak internal linking structure. 60% of pages have the satisfactory number of inbound links. Using this information, now you can add more internal links to these weak pages if they are important for SEO.
In general practice, pages that have less than 10 internal links are crawled by search robots less often. This applies to Googlebot in particular.
Considering that, if only 20-30% of pages on your website have a decent internal linking structure, then it makes sense to change them. You will need to optimize the internal linking strategy or find another way to deal with 80-70% of weak pages (you can either disable indexing, use redirects, or delete them).
Here is an example of a website with a poor internal linking structure:
And here is a website with a decent internal linking structure:
The Page Load Performance graph can be used to analyze the load speed of a website. For clarity purposes, the pages are divided into groups and time intervals with a 100-millisecond step.
Thus, based on the graph, you can identify how many pages are loaded quickly (within the 0-100 milliseconds range), how many of them have an average load time (around 100-200 milliseconds), and which pages take a long time to load (400 milliseconds or more).
Note: The displayed time shows the load speed of the HTML source code. It does not reflect how long it would take to load full pages. Factors such as page rendering or image optimization are not taken into account.
Sitemap is generated based on crawled pages or site images.
- When generating a sitemap consisting of pages, pages of the "text/html" format are added to it.
- When generating a sitemap consisting of images, JPG, PNG, GIF and similar images are added to it.
You can generate a Sitemap immediately after scanning the site, via the main menu: "Projects -> Generate Sitemap".
For sites of large volumes, from 50 000 pages, there is the function of automatically splitting "sitemap.xml" into several files (in this case the main file contains links to additional files containing direct links to the pages of the site). This is due to the requirements of search engines for processing large sitemap files.
If necessary, the amount of pages in the file "sitemap.xml" can be varied by changing the value of 50 000 (it is set by default) to the desired value in the main settings of the program.
The menu item "Import URL" is intended for scanning arbitrary lists of URLs, as well as XML maps of the Sitemap.xml site (including index sitemaps) for subsequent analysis.
Scanning arbitrary URLs is possible in three ways:
- by pasting a list of URLs from the clipboard
- loading from the hard disk files of the *.txt and *.xml formats containing URL lists
- by downloading the Sitemap.xml file directly from the site
A feature of this mode is that when scanning arbitrary URLs, the "project" itself is not saved in the program and the data on it is not added to the database. Also, the sections "Site Structure" and "Dashboard" are not available.
More information about the work of the item "Import URL" can be in the SiteAnalyzer 1.9 review.
Dashboard tab displays a detailed report on the current site optimization quality. The report is generated based on the SEO Statistics tab. In addition to these data, the report contains an indication of the overall quality indicator of site optimization, calculated on a 100-point scale relative to the current degree of optimization. You can export data from the "Dashboard" tab in a handy report in PDF format.
For a more flexible analysis of the received data, it is possible to upload them to the CSV format (the current active tab is exported), as well as generate a full-fledged report in Microsoft Excel with all the tabs in one file.
When exporting data to Excel, a special window is displayed in which the user can select the columns of interest and then generate the report with the required data.
In the program there is a choice of the preferred language on which the work will be done.
Main supported languages: English, German, Italian, Spanish, French, Russian... At the moment the program is translated into more than fifteen (15) most popular languages.
If you want to translate the program into your own language, then it is enough to translate any "*.lng" file into the language of interest, after which the translated file should be sent to the address "firstname.lastname@example.org" (comments to the letter should be written in Russian or English) and your translation will be included in the new release of the program.
More detailed instructions on how to translate the program into languages are found in the distribution (file "lcids.txt").
P.S. If you have any comments on the quality of the translation – send comments and corrections to "email@example.com".
The main menu item "Compress Database" is designed to perform the operation of packing the database (cleaning the database from previously deleted projects, as well as ordering data (analogous to defragmenting data on personal computers)).
This procedure is effective when, for example, a large project containing a large number of records has been deleted from the program. In general, it is recommended to periodically compress data to get rid of redundant data and reduce the size of the database.
The answers to the other questions can be found in the FAQ section.