Web scraping: how to parse and extract data from a website for free?

SiteAnalyzer / Blog

Comments: 0

7,385

2020-12-03 | Time to read: 14 minutes

Author: Simagin Andrey

It’s not a rare case when a webmaster, marketing expert, or SEO specialist needs to extract data from site pages and display it in a comfortable form for further processing. This could be parsing prices in an online store, getting the number of likes, or extracting reviews from resources you’re interested in.

Most technical site audit software collect only the H1 and H2 header content by default, however, if, for instance, you want to collect the H5 headers, you will have to extract them separately. They usually reach out to web scrapers to avoid the routine manual work of parsing and extracting data from the HTML pages.

Parsing and extracting data from website

Web scraping is a process of automated data extraction from the site pages following certain rules.

Web scraping application fields:

Tracking the prices of goods in online stores.
Extracting descriptions of goods and services, getting the number of goods and pictures in a data sheet.
Extracting contact information (email addresses, phone numbers, etc.).
Collecting data for marketing research (likes, shares, ratings).
Extracting specific data from the HTML-page code (searching for analytics systems, checking if there is a micro-markup).
Monitoring ads.

The basic web scraping methods are parsing methods using XPath, CSS selectors, XQuery, Regex, and HTML templates.

XPath is a query language for XML/XHTML document elements. To access elements, XPath uses DOM navigation by describing the path to the desired element on the page. Using it, you can get the value of an element by its ordinal number in the document, extract its text content or internal code, check for specific elements on the page. XPath description >>
CSS selectors are used to find an element of its part (attribute). CSS is syntactically alike to XPath, but in some cases, CSS locators show faster performance and are more descriptive and concise. The only drawback of CSS is that it only works in one direction, i.e. deeper into the document. XPath works both ways (for example, you can search for a parent element by a child). CSS and XPath Comparison table >>
XQuery is based on XPath. XQuery imitates XML, which allows you to create nested expressions in a way that is not possible in XSLT. XQuery description >>
Regex is a search language for extracting values from a set of text strings that match the required conditions (regular expression). Regex description >>
HTML templates is a language for extracting data from HTML documents, which is a combination of HTML markup aimed to describe the search template for the fragment, as well as functions and operations for extracting and transforming data. HTML templates description >>

As a rule, parsing is used to cope with tasks that are difficult to handle manually. This can be web scraping of product descriptions in a new online store, scraping in marketing research to monitor prices or ads (for instance, for apartment sales). For SEO optimization tasks, highly specialized tools are usually used with built-in parsers and all the necessary settings for extracting the main SEO parameters.

BatchURLScraper

There are many tools used for scraping (extracting data from websites), but most of them are paid and cumbersome, which limits their availability for mass use.

So, we decided to develop a simple to use and free tool called BatchURLScraper, designed specially to collect data from a URL list and adapted to export the results to Excel.

BatchURLScraper

The software interface is quite simple and has only 3 tabs:

The "URL List" tab is designed to add parsing pages and display the results of data extraction with their subsequent export.
In the "Rules" tab, you can configure scoping rules using XPath, CSS locators, XQuery, Regex, or HTML templates.
The "Settings" tab contains general program settings (number of threads, User-Agent, etc.).

Set rules of the XPath, CSSPath, XQuery, Regex

About

We also added a module for debugging rules.

Module for debugging rules XPath, CSSPath, XQuery, Regex

Using the built-in rule debugger, you can quickly and easily get the HTML content of any site page and test the operation of queries, and then use the debugged rules for parsing data in BatchURLScraper.

Let’s consider the examples of parsing configuration for different variants of data extraction.

Examples of extracting data from site pages

Since BatchURLScraper allows you to extract data from an arbitrary list of pages, which may contain URLs from different domains and, accordingly, different types of site, we will use all five scraping variants as examples to test data extraction: XPath, CSS, Regex, XQuery, and HTML templates. The list of test URLs and rule settings are located in the software package, so you can test them all this yourself using presets (preset parsing settings).

Data extraction mechanics

1. An example of scraping using XPath.

For example, in an online mobile phone store, we need to extract prices from the product card pages, as well as an indication of the item availability in stock (available or not).

To extract prices, we need to:

Go to the product card.
Highlight the price.
Right-click on it and click "Inspect".
In the window that opens, find the element responsible for the price (it will be highlighted).
Right-click on it and choose Copy > Copy XPath.

In the same way you can extract an indication of the product availability on the site.

Copy XPath from browaser

Since ordinary pages usually have the same template, it only takes to perform the operation of obtaining XPath for one such typical product page to parse the prices of the entire store.

Next, we add rules one by one in the list of program rules and paste the previously copied XPath element codes from the browser into them.

2. Then we need to determine the availability of a Google Analytics counter using Regex or XPath.

XPath:
- Open the source code of any page using Ctrl-U, then search for the text "gtm.start" in it, look for the UA -... identifier in the code, and then by using the display of the element code we copy its XPath and paste it into a new rule in BatchURLScraper.
Regex:
- Searching for a counter using regular expressions is even easier: you just need to insert the data extraction rule code ['](UA -.*?)['].

Check the presence of the Google Analytics counter

3. Extract contact email address using CSS.

It’s quite simple to do it. If there are hyperlinks like "mailto:" on the site pages, you can extract all email addresses from them.

To do this, we add a new rule, select the CSSPath in it, and insert the rule a[href^="mailto:"] into the data extraction rule code.

CSSPath, extract contact email using CSS

4. Extract values into a list or in a table using XQuery.

Unlike other selectors, XQuery allows you to use loops and other programming language features.

For example, using the FOR statement, you can get the values of all LI lists. Example:

XQuery, get values of all lists LI

Or you can figure out if there is email on the site pages:

if (count(//a[starts-with(@href, 'mailto:')])) then "Email is available" else "No email available"

5. Using HTML templates.

In this language, you can use XPath/XQuery, CSSpath, JSONiq, and regular expressions to extract data as functions.

Test table:

1	aaa	other
2	foo	columns
3	bar	are
4	xyz	here

For example, this template searches for a table with the id="t2" attribute and extracts the text from the second column of the table:

<table id="t2"><template:loop><tr><td></td><td>{text()}</td></tr></template:loop></table>

Extracting data from second line:

<table id="t2"><tr></tr><tr><template:loop><td>{text()}</td></template:loop></tr></table>

While this template calculates the sum of the numbers in the table column:

<table id="t2">{_tmp := 0}<template:loop><tr><td>{_tmp := $_tmp + .}</td></tr></template:loop>{result := $_tmp}</table>

HTML templates, the sum of the numbers in the table column

Therefore, we were able to extract almost any data from the site pages using an arbitrary list of URLs, including pages on different domains.

The following is a table with the most common rules for extracting data.

Examples of code for extracting data

In the table below, we have compiled a list of the most common data extracting options that can be retrieved using various types of extractors.

Extractor	Expression	Description
1. CSSPath	#comments>h4	Content of the ID "comments" and the H4 subtitle in it
2. CSSPath	*[src^=http://]	Finding insecure links to the HTTP protocol
3. CSSPath	a[href^=https://yourdomain.com]	Find pages that contain an absolute URL to your domain in hyperlinks
4. CSSPath	a[href^=mailto:]	Search for links containing mailto URLs:
5. CSSPath	a[itemprop=maps]:not([href=your_google_maps_url])	Search for pages that do not contain a specific URL on Google Maps
6. CSSPath	a[itemprop=maps][href=your_google_maps_url]	Search for pages containing a specific URL on Google Maps
7. CSSPath	body>noscript:has(iframe[src$=GTM-your_tracking_id]):first-child	Extract Google Tag Manager ID
8. CSSPath	head:has(link[rel=alternate][hreflang=es-es][href*=/es/])	Search for pages containing href or hreflang=/es/
9. CSSPath	img[alt*=SiteAnalyzer]	The content of the alt tag, if it contains the SiteAnalyzer text
10. CSSPath	link[rel=canonical][href^=http://]	Finding pages containing HTTP pages in canonical URLs
11. CSSPath	link[rel=stylesheet][href^=http://]	Finding unsafe links to the HTTP protocol in links to CSS style files
12. CSSPath	meta[name=description][content*=siteanalyzer]	The content of the description meta tag, if it contains the SiteAnalyzer text
13. Regex	['](GTM-.*?)[']	Extract Google Tag Manager ID 1
14. Regex	['](GTM-\w+)[']	Extract Google Tag Manager ID 2
15. Regex	['](UA-.*?)[']	Retrieving Google Analytics ID
16. Regex	[a-zA-Z0-9][a-zA-Z0-9\.+-]+\@[\w-\.]+\.\w+	Looking for Email 1
17. Regex	[a-zA-Z0-9-_.]+@[a-zA-Z0-9-]+.[a-zA-Z0-9-.]+	Looking for Email 2
18. Regex	\w+	Looking for single words
19. XPath	//*[@hreflang]	Content of all hreflang elements
20. XPath	//*[@hreflang]/@hreflang	Specific values ??of hreflang elements
21. XPath	//*[@id="our_price"]	Parsing prices 1
22. XPath	//*[@class="price"]	Price parsing 2
23. XPath	//*[@itemprop]/@itemprop	Itemprop rules
24. XPath	//*[@itemtype]/@itemtype	Structured data schema types
25. XPath	//*[contains(@class, 'watch-view-count')]"	Youtube Views
26. XPath	//*[contains(@class,'like-button-renderer-dislike-button')])[1]	Number of video dislikes on Youtube
27. XPath	//*[contains(@class,'like-button-renderer-like-button')])[1]	Number of video likes on Youtube
28. XPath	//a[contains(.,'SEO Spider')]/@href	Links that include anchor SEO Spider
29. XPath	//a[contains(@class, 'my_class')]	Retrieving pages containing a hyperlink with a specific class
30. XPath	//a[contains(@href, 'linkedin.com/in') or contains(@href, 'twitter.com/') or contains(@href, 'facebook.com/')]/@href;	Links to social networks
31. XPath	//a[contains(@href, 'site-analyzer.pro')]/@href	Links to internal pages
32. XPath	//a[contains(@href,'screamingfrog.co.uk')]	Extract links with entry (full code or anchor text)
33. XPath	//a[contains(@href,'screamingfrog.co.uk')]/@href	Extracting exactly the URL with the entry
34. XPath	//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'),'seo spider')]/@href	Case sensitive search
35. XPath	//a[not(contains(@href, 'site-analyzer.pro'))]/@href	Links to external pages
36. XPath	//a[starts-with(@href, 'mailto')]	All emails on the page
37. XPath	//a[starts-with(@href, 'tel:')]	All phones on the page
38. XPath	//div[@class="example"]	Content by class
39. XPath	//div[@class="main-blog--posts_single--inside"]//a	Getting anchor text
40. XPath	//div[@class="main-blog--posts_single--inside"]//a	Link source code
41. XPath	//div[@class="main-blog--posts_single--inside"]//a/@href	URL content
42. XPath	//div[contains(@class ,'main-blog--posts_single-inner--text--inner')]//h3\|//a[@class="comments-link"]	Several rules in one expression
43. XPath	//div[contains(@class, 'rating-box')]	Parsing rating 4.5
44. XPath	//div[contains(@class, 'rating-box')]	Rating parsing
45. XPath	//div[contains(@class, 'right-text')]/span[1]	Parsing the price of a product
46. XPath	//div[contains(@class, 'video-line')]/iframe	Number of videos per page by class
47. XPath	//h1/text()	Getting h1 page
48. XPath	//h3	Content of all H3 subheadings
49. XPath	//head/link[@rel='amphtml']/@href	Looking for links to APM versions of pages
50. XPath	//iframe/@src	All URLs in IFrame Containers
51. XPath	//iframe[contains(@src ,'www.youtube.com/embed/')]	Find all URLs in IFrame that contain Youtube
52. XPath	//iframe[not(contains(@src, 'https://www.googletagmanager.com/'))]/@src	Find all URLs in IFrame that do not contain GTM
53. XPath	//meta[@name='description']/@content	Description Meta Tag Content
54. XPath	//meta[@name='robots']/@content	Getting Meta Robots values ??(Index / Noindex)
55. XPath	//meta[@name='theme-color']/@content	Content of the header color meta tag for mobile version
56. XPath	//meta[@name='viewport']/@content	Viewport Tag Content
57. XPath	//meta[starts-with(@property, 'fb:page_id')]/@content	Open Graph 1 markup content
58. XPath	//meta[starts-with(@property, 'og:title')]/@content	Open Graph 2 markup content
59. XPath	//meta[starts-with(@property, 'twitter:title')]/@content	Open Graph 3 markup content
60. XPath	//span[@class="example"]	Content by class
61. XPath	//table[@class="chars-t"]/tbody/tr[2]/td[2]	Parsing specific cells in table 1
62. XPath	//table[@class="chars-t"]/tbody/tr[4]/td[2]	Parsing specific cells in table 2
63. XPath	//title/text()	Getting the title of the page
64. XPath	/descendant::h3[1]	Content of the first listed subheading H3
65. XPath	/descendant::h3[position() >= 0 and position() <= 10]	Contents of the first 10 according to the list of H3 subheadings
66. XPath	count(//h3)	Number of subheadings H3
67. XPath	product: "(.*?)"	JSON-LD structured data 1
68. XPath	ratingValue: "(.*?)"	JSON-LD structured data 2
69. XPath	reviewCount: "(.*?)"	JSON-LD structured data 3
70. XPath	string-length(//h3)	Retrieved string length
71. XPath	//a[contains(@rel,'ugc')]/@href	UGC
72. XPath	//a[contains(@rel,'sponsored')]/@href	Sponsored
73. XQuery	if (count(//a[starts-with(@href, 'mailto:')])) then "Email is available" else "No email available"	Check if there is an Email on the page or not
74. XQuery	for $a in //li return $a	Get contents of all elements of the LI list
75. HTML templates	<table id="t2"><template:loop><tr><td></td><td>{text()}</td></tr></template:loop></table>	Search for a table with attribute id = "t2" and extract text from the second column
76. HTML templates	<table id="t2"><tr></tr><tr><template:loop><td>{text()}</td></template:loop></tr></table>	Retrieving data from the second row of a table
77. HTML templates	<table id="t2">{_tmp := 0}<template:loop><tr><td>{_tmp := $_tmp + .}</td></tr></template:loop>{result := $_tmp}</table>	Calculating the sum of numbers in a table column

Download BatchURLScraper (5 Mb) ↓

Rate this article

5/5

3

Other articles:

0 comments

You must be logged to leave a comment.

<< Back

Our Clients

Our social links

Developed by BrokenByte Software