Trying to scrape multiple data points in same table with duplicate titles

Trying to scrape multiple data points in same table with duplicate titles - web-scraping

I am trying to scrape multiple data points from the left most table in the below link. My issue is I need to collect the "Total Qty:" under each month but am struggling with it. I have tried getall and a few other options but I am trying to collect the Qty from each month and document it in the csv output under each months name. The issue is the list lengths change for each part so it is difficult for me to identify the correct way to grab this data. Any help would be apricated.
https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&showbulk=0&currency=1
def parse_details(self, response):
yield {
'element_id': response.meta.get('element_id'),
'USA_NEW_times_sold Month 1': response.xpath('//*[#class="pcipgOddColumn"]')[0].xpath(
'.//td[contains(text(),"Total Qty:")]/following::td//text()').get('')
}
I have tried getall, along with changing the pathing but am struggling with that portion.

When all else fails you can always iterate through elements to search for the one you are looking for. Try finding the months, and using the following-sibling directive to search for the very next row that contains the Total Qty:
def parse(self, response):
column = response.xpath('//td[#valign="top"][1]')
for i in column.css('td.pcipgSubHeader'):
month = i.xpath('./b/text()').get()
for j in i.xpath('./../following-sibling::tr'):
if j.xpath('./td/text()').re('Total Qty:'):
qty = j.xpath('.//td/b/text()').get()
yield {'month': month, 'qty': qty}
break
With the above method I get this output:
{'month': 'February 2023', 'qty': '13'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'January 2023', 'qty': '8'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'December 2022', 'qty': '3'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'November 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'October 2022', 'qty': '6'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'September 2022', 'qty': '10'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'January 2023', 'qty': '12'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'November 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'October 2022', 'qty': '11'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'August 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'November 2022', 'qty': '6'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'October 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'January 2023', 'qty': '9'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'December 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'November 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'October 2022', 'qty': '2'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'September 2022', 'qty': '3'}
2023-02-10 23:19:37 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.bricklink.com/v2/catalog/catalogitem_pgtab.page?idItem=115180&idColor=47&st=2&gm=1&gc=1&ei=0&prec=2&showflag=0&sh
owbulk=0&currency=1>
{'month': 'November 2022', 'qty': '3'}

Related

Selecting text where each char is stored in separate span

I am trying to scrape a code chunk from this documentation page that hosts code in a peculiar way. Namely, the code chunk is divided into the function call and arguments having their own spans, parantheses and comma even have their own span. I am at a loss trying to extract this code snippet under 'Usage' with a scrapy spider.
Here's the code for my spider, which also scrapes the documentation text.
import scrapy
import w3lib.html
class codeSpider(scrapy.Spider):
name = 'mycodespider'
def start_requests(self):
url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html"
yield scrapy.Request(url)
def parse(self, response):
docu = response.css('div#man-container p').getall()[2]
code = response.css('pre::text').getall()
yield {
'docu': w3lib.html.remove_tags(docu).strip(),
'code': code
}
When trying to extract the text of the block using response.css('pre::text').getall() somehow only the punctation is returned, not the entire function call. This also includes the example at the bottom of the page, which I'd rather avoid but do not know how.
Is there a better way to do this? I thought ::text would be perfect for this use case.

Try iterating through the pre elements and extracting the text from them individually.
import scrapy
import w3lib.html
class codeSpider(scrapy.Spider):
name = 'mycodespider'
def start_requests(self):
url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html"
yield scrapy.Request(url)
def parse(self, response):
docu = response.css('div#man-container p').getall()[2]
code = []
for pre in response.css('pre'):
code.append("".join(pre.css("::text").getall()))
yield {
'docu': w3lib.html.remove_tags(docu).strip(),
'code': code
}
OUTPUT:
2023-01-23 16:35:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html> (referer: None) ['cached']
2023-01-23 16:35:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html>
{'docu': 'Read in csv for year and plot all the accidents in state\non a map.', 'code': [['1'], ['fars_map_state', '(', 'state.num', ',', ' ', 'year', ')', '\n'], ['1'], ['fars_map_state', '(', '1', ',', ' ', '2013
', ')', '\n']]}
2023-01-23 16:35:16 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-23 16:35:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 336,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 29477,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.108742,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 1, 24, 0, 35, 16, 722131),
'httpcache/hit': 1,
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,

How is the btmgmt command used to set PHY?

When I run the command: btmgmt phy LE2MTX LE2MRX
It returns:
Could not set PHY Configuration with status 0x0d (Invalid Parameters)
btmon shows:
# MGMT Open: btmgmt
# MGMT Command: Set PHY Configuration (0x0045) plen 4
Selected PHYs: 0x1800
LE 2M TX
LE 2M RX
# MGMT Event: Command Status (0x0002) plen 3
Set PHY Configuration (0x0045)
Status: Invalid Parameters (0x0d)
# MGMT Close: btmgmt
I'm very unfamiliar with btmgmt, how do I specify that I want LE 2M PHY whenever possible?
If I run: btmgmt phy
I get the available PHYs which incudes the LE2MTX and LE2MRX (which I am after).
Supported phys: BR1M1SLOT BR1M3SLOT BR1M5SLOT EDR2M1SLOT EDR2M3SLOT EDR2M5SLOT EDR3M1SLOT EDR3M3SLOT EDR3M5SLOT LE1MTX LE1MRX LE2MTX LE2MRX
Configurable phys: BR1M3SLOT BR1M5SLOT EDR2M1SLOT EDR2M3SLOT EDR2M5SLOT EDR3M1SLOT EDR3M3SLOT EDR3M5SLOT LE2MTX LE2MRX
Selected phys: BR1M1SLOT BR1M3SLOT BR1M5SLOT EDR2M1SLOT EDR2M3SLOT EDR2M5SLOT EDR3M1SLOT EDR3M3SLOT EDR3M5SLOT LE2MTX LE2MRX
These can also be seen in btmon:
Get PHY Configuration (0x0044) plen 12
Status: Success (0x00)
Supported PHYs: 0x1fff
BR 1M 1SLOT
BR 1M 3SLOT
BR 1M 5SLOT
EDR 2M 1SLOT
EDR 2M 3SLOT
EDR 2M 5SLOT
EDR 3M 1SLOT
EDR 3M 3SLOT
EDR 3M 5SLOT
LE 1M TX
LE 1M RX
LE 2M TX
LE 2M RX
Configurable PHYs: 0x19fe
BR 1M 3SLOT
BR 1M 5SLOT
EDR 2M 1SLOT
EDR 2M 3SLOT
EDR 2M 5SLOT
EDR 3M 1SLOT
EDR 3M 3SLOT
EDR 3M 5SLOT
LE 2M TX
LE 2M RX
Selected PHYs: 0x19ff
BR 1M 1SLOT
BR 1M 3SLOT
BR 1M 5SLOT
EDR 2M 1SLOT
EDR 2M 3SLOT
EDR 2M 5SLOT
EDR 3M 1SLOT
EDR 3M 3SLOT
EDR 3M 5SLOT
LE 2M TX
LE 2M RX

scrapy stops scraping elements that are addressed

Here are my spider code and the log I got. The problem is the spider seems to stop scraping items addressed from somewhere in the midst of page 10 (while there are 352 pages to be scraped). When I check the XPath expressions of the rest of the elements, I find them the same in my browser.
Here is my spider:
# -*- coding: utf-8 -*-
import scrapy
import logging
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C'
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
def start_requests(self):
yield scrapy.Request(url='https://arzdigital.com',
callback= self.parse,dont_filter = True)
def parse(self, response):
posts=response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title=post.xpath(".//#title").get()
post_link=post.xpath(".//#href").get()
post_date=post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-last-post__publish-time']/time/#datetime").get()
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()"):
likes=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
else:
likes=0
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()"):
commnents=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()").get())
else:
commnents=0
yield{
'post_title':post_title,
'post_link':post_link,
'post_date':post_date,
'likes':likes,
'commnents':commnents
}
next_page=response.xpath("//div[#class='arz-last-posts__get-more']/a[#class='arz-btn arz-btn-info arz-round arz-link-nofollow']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse,dont_filter = True)
else:
next_pages= response.xpath("//div[#class='arz-pagination']/ul/li[#class='arz-pagination__item arz-pagination__next']/a[#class='arz-pagination__link']/#href").get()
if next_pages:
yield scrapy.Request(url=next_pages, callback=self.parse, dont_filter = True)
except AttributeError:
logging.error("The element didn't exist")
Here is the log, when the spider stops:
2021-12-04 11:06:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/10/>
{'post_title': 'ولادیمیر پوتین: ارزهای دیجیتال در نوع خود ارزشمند هستند', 'post_link': 'https://arzdigital.com/russias-putin-says-crypto-has-value-but-maybe-not-for-trading-oil-html/', 'post_date': '2021-10-16', 'likes': 17, 'commnents': 1}
2021-12-04 11:06:51 [scrapy.core.scraper] ERROR: Spider error processing <GET https://arzdigital.com/latest-posts/page/10/> (referer: https://arzdigital.com/latest-posts/page/9/)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\arzdigital\arzdigital\spiders\criptolern.py", line 32, in parse
likes=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
ValueError: invalid literal for int() with base 10: '۱,۸۵۱'
2021-12-04 11:06:51 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-04 11:06:51 [scrapy.extensions.feedexport] INFO: Stored csv feed (242 items) in: dataset.csv
2021-12-04 11:06:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4112,
'downloader/request_count': 12,
'downloader/request_method_count/GET': 12,
'downloader/response_bytes': 292561,
'downloader/response_count': 12,
'downloader/response_status_count/200': 12,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 4, 7, 36, 51, 830291),
'item_scraped_count': 242,
'log_count/DEBUG': 254,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'request_depth_max': 10,
'response_received_count': 12,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2021, 12, 4, 7, 36, 47, 423017)}
2021-12-04 11:06:51 [scrapy.core.engine] INFO: Spider closed (finished)
I can't find the problem, if it is related to wrong XPath expression. Thanks for any help!!
EDIT:
So I guess it's better to see two files here.
The first is settings.py:
BOT_NAME = 'arzdigital'
SPIDER_MODULES = ['arzdigital.spiders']
NEWSPIDER_MODULE = 'arzdigital.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 10
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'arzdigital.middlewares.ArzdigitalDownloaderMiddleware': None,
'arzdigital.middlewares.UserAgentRotatorMiddleware':400
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 60
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 120
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = False
FEED_EXPORT_ENCODING='utf-8'
And the second file is middlewares.py:
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random, logging
class UserAgentRotatorMiddleware(UserAgentMiddleware):
user_agent_list=[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/2010010 1 Firefox/7.0.1',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWeb Kit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393'
]
def __init__(self, user_agent=''):
self.user_agent= user_agent
def process_request(self, request, spider):
try:
self.user_agent= random.choice(self.user_agent_list)
request.headers.setdefault('User-Agent', self.user_agent)
except IndexError:
logging.error("Couldn't fetch the user agent")

Your code is working fine as your expectation and the problem was in pagination portion and I've made the pagination in start_urls which type of pagination is always accurate and more than two times faster than if next page.
Code
import scrapy
import logging
#base url=https://arzdigital.com/latest-posts/
#start_url =https://arzdigital.com/latest-posts/page/2/
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
start_urls=[f'https://arzdigital.com/latest-posts/page/{i}/'.format(i) for i in range(1,353)]
def parse(self, response):
posts = response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title = post.xpath(".//#title").get()
post_link = post.xpath(".//#href").get()
post_date = post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-last-post__publish-time']/time/#datetime").get()
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()"):
likes = int(post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
else:
likes = 0
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()"):
commnents = int(post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()").get())
else:
commnents = 0
yield{
'post_title': post_title,
'post_link': post_link,
'post_date': post_date,
'likes': likes,
'commnents': commnents
}
except AttributeError:
logging.error("The element didn't exist")
Output:
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'تأکید مقام رسمی سابق وزارت دفاع آمریکا مبنی بر تشویق سرمایه گذاری بر روی بلاکچین', 'post_link': 'https://arzdigital.com/blockchain-investment/', 'post_date': '2017-07-27', 'likes': 4, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'ریسک سرمایه گذاری از طریق ICO', 'post_link': 'https://arzdigital.com/ico-risk/', 'post_date': '2017-07-27', 'likes': 9, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': '\xa0ای.سی.او چیست؟', 'post_link': 'https://arzdigital.com/what-is-ico/', 'post_date': '2017-07-27', 'likes': 7, 'commnents': 7}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'چرا\xa0فراریت بیت کوین و واحدهای مشابه آن، نسبت به سایر واحدهای پولی بیش\u200cتر است؟', 'post_link': 'https://arzdigital.com/bitcoin-currency/', 'post_date': '2017-07-27', 'likes': 6, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'اتریوم کلاسیک Ethereum Classic چیست ؟', 'post_link': 'https://arzdigital.com/what-is-ethereum-classic/', 'post_date': '2017-07-24', 'likes': 10, 'commnents': 2}
2021-12-04 17:25:19 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-04 17:25:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 111431,
'downloader/request_count': 353,
'downloader/request_method_count/GET': 353,
'downloader/response_bytes': 8814416,
'downloader/response_count': 353,
'downloader/response_status_count/200': 352,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 46.29503,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 4, 11, 25, 19, 124154),
'httpcompression/response_bytes': 55545528,
'httpcompression/response_count': 352,
'item_scraped_count': 7920
.. so on
settings.py file
Please make sure that the settings.py file, you have to change only the uncomment portion nothing else
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 10
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'gs_spider.middlewares.GsSpiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'gs_spider.middlewares.GsSpiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'gs_spider.pipelines.GsSpiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

Unable to understand the ValueError: invalid literal for int() with base 10: 'تومان'

My crawler isn't working properly and I can't find what is the solution to it.
Here is the related part of my spider:
def parse(self, response):
original_price=0
discounted_price=0
star=0
discounted_percent=0
try:
for product in response.xpath("//ul[#class='c-listing__items js-plp-products-list']/li"):
title= product.xpath(".//div/div[2]/div/div/a/text()").get()
if product.xpath(".//div/div[2]/div[2]/div[1]/text()"):
star= float(str(product.xpath(".//div/div[2]/div[2]/div[1]/text()").get()))
if product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()"):
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
if product.xpath(".//div/div[2]/div[3]/div/div/div/text()"):
discounted_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/div/text()").get().strip()).replace(',', ''))
if product.xpath(".//div/div[2]/div[3]/div/div/del/text()"):
original_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/del/text()").get().strip()).replace(',', ''))
discounted_amount= original_price-discounted_price
else:
original_price= print("not available")
discounted_amount= print("not available")
url= response.urljoin(product.xpath(".//div/div[2]/div/div/a/#href").get())
This is my log:
2020-10-21 16:49:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.digikala.com/search/category-book/> from <GET https://www.digikala.com/search/category-book>
2020-10-21 16:49:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikala.com/search/category-book/> (referer: None)
2020-10-21 16:49:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.digikala.com/search/category-book/> (referer: None)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\digi_allbooks\digi_allbooks\spiders\allbooks.py", line 31, in parse
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
ValueError: invalid literal for int() with base 10: 'تومان'
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-21 16:49:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 939,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 90506,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 21, 13, 19, 57, 630044),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2020, 10, 21, 13, 19, 55, 914304)}
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Spider closed (finished)
I guess it says there is a string in an int() function which returns the ValueError but the XPath I'm using targets a number, not a string.
I can't get the error correctly, so I don't find the solution. Can someone help me out, please?

In at least one of the iterations this line is scraping تومان instead of an integer
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
From a google search it seems this is a monetary unit. You need to work on your XPaths, or have the spider ignore this return as there isn't a discount in this item.
It seems this XPath may be a better option for your intention: (I haven't checked all items though)
product.xpath(".//div[#class="c-price__discount-oval"]/span/text()").get()

Scrapy splash not load content

I started using selenium a few months ago, then scrapy. Learning tutorials from Udemy, youtube, and stackoverflow questions, all the scrapes were successful, until I started working with this page , response.css or response.xpath didn't work, so I went to scrapy-splash. I installed docker, and did many tests, and I had successful responses. I have tried all the solutions I have found and it doesn't work, it doesn't even print.I tried python 3.8 and 2.7 with scrapy-splash.
import scrapy
from scrapy_splash import SplashRequest
LUA_SCRIPT = """
function main(splash)
splash.private_mode_enabled = false
splash:go(splash.args.url)
splash:wait(2)
html = splash:html()
splash.private_mode_enabled = true
return html
end
"""
class MySpider(scrapy.Spider):
name = "quotes"
start_urls = ["url"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='execute',
args={
'wait': 1,
"lua_source":LUA_SCRIPT})
def parse(self, response):
print ('Result:')
print(".breadcrumbs-link = %s" % (response.css('body').extract())) # OUTPUT: [...HTML ELEMENTS...]
print(".breadcrumbs-link = %s" % (response.xpath("//td'][1]").extract()))
(Face python 3.8) F:\Selenium\>scrapy crawl quotes
2020-07-31 04:31:00 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapysplash)
2020-07-31 04:31:00 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.0 (tags/v
3.8.0:fa919fd, Oct 14 2019, 19:21:23) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-7
-SP0
2020-07-31 04:31:00 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-07-31 04:31:00 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapysplash',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'scrapysplash.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapysplash.spiders']}
2020-07-31 04:31:00 [scrapy.extensions.telnet] INFO: Telnet Password: f075a705cb8e0509
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-31 04:31:00 [scrapy.core.engine] INFO: Spider opened
2020-07-31 04:31:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 04:31:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-31 04:31:00 [py.warnings] WARNING: d:\selenium\python\lib\site-packages\scrapy_splash\request.py:41: ScrapyDeprecationWarning: Call to deprecated fu
nction to_native_str. Use to_unicode instead.
url = to_native_str(url)
2020-07-31 04:31:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://url/robots.txt> (referer: None)
2020-07-31 04:31:01 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET url>
2020-07-31 04:31:01 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-31 04:31:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 223,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 370,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.355421,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 31, 8, 31, 1, 228466),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'response_received_count': 1,
'robotstxt/forbidden': 1,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 7, 31, 8, 31, 0, 873045)}
2020-07-31 04:31:01 [scrapy.core.engine] INFO: Spider closed (finished)

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Trying to scrape multiple data points in same table with duplicate titles - web-scraping

Related

Selecting text where each char is stored in separate span

How is the btmgmt command used to set PHY?

scrapy stops scraping elements that are addressed

Unable to understand the ValueError: invalid literal for int() with base 10: 'تومان'

Scrapy splash not load content

Categories

Resources