Selecting text where each char is stored in separate span - web-scraping

I am trying to scrape a code chunk from this documentation page that hosts code in a peculiar way. Namely, the code chunk is divided into the function call and arguments having their own spans, parantheses and comma even have their own span. I am at a loss trying to extract this code snippet under 'Usage' with a scrapy spider.
Here's the code for my spider, which also scrapes the documentation text.
import scrapy
import w3lib.html
class codeSpider(scrapy.Spider):
name = 'mycodespider'
def start_requests(self):
url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html"
yield scrapy.Request(url)
def parse(self, response):
docu = response.css('div#man-container p').getall()[2]
code = response.css('pre::text').getall()
yield {
'docu': w3lib.html.remove_tags(docu).strip(),
'code': code
}
When trying to extract the text of the block using response.css('pre::text').getall() somehow only the punctation is returned, not the entire function call. This also includes the example at the bottom of the page, which I'd rather avoid but do not know how.
Is there a better way to do this? I thought ::text would be perfect for this use case.

Try iterating through the pre elements and extracting the text from them individually.
import scrapy
import w3lib.html
class codeSpider(scrapy.Spider):
name = 'mycodespider'
def start_requests(self):
url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html"
yield scrapy.Request(url)
def parse(self, response):
docu = response.css('div#man-container p').getall()[2]
code = []
for pre in response.css('pre'):
code.append("".join(pre.css("::text").getall()))
yield {
'docu': w3lib.html.remove_tags(docu).strip(),
'code': code
}
OUTPUT:
2023-01-23 16:35:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html> (referer: None) ['cached']
2023-01-23 16:35:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html>
{'docu': 'Read in csv for year and plot all the accidents in state\non a map.', 'code': [['1'], ['fars_map_state', '(', 'state.num', ',', ' ', 'year', ')', '\n'], ['1'], ['fars_map_state', '(', '1', ',', ' ', '2013
', ')', '\n']]}
2023-01-23 16:35:16 [scrapy.core.engine] INFO: Closing spider (finished)
2023-01-23 16:35:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 336,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 29477,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.108742,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2023, 1, 24, 0, 35, 16, 722131),
'httpcache/hit': 1,
'item_scraped_count': 1,
'log_count/DEBUG': 4,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,

Related

Airflow failed to get task instance

i have tried to run a simple task using airflow bash operator but keep getting stuck on my DAG never stop running, it stays like green forever without success or fail, when i check the logs i see something like this. Thanks in advance for your time and answers
**`your text`**airflow-scheduler_1 | [SQL: INSERT INTO task_fail (task_id, dag_id, execution_date, start_date, end_date, duration) VALUES (%(task_id)s, %(dag_id)s, %(execution_date)s, %(start_date)s, %(end_date)s, %(duration)s) RETURNING task_fail.id]
airflow-scheduler_1 | [parameters: {'task_id': 'first_task', 'dag_id': 'LocalInjestionDag', 'execution_date': datetime.datetime(2023, 1, 20, 8, 0, tzinfo=Timezone('UTC')), 'start_date': datetime.datetime(2023, 1, 23, 3, 35, 27, 332954, tzinfo=Timezone('UTC')), 'end_date': datetime.datetime(2023, 1, 23, 3, 35, 27, 710572, tzinfo=Timezone('UTC')), 'duration': 0}]
postgres_1 | 2023-01-23 03:55:59.712 UTC [4336] ERROR: column "execution_date" of relation "task_fail" does not exist at character 41"""
I have tried with execution_datetime , using xcom_push and creating functions with xcom and changing to python operator but everything still fall back to same error

Company name extraction with bert-base-ner: easy way to know which words relate to which?

Hi I'm trying to extract the full company name from a string description about the company with bert-base-ner. I am also open to trying other methods but I couldn't really find one. The issue is that although it tags the orgs correctly, it tags it by word/token so I can't easily extract the full company name without having to concat and build it myself.
Is there an easier way or model to do this?
Here is my code:
from transformers import AutoTokenizer, AutoModelForTokenClassification
from transformers import pipeline
tokenizer = AutoTokenizer.from_pretrained("dslim/bert-base-NER")
model = AutoModelForTokenClassification.from_pretrained("dslim/bert-base-NER")
nlp = pipeline("ner", model=model, tokenizer=tokenizer)
ner_results = nlp(text1)
print(ner_results)
Here is my output for one text string:
[{'entity': 'B-ORG', 'score': 0.99965024, 'index': 1, 'word': 'Orion', 'start': 0, 'end': 5}, {'entity': 'I-ORG', 'score': 0.99945647, 'index': 2, 'word': 'Metal', 'start': 6, 'end': 11}, {'entity': 'I-ORG', 'score': 0.99943095, 'index': 3, 'word': '##s', 'start': 11, 'end': 12}, {'entity': 'I-ORG', 'score': 0.99939036, 'index': 4, 'word': 'Limited', 'start': 13, 'end': 20}, {'entity': 'B-LOC', 'score': 0.9997398, 'index': 14, 'word': 'Australia', 'start': 78, 'end': 87}]
I have faced a similar issue and solved it by using a better model called "xlm-roberta-large-finetuned-conll03-English" which is much better than the one you're using right now and will render the complete organization's name rather than the broken pieces. Feel free to test out the below-mentioned code which will extract the full organization's list from the document. Accept my answer by clicking on tick button if it founds useful.
from transformers import pipeline
from subprocess import list2cmdline
from pdfminer.high_level import extract_text
import docx2txt
import spacy
from spacy.matcher import Matcher
import time
start = time.time()
nlp = spacy.load('en_core_web_sm')
matcher = Matcher(nlp.vocab)
model_checkpoint = "xlm-roberta-large-finetuned-conll03-english"
token_classifier = pipeline(
"token-classification", model=model_checkpoint, aggregation_strategy="simple"
)
def text_extraction(file):
""""
To extract texts from both pdf and word
"""
if file.endswith(".pdf"):
return extract_text(file)
else:
resume_text = docx2txt.process(file)
if resume_text:
return resume_text.replace('\t', ' ')
return None
# Organisation names extraction
def org_name(file):
# Extract the complete text in the resume
extracted_text = text_extraction(file)
classifier = token_classifier(extracted_text)
# Get the list of dictionary with key value pair "entity":'ORG'
values = [item for item in classifier if item["entity_group"] == "ORG"]
# Get the list of dictionary with key value pair "entity":'ORG'
res = [sub['word'] for sub in values]
final1 = list(set(res)) # Remove duplicates
final = list(filter(None, final1)) # Remove empty strings
print(final)
org_name("your file name")
end = time.time()
print("The time of execution of above program is :", round((end - start), 2))

scrapy stops scraping elements that are addressed

Here are my spider code and the log I got. The problem is the spider seems to stop scraping items addressed from somewhere in the midst of page 10 (while there are 352 pages to be scraped). When I check the XPath expressions of the rest of the elements, I find them the same in my browser.
Here is my spider:
# -*- coding: utf-8 -*-
import scrapy
import logging
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C'
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
def start_requests(self):
yield scrapy.Request(url='https://arzdigital.com',
callback= self.parse,dont_filter = True)
def parse(self, response):
posts=response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title=post.xpath(".//#title").get()
post_link=post.xpath(".//#href").get()
post_date=post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-last-post__publish-time']/time/#datetime").get()
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()"):
likes=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
else:
likes=0
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()"):
commnents=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()").get())
else:
commnents=0
yield{
'post_title':post_title,
'post_link':post_link,
'post_date':post_date,
'likes':likes,
'commnents':commnents
}
next_page=response.xpath("//div[#class='arz-last-posts__get-more']/a[#class='arz-btn arz-btn-info arz-round arz-link-nofollow']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse,dont_filter = True)
else:
next_pages= response.xpath("//div[#class='arz-pagination']/ul/li[#class='arz-pagination__item arz-pagination__next']/a[#class='arz-pagination__link']/#href").get()
if next_pages:
yield scrapy.Request(url=next_pages, callback=self.parse, dont_filter = True)
except AttributeError:
logging.error("The element didn't exist")
Here is the log, when the spider stops:
2021-12-04 11:06:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/10/>
{'post_title': 'ولادیمیر پوتین: ارزهای دیجیتال در نوع خود ارزشمند هستند', 'post_link': 'https://arzdigital.com/russias-putin-says-crypto-has-value-but-maybe-not-for-trading-oil-html/', 'post_date': '2021-10-16', 'likes': 17, 'commnents': 1}
2021-12-04 11:06:51 [scrapy.core.scraper] ERROR: Spider error processing <GET https://arzdigital.com/latest-posts/page/10/> (referer: https://arzdigital.com/latest-posts/page/9/)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\arzdigital\arzdigital\spiders\criptolern.py", line 32, in parse
likes=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
ValueError: invalid literal for int() with base 10: '۱,۸۵۱'
2021-12-04 11:06:51 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-04 11:06:51 [scrapy.extensions.feedexport] INFO: Stored csv feed (242 items) in: dataset.csv
2021-12-04 11:06:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4112,
'downloader/request_count': 12,
'downloader/request_method_count/GET': 12,
'downloader/response_bytes': 292561,
'downloader/response_count': 12,
'downloader/response_status_count/200': 12,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 4, 7, 36, 51, 830291),
'item_scraped_count': 242,
'log_count/DEBUG': 254,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'request_depth_max': 10,
'response_received_count': 12,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2021, 12, 4, 7, 36, 47, 423017)}
2021-12-04 11:06:51 [scrapy.core.engine] INFO: Spider closed (finished)
I can't find the problem, if it is related to wrong XPath expression. Thanks for any help!!
EDIT:
So I guess it's better to see two files here.
The first is settings.py:
BOT_NAME = 'arzdigital'
SPIDER_MODULES = ['arzdigital.spiders']
NEWSPIDER_MODULE = 'arzdigital.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 10
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'arzdigital.middlewares.ArzdigitalDownloaderMiddleware': None,
'arzdigital.middlewares.UserAgentRotatorMiddleware':400
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 60
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 120
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = False
FEED_EXPORT_ENCODING='utf-8'
And the second file is middlewares.py:
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random, logging
class UserAgentRotatorMiddleware(UserAgentMiddleware):
user_agent_list=[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/2010010 1 Firefox/7.0.1',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWeb Kit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393'
]
def __init__(self, user_agent=''):
self.user_agent= user_agent
def process_request(self, request, spider):
try:
self.user_agent= random.choice(self.user_agent_list)
request.headers.setdefault('User-Agent', self.user_agent)
except IndexError:
logging.error("Couldn't fetch the user agent")
Your code is working fine as your expectation and the problem was in pagination portion and I've made the pagination in start_urls which type of pagination is always accurate and more than two times faster than if next page.
Code
import scrapy
import logging
#base url=https://arzdigital.com/latest-posts/
#start_url =https://arzdigital.com/latest-posts/page/2/
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
start_urls=[f'https://arzdigital.com/latest-posts/page/{i}/'.format(i) for i in range(1,353)]
def parse(self, response):
posts = response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title = post.xpath(".//#title").get()
post_link = post.xpath(".//#href").get()
post_date = post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-last-post__publish-time']/time/#datetime").get()
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()"):
likes = int(post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
else:
likes = 0
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()"):
commnents = int(post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()").get())
else:
commnents = 0
yield{
'post_title': post_title,
'post_link': post_link,
'post_date': post_date,
'likes': likes,
'commnents': commnents
}
except AttributeError:
logging.error("The element didn't exist")
Output:
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'تأکید مقام رسمی سابق وزارت دفاع آمریکا مبنی بر تشویق سرمایه گذاری بر روی بلاکچین', 'post_link': 'https://arzdigital.com/blockchain-investment/', 'post_date': '2017-07-27', 'likes': 4, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'ریسک سرمایه گذاری از طریق ICO', 'post_link': 'https://arzdigital.com/ico-risk/', 'post_date': '2017-07-27', 'likes': 9, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': '\xa0ای.سی.او چیست؟', 'post_link': 'https://arzdigital.com/what-is-ico/', 'post_date': '2017-07-27', 'likes': 7, 'commnents': 7}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'چرا\xa0فراریت بیت کوین و واحدهای مشابه آن، نسبت به سایر واحدهای پولی بیش\u200cتر است؟', 'post_link': 'https://arzdigital.com/bitcoin-currency/', 'post_date': '2017-07-27', 'likes': 6, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'اتریوم کلاسیک Ethereum Classic چیست ؟', 'post_link': 'https://arzdigital.com/what-is-ethereum-classic/', 'post_date': '2017-07-24', 'likes': 10, 'commnents': 2}
2021-12-04 17:25:19 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-04 17:25:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 111431,
'downloader/request_count': 353,
'downloader/request_method_count/GET': 353,
'downloader/response_bytes': 8814416,
'downloader/response_count': 353,
'downloader/response_status_count/200': 352,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 46.29503,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 4, 11, 25, 19, 124154),
'httpcompression/response_bytes': 55545528,
'httpcompression/response_count': 352,
'item_scraped_count': 7920
.. so on
settings.py file
Please make sure that the settings.py file, you have to change only the uncomment portion nothing else
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 10
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'gs_spider.middlewares.GsSpiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'gs_spider.middlewares.GsSpiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'gs_spider.pipelines.GsSpiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False

Unable to understand the ValueError: invalid literal for int() with base 10: 'تومان'

My crawler isn't working properly and I can't find what is the solution to it.
Here is the related part of my spider:
def parse(self, response):
original_price=0
discounted_price=0
star=0
discounted_percent=0
try:
for product in response.xpath("//ul[#class='c-listing__items js-plp-products-list']/li"):
title= product.xpath(".//div/div[2]/div/div/a/text()").get()
if product.xpath(".//div/div[2]/div[2]/div[1]/text()"):
star= float(str(product.xpath(".//div/div[2]/div[2]/div[1]/text()").get()))
if product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()"):
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
if product.xpath(".//div/div[2]/div[3]/div/div/div/text()"):
discounted_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/div/text()").get().strip()).replace(',', ''))
if product.xpath(".//div/div[2]/div[3]/div/div/del/text()"):
original_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/del/text()").get().strip()).replace(',', ''))
discounted_amount= original_price-discounted_price
else:
original_price= print("not available")
discounted_amount= print("not available")
url= response.urljoin(product.xpath(".//div/div[2]/div/div/a/#href").get())
This is my log:
2020-10-21 16:49:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.digikala.com/search/category-book/> from <GET https://www.digikala.com/search/category-book>
2020-10-21 16:49:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikala.com/search/category-book/> (referer: None)
2020-10-21 16:49:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.digikala.com/search/category-book/> (referer: None)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\digi_allbooks\digi_allbooks\spiders\allbooks.py", line 31, in parse
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
ValueError: invalid literal for int() with base 10: 'تومان'
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-21 16:49:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 939,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 90506,
'downloader/response_count': 3,
'downloader/response_status_count/200': 2,
'downloader/response_status_count/301': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 21, 13, 19, 57, 630044),
'log_count/DEBUG': 3,
'log_count/ERROR': 1,
'log_count/INFO': 9,
'log_count/WARNING': 1,
'response_received_count': 2,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2020, 10, 21, 13, 19, 55, 914304)}
2020-10-21 16:49:57 [scrapy.core.engine] INFO: Spider closed (finished)
I guess it says there is a string in an int() function which returns the ValueError but the XPath I'm using targets a number, not a string.
I can't get the error correctly, so I don't find the solution. Can someone help me out, please?
In at least one of the iterations this line is scraping تومان instead of an integer
discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', ''))
From a google search it seems this is a monetary unit. You need to work on your XPaths, or have the spider ignore this return as there isn't a discount in this item.
It seems this XPath may be a better option for your intention: (I haven't checked all items though)
product.xpath(".//div[#class="c-price__discount-oval"]/span/text()").get()

Scrapy splash not load content

I started using selenium a few months ago, then scrapy. Learning tutorials from Udemy, youtube, and stackoverflow questions, all the scrapes were successful, until I started working with this page , response.css or response.xpath didn't work, so I went to scrapy-splash. I installed docker, and did many tests, and I had successful responses. I have tried all the solutions I have found and it doesn't work, it doesn't even print.I tried python 3.8 and 2.7 with scrapy-splash.
import scrapy
from scrapy_splash import SplashRequest
LUA_SCRIPT = """
function main(splash)
splash.private_mode_enabled = false
splash:go(splash.args.url)
splash:wait(2)
html = splash:html()
splash.private_mode_enabled = true
return html
end
"""
class MySpider(scrapy.Spider):
name = "quotes"
start_urls = ["url"]
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url=url,
callback=self.parse,
endpoint='execute',
args={
'wait': 1,
"lua_source":LUA_SCRIPT})
def parse(self, response):
print ('Result:')
print(".breadcrumbs-link = %s" % (response.css('body').extract())) # OUTPUT: [...HTML ELEMENTS...]
print(".breadcrumbs-link = %s" % (response.xpath("//td'][1]").extract()))
(Face python 3.8) F:\Selenium\>scrapy crawl quotes
2020-07-31 04:31:00 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapysplash)
2020-07-31 04:31:00 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.0 (tags/v
3.8.0:fa919fd, Oct 14 2019, 19:21:23) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-7
-SP0
2020-07-31 04:31:00 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-07-31 04:31:00 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'scrapysplash',
'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage',
'NEWSPIDER_MODULE': 'scrapysplash.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['scrapysplash.spiders']}
2020-07-31 04:31:00 [scrapy.extensions.telnet] INFO: Telnet Password: f075a705cb8e0509
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy_splash.SplashCookiesMiddleware',
'scrapy_splash.SplashMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-07-31 04:31:00 [scrapy.core.engine] INFO: Spider opened
2020-07-31 04:31:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-07-31 04:31:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-07-31 04:31:00 [py.warnings] WARNING: d:\selenium\python\lib\site-packages\scrapy_splash\request.py:41: ScrapyDeprecationWarning: Call to deprecated fu
nction to_native_str. Use to_unicode instead.
url = to_native_str(url)
2020-07-31 04:31:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://url/robots.txt> (referer: None)
2020-07-31 04:31:01 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET url>
2020-07-31 04:31:01 [scrapy.core.engine] INFO: Closing spider (finished)
2020-07-31 04:31:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1,
'downloader/request_bytes': 223,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 370,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.355421,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 7, 31, 8, 31, 1, 228466),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'response_received_count': 1,
'robotstxt/forbidden': 1,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 7, 31, 8, 31, 0, 873045)}
2020-07-31 04:31:01 [scrapy.core.engine] INFO: Spider closed (finished)

Resources