scrapy stops scraping elements that are addressed - web-scraping
Here are my spider code and the log I got. The problem is the spider seems to stop scraping items addressed from somewhere in the midst of page 10 (while there are 352 pages to be scraped). When I check the XPath expressions of the rest of the elements, I find them the same in my browser.
Here is my spider:
# -*- coding: utf-8 -*-
import scrapy
import logging
import urllib.parse
parts = urllib.parse.urlsplit(u'http://fa.wikipedia.org/wiki/صفحهٔ_اصلی')
parts = parts._replace(path=urllib.parse.quote(parts.path.encode('utf8')))
encoded_url = parts.geturl().encode('ascii')
'https://fa.wikipedia.org/wiki/%D8%B5%D9%81%D8%AD%D9%87%D9%94_%D8%A7%D8%B5%D9%84%DB%8C'
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
def start_requests(self):
yield scrapy.Request(url='https://arzdigital.com',
callback= self.parse,dont_filter = True)
def parse(self, response):
posts=response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title=post.xpath(".//#title").get()
post_link=post.xpath(".//#href").get()
post_date=post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-last-post__publish-time']/time/#datetime").get()
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()"):
likes=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
else:
likes=0
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()"):
commnents=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()").get())
else:
commnents=0
yield{
'post_title':post_title,
'post_link':post_link,
'post_date':post_date,
'likes':likes,
'commnents':commnents
}
next_page=response.xpath("//div[#class='arz-last-posts__get-more']/a[#class='arz-btn arz-btn-info arz-round arz-link-nofollow']/#href").get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse,dont_filter = True)
else:
next_pages= response.xpath("//div[#class='arz-pagination']/ul/li[#class='arz-pagination__item arz-pagination__next']/a[#class='arz-pagination__link']/#href").get()
if next_pages:
yield scrapy.Request(url=next_pages, callback=self.parse, dont_filter = True)
except AttributeError:
logging.error("The element didn't exist")
Here is the log, when the spider stops:
2021-12-04 11:06:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/10/>
{'post_title': 'ولادیمیر پوتین: ارزهای دیجیتال در نوع خود ارزشمند هستند', 'post_link': 'https://arzdigital.com/russias-putin-says-crypto-has-value-but-maybe-not-for-trading-oil-html/', 'post_date': '2021-10-16', 'likes': 17, 'commnents': 1}
2021-12-04 11:06:51 [scrapy.core.scraper] ERROR: Spider error processing <GET https://arzdigital.com/latest-posts/page/10/> (referer: https://arzdigital.com/latest-posts/page/9/)
Traceback (most recent call last):
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback
yield next(it)
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output
for x in result:
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
return (_set_referer(r) for r in result or ())
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
return (r for r in result or () if _filter(r))
File "C:\Users\shima\projects\arzdigital\arzdigital\spiders\criptolern.py", line 32, in parse
likes=int(post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
ValueError: invalid literal for int() with base 10: '۱,۸۵۱'
2021-12-04 11:06:51 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-04 11:06:51 [scrapy.extensions.feedexport] INFO: Stored csv feed (242 items) in: dataset.csv
2021-12-04 11:06:51 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4112,
'downloader/request_count': 12,
'downloader/request_method_count/GET': 12,
'downloader/response_bytes': 292561,
'downloader/response_count': 12,
'downloader/response_status_count/200': 12,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 4, 7, 36, 51, 830291),
'item_scraped_count': 242,
'log_count/DEBUG': 254,
'log_count/ERROR': 1,
'log_count/INFO': 10,
'request_depth_max': 10,
'response_received_count': 12,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 11,
'scheduler/dequeued/memory': 11,
'scheduler/enqueued': 11,
'scheduler/enqueued/memory': 11,
'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2021, 12, 4, 7, 36, 47, 423017)}
2021-12-04 11:06:51 [scrapy.core.engine] INFO: Spider closed (finished)
I can't find the problem, if it is related to wrong XPath expression. Thanks for any help!!
EDIT:
So I guess it's better to see two files here.
The first is settings.py:
BOT_NAME = 'arzdigital'
SPIDER_MODULES = ['arzdigital.spiders']
NEWSPIDER_MODULE = 'arzdigital.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# See also autothrottle settings and docs
DOWNLOAD_DELAY = 10
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
DOWNLOADER_MIDDLEWARES = {
'arzdigital.middlewares.ArzdigitalDownloaderMiddleware': None,
'arzdigital.middlewares.UserAgentRotatorMiddleware':400
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
AUTOTHROTTLE_ENABLED = True
# The initial download delay
AUTOTHROTTLE_START_DELAY = 60
# The maximum download delay to be set in case of high latencies
AUTOTHROTTLE_MAX_DELAY = 120
# The average number of requests Scrapy should be sending in parallel to
# each remote server
AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
HTTPCACHE_ENABLED = False
FEED_EXPORT_ENCODING='utf-8'
And the second file is middlewares.py:
from scrapy import signals
from scrapy.downloadermiddlewares.useragent import UserAgentMiddleware
import random, logging
class UserAgentRotatorMiddleware(UserAgentMiddleware):
user_agent_list=[
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36',
'Mozilla/5.0 (Windows NT 5.1; rv:7.0.1) Gecko/2010010 1 Firefox/7.0.1',
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWeb Kit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.79 Safari/537.36 Edge/14.14393'
]
def __init__(self, user_agent=''):
self.user_agent= user_agent
def process_request(self, request, spider):
try:
self.user_agent= random.choice(self.user_agent_list)
request.headers.setdefault('User-Agent', self.user_agent)
except IndexError:
logging.error("Couldn't fetch the user agent")
Your code is working fine as your expectation and the problem was in pagination portion and I've made the pagination in start_urls which type of pagination is always accurate and more than two times faster than if next page.
Code
import scrapy
import logging
#base url=https://arzdigital.com/latest-posts/
#start_url =https://arzdigital.com/latest-posts/page/2/
class CriptolernSpider(scrapy.Spider):
name = 'criptolern'
allowed_domains = ['arzdigital.com']
start_urls=[f'https://arzdigital.com/latest-posts/page/{i}/'.format(i) for i in range(1,353)]
def parse(self, response):
posts = response.xpath("//a[#class='arz-last-post arz-row']")
try:
for post in posts:
post_title = post.xpath(".//#title").get()
post_link = post.xpath(".//#href").get()
post_date = post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-last-post__publish-time']/time/#datetime").get()
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()"):
likes = int(post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-likes']/span[2]/text()").get())
else:
likes = 0
if post.xpath(".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()"):
commnents = int(post.xpath(
".//div[#class='arz-col-12 arz-col-md arz-last-post__link-box']/div/div[#class='arz-last-post__info']/div[#class='arz-post__info-comment']/span[2]/text()").get())
else:
commnents = 0
yield{
'post_title': post_title,
'post_link': post_link,
'post_date': post_date,
'likes': likes,
'commnents': commnents
}
except AttributeError:
logging.error("The element didn't exist")
Output:
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'تأکید مقام رسمی سابق وزارت دفاع آمریکا مبنی بر تشویق سرمایه گذاری بر روی بلاکچین', 'post_link': 'https://arzdigital.com/blockchain-investment/', 'post_date': '2017-07-27', 'likes': 4, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'ریسک سرمایه گذاری از طریق ICO', 'post_link': 'https://arzdigital.com/ico-risk/', 'post_date': '2017-07-27', 'likes': 9, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': '\xa0ای.سی.او چیست؟', 'post_link': 'https://arzdigital.com/what-is-ico/', 'post_date': '2017-07-27', 'likes': 7, 'commnents': 7}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'چرا\xa0فراریت بیت کوین و واحدهای مشابه آن، نسبت به سایر واحدهای پولی بیش\u200cتر است؟', 'post_link': 'https://arzdigital.com/bitcoin-currency/', 'post_date': '2017-07-27', 'likes': 6, 'commnents': 0}
2021-12-04 17:25:19 [scrapy.core.scraper] DEBUG: Scraped from <200 https://arzdigital.com/latest-posts/page/352/>
{'post_title': 'اتریوم کلاسیک Ethereum Classic چیست ؟', 'post_link': 'https://arzdigital.com/what-is-ethereum-classic/', 'post_date': '2017-07-24', 'likes': 10, 'commnents': 2}
2021-12-04 17:25:19 [scrapy.core.engine] INFO: Closing spider (finished)
2021-12-04 17:25:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 111431,
'downloader/request_count': 353,
'downloader/request_method_count/GET': 353,
'downloader/response_bytes': 8814416,
'downloader/response_count': 353,
'downloader/response_status_count/200': 352,
'downloader/response_status_count/301': 1,
'elapsed_time_seconds': 46.29503,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 12, 4, 11, 25, 19, 124154),
'httpcompression/response_bytes': 55545528,
'httpcompression/response_count': 352,
'item_scraped_count': 7920
.. so on
settings.py file
Please make sure that the settings.py file, you have to change only the uncomment portion nothing else
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 10
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'gs_spider.middlewares.GsSpiderSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'gs_spider.middlewares.GsSpiderDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
#ITEM_PIPELINES = {
# 'gs_spider.pipelines.GsSpiderPipeline': 300,
#}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
Related
Selecting text where each char is stored in separate span
I am trying to scrape a code chunk from this documentation page that hosts code in a peculiar way. Namely, the code chunk is divided into the function call and arguments having their own spans, parantheses and comma even have their own span. I am at a loss trying to extract this code snippet under 'Usage' with a scrapy spider. Here's the code for my spider, which also scrapes the documentation text. import scrapy import w3lib.html class codeSpider(scrapy.Spider): name = 'mycodespider' def start_requests(self): url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html" yield scrapy.Request(url) def parse(self, response): docu = response.css('div#man-container p').getall()[2] code = response.css('pre::text').getall() yield { 'docu': w3lib.html.remove_tags(docu).strip(), 'code': code } When trying to extract the text of the block using response.css('pre::text').getall() somehow only the punctation is returned, not the entire function call. This also includes the example at the bottom of the page, which I'd rather avoid but do not know how. Is there a better way to do this? I thought ::text would be perfect for this use case.
Try iterating through the pre elements and extracting the text from them individually. import scrapy import w3lib.html class codeSpider(scrapy.Spider): name = 'mycodespider' def start_requests(self): url="https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html" yield scrapy.Request(url) def parse(self, response): docu = response.css('div#man-container p').getall()[2] code = [] for pre in response.css('pre'): code.append("".join(pre.css("::text").getall())) yield { 'docu': w3lib.html.remove_tags(docu).strip(), 'code': code } OUTPUT: 2023-01-23 16:35:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html> (referer: None) ['cached'] 2023-01-23 16:35:16 [scrapy.core.scraper] DEBUG: Scraped from <200 https://rdrr.io/github/00mathieu/FarsExample/man/fars_map_state.html> {'docu': 'Read in csv for year and plot all the accidents in state\non a map.', 'code': [['1'], ['fars_map_state', '(', 'state.num', ',', ' ', 'year', ')', '\n'], ['1'], ['fars_map_state', '(', '1', ',', ' ', '2013 ', ')', '\n']]} 2023-01-23 16:35:16 [scrapy.core.engine] INFO: Closing spider (finished) 2023-01-23 16:35:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 336, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 29477, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.108742, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2023, 1, 24, 0, 35, 16, 722131), 'httpcache/hit': 1, 'item_scraped_count': 1, 'log_count/DEBUG': 4, 'log_count/INFO': 10, 'response_received_count': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1,
kubeflow - how to connect to kubeflow pipeline endpoint from the Jupyter notebook in the kubeflow
Problem From within the kubeflow jupyter notebook, the connection to the kubeflow pipeline fails, although followed the Connect to Kubeflow Pipelines from the same cluster - Multi-user mode. import os import kfp with open(os.environ['KF_PIPELINES_SA_TOKEN_PATH'], "r") as f: TOKEN = f.read() client = kfp.Client( existing_token=TOKEN ) print(client.list_pipelines()) --------------------------------------------------------------------------- ConnectionRefusedError Traceback (most recent call last) /opt/conda/lib/python3.8/site-packages/urllib3/connection.py in _new_conn(self) 168 try: --> 169 conn = connection.create_connection( 170 (self._dns_host, self.port), self.timeout, **extra_kw /opt/conda/lib/python3.8/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options) 95 if err is not None: ---> 96 raise err 97 /opt/conda/lib/python3.8/site-packages/urllib3/util/connection.py in create_connection(address, timeout, source_address, socket_options) 85 sock.bind(source_address) ---> 86 sock.connect(sa) 87 return sock ConnectionRefusedError: [Errno 111] Connection refused During handling of the above exception, another exception occurred: NewConnectionError Traceback (most recent call last) /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 698 # Make the request on the httplib connection object. --> 699 httplib_response = self._make_request( 700 conn, /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw) 393 else: --> 394 conn.request(method, url, **httplib_request_kw) 395 /opt/conda/lib/python3.8/site-packages/urllib3/connection.py in request(self, method, url, body, headers) 233 headers["User-Agent"] = _get_default_user_agent() --> 234 super(HTTPConnection, self).request(method, url, body=body, headers=headers) 235 /opt/conda/lib/python3.8/http/client.py in request(self, method, url, body, headers, encode_chunked) 1251 """Send a complete request to the server.""" -> 1252 self._send_request(method, url, body, headers, encode_chunked) 1253 /opt/conda/lib/python3.8/http/client.py in _send_request(self, method, url, body, headers, encode_chunked) 1297 body = _encode(body, 'body') -> 1298 self.endheaders(body, encode_chunked=encode_chunked) 1299 /opt/conda/lib/python3.8/http/client.py in endheaders(self, message_body, encode_chunked) 1246 raise CannotSendHeader() -> 1247 self._send_output(message_body, encode_chunked=encode_chunked) 1248 /opt/conda/lib/python3.8/http/client.py in _send_output(self, message_body, encode_chunked) 1006 del self._buffer[:] -> 1007 self.send(msg) 1008 /opt/conda/lib/python3.8/http/client.py in send(self, data) 946 if self.auto_open: --> 947 self.connect() 948 else: /opt/conda/lib/python3.8/site-packages/urllib3/connection.py in connect(self) 199 def connect(self): --> 200 conn = self._new_conn() 201 self._prepare_conn(conn) /opt/conda/lib/python3.8/site-packages/urllib3/connection.py in _new_conn(self) 180 except SocketError as e: --> 181 raise NewConnectionError( 182 self, "Failed to establish a new connection: %s" % e NewConnectionError: <urllib3.connection.HTTPConnection object at 0x7f5b1ac2e2b0>: Failed to establish a new connection: [Errno 111] Connection refused During handling of the above exception, another exception occurred: MaxRetryError Traceback (most recent call last) <ipython-input-26-245cf5dc3b72> in <module> 3 existing_token=TOKEN 4 ) ----> 5 print(client.list_pipelines()) /opt/conda/lib/python3.8/site-packages/kfp/_client.py in list_pipelines(self, page_token, page_size, sort_by) 543 A response object including a list of pipelines and next page token. 544 """ --> 545 return self._pipelines_api.list_pipelines(page_token=page_token, page_size=page_size, sort_by=sort_by) 546 547 def list_pipeline_versions(self, pipeline_id: str, page_token='', page_size=10, sort_by=''): /opt/conda/lib/python3.8/site-packages/kfp_server_api/api/pipeline_service_api.py in list_pipelines(self, **kwargs) 1210 """ 1211 kwargs['_return_http_data_only'] = True -> 1212 return self.list_pipelines_with_http_info(**kwargs) # noqa: E501 1213 1214 def list_pipelines_with_http_info(self, **kwargs): # noqa: E501 /opt/conda/lib/python3.8/site-packages/kfp_server_api/api/pipeline_service_api.py in list_pipelines_with_http_info(self, **kwargs) 1311 auth_settings = ['Bearer'] # noqa: E501 1312 -> 1313 return self.api_client.call_api( 1314 '/apis/v1beta1/pipelines', 'GET', 1315 path_params, /opt/conda/lib/python3.8/site-packages/kfp_server_api/api_client.py in call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_type, auth_settings, async_req, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host) 362 """ 363 if not async_req: --> 364 return self.__call_api(resource_path, method, 365 path_params, query_params, header_params, 366 body, post_params, files, /opt/conda/lib/python3.8/site-packages/kfp_server_api/api_client.py in __call_api(self, resource_path, method, path_params, query_params, header_params, body, post_params, files, response_type, auth_settings, _return_http_data_only, collection_formats, _preload_content, _request_timeout, _host) 179 try: 180 # perform request and return response --> 181 response_data = self.request( 182 method, url, query_params=query_params, headers=header_params, 183 post_params=post_params, body=body, /opt/conda/lib/python3.8/site-packages/kfp_server_api/api_client.py in request(self, method, url, query_params, headers, post_params, body, _preload_content, _request_timeout) 387 """Makes the HTTP request using RESTClient.""" 388 if method == "GET": --> 389 return self.rest_client.GET(url, 390 query_params=query_params, 391 _preload_content=_preload_content, /opt/conda/lib/python3.8/site-packages/kfp_server_api/rest.py in GET(self, url, headers, query_params, _preload_content, _request_timeout) 228 def GET(self, url, headers=None, query_params=None, _preload_content=True, 229 _request_timeout=None): --> 230 return self.request("GET", url, 231 headers=headers, 232 _preload_content=_preload_content, /opt/conda/lib/python3.8/site-packages/kfp_server_api/rest.py in request(self, method, url, query_params, headers, body, post_params, _preload_content, _request_timeout) 206 # For `GET`, `HEAD` 207 else: --> 208 r = self.pool_manager.request(method, url, 209 fields=query_params, 210 preload_content=_preload_content, /opt/conda/lib/python3.8/site-packages/urllib3/request.py in request(self, method, url, fields, headers, **urlopen_kw) 72 73 if method in self._encode_url_methods: ---> 74 return self.request_encode_url( 75 method, url, fields=fields, headers=headers, **urlopen_kw 76 ) /opt/conda/lib/python3.8/site-packages/urllib3/request.py in request_encode_url(self, method, url, fields, headers, **urlopen_kw) 94 url += "?" + urlencode(fields) 95 ---> 96 return self.urlopen(method, url, **extra_kw) 97 98 def request_encode_body( /opt/conda/lib/python3.8/site-packages/urllib3/poolmanager.py in urlopen(self, method, url, redirect, **kw) 373 response = conn.urlopen(method, url, **kw) 374 else: --> 375 response = conn.urlopen(method, u.request_uri, **kw) 376 377 redirect_location = redirect and response.get_redirect_location() /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 781 "Retrying (%r) after connection broken by '%r': %s", retries, err, url 782 ) --> 783 return self.urlopen( 784 method, 785 url, /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 781 "Retrying (%r) after connection broken by '%r': %s", retries, err, url 782 ) --> 783 return self.urlopen( 784 method, 785 url, /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 781 "Retrying (%r) after connection broken by '%r': %s", retries, err, url 782 ) --> 783 return self.urlopen( 784 method, 785 url, /opt/conda/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw) 753 e = ProtocolError("Connection aborted.", e) 754 --> 755 retries = retries.increment( 756 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2] 757 ) /opt/conda/lib/python3.8/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace) 572 573 if new_retry.is_exhausted(): --> 574 raise MaxRetryError(_pool, url, error or ResponseError(cause)) 575 576 log.debug("Incremented Retry for (url='%s'): %r", url, new_retry) MaxRetryError: HTTPConnectionPool(host='localhost', port=80): Max retries exceeded with url: /apis/v1beta1/pipelines?page_token=&page_size=10&sort_by= (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f5b1ac2e2b0>: Failed to establish a new connection: [Errno 111] Connection refused')) Related The issue is discussed in the github issue below but no clear answer. [Multi User] failed to call kfp.Client().create_run_from_pipeline_func in in-cluster juypter notebook #4440
import os with open(os.environ['KF_PIPELINES_SA_TOKEN_PATH'], "r") as f: TOKEN = f.read() import kfp client = kfp.Client( host='http://ml-pipeline.kubeflow.svc.cluster.local:8888', # host='http://ml-pipeline-ui.kubeflow.svc.cluster.local:80', # <--- Does not work as later causes HTTP response body: RBAC: access denied # existing_token=TOKEN. # Not required ) print(client.list_pipelines()) Result {'next_page_token': None, 'pipelines': [{'created_at': datetime.datetime(2022, 5, 22, 2, 5, 33, tzinfo=tzlocal()), 'default_version': {'code_source_url': None, 'created_at': datetime.datetime(2022, 5, 22, 2, 5, 33, tzinfo=tzlocal()), 'id': 'b693a0d3-b11c-4c5b-b3f9-6158382948d6', 'name': '[Demo] XGBoost - Iterative model ' 'training', 'package_url': None, 'parameters': None, 'resource_references': [{'key': {'id': 'b693a0d3-b11c-4c5b-b3f9-6158382948d6', 'type': 'PIPELINE'}, 'name': None, 'relationship': 'OWNER'}]}, 'description': '[source ' 'code](https://github.com/kubeflow/pipelines/blob/c8a18bde299f2fdf5f72144f15887915b8d11520/samples/core/train_until_good/train_until_good.py) ' 'This sample demonstrates iterative training ' 'using a train-eval-check recursive loop. The ' 'main pipeline trains the initial model and ' 'then gradually trains the model some more ' 'until the model evaluation metrics are good ' 'enough.', 'error': None, 'id': 'b693a0d3-b11c-4c5b-b3f9-6158382948d6', 'name': '[Demo] XGBoost - Iterative model training', 'parameters': None, 'resource_references': None, 'url': None}, {'created_at': datetime.datetime(2022, 5, 22, 2, 5, 34, tzinfo=tzlocal()), 'default_version': {'code_source_url': None, 'created_at': datetime.datetime(2022, 5, 22, 2, 5, 34, tzinfo=tzlocal()), 'id': 'c65b4f2e-362d-41a8-8f5c-9b944830029e', 'name': '[Demo] TFX - Taxi tip prediction ' 'model trainer', 'package_url': None, 'parameters': [{'name': 'pipeline-root', 'value': 'gs://{{kfp-default-bucket}}/tfx_taxi_simple/{{workflow.uid}}'}, {'name': 'module-file', 'value': '/opt/conda/lib/python3.7/site-packages/tfx/examples/chicago_taxi_pipeline/taxi_utils_native_keras.py'}, {'name': 'push_destination', 'value': '{"filesystem": ' '{"base_directory": ' '"gs://your-bucket/serving_model/tfx_taxi_simple"}}'}], 'resource_references': [{'key': {'id': 'c65b4f2e-362d-41a8-8f5c-9b944830029e', 'type': 'PIPELINE'}, 'name': None, 'relationship': 'OWNER'}]}, 'description': '[source ' 'code](https://github.com/kubeflow/pipelines/tree/c8a18bde299f2fdf5f72144f15887915b8d11520/samples/core/parameterized_tfx_oss) ' '[GCP Permission ' 'requirements](https://github.com/kubeflow/pipelines/blob/c8a18bde299f2fdf5f72144f15887915b8d11520/samples/core/parameterized_tfx_oss#permission). ' 'Example pipeline that does classification with ' 'model analysis based on a public tax cab ' 'dataset.', 'error': None, 'id': 'c65b4f2e-362d-41a8-8f5c-9b944830029e', 'name': '[Demo] TFX - Taxi tip prediction model trainer', 'parameters': [{'name': 'pipeline-root', 'value': 'gs://{{kfp-default-bucket}}/tfx_taxi_simple/{{workflow.uid}}'}, {'name': 'module-file', 'value': '/opt/conda/lib/python3.7/site-packages/tfx/examples/chicago_taxi_pipeline/taxi_utils_native_keras.py'}, {'name': 'push_destination', 'value': '{"filesystem": {"base_directory": ' '"gs://your-bucket/serving_model/tfx_taxi_simple"}}'}], 'resource_references': None, 'url': None}, {'created_at': datetime.datetime(2022, 5, 22, 2, 5, 35, tzinfo=tzlocal()), 'default_version': {'code_source_url': None, 'created_at': datetime.datetime(2022, 5, 22, 2, 5, 35, tzinfo=tzlocal()), 'id': '56bb7063-ade0-4074-9721-b063f42c46fd', 'name': '[Tutorial] Data passing in python ' 'components', 'package_url': None, 'parameters': None, 'resource_references': [{'key': {'id': '56bb7063-ade0-4074-9721-b063f42c46fd', 'type': 'PIPELINE'}, 'name': None, 'relationship': 'OWNER'}]}, 'description': '[source ' 'code](https://github.com/kubeflow/pipelines/tree/c8a18bde299f2fdf5f72144f15887915b8d11520/samples/tutorials/Data%20passing%20in%20python%20components) ' 'Shows how to pass data between python ' 'components.', 'error': None, 'id': '56bb7063-ade0-4074-9721-b063f42c46fd', 'name': '[Tutorial] Data passing in python components', 'parameters': None, 'resource_references': None, 'url': None}, {'created_at': datetime.datetime(2022, 5, 22, 2, 5, 36, tzinfo=tzlocal()), 'default_version': {'code_source_url': None, 'created_at': datetime.datetime(2022, 5, 22, 2, 5, 36, tzinfo=tzlocal()), 'id': '36b09aa0-a317-4ad4-a0ed-ddf55a485eb0', 'name': '[Tutorial] DSL - Control ' 'structures', 'package_url': None, 'parameters': None, 'resource_references': [{'key': {'id': '36b09aa0-a317-4ad4-a0ed-ddf55a485eb0', 'type': 'PIPELINE'}, 'name': None, 'relationship': 'OWNER'}]}, 'description': '[source ' 'code](https://github.com/kubeflow/pipelines/tree/c8a18bde299f2fdf5f72144f15887915b8d11520/samples/tutorials/DSL%20-%20Control%20structures) ' 'Shows how to use conditional execution and ' 'exit handlers. This pipeline will randomly ' 'fail to demonstrate that the exit handler gets ' 'executed even in case of failure.', 'error': None, 'id': '36b09aa0-a317-4ad4-a0ed-ddf55a485eb0', 'name': '[Tutorial] DSL - Control structures', 'parameters': None, 'resource_references': None, 'url': None}, {'created_at': datetime.datetime(2022, 5, 24, 6, 46, 45, tzinfo=tzlocal()), 'default_version': {'code_source_url': None, 'created_at': datetime.datetime(2022, 5, 24, 6, 46, 45, tzinfo=tzlocal()), 'id': 'da2bc8b4-27f2-4aa3-befb-c53487d9db49', 'name': 'test', 'package_url': None, 'parameters': [{'name': 'a', 'value': '1'}, {'name': 'b', 'value': '7'}], 'resource_references': [{'key': {'id': 'da2bc8b4-27f2-4aa3-befb-c53487d9db49', 'type': 'PIPELINE'}, 'name': None, 'relationship': 'OWNER'}]}, 'description': 'test', 'error': None, 'id': 'da2bc8b4-27f2-4aa3-befb-c53487d9db49', 'name': 'test', 'parameters': [{'name': 'a', 'value': '1'}, {'name': 'b', 'value': '7'}], 'resource_references': None, 'url': None}], 'total_size': 5}
How can I webscrape these ticker symbols from barchart.com?
I am trying to use Beautiful Soup to webscrape the list of ticker symbols from this page: https://www.barchart.com/options/most-active/stocks My code returns a lot of HTML from the page, but I can't find any of the ticker symbols with CTRL+F. Would be much appreciated if someone could let me know how I can access these! Code: from bs4 import BeautifulSoup as bs import requests headers = {'user-agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36"} url = "https://www.barchart.com/options/most-active/stocks" page = requests.get(url, headers=headers) html = page.text soup = bs(html, 'html.parser') print(soup.find_all())
import requests from urllib.parse import unquote import pandas as pd headers = { "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:90.0) Gecko/20100101 Firefox/90.0", } def main(url): with requests.Session() as req: req.headers.update(headers) r = req.get(url[:25]) req.headers.update( {'X-XSRF-TOKEN': unquote(r.cookies.get_dict()['XSRF-TOKEN'])}) params = { "list": "options.mostActive.us", "fields": "symbol,symbolType,symbolName,hasOptions,lastPrice,priceChange,percentChange,optionsImpliedVolatilityRank1y,optionsTotalVolume,optionsPutVolumePercent,optionsCallVolumePercent,optionsPutCallVolumeRatio,tradeTime,symbolCode", "orderBy": "optionsTotalVolume", "orderDir": "desc", "between(lastPrice,.10,)": "", "between(tradeTime,2021-08-03,2021-08-04)": "", "meta": "field.shortName,field.type,field.description", "hasOptions": "true", "page": "1", "limit": "500", "raw": "1" } r = req.get(url, params=params).json() df = pd.DataFrame(r['data']).iloc[:, :-1] print(df) main('https://www.barchart.com/proxies/core-api/v1/quotes/get?') Output: symbol symbolType ... tradeTime symbolCode 0 AMD 1 ... 08/03/21 STK 1 AAPL 1 ... 08/03/21 STK 2 TSLA 1 ... 08/03/21 STK 3 AMC 1 ... 08/03/21 STK 4 PFE 1 ... 08/03/21 STK .. ... ... ... ... ... 495 BTU 1 ... 08/03/21 STK 496 EVER 1 ... 08/03/21 STK 497 VRTX 1 ... 08/03/21 STK 498 MCHP 1 ... 08/03/21 STK 499 PAA 1 ... 08/03/21 STK [500 rows x 14 columns]
Unable to understand the ValueError: invalid literal for int() with base 10: 'تومان'
My crawler isn't working properly and I can't find what is the solution to it. Here is the related part of my spider: def parse(self, response): original_price=0 discounted_price=0 star=0 discounted_percent=0 try: for product in response.xpath("//ul[#class='c-listing__items js-plp-products-list']/li"): title= product.xpath(".//div/div[2]/div/div/a/text()").get() if product.xpath(".//div/div[2]/div[2]/div[1]/text()"): star= float(str(product.xpath(".//div/div[2]/div[2]/div[1]/text()").get())) if product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()"): discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', '')) if product.xpath(".//div/div[2]/div[3]/div/div/div/text()"): discounted_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/div/text()").get().strip()).replace(',', '')) if product.xpath(".//div/div[2]/div[3]/div/div/del/text()"): original_price= int(str(product.xpath(".//div/div[2]/div[3]/div/div/del/text()").get().strip()).replace(',', '')) discounted_amount= original_price-discounted_price else: original_price= print("not available") discounted_amount= print("not available") url= response.urljoin(product.xpath(".//div/div[2]/div/div/a/#href").get()) This is my log: 2020-10-21 16:49:56 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (301) to <GET https://www.digikala.com/search/category-book/> from <GET https://www.digikala.com/search/category-book> 2020-10-21 16:49:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.digikala.com/search/category-book/> (referer: None) 2020-10-21 16:49:57 [scrapy.core.scraper] ERROR: Spider error processing <GET https://www.digikala.com/search/category-book/> (referer: None) Traceback (most recent call last): File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 102, in iter_errback yield next(it) File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\offsite.py", line 29, in process_spider_output for x in result: File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr> return (_set_referer(r) for r in result or ()) File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Users\shima\anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr> return (r for r in result or () if _filter(r)) File "C:\Users\shima\projects\digi_allbooks\digi_allbooks\spiders\allbooks.py", line 31, in parse discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', '')) ValueError: invalid literal for int() with base 10: 'تومان' 2020-10-21 16:49:57 [scrapy.core.engine] INFO: Closing spider (finished) 2020-10-21 16:49:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 939, 'downloader/request_count': 3, 'downloader/request_method_count/GET': 3, 'downloader/response_bytes': 90506, 'downloader/response_count': 3, 'downloader/response_status_count/200': 2, 'downloader/response_status_count/301': 1, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 10, 21, 13, 19, 57, 630044), 'log_count/DEBUG': 3, 'log_count/ERROR': 1, 'log_count/INFO': 9, 'log_count/WARNING': 1, 'response_received_count': 2, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/200': 1, 'scheduler/dequeued': 2, 'scheduler/dequeued/memory': 2, 'scheduler/enqueued': 2, 'scheduler/enqueued/memory': 2, 'spider_exceptions/ValueError': 1, 'start_time': datetime.datetime(2020, 10, 21, 13, 19, 55, 914304)} 2020-10-21 16:49:57 [scrapy.core.engine] INFO: Spider closed (finished) I guess it says there is a string in an int() function which returns the ValueError but the XPath I'm using targets a number, not a string. I can't get the error correctly, so I don't find the solution. Can someone help me out, please?
In at least one of the iterations this line is scraping تومان instead of an integer discounted_percent = int(str(product.xpath(".//div/div[2]/div[3]/div/div/div[1]/span/text()").get().strip()).replace('٪', '')) From a google search it seems this is a monetary unit. You need to work on your XPaths, or have the spider ignore this return as there isn't a discount in this item. It seems this XPath may be a better option for your intention: (I haven't checked all items though) product.xpath(".//div[#class="c-price__discount-oval"]/span/text()").get()
Scrapy splash not load content
I started using selenium a few months ago, then scrapy. Learning tutorials from Udemy, youtube, and stackoverflow questions, all the scrapes were successful, until I started working with this page , response.css or response.xpath didn't work, so I went to scrapy-splash. I installed docker, and did many tests, and I had successful responses. I have tried all the solutions I have found and it doesn't work, it doesn't even print.I tried python 3.8 and 2.7 with scrapy-splash. import scrapy from scrapy_splash import SplashRequest LUA_SCRIPT = """ function main(splash) splash.private_mode_enabled = false splash:go(splash.args.url) splash:wait(2) html = splash:html() splash.private_mode_enabled = true return html end """ class MySpider(scrapy.Spider): name = "quotes" start_urls = ["url"] def start_requests(self): for url in self.start_urls: yield SplashRequest(url=url, callback=self.parse, endpoint='execute', args={ 'wait': 1, "lua_source":LUA_SCRIPT}) def parse(self, response): print ('Result:') print(".breadcrumbs-link = %s" % (response.css('body').extract())) # OUTPUT: [...HTML ELEMENTS...] print(".breadcrumbs-link = %s" % (response.xpath("//td'][1]").extract())) (Face python 3.8) F:\Selenium\>scrapy crawl quotes 2020-07-31 04:31:00 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: scrapysplash) 2020-07-31 04:31:00 [scrapy.utils.log] INFO: Versions: lxml 4.5.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.8.0 (tags/v 3.8.0:fa919fd, Oct 14 2019, 19:21:23) [MSC v.1916 32 bit (Intel)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1g 21 Apr 2020), cryptography 2.9.2, Platform Windows-7 -SP0 2020-07-31 04:31:00 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor 2020-07-31 04:31:00 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'scrapysplash', 'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter', 'HTTPCACHE_STORAGE': 'scrapy_splash.SplashAwareFSCacheStorage', 'NEWSPIDER_MODULE': 'scrapysplash.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['scrapysplash.spiders']} 2020-07-31 04:31:00 [scrapy.extensions.telnet] INFO: Telnet Password: f075a705cb8e0509 2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled extensions: ['scrapy.extensions.corestats.CoreStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.logstats.LogStats'] 2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware', 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy_splash.SplashCookiesMiddleware', 'scrapy_splash.SplashMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2020-07-31 04:31:00 [scrapy.middleware] INFO: Enabled item pipelines: [] 2020-07-31 04:31:00 [scrapy.core.engine] INFO: Spider opened 2020-07-31 04:31:00 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2020-07-31 04:31:00 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023 2020-07-31 04:31:00 [py.warnings] WARNING: d:\selenium\python\lib\site-packages\scrapy_splash\request.py:41: ScrapyDeprecationWarning: Call to deprecated fu nction to_native_str. Use to_unicode instead. url = to_native_str(url) 2020-07-31 04:31:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://url/robots.txt> (referer: None) 2020-07-31 04:31:01 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: <GET url> 2020-07-31 04:31:01 [scrapy.core.engine] INFO: Closing spider (finished) 2020-07-31 04:31:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/exception_count': 1, 'downloader/exception_type_count/scrapy.exceptions.IgnoreRequest': 1, 'downloader/request_bytes': 223, 'downloader/request_count': 1, 'downloader/request_method_count/GET': 1, 'downloader/response_bytes': 370, 'downloader/response_count': 1, 'downloader/response_status_count/200': 1, 'elapsed_time_seconds': 0.355421, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2020, 7, 31, 8, 31, 1, 228466), 'log_count/DEBUG': 2, 'log_count/INFO': 10, 'log_count/WARNING': 1, 'response_received_count': 1, 'robotstxt/forbidden': 1, 'robotstxt/request_count': 1, 'robotstxt/response_count': 1, 'robotstxt/response_status_count/200': 1, 'scheduler/dequeued': 1, 'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1, 'scheduler/enqueued/memory': 1, 'start_time': datetime.datetime(2020, 7, 31, 8, 31, 0, 873045)} 2020-07-31 04:31:01 [scrapy.core.engine] INFO: Spider closed (finished)