I try to scrape the website of http://www.yhd.com and scrape the price and product ID there. This is my spider/test.py file. But it seems it downloads nothing at all. I do not know why.
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from try_yhd.items import TryYhdItem
class MySpider(CrawlSpider):
name = "yhdspider"
allowed_domains = ["http://www.yihaodian.com.yhcdn.cn"]
start_urls = ['http://item.yhd.com/item/11271079',
'http://item.yhd.com/item/2149386',
]
rules = [Rule(SgmlLinkExtractor(allow=['/item/\d+']),'parse_torrent',follow = True),]
def parse_items(self, response):
hxs = HtmlXPathSelector(response)
item = TryYhdItem()
# find the price and product id.
item['price']= hxs.select("//span[#id='current_price']").extract()[0]
item['id']= hxs.select("//p[#class='product_id']/text()").extract()[0]
return item
This is the output.
2014-09-22 10:18:31-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-09-22 10:18:31-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-09-22 10:18:32-0500 [yhdspider] DEBUG: Crawled (200) <GET http://item.yhd.com /item/11271079> (referer: None)
2014-09-22 10:18:32-0500 [yhdspider] DEBUG: Filtered offsite request to 'item.yhd.com': <GET http://item.yhd.com/item/11271079>
2014-09-22 10:18:32-0500 [yhdspider] DEBUG: Crawled (200) <GET http://item.yhd.com/item/2149386> (referer: None)
2014-09-22 10:18:32-0500 [yhdspider] INFO: Closing spider (finished)
2014-09-22 10:18:32-0500 [yhdspider] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 447,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 2,
'downloader/response_bytes': 68145,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 9, 22, 15, 18, 32, 892277),
'log_count/DEBUG': 5,
'log_count/INFO': 7,
'offsite/domains': 1,
'offsite/filtered': 2,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2014, 9, 22, 15, 18, 31, 211841)}
2014-09-22 10:18:32-0500 [yhdspider] INFO: Spider closed (finished)
AFTER modification, I get the following output log. Could anyone tell me what is wrong?
You need to add item.yhd.com to the allowed_domains. The requests are getting filtered as being offsite by the OffsiteMiddleware middleware which is enabled by default.
'offsite/domains': 1,
'offsite/filtered': 2,
This middleware filters out every request whose host names aren’t in the spider’s allowed_domains attribute.
You have a couple of choices. If the spider doesn’t define an allowed_domains attribute, or the attribute is empty, the offsite middleware will allow all requests.
If the request has the dont_filter attribute set, the offsite middleware will allow the request even if its domain is not listed in allowed domains.
Related
I'm new in scrapy and I'm trying to scrap https:opensports.I need some data from all products, so the idea is to get all brands (if I get all brands I'll get all products). Each url's brand, has a number of pages (24 articles per page), so I need to define the total number of pages from each brand and then get the links from 1 to Total number of pages.
I ' m facing a (or more!) problem with hrefs...This is the script:
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from datetime import datetime
import datetime
#start_url: https://www.opensports.com.ar/marcas.html
class SolodeportesSpider(scrapy.Spider):
name = 'solodeportes'
start_urls = ['https://www.opensports.com.ar/marcas.html']
custom_settings = {'FEED_URI':'opensports_' + f'{datetime.datetime.today().strftime("%d-%m-%Y-%H%M%S")}.csv', 'FEED_FORMAT': 'csv', }
#get links of dif. brands
def parse(self, response):
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
for marca in marcas:
yield Request(marca, self.parse_paginator)
#get total number of pages of the brand And request all pages from 1 to total number of products
def parse_paginator(self,response):
total_products = int(int(response.css('#toolbar-amount > span:nth-child(3)::text').get() / 24) + 1)
for count in range(1, total_products):
yield Request(url=f'https://www.opensports.com.ar/{response.url}?p={count}',
callback=self.parse_listings)
#Links list to click to get the articles detail
def parse_listings(self, response):
all_listings = response.css('a.product-item-link::attr(class)').getall()
for url in all_listings:
yield Request(url, self.detail_page)
#url--Article-- Needed data
def detail_page(self, response):
yield {
'Nombre_Articulo' :response.css('h1.page-title span::text').get(),
'Precio_Articulo' : response.css('span.price::text').get(),
'Sku_Articulo' : response.css('td[data-th="SKU"]::text').get() ,
'Tipo_Producto': response.css('td[data-th="Disciplina"]::text').get() ,
'Item_url': response.url
}
process = CrawlerProcess()
process.crawl(SolodeportesSpider)
process.start()
And I'm getting this error message:
c:/Users/User/Desktop/Personal/DABRA/Scraper_opensports/opensports/opens_sp_copia_solod.py
2022-01-16 03:45:05 [scrapy.utils.log] INFO: Scrapy 2.5.1 started
(bot: scrapybot) 2022-01-16 03:45:05 [scrapy.utils.log] INFO:
Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel
1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.10.1 (tags/v3.10.1:2cd268a, Dec 6 2021, 19:10:37) [MSC v.1929 64 bit
(AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography
36.0.1, Platform Windows-10-10.0.19042-SP0 2022-01-16 03:45:05 [scrapy.utils.log] DEBUG: Using reactor:
twisted.internet.selectreactor.SelectReactor 2022-01-16 03:45:05
[scrapy.crawler] INFO: Overridden settings: {} 2022-01-16 03:45:05
[scrapy.extensions.telnet] INFO: Telnet Password: b362a63ff2281937
2022-01-16 03:45:05 [py.warnings] WARNING:
C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-
packages\scrapy\extensions\feedexport.py:247:
ScrapyDeprecationWarning: The FEED_URI and FEED_FORMAT settings
have been deprecated in favor of the FEEDS setting. Please see
the FEEDS setting docs for more details exporter = cls(crawler)
2022-01-16 03:45:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats'] 2022-01-16 03:45:05
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-01-16
03:45:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-01-16 03:45:05
[scrapy.middleware] INFO: Enabled item pipelines: [] 2022-01-16
03:45:05 [scrapy.core.engine] INFO: Spider opened 2022-01-16 03:45:05
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2022-01-16 03:45:05
[scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023 2022-01-16 03:45:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.opensports.com.ar/marcas.html> (referer: None)
2022-01-16 03:45:07 [scrapy.core.scraper] ERROR: Spider error
processing <GET https://www.opensports.com.ar/marcas.html> (referer:
None) Traceback (most recent call last): File
"C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\defer.py",
line 120, in iter_errback
yield next(it) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\python.py",
line 353, in next
return next(self.data) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\python.py",
line 353, in next
return next(self.data) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\offsite.py",
line 29, in process_spider_output
for x in result: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\referer.py",
line 342, in
return (_set_referer(r) for r in result or ()) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\urllength.py",
line 40, in
return (r for r in result or () if _filter(r)) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r)) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in evaluate_iterable
for r in iterable: File "c:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\opensports\opens_sp_copia_solod.py",
line 16, in parse
yield Request(marca, self.parse_paginator) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\http\request_init.py",
line 25, in init
self.set_url(url) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\http\request_init.py",
line 73, in _set_url
raise ValueError(f'Missing scheme in request url: {self._url}') ValueError: Missing scheme in request url: /marca/adidas.html
2022-01-16 03:45:07 [scrapy.core.engine] INFO: Closing spider
(finished) 2022-01-16 03:45:07 [scrapy.statscollectors] INFO: Dumping
Scrapy stats: {'downloader/request_bytes': 232,
'downloader/request_count': 1, 'downloader/request_method_count/GET':
1, 'downloader/response_bytes': 22711, 'downloader/response_count':
1, 'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.748282, 'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 1, 16, 6, 45, 7, 151772),
'httpcompression/response_bytes': 116063,
'httpcompression/response_count': 1, 'log_count/DEBUG': 1,
'log_count/ERROR': 1, 'log_count/INFO': 10, 'log_count/WARNING': 1,
'response_received_count': 1, 'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1, 'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2022, 1, 16, 6, 45, 5, 403490)}
At first I have a problem with the f' url...I don't know how to concatenate the url because in :
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
I get this type of url (I don't know if it's ok or I need the https:// part):
'/marca/adidas.html'
I know that it's wrong and I coudln't find a way to fix it...Could anyone give me a hand?
Thanks in advance!
For the relative you can use response.follow or with request just add the base url.
Some other errors you have:
The pagination doesn't always work.
In the function parse_listings you have class attribute instead of href.
For some reason I'm getting 500 status for some of the urls.
I've fixed errors #1 and #2, you need to figure out how to fix error #3.
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from datetime import datetime
import datetime
#start_url: https://www.opensports.com.ar/marcas.html
class SolodeportesSpider(scrapy.Spider):
name = 'solodeportes'
start_urls = ['https://www.opensports.com.ar/marcas.html']
custom_settings = {
'FEED_URI': 'opensports_' + f'{datetime.datetime.today().strftime("%d-%m-%Y-%H%M%S")}.csv', 'FEED_FORMAT': 'csv',
}
#get links of dif. brands
def parse(self, response):
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
for marca in marcas:
yield response.follow(url=marca, callback=self.parse_paginator)
#get total number of pages of the brand And request all pages from 1 to total number of products
def parse_paginator(self, response):
yield scrapy.Request(url=response.url, callback=self.parse_listings, dont_filter=True)
next_page = response.xpath('//a[contains(#class, "next")]/#href').get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse_paginator)
#Links list to click to get the articles detail
def parse_listings(self, response):
all_listings = response.css('a.product-item-link::attr(href)').getall()
for url in all_listings:
yield Request(url, self.detail_page)
#url--Article-- Needed data
def detail_page(self, response):
yield {
'Nombre_Articulo': response.css('h1.page-title span::text').get(),
'Precio_Articulo': response.css('span.price::text').get(),
'Sku_Articulo': response.css('td[data-th="SKU"]::text').get(),
'Tipo_Producto': response.css('td[data-th="Disciplina"]::text').get(),
'Item_url': response.url
}
I've tried to set up Scrapy to crawl a database for technical norm and standards.
What is the problem:
I wrote a Scraper, got a 200 response but no results - it scraped 0 pages:
2020-09-06 12:42:00 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: stack)
2020-09-06 12:42:00 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j 20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.18362-SP0
2020-09-06 12:42:00 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'stack', 'NEWSPIDER_MODULE': 'stack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
2020-09-06 12:42:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-09-06 12:42:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-06 12:42:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-06 12:42:01 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-09-06 12:42:01 [scrapy.core.engine] INFO: Spider opened
2020-09-06 12:42:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-06 12:42:01 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2020-09-06 12:42:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe> (referer: None)
2020-09-06 12:42:01 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-06 12:42:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 341,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 6149,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 6149,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 9, 6, 10, 42, 1, 684021),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 9, 6, 10, 42, 1, 140686)}
2020-09-06 12:42:01 [scrapy.core.engine] INFO: Spider closed (finished)
Here are my items out of the items.py file which I want to scrape:
from scrapy.item import Item, Field
class StackItem(Item):
title = Field()
url = Field()
date = Field()
price = Field()
subtitle = Field()
description = Field()
This my crawler code:
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem
class StackSpider(Spider):
name = "stack"
allowed_domains = ["www.beuth.de"]
start_urls = [
"https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe",
]
def parse(self, response):
elements = Selector(response).xpath('//div[#class="bwr-card__inner"]')
for element in elements:
item = StackItem()
item['title'] = element.xpath('a[#class="bwr-link__label"]/text()').extract()[0]
item['url'] = element.xpath('a[#class="bwr-card__title-link"]/#href').extract()[0]
item['date'] = element.xpath('div[#class="bwr-type__item bwr-type__item--light"]/text()').extract()[0]
item['price'] = element.xpath('div[#class="bwr-buybox__price-emph]/text()').extract()[0]
item['subtitle'] = element.xpath('div[#class="bwr-card__subtitle bwr-data-dlink"]/text()').extract()[0]
item['description'] = element.xpath('div[#class="bwr-card__text bwr-rte bwr-data-dlink"]/text()').extract()[0]
yield item
What I tried so solve the problem:
I tried the Scrapy-Shell. After I used a defined User-Agent the website didn't block me:
In [2]: from scrapy import Request
...: req = Request("https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe",
headers={"USER-AGENT" : "Mozi
...: lla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 OPR/45.0.
...: 2552.888"})
...: fetch(req)
2020-09-06 12:48:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe>
(referer: None)
After that I tested the selectors (see items list above) but unfortunately I just get a [] whatever XPath I use:
In [3]: response.xpath("//div[#class='bwr-buybox__price']/a/text").getall()
Out[3]: []
When I try view(response) I just a Browser Window with an infinite loading loop.
I tried to analyse the outcome but without an error I wasnt sure, where to start fixing.
I defined a User Agent in the settings.py because otherwise I got an error message about blocking (I think the website dont allow crawls).
Summary:
I want to scrape the Items above from the list. But I am not sure if there is just a problem with my selectors because testing in Shell results in a [ ] everytime.
please I need help. I am learning scraping and have been struggling to get it work scraping a website.
I get 0 items crawled every time. I have used user_agent and also set robot_txt = False in the settings.py and yet it doesn't work.
I notice when I use scrapy shell, I get all the details and have checked through my codes again and again to find errors but can't still find it. Please someone should help me check and tell me where I got it wrong.
spider code:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from batt_data.items import BattDataItem
import urllib.parse
class BatterySpider(CrawlSpider):
name = 'battery'
allowed_domains = ['web']
start_urls = ['https://www.made-in-china.com/multi-
search/24v%2Bbattery/F1/1.html']
base_url = ['https://www.made-in-china.com/multi-
search/24v%2Bbattery/F1/1.html']
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[contains(#class,
"nextpage")]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = BattDataItem()
item['description'] = response.xpath('//img[#class="J-firstLazyload"]/#alt').extract()
item['chemistry'] = response.xpath('//li[#class="J-faketitle ellipsis"][1]/span/text()').extract()
item['applications'] = response.xpath('//li[#class="J-faketitle ellipsis"][2]/span/text()').extract()
item['shape'] = response.xpath('//li[#class="J-faketitle ellipsis"][4]/span/text()').extract()
item['discharge_rate'] = response.xpath('//li[#class="J-faketitle ellipsis"][5]/span/text()').extract()
yield item
log file:
C:\Users\Ikeen\batt_data>scrapy crawl battery
2020-08-29 21:17:27 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: batt_data)
2020-08-29 21:17:27 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2l 25 May 2017), cryptography 2.0.3, Platform Windows-10-10.0.18362-SP0
2020-08-29 21:17:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-08-29 21:17:27 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'batt_data',
'NEWSPIDER_MODULE': 'batt_data.spiders',
'SPIDER_MODULES': ['batt_data.spiders'],
'USER_AGENT': 'Mozilla/5.0'}
2020-08-29 21:17:27 [scrapy.extensions.telnet] INFO: Telnet Password: 549b17173b135b6b
2020-08-29 21:17:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-08-29 21:17:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-29 21:17:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-29 21:17:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-29 21:17:28 [scrapy.core.engine] INFO: Spider opened
2020-08-29 21:17:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-29 21:17:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-29 21:17:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.made-in-china.com/multi-search/24v%2Bbattery/F1/1.html> (referer: None)
2020-08-29 21:17:30 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.made-in-china.com': <GET https://www.made-in-china.com/multi-search/24v%2Bbattery/F1/2.html;jsessionid=2B77F23449911847145999CD6E9B6429>
2020-08-29 21:17:30 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-29 21:17:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 234,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 54381,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.42789,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 8, 29, 20, 17, 30, 804912),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 8, 29, 20, 17, 28, 377022)}
2020-08-29 21:17:30 [scrapy.core.engine] INFO: Spider closed (finished)
2020-08-29 21:17:30 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.made-in-china.com': <GET https://www.made-in-china.com/multi-search/24v%2Bbattery/F1/2.html;jsessionid=2B77F23449911847145999CD6E9B6429>
Your request is being filtered as it doesn't belong to the allowed domains that you defined.
allowed_domains = ['web']
Use allowed_domains = ['made-in-china.com'] or remove it completely.
Hello I am trying to login to a website via scrapy. I'm a bit confused because first if I search tokens there are two __RequestVerificationTokens on the login page. Second of all when I inspect the page to find a 302 redirect on successful login, I am unable to find one.
Currently, if I run my code regardless of I have username and password correct I am getting the same results. If I pass a random string as the token then scrapy errors out and redirects to a page not found error.
What do I need to do to get authenticated and redirected to the main page as if I was logging in myself?
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
class LoginSpider(scrapy.Spider):
name = "login"
allowed_domains = ["albertacannabis.org"]
start_urls = ['https://albertacannabis.org/login/']
def parse(self, response):
csrf_token = response.xpath('//*[#name="__RequestVerificationToken"]/#value').extract()[0]
yield FormRequest('https://albertacannabis.org/api/cxa/LoginExtended/LoginAglc/',
formdata={'__RequestVerificationToken' : csrf_token,
'UserName': 'test',
'Password' : 'test'},
callback=self.parse_after_login)
def parse_after_login(self, response):
if response.xpath('//a[text()="Log Out"]'):
print 'Success'
This is what I am getting from Scrapy
2018-11-07 14:23:55 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: canna_spider)
2018-11-07 14:23:55 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 18:37:09) [MSC v.1500 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134
2018-11-07 14:23:55 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'canna_spider.spiders', 'SPIDER_MODULES': ['canna_spider.spiders'], 'BOT_NAME': 'canna_spider'}
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-11-07 14:23:55 [scrapy.core.engine] INFO: Spider opened
2018-11-07 14:23:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-07 14:23:55 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-11-07 14:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://albertacannabis.org/login/> (referer: None)
2018-11-07 14:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://albertacannabis.org/api/cxa/LoginExtended/LoginAglc/> (referer: https://albertacannabis.org/login/)
2018-11-07 14:23:56 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-07 14:23:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 994,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 8316,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 7, 19, 23, 56, 77000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 11, 7, 19, 23, 55, 547000)}
2018-11-07 14:23:56 [scrapy.core.engine] INFO: Spider closed (finished)
I was logged in, just not redirected to the home page, here's slightly updated code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
from scrapy.shell import inspect_response
class LoginSpider(scrapy.Spider):
name = "login"
allowed_domains = ["albertacannabis.org"]
start_urls = ['https://albertacannabis.org/']
def parse(self, response):
csrf_token = response.xpath('//*[#name="__RequestVerificationToken"]/#value').extract()[0]
yield FormRequest('https://albertacannabis.org/api/cxa/LoginExtended/LoginAglc',
formdata={'__RequestVerificationToken' : csrf_token,
'UserName': 'test#gmail.com',
'Password' : '12345678',
'X-Requested-With': 'XMLHttpRequest'},
callback=self.parse_after_login)
def parse_after_login(self, response):
yield scrapy.Request('https://albertacannabis.org/',
callback=self.parse_home_page)
def parse_home_page(self, response):
if response.xpath('//a[text()="Log Out"]'):
print('Success')
I am trying to download images from different urls via scrapy. I'm new to python and scrapy so maybe I'm missing something obvious. This is my first post on stack overflow. Help would be really appreciated!
Here are my different files :
items.py
from scrapy.item import Item, Field
class ImagesTestItem(Item):
image_urls = Field()
image_names =Field()
images = Field()
pass
setting.py:
from scrapy import log
log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)
BOT_NAME = 'images_test'
SPIDER_MODULES = ['images_test.spiders']
NEWSPIDER_MODULE = 'images_test.spiders'
ITEM_PIPELINES = {'images_test.pipelines.images_test': 1}
IMAGES_STORE = '/Users/Coralie/Documents/scrapy/images_test/images'
DOWNLOAD_DELAY = 5
STATS_CLASS = True
spider:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item,Field
from scrapy.utils.response import get_base_url
import logging
from scrapy.log import ScrapyFileLogObserver
logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()
class images_test(CrawlSpider):
name = "images_test"
allowed_domains = ['veranstaltungszentrum.bbaw.de']
start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0%d_g.jpg' % i for i in xrange(9) ]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items = []
sites = hxs.select()
number = 0
for site in sites:
xpath = '//img/#src'
image_urls = hxs.select('//img/#src').extract()
item['image_urls'] = ["http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0x_g.jpg" + x for x in image_urls]
items.append(item)
number = number + 1
return item
print item['image_urls']
pipelines.py
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from PIL import Image
from scrapy import log
log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)
scrapy.log.ERROR
class images_test(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
the log is saying the following:
/Library/Python/2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users#googlegroups.com for alternatives):
STATS_ENABLED: no longer supported (change STATS_CLASS instead)
warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-03 11:36:48+0100 [scrapy] INFO: Scrapy 0.20.2 started (bot: images_test)
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'images_test.spiders', 'SPIDER_MODULES': ['images_test.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'images_test'}
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-03 11:36:49+0100 [scrapy] WARNING: This is a warning
2014-01-03 11:36:49+0100 [scrapy] ERROR: This is a error
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled item pipelines: images_test
2014-01-03 11:36:49+0100 [images_test] INFO: Spider opened
2014-01-03 11:36:49+0100 [images_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-03 11:36:49+0100 [images_test] DEBUG: Crawled (404) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib00_g.jpg> (referer: None)
2014-01-03 11:36:55+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib01_g.jpg> (referer: None)
2014-01-03 11:36:59+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib02_g.jpg> (referer: None)
2014-01-03 11:37:05+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib03_g.jpg> (referer: None)
2014-01-03 11:37:10+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib04_g.jpg> (referer: None)
2014-01-03 11:37:16+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib05_g.jpg> (referer: None)
2014-01-03 11:37:22+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib06_g.jpg> (referer: None)
2014-01-03 11:37:29+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib07_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib08_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] INFO: Closing spider (finished)
2014-01-03 11:37:36+0100 [images_test] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2376,
'downloader/request_count': 9,
'downloader/request_method_count/GET': 9,
'downloader/response_bytes': 343660,
'downloader/response_count': 9,
'downloader/response_status_count/200': 8,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 1, 3, 10, 37, 36, 166139),
'log_count/DEBUG': 15,
'log_count/ERROR': 1,
'log_count/INFO': 3,
'log_count/WARNING': 1,
'response_received_count': 9,
'scheduler/dequeued': 9,
'scheduler/dequeued/memory': 9,
'scheduler/enqueued': 9,
'scheduler/enqueued/memory': 9,
'start_time': datetime.datetime(2014, 1, 3, 10, 36, 49, 37947)}
2014-01-03 11:37:36+0100 [images_test] INFO: Spider closed (finished)
How come images are not getting saved? Even my print item['image_urls'] command is not being executed.
Thank you
consider changing your spider code to the following:
start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery']
def parse(self, response):
sel = HtmlXPathSelector(response)
item = ImagesTestItem()
url = 'http://veranstaltungszentrum.bbaw.de'
return item['image_urls'] = [urljoin(url, x) for x in
sel.select('//img/#src').extract())]
HtmlXPathSelector can only parse html documents, it seem that you fed it with images from your start_urls
You can try out without piplines:
def parse(self,response):
#extract your images url
imageurl = response.xpath("//img/#src").get()
imagename = imageurl.split("/")[-1]
req = urllib.request.Request(imageurl, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'})
resource = urllib.request.urlopen(req)
output = open("foldername/"+imagename,"wb")
output.write(resource.read())
output.close()