I've tried to set up Scrapy to crawl a database for technical norm and standards.
What is the problem:
I wrote a Scraper, got a 200 response but no results - it scraped 0 pages:
2020-09-06 12:42:00 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: stack)
2020-09-06 12:42:00 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j 20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.18362-SP0
2020-09-06 12:42:00 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'stack', 'NEWSPIDER_MODULE': 'stack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
2020-09-06 12:42:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-09-06 12:42:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-06 12:42:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-06 12:42:01 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-09-06 12:42:01 [scrapy.core.engine] INFO: Spider opened
2020-09-06 12:42:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-06 12:42:01 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2020-09-06 12:42:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe> (referer: None)
2020-09-06 12:42:01 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-06 12:42:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 341,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 6149,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 6149,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 9, 6, 10, 42, 1, 684021),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 9, 6, 10, 42, 1, 140686)}
2020-09-06 12:42:01 [scrapy.core.engine] INFO: Spider closed (finished)
Here are my items out of the items.py file which I want to scrape:
from scrapy.item import Item, Field
class StackItem(Item):
title = Field()
url = Field()
date = Field()
price = Field()
subtitle = Field()
description = Field()
This my crawler code:
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem
class StackSpider(Spider):
name = "stack"
allowed_domains = ["www.beuth.de"]
start_urls = [
"https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe",
]
def parse(self, response):
elements = Selector(response).xpath('//div[#class="bwr-card__inner"]')
for element in elements:
item = StackItem()
item['title'] = element.xpath('a[#class="bwr-link__label"]/text()').extract()[0]
item['url'] = element.xpath('a[#class="bwr-card__title-link"]/#href').extract()[0]
item['date'] = element.xpath('div[#class="bwr-type__item bwr-type__item--light"]/text()').extract()[0]
item['price'] = element.xpath('div[#class="bwr-buybox__price-emph]/text()').extract()[0]
item['subtitle'] = element.xpath('div[#class="bwr-card__subtitle bwr-data-dlink"]/text()').extract()[0]
item['description'] = element.xpath('div[#class="bwr-card__text bwr-rte bwr-data-dlink"]/text()').extract()[0]
yield item
What I tried so solve the problem:
I tried the Scrapy-Shell. After I used a defined User-Agent the website didn't block me:
In [2]: from scrapy import Request
...: req = Request("https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe",
headers={"USER-AGENT" : "Mozi
...: lla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 OPR/45.0.
...: 2552.888"})
...: fetch(req)
2020-09-06 12:48:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe>
(referer: None)
After that I tested the selectors (see items list above) but unfortunately I just get a [] whatever XPath I use:
In [3]: response.xpath("//div[#class='bwr-buybox__price']/a/text").getall()
Out[3]: []
When I try view(response) I just a Browser Window with an infinite loading loop.
I tried to analyse the outcome but without an error I wasnt sure, where to start fixing.
I defined a User Agent in the settings.py because otherwise I got an error message about blocking (I think the website dont allow crawls).
Summary:
I want to scrape the Items above from the list. But I am not sure if there is just a problem with my selectors because testing in Shell results in a [ ] everytime.
Related
I'm new in scrapy and I'm trying to scrap https:opensports.I need some data from all products, so the idea is to get all brands (if I get all brands I'll get all products). Each url's brand, has a number of pages (24 articles per page), so I need to define the total number of pages from each brand and then get the links from 1 to Total number of pages.
I ' m facing a (or more!) problem with hrefs...This is the script:
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from datetime import datetime
import datetime
#start_url: https://www.opensports.com.ar/marcas.html
class SolodeportesSpider(scrapy.Spider):
name = 'solodeportes'
start_urls = ['https://www.opensports.com.ar/marcas.html']
custom_settings = {'FEED_URI':'opensports_' + f'{datetime.datetime.today().strftime("%d-%m-%Y-%H%M%S")}.csv', 'FEED_FORMAT': 'csv', }
#get links of dif. brands
def parse(self, response):
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
for marca in marcas:
yield Request(marca, self.parse_paginator)
#get total number of pages of the brand And request all pages from 1 to total number of products
def parse_paginator(self,response):
total_products = int(int(response.css('#toolbar-amount > span:nth-child(3)::text').get() / 24) + 1)
for count in range(1, total_products):
yield Request(url=f'https://www.opensports.com.ar/{response.url}?p={count}',
callback=self.parse_listings)
#Links list to click to get the articles detail
def parse_listings(self, response):
all_listings = response.css('a.product-item-link::attr(class)').getall()
for url in all_listings:
yield Request(url, self.detail_page)
#url--Article-- Needed data
def detail_page(self, response):
yield {
'Nombre_Articulo' :response.css('h1.page-title span::text').get(),
'Precio_Articulo' : response.css('span.price::text').get(),
'Sku_Articulo' : response.css('td[data-th="SKU"]::text').get() ,
'Tipo_Producto': response.css('td[data-th="Disciplina"]::text').get() ,
'Item_url': response.url
}
process = CrawlerProcess()
process.crawl(SolodeportesSpider)
process.start()
And I'm getting this error message:
c:/Users/User/Desktop/Personal/DABRA/Scraper_opensports/opensports/opens_sp_copia_solod.py
2022-01-16 03:45:05 [scrapy.utils.log] INFO: Scrapy 2.5.1 started
(bot: scrapybot) 2022-01-16 03:45:05 [scrapy.utils.log] INFO:
Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel
1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.10.1 (tags/v3.10.1:2cd268a, Dec 6 2021, 19:10:37) [MSC v.1929 64 bit
(AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography
36.0.1, Platform Windows-10-10.0.19042-SP0 2022-01-16 03:45:05 [scrapy.utils.log] DEBUG: Using reactor:
twisted.internet.selectreactor.SelectReactor 2022-01-16 03:45:05
[scrapy.crawler] INFO: Overridden settings: {} 2022-01-16 03:45:05
[scrapy.extensions.telnet] INFO: Telnet Password: b362a63ff2281937
2022-01-16 03:45:05 [py.warnings] WARNING:
C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-
packages\scrapy\extensions\feedexport.py:247:
ScrapyDeprecationWarning: The FEED_URI and FEED_FORMAT settings
have been deprecated in favor of the FEEDS setting. Please see
the FEEDS setting docs for more details exporter = cls(crawler)
2022-01-16 03:45:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats'] 2022-01-16 03:45:05
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-01-16
03:45:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-01-16 03:45:05
[scrapy.middleware] INFO: Enabled item pipelines: [] 2022-01-16
03:45:05 [scrapy.core.engine] INFO: Spider opened 2022-01-16 03:45:05
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2022-01-16 03:45:05
[scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023 2022-01-16 03:45:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.opensports.com.ar/marcas.html> (referer: None)
2022-01-16 03:45:07 [scrapy.core.scraper] ERROR: Spider error
processing <GET https://www.opensports.com.ar/marcas.html> (referer:
None) Traceback (most recent call last): File
"C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\defer.py",
line 120, in iter_errback
yield next(it) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\python.py",
line 353, in next
return next(self.data) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\python.py",
line 353, in next
return next(self.data) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\offsite.py",
line 29, in process_spider_output
for x in result: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\referer.py",
line 342, in
return (_set_referer(r) for r in result or ()) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\urllength.py",
line 40, in
return (r for r in result or () if _filter(r)) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r)) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in evaluate_iterable
for r in iterable: File "c:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\opensports\opens_sp_copia_solod.py",
line 16, in parse
yield Request(marca, self.parse_paginator) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\http\request_init.py",
line 25, in init
self.set_url(url) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\http\request_init.py",
line 73, in _set_url
raise ValueError(f'Missing scheme in request url: {self._url}') ValueError: Missing scheme in request url: /marca/adidas.html
2022-01-16 03:45:07 [scrapy.core.engine] INFO: Closing spider
(finished) 2022-01-16 03:45:07 [scrapy.statscollectors] INFO: Dumping
Scrapy stats: {'downloader/request_bytes': 232,
'downloader/request_count': 1, 'downloader/request_method_count/GET':
1, 'downloader/response_bytes': 22711, 'downloader/response_count':
1, 'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.748282, 'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 1, 16, 6, 45, 7, 151772),
'httpcompression/response_bytes': 116063,
'httpcompression/response_count': 1, 'log_count/DEBUG': 1,
'log_count/ERROR': 1, 'log_count/INFO': 10, 'log_count/WARNING': 1,
'response_received_count': 1, 'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1, 'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2022, 1, 16, 6, 45, 5, 403490)}
At first I have a problem with the f' url...I don't know how to concatenate the url because in :
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
I get this type of url (I don't know if it's ok or I need the https:// part):
'/marca/adidas.html'
I know that it's wrong and I coudln't find a way to fix it...Could anyone give me a hand?
Thanks in advance!
For the relative you can use response.follow or with request just add the base url.
Some other errors you have:
The pagination doesn't always work.
In the function parse_listings you have class attribute instead of href.
For some reason I'm getting 500 status for some of the urls.
I've fixed errors #1 and #2, you need to figure out how to fix error #3.
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from datetime import datetime
import datetime
#start_url: https://www.opensports.com.ar/marcas.html
class SolodeportesSpider(scrapy.Spider):
name = 'solodeportes'
start_urls = ['https://www.opensports.com.ar/marcas.html']
custom_settings = {
'FEED_URI': 'opensports_' + f'{datetime.datetime.today().strftime("%d-%m-%Y-%H%M%S")}.csv', 'FEED_FORMAT': 'csv',
}
#get links of dif. brands
def parse(self, response):
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
for marca in marcas:
yield response.follow(url=marca, callback=self.parse_paginator)
#get total number of pages of the brand And request all pages from 1 to total number of products
def parse_paginator(self, response):
yield scrapy.Request(url=response.url, callback=self.parse_listings, dont_filter=True)
next_page = response.xpath('//a[contains(#class, "next")]/#href').get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse_paginator)
#Links list to click to get the articles detail
def parse_listings(self, response):
all_listings = response.css('a.product-item-link::attr(href)').getall()
for url in all_listings:
yield Request(url, self.detail_page)
#url--Article-- Needed data
def detail_page(self, response):
yield {
'Nombre_Articulo': response.css('h1.page-title span::text').get(),
'Precio_Articulo': response.css('span.price::text').get(),
'Sku_Articulo': response.css('td[data-th="SKU"]::text').get(),
'Tipo_Producto': response.css('td[data-th="Disciplina"]::text').get(),
'Item_url': response.url
}
Code for the items.py and other files are mentioned below. The logs are also mentioned at the end.I am not getting any error but according to the logs the scrapy has not scraped any pages.
```
import scrapy
class YelpItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
name_url = scrapy.Field()
rating = scrapy.Field()
date = scrapy.Field()
review_text = scrapy.Field()
user_pic = scrapy.Field()
city = scrapy.Field()
is_true = scrapy.Field()
```
code for settings.py
import pathlib
BOT_NAME = 'yelp-scrapy-dev'
SPIDER_MODULES = ['yelp-scrapy-dev.spiders']
NEWSPIDER_MODULE = 'yelp-scrapy-dev.spiders'
{
pathlib.Path('output1.csv'):{
'format':'csv',
},
}
ROBOTSTXT_OBEY = False
code for pipelines.py
class YelpPipeline:
def open_spider(self, spider):
self.file = open('output1.csv', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
return item
code for middlewares.py
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class YelpSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class YelpDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
code for city spider. The spider collects the reviews from the specified URL's
import scrapy
from ..items import YelpItem
# currently will grab the first 100 reviews from the first 10 businesses from start url
class CitySpider(scrapy.Spider):
name = 'city'
start_urls = [
'https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA',
'https://www.yelp.com/search?find_desc=&find_loc=SanFrancisco%2C+CA',
'https://www.yelp.com/search?find_desc=&find_loc=NewYork%2C+NY',
'https://www.yelp.com/search?find_desc=&find_loc=Dallas%2C+TX',
'https://www.yelp.com/search?find_desc=&find_loc=Atlanta%2C+GA',
]
# gets the first 10 businesses from the start url
def parse(self, response):
business_pages = response.css('.text-weight--bold__373c0__1elNz a')
yield from response.follow_all(business_pages, self.parse_business)
# extracts the first 100 reviews from the yelp-scrapy-dev business
def parse_business(self, response):
items = YelpItem()
all_reviews = response.css('.sidebarActionsHoverTarget__373c0__2kfhE')
address = response.request.url.split('?')
src = address[0].split('/')
biz = src[-1].split('-')
loc = biz[-1] if not biz[-1].isdigit() else biz[-2]
if loc == 'seattle':
city = 'Seattle, WA'
elif loc == 'dallas':
city = 'Dallas, TX'
elif loc == 'francisco':
city = 'San Francisco, CA'
elif loc == 'york':
city = 'New York, NY'
elif loc == 'atlanta':
city = 'Atlanta, GA'
else:
city = 'outofrange'
for review in all_reviews:
name = review.css('.link-size--inherit__373c0__1VFlE::text').extract_first()
name_url = review.css('.link-size--inherit__373c0__1VFlE::attr(href)').extract_first().split('=')
rating = review.css('.overflow--hidden__373c0__2y4YK::attr(aria-label)').extract()
date = review.css('.arrange-unit-fill__373c0__3Sfw1 .text-color--mid__373c0__jCeOG::text').extract()
review_text = review.css('.raw__373c0__3rKqk::text').extract()
user_pic = review.css('.gutter-1__373c0__2l5bx .photo-box-img__373c0__35y5v::attr(src)').extract()
if city != 'outofrange':
# making sure data is stored as a str
items['name'] = name
items['name_url'] = name_url[1]
items['rating'] = rating[0]
items['date'] = date[0]
items['review_text'] = review_text[0]
items['user_pic'] = user_pic[0] != 'https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_styleguide/514f6997a318/assets/img/default_avatars/user_60_square.png'
items['city'] = city
items['is_true'] = True
yield items
source = response.request.url
# prevent duplicate secondary pages from being recrawled
if '?start=' not in source:
# gets 20th-100th reviews, pages are every 20 reviews
for i in range(1, 5):
next_page = source + '?start=' + str(i*20)
yield response.follow(next_page, callback=self.parse_business)
And below are the log lines.
(venv) C:\Users\somar\yelp-scrapy\yelp>scrapy crawl city
2020-10-09 22:34:53 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: yelp-scrapy-dev)
2020-10-09 22:34:53 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7
.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10
.0.18362-SP0
2020-10-09 22:34:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-09 22:34:53 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'yelp-scrapy-dev',
'NEWSPIDER_MODULE': 'yelp-scrapy-dev.spiders',
'SPIDER_MODULES': ['yelp-scrapy-dev.spiders']}
2020-10-09 22:34:53 [scrapy.extensions.telnet] INFO: Telnet Password: 1f95c571b9245c42
2020-10-09 22:34:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-09 22:34:54 [scrapy.core.engine] INFO: Spider opened
2020-10-09 22:34:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-09 22:34:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Dallas%2C+TX> (referer: None)
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Atlanta%2C+GA> (referer: None)
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=NewYork%2C+NY> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=SanFrancisco%2C+CA> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-09 22:34:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1264,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 278234,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'elapsed_time_seconds': 2.159687,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 10, 5, 34, 56, 173193),
'log_count/DEBUG': 5,
'log_count/INFO': 10,
'response_received_count': 5,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2020, 10, 10, 5, 34, 54, 13506)}
2020-10-09 22:34:56 [scrapy.core.engine] INFO: Spider closed (finished)
please I need help. I am learning scraping and have been struggling to get it work scraping a website.
I get 0 items crawled every time. I have used user_agent and also set robot_txt = False in the settings.py and yet it doesn't work.
I notice when I use scrapy shell, I get all the details and have checked through my codes again and again to find errors but can't still find it. Please someone should help me check and tell me where I got it wrong.
spider code:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from batt_data.items import BattDataItem
import urllib.parse
class BatterySpider(CrawlSpider):
name = 'battery'
allowed_domains = ['web']
start_urls = ['https://www.made-in-china.com/multi-
search/24v%2Bbattery/F1/1.html']
base_url = ['https://www.made-in-china.com/multi-
search/24v%2Bbattery/F1/1.html']
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[contains(#class,
"nextpage")]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = BattDataItem()
item['description'] = response.xpath('//img[#class="J-firstLazyload"]/#alt').extract()
item['chemistry'] = response.xpath('//li[#class="J-faketitle ellipsis"][1]/span/text()').extract()
item['applications'] = response.xpath('//li[#class="J-faketitle ellipsis"][2]/span/text()').extract()
item['shape'] = response.xpath('//li[#class="J-faketitle ellipsis"][4]/span/text()').extract()
item['discharge_rate'] = response.xpath('//li[#class="J-faketitle ellipsis"][5]/span/text()').extract()
yield item
log file:
C:\Users\Ikeen\batt_data>scrapy crawl battery
2020-08-29 21:17:27 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: batt_data)
2020-08-29 21:17:27 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2l 25 May 2017), cryptography 2.0.3, Platform Windows-10-10.0.18362-SP0
2020-08-29 21:17:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-08-29 21:17:27 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'batt_data',
'NEWSPIDER_MODULE': 'batt_data.spiders',
'SPIDER_MODULES': ['batt_data.spiders'],
'USER_AGENT': 'Mozilla/5.0'}
2020-08-29 21:17:27 [scrapy.extensions.telnet] INFO: Telnet Password: 549b17173b135b6b
2020-08-29 21:17:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-08-29 21:17:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-29 21:17:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-29 21:17:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-29 21:17:28 [scrapy.core.engine] INFO: Spider opened
2020-08-29 21:17:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-29 21:17:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-29 21:17:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.made-in-china.com/multi-search/24v%2Bbattery/F1/1.html> (referer: None)
2020-08-29 21:17:30 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.made-in-china.com': <GET https://www.made-in-china.com/multi-search/24v%2Bbattery/F1/2.html;jsessionid=2B77F23449911847145999CD6E9B6429>
2020-08-29 21:17:30 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-29 21:17:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 234,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 54381,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.42789,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 8, 29, 20, 17, 30, 804912),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 8, 29, 20, 17, 28, 377022)}
2020-08-29 21:17:30 [scrapy.core.engine] INFO: Spider closed (finished)
2020-08-29 21:17:30 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.made-in-china.com': <GET https://www.made-in-china.com/multi-search/24v%2Bbattery/F1/2.html;jsessionid=2B77F23449911847145999CD6E9B6429>
Your request is being filtered as it doesn't belong to the allowed domains that you defined.
allowed_domains = ['web']
Use allowed_domains = ['made-in-china.com'] or remove it completely.
Hello I am trying to login to a website via scrapy. I'm a bit confused because first if I search tokens there are two __RequestVerificationTokens on the login page. Second of all when I inspect the page to find a 302 redirect on successful login, I am unable to find one.
Currently, if I run my code regardless of I have username and password correct I am getting the same results. If I pass a random string as the token then scrapy errors out and redirects to a page not found error.
What do I need to do to get authenticated and redirected to the main page as if I was logging in myself?
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
class LoginSpider(scrapy.Spider):
name = "login"
allowed_domains = ["albertacannabis.org"]
start_urls = ['https://albertacannabis.org/login/']
def parse(self, response):
csrf_token = response.xpath('//*[#name="__RequestVerificationToken"]/#value').extract()[0]
yield FormRequest('https://albertacannabis.org/api/cxa/LoginExtended/LoginAglc/',
formdata={'__RequestVerificationToken' : csrf_token,
'UserName': 'test',
'Password' : 'test'},
callback=self.parse_after_login)
def parse_after_login(self, response):
if response.xpath('//a[text()="Log Out"]'):
print 'Success'
This is what I am getting from Scrapy
2018-11-07 14:23:55 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: canna_spider)
2018-11-07 14:23:55 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 18:37:09) [MSC v.1500 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134
2018-11-07 14:23:55 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'canna_spider.spiders', 'SPIDER_MODULES': ['canna_spider.spiders'], 'BOT_NAME': 'canna_spider'}
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-11-07 14:23:55 [scrapy.core.engine] INFO: Spider opened
2018-11-07 14:23:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-07 14:23:55 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-11-07 14:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://albertacannabis.org/login/> (referer: None)
2018-11-07 14:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://albertacannabis.org/api/cxa/LoginExtended/LoginAglc/> (referer: https://albertacannabis.org/login/)
2018-11-07 14:23:56 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-07 14:23:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 994,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 8316,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 7, 19, 23, 56, 77000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 11, 7, 19, 23, 55, 547000)}
2018-11-07 14:23:56 [scrapy.core.engine] INFO: Spider closed (finished)
I was logged in, just not redirected to the home page, here's slightly updated code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
from scrapy.shell import inspect_response
class LoginSpider(scrapy.Spider):
name = "login"
allowed_domains = ["albertacannabis.org"]
start_urls = ['https://albertacannabis.org/']
def parse(self, response):
csrf_token = response.xpath('//*[#name="__RequestVerificationToken"]/#value').extract()[0]
yield FormRequest('https://albertacannabis.org/api/cxa/LoginExtended/LoginAglc',
formdata={'__RequestVerificationToken' : csrf_token,
'UserName': 'test#gmail.com',
'Password' : '12345678',
'X-Requested-With': 'XMLHttpRequest'},
callback=self.parse_after_login)
def parse_after_login(self, response):
yield scrapy.Request('https://albertacannabis.org/',
callback=self.parse_home_page)
def parse_home_page(self, response):
if response.xpath('//a[text()="Log Out"]'):
print('Success')
I am trying to download images from different urls via scrapy. I'm new to python and scrapy so maybe I'm missing something obvious. This is my first post on stack overflow. Help would be really appreciated!
Here are my different files :
items.py
from scrapy.item import Item, Field
class ImagesTestItem(Item):
image_urls = Field()
image_names =Field()
images = Field()
pass
setting.py:
from scrapy import log
log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)
BOT_NAME = 'images_test'
SPIDER_MODULES = ['images_test.spiders']
NEWSPIDER_MODULE = 'images_test.spiders'
ITEM_PIPELINES = {'images_test.pipelines.images_test': 1}
IMAGES_STORE = '/Users/Coralie/Documents/scrapy/images_test/images'
DOWNLOAD_DELAY = 5
STATS_CLASS = True
spider:
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.loader import XPathItemLoader
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item,Field
from scrapy.utils.response import get_base_url
import logging
from scrapy.log import ScrapyFileLogObserver
logfile = open('testlog.log', 'w')
log_observer = ScrapyFileLogObserver(logfile, level=logging.DEBUG)
log_observer.start()
class images_test(CrawlSpider):
name = "images_test"
allowed_domains = ['veranstaltungszentrum.bbaw.de']
start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0%d_g.jpg' % i for i in xrange(9) ]
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
items = []
sites = hxs.select()
number = 0
for site in sites:
xpath = '//img/#src'
image_urls = hxs.select('//img/#src').extract()
item['image_urls'] = ["http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib0x_g.jpg" + x for x in image_urls]
items.append(item)
number = number + 1
return item
print item['image_urls']
pipelines.py
from scrapy.contrib.pipeline.images import ImagesPipeline
from scrapy.exceptions import DropItem
from scrapy.http import Request
from PIL import Image
from scrapy import log
log.msg("This is a warning", level=log.WARNING)
log.msg("This is a error", level=log.ERROR)
scrapy.log.ERROR
class images_test(ImagesPipeline):
def get_media_requests(self, item, info):
for image_url in item['image_urls']:
yield Request(image_url)
def item_completed(self, results, item, info):
image_paths = [x['path'] for ok, x in results if ok]
if not image_paths:
raise DropItem("Item contains no images")
item['image_paths'] = image_paths
return item
the log is saying the following:
/Library/Python/2.7/site-packages/Scrapy-0.20.2-py2.7.egg/scrapy/settings/deprecated.py:26: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask scrapy-users#googlegroups.com for alternatives):
STATS_ENABLED: no longer supported (change STATS_CLASS instead)
warnings.warn(msg, ScrapyDeprecationWarning)
2014-01-03 11:36:48+0100 [scrapy] INFO: Scrapy 0.20.2 started (bot: images_test)
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Optional features available: ssl, http11
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'images_test.spiders', 'SPIDER_MODULES': ['images_test.spiders'], 'DOWNLOAD_DELAY': 5, 'BOT_NAME': 'images_test'}
2014-01-03 11:36:48+0100 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-03 11:36:49+0100 [scrapy] WARNING: This is a warning
2014-01-03 11:36:49+0100 [scrapy] ERROR: This is a error
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Enabled item pipelines: images_test
2014-01-03 11:36:49+0100 [images_test] INFO: Spider opened
2014-01-03 11:36:49+0100 [images_test] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-03 11:36:49+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-03 11:36:49+0100 [images_test] DEBUG: Crawled (404) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib00_g.jpg> (referer: None)
2014-01-03 11:36:55+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib01_g.jpg> (referer: None)
2014-01-03 11:36:59+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib02_g.jpg> (referer: None)
2014-01-03 11:37:05+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib03_g.jpg> (referer: None)
2014-01-03 11:37:10+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib04_g.jpg> (referer: None)
2014-01-03 11:37:16+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib05_g.jpg> (referer: None)
2014-01-03 11:37:22+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib06_g.jpg> (referer: None)
2014-01-03 11:37:29+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib07_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] DEBUG: Crawled (200) <GET http://veranstaltungszentrum.bbaw.de/en/photo_gallery/leib08_g.jpg> (referer: None)
2014-01-03 11:37:36+0100 [images_test] INFO: Closing spider (finished)
2014-01-03 11:37:36+0100 [images_test] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 2376,
'downloader/request_count': 9,
'downloader/request_method_count/GET': 9,
'downloader/response_bytes': 343660,
'downloader/response_count': 9,
'downloader/response_status_count/200': 8,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2014, 1, 3, 10, 37, 36, 166139),
'log_count/DEBUG': 15,
'log_count/ERROR': 1,
'log_count/INFO': 3,
'log_count/WARNING': 1,
'response_received_count': 9,
'scheduler/dequeued': 9,
'scheduler/dequeued/memory': 9,
'scheduler/enqueued': 9,
'scheduler/enqueued/memory': 9,
'start_time': datetime.datetime(2014, 1, 3, 10, 36, 49, 37947)}
2014-01-03 11:37:36+0100 [images_test] INFO: Spider closed (finished)
How come images are not getting saved? Even my print item['image_urls'] command is not being executed.
Thank you
consider changing your spider code to the following:
start_urls = ['http://veranstaltungszentrum.bbaw.de/en/photo_gallery']
def parse(self, response):
sel = HtmlXPathSelector(response)
item = ImagesTestItem()
url = 'http://veranstaltungszentrum.bbaw.de'
return item['image_urls'] = [urljoin(url, x) for x in
sel.select('//img/#src').extract())]
HtmlXPathSelector can only parse html documents, it seem that you fed it with images from your start_urls
You can try out without piplines:
def parse(self,response):
#extract your images url
imageurl = response.xpath("//img/#src").get()
imagename = imageurl.split("/")[-1]
req = urllib.request.Request(imageurl, headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.100 Safari/537.36'})
resource = urllib.request.urlopen(req)
output = open("foldername/"+imagename,"wb")
output.write(resource.read())
output.close()