Scrapy Value Error f'Missing scheme in request - css

I'm new in scrapy and I'm trying to scrap https:opensports.I need some data from all products, so the idea is to get all brands (if I get all brands I'll get all products). Each url's brand, has a number of pages (24 articles per page), so I need to define the total number of pages from each brand and then get the links from 1 to Total number of pages.
I ' m facing a (or more!) problem with hrefs...This is the script:
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from datetime import datetime
import datetime
#start_url: https://www.opensports.com.ar/marcas.html
class SolodeportesSpider(scrapy.Spider):
name = 'solodeportes'
start_urls = ['https://www.opensports.com.ar/marcas.html']
custom_settings = {'FEED_URI':'opensports_' + f'{datetime.datetime.today().strftime("%d-%m-%Y-%H%M%S")}.csv', 'FEED_FORMAT': 'csv', }
#get links of dif. brands
def parse(self, response):
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
for marca in marcas:
yield Request(marca, self.parse_paginator)
#get total number of pages of the brand And request all pages from 1 to total number of products
def parse_paginator(self,response):
total_products = int(int(response.css('#toolbar-amount > span:nth-child(3)::text').get() / 24) + 1)
for count in range(1, total_products):
yield Request(url=f'https://www.opensports.com.ar/{response.url}?p={count}',
callback=self.parse_listings)
#Links list to click to get the articles detail
def parse_listings(self, response):
all_listings = response.css('a.product-item-link::attr(class)').getall()
for url in all_listings:
yield Request(url, self.detail_page)
#url--Article-- Needed data
def detail_page(self, response):
yield {
'Nombre_Articulo' :response.css('h1.page-title span::text').get(),
'Precio_Articulo' : response.css('span.price::text').get(),
'Sku_Articulo' : response.css('td[data-th="SKU"]::text').get() ,
'Tipo_Producto': response.css('td[data-th="Disciplina"]::text').get() ,
'Item_url': response.url
}
process = CrawlerProcess()
process.crawl(SolodeportesSpider)
process.start()
And I'm getting this error message:
c:/Users/User/Desktop/Personal/DABRA/Scraper_opensports/opensports/opens_sp_copia_solod.py
2022-01-16 03:45:05 [scrapy.utils.log] INFO: Scrapy 2.5.1 started
(bot: scrapybot) 2022-01-16 03:45:05 [scrapy.utils.log] INFO:
Versions: lxml 4.7.1.0, libxml2 2.9.12, cssselect 1.1.0, parsel
1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.10.1 (tags/v3.10.1:2cd268a, Dec 6 2021, 19:10:37) [MSC v.1929 64 bit
(AMD64)], pyOpenSSL 21.0.0 (OpenSSL 1.1.1m 14 Dec 2021), cryptography
36.0.1, Platform Windows-10-10.0.19042-SP0 2022-01-16 03:45:05 [scrapy.utils.log] DEBUG: Using reactor:
twisted.internet.selectreactor.SelectReactor 2022-01-16 03:45:05
[scrapy.crawler] INFO: Overridden settings: {} 2022-01-16 03:45:05
[scrapy.extensions.telnet] INFO: Telnet Password: b362a63ff2281937
2022-01-16 03:45:05 [py.warnings] WARNING:
C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-
packages\scrapy\extensions\feedexport.py:247:
ScrapyDeprecationWarning: The FEED_URI and FEED_FORMAT settings
have been deprecated in favor of the FEEDS setting. Please see
the FEEDS setting docs for more details exporter = cls(crawler)
2022-01-16 03:45:05 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats'] 2022-01-16 03:45:05
[scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2022-01-16
03:45:05 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2022-01-16 03:45:05
[scrapy.middleware] INFO: Enabled item pipelines: [] 2022-01-16
03:45:05 [scrapy.core.engine] INFO: Spider opened 2022-01-16 03:45:05
[scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min),
scraped 0 items (at 0 items/min) 2022-01-16 03:45:05
[scrapy.extensions.telnet] INFO: Telnet console listening on
127.0.0.1:6023 2022-01-16 03:45:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.opensports.com.ar/marcas.html> (referer: None)
2022-01-16 03:45:07 [scrapy.core.scraper] ERROR: Spider error
processing <GET https://www.opensports.com.ar/marcas.html> (referer:
None) Traceback (most recent call last): File
"C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\defer.py",
line 120, in iter_errback
yield next(it) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\python.py",
line 353, in next
return next(self.data) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\utils\python.py",
line 353, in next
return next(self.data) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\offsite.py",
line 29, in process_spider_output
for x in result: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\referer.py",
line 342, in
return (_set_referer(r) for r in result or ()) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\urllength.py",
line 40, in
return (r for r in result or () if _filter(r)) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in _evaluate_iterable
for r in iterable: File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\spidermiddlewares\depth.py", line 58, in
return (r for r in result or () if _filter(r)) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\core\spidermw.py",
line 56, in evaluate_iterable
for r in iterable: File "c:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\opensports\opens_sp_copia_solod.py",
line 16, in parse
yield Request(marca, self.parse_paginator) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\http\request_init.py",
line 25, in init
self.set_url(url) File "C:\Users\User\Desktop\Personal\DABRA\Scraper_opensports\venv\lib\site-packages\scrapy\http\request_init.py",
line 73, in _set_url
raise ValueError(f'Missing scheme in request url: {self._url}') ValueError: Missing scheme in request url: /marca/adidas.html
2022-01-16 03:45:07 [scrapy.core.engine] INFO: Closing spider
(finished) 2022-01-16 03:45:07 [scrapy.statscollectors] INFO: Dumping
Scrapy stats: {'downloader/request_bytes': 232,
'downloader/request_count': 1, 'downloader/request_method_count/GET':
1, 'downloader/response_bytes': 22711, 'downloader/response_count':
1, 'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 1.748282, 'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 1, 16, 6, 45, 7, 151772),
'httpcompression/response_bytes': 116063,
'httpcompression/response_count': 1, 'log_count/DEBUG': 1,
'log_count/ERROR': 1, 'log_count/INFO': 10, 'log_count/WARNING': 1,
'response_received_count': 1, 'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1, 'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1, 'spider_exceptions/ValueError': 1,
'start_time': datetime.datetime(2022, 1, 16, 6, 45, 5, 403490)}
At first I have a problem with the f' url...I don't know how to concatenate the url because in :
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
I get this type of url (I don't know if it's ok or I need the https:// part):
'/marca/adidas.html'
I know that it's wrong and I coudln't find a way to fix it...Could anyone give me a hand?
Thanks in advance!

For the relative you can use response.follow or with request just add the base url.
Some other errors you have:
The pagination doesn't always work.
In the function parse_listings you have class attribute instead of href.
For some reason I'm getting 500 status for some of the urls.
I've fixed errors #1 and #2, you need to figure out how to fix error #3.
import scrapy
from scrapy import Request
from scrapy.crawler import CrawlerProcess
from datetime import datetime
import datetime
#start_url: https://www.opensports.com.ar/marcas.html
class SolodeportesSpider(scrapy.Spider):
name = 'solodeportes'
start_urls = ['https://www.opensports.com.ar/marcas.html']
custom_settings = {
'FEED_URI': 'opensports_' + f'{datetime.datetime.today().strftime("%d-%m-%Y-%H%M%S")}.csv', 'FEED_FORMAT': 'csv',
}
#get links of dif. brands
def parse(self, response):
marcas= response.css('#maincontent > div.category-view > div > div.brands-page > table > tbody td a::attr(href)').getall()
for marca in marcas:
yield response.follow(url=marca, callback=self.parse_paginator)
#get total number of pages of the brand And request all pages from 1 to total number of products
def parse_paginator(self, response):
yield scrapy.Request(url=response.url, callback=self.parse_listings, dont_filter=True)
next_page = response.xpath('//a[contains(#class, "next")]/#href').get()
if next_page:
yield scrapy.Request(url=next_page, callback=self.parse_paginator)
#Links list to click to get the articles detail
def parse_listings(self, response):
all_listings = response.css('a.product-item-link::attr(href)').getall()
for url in all_listings:
yield Request(url, self.detail_page)
#url--Article-- Needed data
def detail_page(self, response):
yield {
'Nombre_Articulo': response.css('h1.page-title span::text').get(),
'Precio_Articulo': response.css('span.price::text').get(),
'Sku_Articulo': response.css('td[data-th="SKU"]::text').get(),
'Tipo_Producto': response.css('td[data-th="Disciplina"]::text').get(),
'Item_url': response.url
}

Related

I am using scrapy to scrape data from Yelp. I cannot see any error but data is not getting scraped from the StartURLs mentioned in the spider

Code for the items.py and other files are mentioned below. The logs are also mentioned at the end.I am not getting any error but according to the logs the scrapy has not scraped any pages.
```
import scrapy
class YelpItem(scrapy.Item):
# define the fields for your item here like:
name = scrapy.Field()
name_url = scrapy.Field()
rating = scrapy.Field()
date = scrapy.Field()
review_text = scrapy.Field()
user_pic = scrapy.Field()
city = scrapy.Field()
is_true = scrapy.Field()
```
code for settings.py
import pathlib
BOT_NAME = 'yelp-scrapy-dev'
SPIDER_MODULES = ['yelp-scrapy-dev.spiders']
NEWSPIDER_MODULE = 'yelp-scrapy-dev.spiders'
{
pathlib.Path('output1.csv'):{
'format':'csv',
},
}
ROBOTSTXT_OBEY = False
code for pipelines.py
class YelpPipeline:
def open_spider(self, spider):
self.file = open('output1.csv', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
return item
code for middlewares.py
from scrapy import signals
# useful for handling different item types with a single interface
from itemadapter import is_item, ItemAdapter
class YelpSpiderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the spider middleware does not modify the
# passed objects.
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_spider_input(self, response, spider):
# Called for each response that goes through the spider
# middleware and into the spider.
# Should return None or raise an exception.
return None
def process_spider_output(self, response, result, spider):
# Called with the results returned from the Spider, after
# it has processed the response.
# Must return an iterable of Request, or item objects.
for i in result:
yield i
def process_spider_exception(self, response, exception, spider):
# Called when a spider or process_spider_input() method
# (from other spider middleware) raises an exception.
# Should return either None or an iterable of Request or item objects.
pass
def process_start_requests(self, start_requests, spider):
# Called with the start requests of the spider, and works
# similarly to the process_spider_output() method, except
# that it doesn’t have a response associated.
# Must return only requests (not items).
for r in start_requests:
yield r
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
class YelpDownloaderMiddleware:
# Not all methods need to be defined. If a method is not defined,
# scrapy acts as if the downloader middleware does not modify the
# passed objects.
#classmethod
def from_crawler(cls, crawler):
# This method is used by Scrapy to create your spiders.
s = cls()
crawler.signals.connect(s.spider_opened, signal=signals.spider_opened)
return s
def process_request(self, request, spider):
# Called for each request that goes through the downloader
# middleware.
# Must either:
# - return None: continue processing this request
# - or return a Response object
# - or return a Request object
# - or raise IgnoreRequest: process_exception() methods of
# installed downloader middleware will be called
return None
def process_response(self, request, response, spider):
# Called with the response returned from the downloader.
# Must either;
# - return a Response object
# - return a Request object
# - or raise IgnoreRequest
return response
def process_exception(self, request, exception, spider):
# Called when a download handler or a process_request()
# (from other downloader middleware) raises an exception.
# Must either:
# - return None: continue processing this exception
# - return a Response object: stops process_exception() chain
# - return a Request object: stops process_exception() chain
pass
def spider_opened(self, spider):
spider.logger.info('Spider opened: %s' % spider.name)
code for city spider. The spider collects the reviews from the specified URL's
import scrapy
from ..items import YelpItem
# currently will grab the first 100 reviews from the first 10 businesses from start url
class CitySpider(scrapy.Spider):
name = 'city'
start_urls = [
'https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA',
'https://www.yelp.com/search?find_desc=&find_loc=SanFrancisco%2C+CA',
'https://www.yelp.com/search?find_desc=&find_loc=NewYork%2C+NY',
'https://www.yelp.com/search?find_desc=&find_loc=Dallas%2C+TX',
'https://www.yelp.com/search?find_desc=&find_loc=Atlanta%2C+GA',
]
# gets the first 10 businesses from the start url
def parse(self, response):
business_pages = response.css('.text-weight--bold__373c0__1elNz a')
yield from response.follow_all(business_pages, self.parse_business)
# extracts the first 100 reviews from the yelp-scrapy-dev business
def parse_business(self, response):
items = YelpItem()
all_reviews = response.css('.sidebarActionsHoverTarget__373c0__2kfhE')
address = response.request.url.split('?')
src = address[0].split('/')
biz = src[-1].split('-')
loc = biz[-1] if not biz[-1].isdigit() else biz[-2]
if loc == 'seattle':
city = 'Seattle, WA'
elif loc == 'dallas':
city = 'Dallas, TX'
elif loc == 'francisco':
city = 'San Francisco, CA'
elif loc == 'york':
city = 'New York, NY'
elif loc == 'atlanta':
city = 'Atlanta, GA'
else:
city = 'outofrange'
for review in all_reviews:
name = review.css('.link-size--inherit__373c0__1VFlE::text').extract_first()
name_url = review.css('.link-size--inherit__373c0__1VFlE::attr(href)').extract_first().split('=')
rating = review.css('.overflow--hidden__373c0__2y4YK::attr(aria-label)').extract()
date = review.css('.arrange-unit-fill__373c0__3Sfw1 .text-color--mid__373c0__jCeOG::text').extract()
review_text = review.css('.raw__373c0__3rKqk::text').extract()
user_pic = review.css('.gutter-1__373c0__2l5bx .photo-box-img__373c0__35y5v::attr(src)').extract()
if city != 'outofrange':
# making sure data is stored as a str
items['name'] = name
items['name_url'] = name_url[1]
items['rating'] = rating[0]
items['date'] = date[0]
items['review_text'] = review_text[0]
items['user_pic'] = user_pic[0] != 'https://s3-media0.fl.yelpcdn.com/assets/srv0/yelp_styleguide/514f6997a318/assets/img/default_avatars/user_60_square.png'
items['city'] = city
items['is_true'] = True
yield items
source = response.request.url
# prevent duplicate secondary pages from being recrawled
if '?start=' not in source:
# gets 20th-100th reviews, pages are every 20 reviews
for i in range(1, 5):
next_page = source + '?start=' + str(i*20)
yield response.follow(next_page, callback=self.parse_business)
And below are the log lines.
(venv) C:\Users\somar\yelp-scrapy\yelp>scrapy crawl city
2020-10-09 22:34:53 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: yelp-scrapy-dev)
2020-10-09 22:34:53 [scrapy.utils.log] INFO: Versions: lxml 4.5.2.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.7
.6 (default, Jan 8 2020, 20:23:39) [MSC v.1916 64 bit (AMD64)], pyOpenSSL 19.1.0 (OpenSSL 1.1.1h 22 Sep 2020), cryptography 3.1.1, Platform Windows-10-10
.0.18362-SP0
2020-10-09 22:34:53 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-10-09 22:34:53 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'yelp-scrapy-dev',
'NEWSPIDER_MODULE': 'yelp-scrapy-dev.spiders',
'SPIDER_MODULES': ['yelp-scrapy-dev.spiders']}
2020-10-09 22:34:53 [scrapy.extensions.telnet] INFO: Telnet Password: 1f95c571b9245c42
2020-10-09 22:34:53 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-10-09 22:34:54 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-10-09 22:34:54 [scrapy.core.engine] INFO: Spider opened
2020-10-09 22:34:54 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-10-09 22:34:54 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Dallas%2C+TX> (referer: None)
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Atlanta%2C+GA> (referer: None)
2020-10-09 22:34:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=Seattle%2C+WA> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=NewYork%2C+NY> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.yelp.com/search?find_desc=&find_loc=SanFrancisco%2C+CA> (referer: None)
2020-10-09 22:34:56 [scrapy.core.engine] INFO: Closing spider (finished)
2020-10-09 22:34:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1264,
'downloader/request_count': 5,
'downloader/request_method_count/GET': 5,
'downloader/response_bytes': 278234,
'downloader/response_count': 5,
'downloader/response_status_count/200': 5,
'elapsed_time_seconds': 2.159687,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 10, 10, 5, 34, 56, 173193),
'log_count/DEBUG': 5,
'log_count/INFO': 10,
'response_received_count': 5,
'scheduler/dequeued': 5,
'scheduler/dequeued/memory': 5,
'scheduler/enqueued': 5,
'scheduler/enqueued/memory': 5,
'start_time': datetime.datetime(2020, 10, 10, 5, 34, 54, 13506)}
2020-10-09 22:34:56 [scrapy.core.engine] INFO: Spider closed (finished)

Scrapy runs but doesn't crawl site - Scrapy Shell response in loop

I've tried to set up Scrapy to crawl a database for technical norm and standards.
What is the problem:
I wrote a Scraper, got a 200 response but no results - it scraped 0 pages:
2020-09-06 12:42:00 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: stack)
2020-09-06 12:42:00 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 22:20:52) [MSC v.1916 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0j 20 Nov 2018), cryptography 2.4.2, Platform Windows-10-10.0.18362-SP0
2020-09-06 12:42:00 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'stack', 'NEWSPIDER_MODULE': 'stack.spiders', 'SPIDER_MODULES': ['stack.spiders'], 'USER_AGENT': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
2020-09-06 12:42:00 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-09-06 12:42:01 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-09-06 12:42:01 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-09-06 12:42:01 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-09-06 12:42:01 [scrapy.core.engine] INFO: Spider opened
2020-09-06 12:42:01 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-09-06 12:42:01 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2020-09-06 12:42:01 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe> (referer: None)
2020-09-06 12:42:01 [scrapy.core.engine] INFO: Closing spider (finished)
2020-09-06 12:42:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 341,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 6149,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 6149,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 9, 6, 10, 42, 1, 684021),
'log_count/DEBUG': 2,
'log_count/INFO': 7,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 9, 6, 10, 42, 1, 140686)}
2020-09-06 12:42:01 [scrapy.core.engine] INFO: Spider closed (finished)
Here are my items out of the items.py file which I want to scrape:
from scrapy.item import Item, Field
class StackItem(Item):
title = Field()
url = Field()
date = Field()
price = Field()
subtitle = Field()
description = Field()
This my crawler code:
from scrapy import Spider
from scrapy.selector import Selector
from stack.items import StackItem
class StackSpider(Spider):
name = "stack"
allowed_domains = ["www.beuth.de"]
start_urls = [
"https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe",
]
def parse(self, response):
elements = Selector(response).xpath('//div[#class="bwr-card__inner"]')
for element in elements:
item = StackItem()
item['title'] = element.xpath('a[#class="bwr-link__label"]/text()').extract()[0]
item['url'] = element.xpath('a[#class="bwr-card__title-link"]/#href').extract()[0]
item['date'] = element.xpath('div[#class="bwr-type__item bwr-type__item--light"]/text()').extract()[0]
item['price'] = element.xpath('div[#class="bwr-buybox__price-emph]/text()').extract()[0]
item['subtitle'] = element.xpath('div[#class="bwr-card__subtitle bwr-data-dlink"]/text()').extract()[0]
item['description'] = element.xpath('div[#class="bwr-card__text bwr-rte bwr-data-dlink"]/text()').extract()[0]
yield item
What I tried so solve the problem:
I tried the Scrapy-Shell. After I used a defined User-Agent the website didn't block me:
In [2]: from scrapy import Request
...: req = Request("https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe",
headers={"USER-AGENT" : "Mozi
...: lla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 OPR/45.0.
...: 2552.888"})
...: fetch(req)
2020-09-06 12:48:19 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.beuth.de/de/regelwerke/vdi/vdi-richtlinien-entwuerfe>
(referer: None)
After that I tested the selectors (see items list above) but unfortunately I just get a [] whatever XPath I use:
In [3]: response.xpath("//div[#class='bwr-buybox__price']/a/text").getall()
Out[3]: []
When I try view(response) I just a Browser Window with an infinite loading loop.
I tried to analyse the outcome but without an error I wasnt sure, where to start fixing.
I defined a User Agent in the settings.py because otherwise I got an error message about blocking (I think the website dont allow crawls).
Summary:
I want to scrape the Items above from the list. But I am not sure if there is just a problem with my selectors because testing in Shell results in a [ ] everytime.

Scrapy returns 0 items and 0 crawled pages

please I need help. I am learning scraping and have been struggling to get it work scraping a website.
I get 0 items crawled every time. I have used user_agent and also set robot_txt = False in the settings.py and yet it doesn't work.
I notice when I use scrapy shell, I get all the details and have checked through my codes again and again to find errors but can't still find it. Please someone should help me check and tell me where I got it wrong.
spider code:
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.loader import ItemLoader
from batt_data.items import BattDataItem
import urllib.parse
class BatterySpider(CrawlSpider):
name = 'battery'
allowed_domains = ['web']
start_urls = ['https://www.made-in-china.com/multi-
search/24v%2Bbattery/F1/1.html']
base_url = ['https://www.made-in-china.com/multi-
search/24v%2Bbattery/F1/1.html']
rules = (
Rule(LinkExtractor(restrict_xpaths='//*[contains(#class,
"nextpage")]'), callback='parse_item', follow=True),
)
def parse_item(self, response):
item = BattDataItem()
item['description'] = response.xpath('//img[#class="J-firstLazyload"]/#alt').extract()
item['chemistry'] = response.xpath('//li[#class="J-faketitle ellipsis"][1]/span/text()').extract()
item['applications'] = response.xpath('//li[#class="J-faketitle ellipsis"][2]/span/text()').extract()
item['shape'] = response.xpath('//li[#class="J-faketitle ellipsis"][4]/span/text()').extract()
item['discharge_rate'] = response.xpath('//li[#class="J-faketitle ellipsis"][5]/span/text()').extract()
yield item
log file:
C:\Users\Ikeen\batt_data>scrapy crawl battery
2020-08-29 21:17:27 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: batt_data)
2020-08-29 21:17:27 [scrapy.utils.log] INFO: Versions: lxml 4.1.0.0, libxml2 2.9.4, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 20.3.0, Python 3.6.3 |Anaconda, Inc.| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 17.2.0 (OpenSSL 1.0.2l 25 May 2017), cryptography 2.0.3, Platform Windows-10-10.0.18362-SP0
2020-08-29 21:17:27 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-08-29 21:17:27 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'batt_data',
'NEWSPIDER_MODULE': 'batt_data.spiders',
'SPIDER_MODULES': ['batt_data.spiders'],
'USER_AGENT': 'Mozilla/5.0'}
2020-08-29 21:17:27 [scrapy.extensions.telnet] INFO: Telnet Password: 549b17173b135b6b
2020-08-29 21:17:27 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-08-29 21:17:28 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-08-29 21:17:28 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-08-29 21:17:28 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-08-29 21:17:28 [scrapy.core.engine] INFO: Spider opened
2020-08-29 21:17:28 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-08-29 21:17:28 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-08-29 21:17:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.made-in-china.com/multi-search/24v%2Bbattery/F1/1.html> (referer: None)
2020-08-29 21:17:30 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.made-in-china.com': <GET https://www.made-in-china.com/multi-search/24v%2Bbattery/F1/2.html;jsessionid=2B77F23449911847145999CD6E9B6429>
2020-08-29 21:17:30 [scrapy.core.engine] INFO: Closing spider (finished)
2020-08-29 21:17:30 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 234,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 54381,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 2.42789,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 8, 29, 20, 17, 30, 804912),
'log_count/DEBUG': 2,
'log_count/INFO': 10,
'offsite/domains': 1,
'offsite/filtered': 1,
'request_depth_max': 1,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 8, 29, 20, 17, 28, 377022)}
2020-08-29 21:17:30 [scrapy.core.engine] INFO: Spider closed (finished)
2020-08-29 21:17:30 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.made-in-china.com': <GET https://www.made-in-china.com/multi-search/24v%2Bbattery/F1/2.html;jsessionid=2B77F23449911847145999CD6E9B6429>
Your request is being filtered as it doesn't belong to the allowed domains that you defined.
allowed_domains = ['web']
Use allowed_domains = ['made-in-china.com'] or remove it completely.

Scrapy Spider Login Issues

Hello I am trying to login to a website via scrapy. I'm a bit confused because first if I search tokens there are two __RequestVerificationTokens on the login page. Second of all when I inspect the page to find a 302 redirect on successful login, I am unable to find one.
Currently, if I run my code regardless of I have username and password correct I am getting the same results. If I pass a random string as the token then scrapy errors out and redirects to a page not found error.
What do I need to do to get authenticated and redirected to the main page as if I was logging in myself?
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
class LoginSpider(scrapy.Spider):
name = "login"
allowed_domains = ["albertacannabis.org"]
start_urls = ['https://albertacannabis.org/login/']
def parse(self, response):
csrf_token = response.xpath('//*[#name="__RequestVerificationToken"]/#value').extract()[0]
yield FormRequest('https://albertacannabis.org/api/cxa/LoginExtended/LoginAglc/',
formdata={'__RequestVerificationToken' : csrf_token,
'UserName': 'test',
'Password' : 'test'},
callback=self.parse_after_login)
def parse_after_login(self, response):
if response.xpath('//a[text()="Log Out"]'):
print 'Success'
This is what I am getting from Scrapy
2018-11-07 14:23:55 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: canna_spider)
2018-11-07 14:23:55 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.5.1, w3lib 1.19.0, Twisted 18.7.0, Python 2.7.15 |Anaconda, Inc.| (default, May 1 2018, 18:37:09) [MSC v.1500 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2p 14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134
2018-11-07 14:23:55 [scrapy.crawler] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'canna_spider.spiders', 'SPIDER_MODULES': ['canna_spider.spiders'], 'BOT_NAME': 'canna_spider'}
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.logstats.LogStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-11-07 14:23:55 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-11-07 14:23:55 [scrapy.core.engine] INFO: Spider opened
2018-11-07 14:23:55 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-11-07 14:23:55 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6024
2018-11-07 14:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://albertacannabis.org/login/> (referer: None)
2018-11-07 14:23:55 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://albertacannabis.org/api/cxa/LoginExtended/LoginAglc/> (referer: https://albertacannabis.org/login/)
2018-11-07 14:23:56 [scrapy.core.engine] INFO: Closing spider (finished)
2018-11-07 14:23:56 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 994,
'downloader/request_count': 2,
'downloader/request_method_count/GET': 1,
'downloader/request_method_count/POST': 1,
'downloader/response_bytes': 8316,
'downloader/response_count': 2,
'downloader/response_status_count/200': 2,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 11, 7, 19, 23, 56, 77000),
'log_count/DEBUG': 3,
'log_count/INFO': 7,
'request_depth_max': 1,
'response_received_count': 2,
'scheduler/dequeued': 2,
'scheduler/dequeued/memory': 2,
'scheduler/enqueued': 2,
'scheduler/enqueued/memory': 2,
'start_time': datetime.datetime(2018, 11, 7, 19, 23, 55, 547000)}
2018-11-07 14:23:56 [scrapy.core.engine] INFO: Spider closed (finished)
I was logged in, just not redirected to the home page, here's slightly updated code:
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import FormRequest
from scrapy.shell import inspect_response
class LoginSpider(scrapy.Spider):
name = "login"
allowed_domains = ["albertacannabis.org"]
start_urls = ['https://albertacannabis.org/']
def parse(self, response):
csrf_token = response.xpath('//*[#name="__RequestVerificationToken"]/#value').extract()[0]
yield FormRequest('https://albertacannabis.org/api/cxa/LoginExtended/LoginAglc',
formdata={'__RequestVerificationToken' : csrf_token,
'UserName': 'test#gmail.com',
'Password' : '12345678',
'X-Requested-With': 'XMLHttpRequest'},
callback=self.parse_after_login)
def parse_after_login(self, response):
yield scrapy.Request('https://albertacannabis.org/',
callback=self.parse_home_page)
def parse_home_page(self, response):
if response.xpath('//a[text()="Log Out"]'):
print('Success')

Q: scrapy-redis didn't scrape any page, just finished in a second

My spider scrapes no pages,finished in less than a second ,but throws no errors.
I've checked the code and compared with another similar project which I ran successfully a few weeks ago, but still can't figure out what the problem might be .
I'm using scrapy 1.0.1 and scrapy-redis 0.6.
this is the logs:
2015-07-21 11:33:20 [scrapy] INFO: Scrapy 1.0.1 started (bot: demo)
2015-07-21 11:33:20 [scrapy] INFO: Optional features available: ssl, http11
2015-07-21 11:33:20 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'demo.spiders', 'LOG_LEVEL': 'INFO', 'SPIDER_MODULES': ['demo.spiders'], 'RETRY_HTTP_CODES': [500, 502, 503, 504, 400, 408, 404, 302, 403], 'BOT_NAME': 'demo', 'SCHEDULER': 'scrapy_redis.scheduler.Scheduler', 'DEFAULT_ITEM_CLASS': 'demo.items.DemoItem', 'REDIRECT_ENABLED': False}
2015-07-21 11:33:20 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2015-07-21 11:33:20 [scrapy] INFO: Enabled downloader middlewares: CustomUserAgentMiddleware, CustomHttpProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2015-07-21 11:33:20 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2015-07-21 11:33:20 [scrapy] INFO: Enabled item pipelines: RedisPipeline, DemoPipeline
2015-07-21 11:33:20 [scrapy] INFO: Spider opened
2015-07-21 11:33:20 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2015-07-21 11:33:20 [scrapy] INFO: Closing spider (finished)
2015-07-21 11:33:20 [scrapy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
'finish_time': datetime.datetime(2015, 7, 21, 3, 33, 20, 301371),
'log_count/INFO': 7,
'start_time': datetime.datetime(2015, 7, 21, 3, 33, 20, 296941)}
2015-07-21 11:33:20 [scrapy] INFO: Spider closed (finished)
this is the spider.py
# -*- coding: utf-8 -*-
import scrapy
from demo.items import DemoItem
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy_redis.spiders import RedisMixin
from pip._vendor.requests.models import Request
class DemoCrawler(RedisMixin, CrawlSpider):
name = "demo"
redis_key = "demoCrawler:start_urls"
rules = ( Rule(LinkExtractor(allow='/shop/\d+?/$',restrict_xpaths=u"//ul/li/div[#class='txt']/div[#class='tit']/a"),callback = 'parse_demo'),
Rule(LinkExtractor(restrict_xpaths = u"//div[#class='shop-wrap']/div[#class='page']/a[#class='next']"),follow = True)
)
def parse_demo(self,response):
item = DemoItem()
temp = response.xpath(u"//div[#id='basic-info']/div[#class='action']/a/#href").re("\d.+\d")
item['id'] = temp[0] if temp else ''
temp = response.xpath(u"//div[#class='page-header']/div[#class='container']/a[#class='city J-city']/text()").extract()
item['city'] = temp[0] if temp else ''
temp = response.xpath(u"//div[#class='breadcrumb']/span/text()").extract()
item['name'] = temp[0] if temp else ''
temp = response.xpath(u"//div[#class='main']/div[#id='sales']/text()").extract()
item['deals'] = temp[0] if temp else ''
temp = response.xpath(u"//div[#class='main-nav']/div[#class='container']/a[1]/text()").extract()
item['category'] = temp[0] if temp else ''
temp = response.xpath(u"//div[#class='main']/div[#id='basic-info']/div[#class='expand-info address']/a/span/text()").extract()
item['region'] = temp[0] if temp else ''
temp = response.xpath(u"//div[#class='main']/div[#id='basic-info']/div[#class='expand-info address']/span/text()").extract()
item['address'] = temp[0] if temp else ''
yield item
To start the spider,I should type two commands in the shell:
redis-cli lpush demoCrawler:start_urls url
scrapy crawl demo
url is the specific url I'm scraping,for exmaple http://google.com

Resources