Empty result file using scrapy - web-scraping

just started learning python so sorry if this is a stupid question!
I'm trying to scrape real estate data from this website: https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=2&r=10 using scrapy.
Ideally, in the end I'd get a file containing all available real estate offers and their respective address, price, area in m2, and other details (e.g. connection to public transport).
I built a test spider with scrapy but it always returns an empty file. I tried a whole bunch of different xpaths but can't get it to work. Can anyone help? Here's my code:
import scrapy
class GetdataSpider(scrapy.Spider):
name = 'getdata'
allowed_domains = ['immoscout24.ch']
start_urls = ['https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=2&r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=3&r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=4&r=10',
'https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn=5&r=10']
def parse(self, response):
single_offer = response.xpath('//*[#class="Body-jQnOud bjiWLb"]')
for offer in single_offer:
offer_price = offer.xpath('.//*[#class="Box-cYFBPY jPbvXR Heading-daBLVV dOtgYu xh- highlight"]/text()').extract()
offer_address = offer.xpath('.//*[#class="Address__AddressStyled-lnefMi fUIggX"]/text()').extract_first()
yield {'Price': offer_price,
'Address': offer_address}

First all of , You need to add your real user agent . I injected user-agent in settings.py file. I also have corrected the xpath selection and made pagination in start_urls which type of next page pagination is 2 time fister than other types.This is the woeking example.
import scrapy
class GetdataSpider(scrapy.Spider):
name = 'getdata'
allowed_domains = ['immoscout24.ch']
start_urls = ['https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?pn='+ str(x)+'&r=10' for x in range(1,5)]
def parse(self, response):
single_offer = response.xpath('//*[#class="Content-kCEgNG degSLr"]')
for offer in single_offer:
name = offer.xpath('.//*[#class="Box-cYFBPY jPbvXR Heading-daBLVV dOtgYu"]/span/text()').extract_first()
offer_address = offer.xpath('.//*[#class="AddressLine__TextStyled-eaUAMD iBNjyG"]/text()').extract_first()
yield {'title': name,
'Address': offer_address}
Output:
{'title': 'CHF 950.—', 'Address': 'Friedackerstrasse 6, 8050 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 220.—', 'Address': 'Freilagerstrasse 40, 8047 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 220.—', 'Address': 'Freilagerstrasse 40, 8047 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 220.—', 'Address': 'Freilagerstrasse 40, 8047 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 4020.—', 'Address': 'Uitikonerstrasse 9, 8952 Schlieren, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 260.—', 'Address': 'Buckhauserstrasse 45/49, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 260.—', 'Address': 'Letzigraben 75, 8003 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 2655.—', 'Address': 'Weiningerstr. 53, 8103 Unterengstringen, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 320.—', 'Address': 'Baslerstrasse 60, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 190.—', 'Address': 'Neugutstrasse 66, 8600 Dübendorf, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 180.—', 'Address': 'Herostrasse 9, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
{'title': 'CHF 180.—', 'Address': 'Herostrasse 9, 8048 Zürich, ZH'}
2022-03-22 00:58:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.immoscout24.ch/de/buero-gewerbe-industrie/mieten/ort-zuerich?r=10>
... so on

Related

MY CODE CREATES A JSON WITH NO DATA IN IT

I am learning web scraping with scrapy. I wrote the following code that I know works for other people. It is vary basic
import scrapy
class jumboSpider(scrapy.Spider):
name = 'jumbo'
start_urls = [
'https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS'
]
def parse(self, response):
prize = response.xpath('//span[#class="price"]/text()').getall()
yield {
'prize': prize
}
Then I go to the terminal and write the following command
scrapy crawl jumbo -o jumbo.json
to create a json file in which will be the data I am extracting with the xpath sentence. I already made sure the xpath response is right. When I use the same xpath in scrapy shell it brings the data with no issues
The problem is that the file created by the code is empty. No data is shown.
I do not know if this happens to someone else
Any help is more than appreciate it
.getall() method produces a list and list can't grab data as string/text that's why it must iterate over the list of elements then extract desired data as follows:
Script:
import scrapy
class jumboSpider(scrapy.Spider):
name = 'jumbo'
start_urls = ['https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS']
def parse(self, response):
for price in response.xpath('//span[#class="price"]'):
yield {
'price': price.xpath('.//text()').get()}
Output:
{'price': '$2.249.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$2.579.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$2.999.900'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$2.399.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$2.399.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$2.499.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$3.449.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.699.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.769.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.799.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$949.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.659.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.815.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.815.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.749.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$4.799.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$799.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.199.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.399.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.799.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.299.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.139.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.629.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$5.239.000'}
2021-12-09 03:51:51 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.alkosto.com/computadores-tablet/computadores-portatiles/c/BI_104_ALKOS>
{'price': '$1.799.000'}

Scrapy Returns Inconsistent Results

I'm trying to scrape an Amazon product page but scrapy is giving me inconsistent results (sometimes it returns what I want and sometimes it returns None). I have no idea as to why the same code give different results. I created a loop that yield the same request 10 times and it was giving me different results. Can anyone help me?
import scrapy
from scrapy import Request
class AmzsingleSpider(scrapy.Spider):
name = 'amzsingle'
def start_requests(self):
for i in range(10):
yield Request(url="https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929", callback=self.parse, dont_filter=True)
def parse(self, response):
yield {
'title': response.xpath('//span[#id="productTitle"]/text()').get()
}
and this is the log that I get in the terminal. This attempt gave 9 None and 1 found (some other time it was returning 7 None and 3 found):
2021-11-27 22:08:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/robots.txt> (referer: None)
2021-11-27 22:08:30 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:33 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:35 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:36 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:36 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:38 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:38 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:39 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:40 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': '\n¡Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n'}
2021-11-27 22:08:45 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 22:08:45 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{'title': None}
2021-11-27 22:08:45 [scrapy.core.engine] INFO: Closing spider (finished)
2021-11-27 22:08:45 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 4664,
'downloader/request_count': 11,
'downloader/request_method_count/GET': 11,
'downloader/response_bytes': 1508328,
'downloader/response_count': 11,
'downloader/response_status_count/200': 11,
'elapsed_time_seconds': 20.82323,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 11, 27, 15, 8, 45, 324091),
'httpcompression/response_bytes': 7323320,
'httpcompression/response_count': 11,
'item_scraped_count': 10,
'log_count/DEBUG': 22,
'log_count/INFO': 11,
'memusage/max': 53161984,
'memusage/startup': 53161984,
'proxies/good': 1,
'proxies/mean_backoff': 0.0,
'proxies/reanimated': 0,
'proxies/unchecked': 0,
'response_received_count': 11,
'robotstxt/request_count': 1,
'robotstxt/response_count': 1,
'robotstxt/response_status_count/200': 1,
'scheduler/dequeued': 10,
'scheduler/dequeued/memory': 10,
'scheduler/enqueued': 10,
'scheduler/enqueued/memory': 10,
'start_time': datetime.datetime(2021, 11, 27, 15, 8, 24, 500861)}
2021-11-27 22:08:45 [scrapy.core.engine] INFO: Spider closed (finished)
You can use a CSS selector.
import scrapy
from scrapy import Request
class AmzsingleSpider(scrapy.Spider):
name = 'amzsingle-parse'
def start_requests(self):
for i in range(10):
yield Request(url="https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929", callback=self.parse, dont_filter=True)
def parse(self, response):
yield {
'title': response.css('#productTitle ::text').get()
}
Output
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:41 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:41 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}
2021-11-27 15:56:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929> (referer: None)
2021-11-27 15:56:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.amazon.com/%C2%A1Avancemos-Student-Level-2013-Spanish/dp/0547871929>
{"title": "\n\u00a1Avancemos!: Student Edition Level 3 2013 (Spanish Edition)\n"}

Scrapy via API requests: [Product Catalogue page > Product Page] > Pagination

I am trying to scrape product details from the product page using API requests. I have no issues accessing the product catalogue page and getting the request urls for each of the products. But, I am facing some problem in parsing them correctly from one function to another.
I think I am missing a few lines of codes, or incorrect use of self.parse. If i send in a new request (for each product page), should I send in new header requests as well? Because the product page has different request headers than the one in product catalogue page. How do I do that?
Thank you so much for your feedbacks and help! Much appreciated.
This is my work so far: https://pastebin.com/H1yyDiDL
import scrapy
from scrapy.exceptions import CloseSpider
import json
class HtmshopeeSpider(scrapy.Spider):
name = 'shopeeitem2'
headers={
'authority': 'shopee.com.my',
'method': 'GET',
'path': '/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2',
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'cookie': 'private_content_version=75d921dc5d1fc85c97d8d9876d6e58b2; _fbp=fb.2.1626162049790.1893904607; _ga=GA1.3.518387377.1626162051; _gid=GA1.3.151467354.1626162051; _gcl_au=1.1.203553443.1626162051; x_axis_main=v_id:017a9ecfb7ba000a4be21b24a20803079001c0710093c$_sn:1$_ss:1$_pn:1%3Bexp-session$_st:1626163851002$ses_id:1626162051002%3Bexp-session',
'if-none-match-': '55b03-676eb00af72df9e2b38a2976dd41d5ea',
'pragma': 'no-cache',
'referer': 'https://shopee.com.my/search?keyword=chantiva&page=0',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'x-api-source': 'pc',
'x-requested-with': 'XMLHttpRequest',
'x-shopee-language': 'en'
}
def start_requests(self):
yield scrapy.Request(
url= 'https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2',
headers=self.headers,
callback=self.parse_products,
meta={
'newest':0
}
)
def parse_products(self, response):
json_resp = json.loads(response.body)
products = json_resp.get('items')
for product in products:
item_id = product.get('item_basic').get('itemid'),
shop_id = product.get('item_basic').get('shopid')
yield scrapy.Request(
url=f"https://shopee.com.my/api/v2/item/get?itemid={item_id}&shopid={shop_id}",
callback=self.parse_data,
headers=self.headers
)
def parse_data(self, response):
json_resp = json.loads(response.body)
datas = json_resp.get('item')
for data in datas:
yield {
'product': data.get('name')
}
count= 240000
next_page = response.meta['newest'] + 60
if next_page <= count:
yield scrapy.Request(
url=f"https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest={next_page}&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2",
headers=self.headers,
meta={'newest': next_page}
)
Here is the solution. Actualy, the url contains total count 123 and per page count 60
CODE:
import scrapy
from scrapy.exceptions import CloseSpider
import json
class HtmshopeeSpider(scrapy.Spider):
name = 'shopeeitem2'
headers={
'authority': 'shopee.com.my',
'method': 'GET',
'path': '/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2',
'scheme': 'https',
'accept': '*/*',
'accept-encoding': 'gzip, deflate, br',
'accept-language': 'en-US,en;q=0.9',
'cache-control': 'no-cache',
'cookie': 'private_content_version=75d921dc5d1fc85c97d8d9876d6e58b2; _fbp=fb.2.1626162049790.1893904607; _ga=GA1.3.518387377.1626162051; _gid=GA1.3.151467354.1626162051; _gcl_au=1.1.203553443.1626162051; x_axis_main=v_id:017a9ecfb7ba000a4be21b24a20803079001c0710093c$_sn:1$_ss:1$_pn:1%3Bexp-session$_st:1626163851002$ses_id:1626162051002%3Bexp-session',
'if-none-match-': '55b03-676eb00af72df9e2b38a2976dd41d5ea',
'pragma': 'no-cache',
'referer': 'https://shopee.com.my/search?keyword=chantiva&page=0',
'sec-ch-ua': '" Not;A Brand";v="99", "Google Chrome";v="91", "Chromium";v="91"',
'sec-ch-ua-mobile': '?0',
'sec-fetch-dest': 'empty',
'sec-fetch-mode': 'cors',
'sec-fetch-site': 'same-origin',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36',
'x-api-source': 'pc',
'x-requested-with': 'XMLHttpRequest',
'x-shopee-language': 'en'
}
def start_requests(self):
yield scrapy.Request(
url= 'https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2',
headers=self.headers,
callback=self.parse_products,
meta={
'newest':0
}
)
def parse_products(self, response):
json_resp = json.loads(response.body)
products = json_resp.get('items')
for product in products:
yield{
'Name':product.get('item_basic').get('name'),
'Price':product.get('item_basic').get('price')
}
count = json_resp.get('total_count')
next_page = response.meta['newest'] + 60
if next_page <= count:
yield scrapy.Request(
url=f'https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest={next_page}&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2',
callback=self.parse_products,
headers=self.headers,
meta={'newest': next_page}
)
OUTPUT: A portion of total output.
{'Name': 'Chantiva Haruan Tablet SS Plus 450mg (60 Tabs) Cepat sembuh luka', 'Price': 9000000}
2021-08-10 12:40:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': 'CHANTIVA 750MG 30 TABLETS (EXP:04/23)', 'Price': 8490000}
2021-08-10 12:40:30 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=0&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': 'CHANTIVA TABLET HARUAN SS PLUS 450MG (EXP: 03/2022)', 'Price': 1389000}
{'Name': 'CHANTIVA HARUAN SS PLUS TAB 60S', 'Price': 7550000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': "CHANTIVA 450MG 1 STRIP 10'S (IKAN HARUAN)", 'Price': 2000000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': 'CHANTIVA TABLET 750MG (EXP 04/23)', 'Price': 3800000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': "TrueLifeSciences® CHANTIVA Haruan SS Plus 450mg Tablet 60's", 'Price': 8460000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': 'Chantiva 450mg Tablet', 'Price': 9400000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': 'Chantiva Tablet Haruan SS Plus 450mg 60s', 'Price': 8565000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': 'Chantiva Skin Fix Cream 20g x2 (Twin Pack)', 'Price': 5380000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': "CHANTIVA 450MG TABLET 60'S", 'Price': 7690000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': 'CHANTIVA TABLET HARUAN (450MG/750MG)', 'Price': 2000000}
2021-08-10 12:40:31 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=60&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>
{'Name': 'CHANTIVA 750MG 30 TABLETS (EXP: 09/2022)', 'Price': 8490000}
2021-08-10 12:40:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=120&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2> (referer: https://shopee.com.my/search?keyword=chantiva&page=0)
2021-08-10 12:40:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=120&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>{'Name': 'CHANTIVA TABLET IKAN HARUAN 450MG SAKIT LUTUT SAKIT URAT LUKA 60"S', 'Price': 7490000}
2021-08-10 12:40:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=120&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>{'Name': "[CLEARANCE][🎁WITH FREE GIFT🎁] CHANTIVA TABLET HARUAN SS PLUS 60'S (EXP:02/2021)", 'Price': 7600000}
2021-08-10 12:40:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://shopee.com.my/api/v4/search/search_items?by=relevancy&keyword=chantiva&limit=60&newest=120&order=desc&page_type=search&scenario=PAGE_GLOBAL_SEARCH&version=2>{'Name': "CHANTIVA 450MG TABLET 6X10'S by strip Exp:10/21", 'Price': 990000}
2021-08-10 12:40:32 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-10 12:40:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 3242,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 40725,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'elapsed_time_seconds': 4.219452,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 8, 10, 6, 40, 32, 976939),
'httpcompression/response_bytes': 377162,
'httpcompression/response_count': 3,
'item_scraped_count': 123,

scrapy not working on imdb keywords pages

Here is how I intend this code to work;
I have a keyword, say, "gadgets". I search titles on advanced imdb search page. I want the code to go to each title page, then go to keywords page of each title and then download title and all the keywords.
The code structure looks good to me but it is really not working.
Please suggest whether it needs to be re-written or it can be corrected with some advice?
Here is my spider:
import scrapy
class KwordsSpider(scrapy.Spider):
name= 'ImdbSpider'
allowed_domains = ['imdb.com']
start_urls = [
'https://www.imdb.com/search/title/?keywords=gadgets'
]
def parse(self, response):
titleLinks = response.xpath('//*[#class="lister-item-content"]')
for link in titleLinks:
title_url = 'https://www.imdb.com'+link.xpath('.//h3/a/#href').extract_first()
yield scrapy.Request(title_url, callback=self.parse_title)
next_page_url = 'https://www.imdb.com'+response.xpath('//div[#class="article"]/div[#class="desc"]/a[#href]').extract_first()
if next_page_url is not None:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(next_page_url, callback=self.parse)
def parse_title(self, response):
keywords_url = 'https://www.imdb.com' + response.xpath('//nobr/a[#href]').extract_first()
yield scrapy.Request(keywords_url, callback=self.parse_keys)
#looking at the keywords page
def parse_keys(self, response):
title = response.xpath('//h3/a/text()').extract_first()
keys = response.xpath('//div[#class="sodatext"]/a/text()').extract()
print('my print'+title)
yield{
'title': title,
'Keywords': keys,
}
Following are few power shell lines
2020-05-02 08:33:40 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-02 08:33:40 [scrapy.core.engine] INFO: Spider opened
2020-05-02 08:33:40 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-02 08:33:40 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-02 08:33:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/search/title/?keywords=gadgets> (referer: None)
2020-05-02 08:33:43 [scrapy.spidermiddlewares.offsite] DEBUG: Filtered offsite request to 'www.imdb.com<a href="': <GET https://www.imdb.com<a href="/search/title/?keywords=gadgets&start=51%22%20class=%22lister-page-next%20next-page%22%3ENext%20%C2%BB%3C/a%3E>
2020-05-02 08:33:46 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt3896198/> (referer: https://www.imdb.com/search/title/?keywords=gadgets)
2020-05-02 08:34:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt0369171/> (referer: https://www.imdb.com/search/title/?keywords=gadgets)
2020-05-02 08:34:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.imdb.com/title/tt1149317/> (referer: https://www.imdb.com/search/title/?keywords=gadgets)
2020-05-02 08:34:11 [scrapy.core.engine] INFO: Closing spider (finished)
Few xpaths in your script were wrong. I've fixed them. It should work now.
class KwordsSpider(scrapy.Spider):
name = 'ImdbSpider'
start_urls = [
'https://www.imdb.com/search/title/?keywords=gadgets'
]
def parse(self, response):
titleLinks = response.xpath('//*[#class="lister-item-content"]')
for link in titleLinks:
title_url = response.urljoin(link.xpath('.//h3/a/#href').get())
yield scrapy.Request(title_url, callback=self.parse_title)
next_page_url = response.xpath('//div[#class="article"]/div[#class="desc"]/a/#href').get()
if next_page_url:
next_page_url = response.urljoin(next_page_url)
yield scrapy.Request(next_page_url, callback=self.parse)
def parse_title(self, response):
keywords_url = response.urljoin(response.xpath('//nobr/a/#href').get())
yield scrapy.Request(keywords_url, callback=self.parse_keys)
def parse_keys(self, response):
title = response.xpath('//h3/a/text()').get()
keys = response.xpath('//div[#class="sodatext"]/a/text()').getall()
yield {
'title': title,
'Keywords': keys,
}

Scrapy 0 pages crawled but no visible issue?

I used Portia to create a spider and then downloaded it as scrapy project. The spider runs fine but it says in the logs: Scrapy Crawled 0 pages (at 0 pages/min) and also nothing get's saved. However, it also shows all the pages crawled with 200 response, then shows the bytes of data at the end..
Spider Code
from __future__ import absolute_import
from scrapy import Request
from scrapy.linkextractors import LinkExtractor
from scrapy.loader import ItemLoader
from scrapy.loader.processors import Identity
from scrapy.spiders import Rule
from ..utils.spiders import BasePortiaSpider
from ..utils.starturls import FeedGenerator, FragmentGenerator
from ..utils.processors import Item, Field, Text, Number, Price, Date, Url, Image, Regex
from ..items import PortiaItem, AllProductsBooksToScrapeSandboxItem
class BooksToscrape(BasePortiaSpider):
name = "books.toscrape.com"
allowed_domains = ['books.toscrape.com']
start_urls = [{'fragments': [{'valid': True,
'type': 'fixed',
'value': 'http://books.toscrape.com/catalogue/page-'},
{'valid': True,
'type': 'range',
'value': '1-50'},
{'valid': True,
'type': 'fixed',
'value': '.html'}],
'type': 'generated',
'url': 'http://books.toscrape.com/catalogue/page-[1-50].html'}]
rules = [
Rule(
LinkExtractor(
allow=(),
deny=('.*')
),
callback='parse_item',
follow=True
)
]
items = [
[
Item(
AllProductsBooksToScrapeSandboxItem, None, '.product_pod', [
Field(
'title', 'h3 > a::attr(title)', []), Field(
'price', '.product_price > .price_color *::text', [])])]]
Pipeline Code
I added openSpider and closeSpider functions to write the items to json lines upon crawling and I think it works because jl file gets created.
# -*- coding: utf-8 -*-
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json
class TesterPipeline(object):
def open_spider(self, spider):
self.file = open('items.jl', 'w')
def close_spider(self, spider):
self.file.close()
def process_item(self, item, spider):
line = json.dumps(dict(item)) + "\n"
self.file.write(line)
return item
Settings Code
Enabled pipeline in settings too for pipeline to work.
# -*- coding: utf-8 -*-
# Scrapy settings for Tester project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://doc.scrapy.org/en/latest/topics/settings.html
# https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
# https://doc.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'Tester'
SPIDER_MODULES = ['Tester.spiders']
NEWSPIDER_MODULE = 'Tester.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'Tester (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = True
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://doc.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://doc.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'Tester.middlewares.TesterSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'Tester.middlewares.TesterDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://doc.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://doc.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'Tester.pipelines.TesterPipeline': 300,
}
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://doc.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
When I run the spider, the following log is created:
(scrape) C:\Users\da74\Desktop\tester>scrapy crawl books.toscrape.com
2018-07-24 12:18:15 [scrapy.utils.log] INFO: Scrapy 1.5.0 started (bot: Tester)
2018-07-24 12:18:15 [scrapy.utils.log] INFO: Versions: lxml 4.2.2.0, libxml2 2.9.8, cssselect 1.0.3, parsel 1.4.0, w3lib 1.19.0, Twisted 17.5.0, Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 11:27:44) [MSC v.1900 64 bit (AMD64)], pyOpenSSL 18.0.0 (OpenSSL 1.0.2o 27 Mar 2018), cryptography 2.2.2, Platform Windows-10-10.0.17134-SP0
2018-07-24 12:18:15 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'Tester', 'NEWSPIDER_MODULE': 'Tester.spiders', 'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['Tester.spiders']}
2018-07-24 12:18:15 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2018-07-24 12:18:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-07-24 12:18:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-07-24 12:18:16 [scrapy.middleware] INFO: Enabled item pipelines:
['Tester.pipelines.TesterPipeline']
2018-07-24 12:18:16 [scrapy.core.engine] INFO: Spider opened
2018-07-24 12:18:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-07-24 12:18:16 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-07-24 12:18:16 [scrapy.core.engine] DEBUG: Crawled (404) <GET http://books.toscrape.com/robots.txt> (referer: None)
2018-07-24 12:18:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-1.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-2.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-7.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-4.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-3.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-9.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-5.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-8.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-6.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-10.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-12.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-11.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-14.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-15.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-16.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-17.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-13.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-18.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-19.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-21.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-20.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-22.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-23.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-25.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-24.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-26.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-27.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-32.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-29.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-30.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-33.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-28.html> (referer: None)
2018-07-24 12:18:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-31.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-34.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-35.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-36.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-39.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-40.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-38.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-41.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-37.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-42.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-43.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-44.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-47.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-45.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-46.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-48.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-49.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://books.toscrape.com/catalogue/page-50.html> (referer: None)
2018-07-24 12:18:18 [scrapy.core.engine] INFO: Closing spider (finished)
2018-07-24 12:18:18 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 12168,
'downloader/request_count': 51,
'downloader/request_method_count/GET': 51,
'downloader/response_bytes': 299913,
'downloader/response_count': 51,
'downloader/response_status_count/200': 50,
'downloader/response_status_count/404': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 7, 24, 4, 18, 18, 598891),
'log_count/DEBUG': 52,
'log_count/INFO': 7,
'response_received_count': 51,
'scheduler/dequeued': 50,
'scheduler/dequeued/memory': 50,
'scheduler/enqueued': 50,
'scheduler/enqueued/memory': 50,
'start_time': datetime.datetime(2018, 7, 24, 4, 18, 16, 208142)}
2018-07-24 12:18:18 [scrapy.core.engine] INFO: Spider closed (finished)
I don't understand why it isn't gathering items. I says first that 0 items crawled and then shows 200 success response for pages..
Please if anyone has any idea what to try to make it crawl will be helpful.
Thankyou

Resources