Scrapy scrape multiple pages - web-scraping

I have a function that can scrape individual page. How can I scrape multiple pages after following corresponding links? Do I need a separate function that calls the parse() like the following gotoIndivPage()? Thank you!
import scrapy
class trainingScraper(scrapy,Spider):
name = "..."
start_urls = "url with links to multiple pages"
# for scraping individual page
def parse(self,response):
SELECTOR1 = '.entry-title ::text'
SELECTOR2 = '//li[#class="location"]/ul/li/a/text()'
yield{
'title': response.css(SELECTOR1).extract_first(),
'date': response.xpath(SELECTOR2).extract_first(),
}
def gotoIndivPage(self,response):
PAGE_SELECTOR = '//h3[#class="entry-title"]/a/#href'
for page in response.xpath(PAGE_SELECTOR):
if page:
yield scrapy.Request(
response.urljoin(page),
callback=self.parse
)

I generally create a new function for every different type of HTML structure I'm trying to scrape. So if your links send you to a page with a different HTML structure then your starting page, I would create a new function and pass that to my callback.
def parseNextPage(self, response):
# Parse new page
def parse(self,response):
SELECTOR1 = '.entry-title ::text'
SELECTOR2 = '//li[#class="example"]/ul/li/a/text()'
yield{
'title': response.css(SELECTOR1).extract_first(),
'date': response.xpath(SELECTOR2).extract_first(),
}
href = //li[#class="location"]/ul/li/a/#href
yield scrapy.Request(
url = href,
callback=self.parseNextPage
)

Related

Scraping dynamic amazon page with scrolling

I am trying to scrape products on Amazon's Best Seller 100 for a particular category. For example -
https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_0
The 100 products are divided into two pages with 50 products on each page.
Earlier, the page was static and all 50 products use to appear on the page. However, now the page is dynamic and I need to scroll down to see all 50 products on the page.
I was using scrapy to scrape the page earlier. Would really appreciate if you could help me out with this. Thanks!
Adding my code below -
import scrapy
from scrapy_splash import SplashRequest
class BsrNewSpider(scrapy.Spider):
name = 'bsr_new'
allowed_domains = ['www.amazon.in']
#start_urls = ['https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0']
script = '''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return splash:html()
end
'''
def start_requests(self):
url = 'https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0'
yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
'lua_source': self.script
})
def parse(self, response):
for rev in response.xpath("//div[#id='gridItemRoot']"):
yield {
'Segment': "Home", #Enter name of the segment here
#'Sub-segment':segment,
'ASIN' : rev.xpath(".//div/div[#class='zg-grid-general-faceout']/div/a[#class='a-link-normal']/#href").re('\S*/dp/(\S+)_\S+')[0][:10],
'Rank' : rev.xpath(".//span[#class='zg-bdg-text']/text()").get(),
'Name' : rev.xpath("normalize-space(.//a[#class='a-link-normal']/span/div/text())").get(),
'No. of Ratings' : rev.xpath(".//span[contains(#class,'a-size-small')]/text()").get(),
'Rating' : rev.xpath(".//span[#class='a-icon-alt']/text()").get(),
'Price' : rev.xpath(".//span[#class='a-size-base a-color-price']//text()").get()
}
next_page = response.xpath("//a[text()='Next page']/#href").get()
if next_page:
url = response.urljoin(next_page)
yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
'lua_source': self.script
})
Regards
Sreejan
Here is an alternate approach that does not need Splash.
All 50 products' ASIN is hidden on the first page itself. You can extract these ASIN and build all those 50 product URLs.
import scrapy
import json
class AmazonSpider(scrapy.Spider):
custom_settings ={
'DEFAULT_REQUEST_HEADERS':''# Important
}
name = 'amazon'
start_urls = ['https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_pg_1?_encoding=UTF8&pg=1']
def parse(self, response):
raw_data = response.css('[data-client-recs-list]::attr(data-client-recs-list)').get()
data = json.loads(raw_data)
for item in data:
url = 'https://www.amazon.com/dp/{}'.format(item['id'])
yield scrapy.Request(url, callback=self.parse_item)
def parse_item(self, response,):
...

Scrapy working erratically while scraping Amazon

I am trying to scrape the brand name that appears on a product page on Amazon. It can be found in the line below the product name as 'Visit the xxxx store' or 'Brand: xxxx'. Once I scrape this text, I just proceed to use the replace function in excel to get the brand name.
For example visit the following page - https://www.amazon.in/dp/B09HC794BD
I use the following code to do so -
class BrandSpider(scrapy.Spider):
name = 'brand'
allowed_domains = ['www.amazon.in']
def start_requests(self):
for asin in asin_list:
url = reviews_url.format(asin)
yield scrapy.Request(url=url, meta={'ASIN':asin})
def parse(self, response):
asin = response.request.meta['ASIN']
brand = response.xpath("normalize-space(//a[contains(#id,'bylineInfo')]/text())").get()
yield {
'ASIN':asin,
'Brand': brand
}
While it generally works, in several instances it is returning me an empty string '' for the brand name. I tried inspecting on the website and it works fine. Please help me with this. Thanks in advance!

Is it possible to scrape data from a sublink and go back to the main link using scrapy? Or is there any tool that I can use

For Example I will scrape a link and will open any other links that is included in it and then scrape those links and go back to the main link.
You could send a request back to the page your requests came from, I just don't think that it would make much sense.
Since you could get all the data you need from the main link the first time around, I think it would be better to pass the item you need for the following pages with the meta attribute of Request or in newer versions of Scrapy with cb_kwargs.
yield Request(
"http://www.example.com",
self.callback,
meta={
'item': your_item,
'main_url': response.url
}
)
You could then access the item or main link using the response's meta attribute, work with your old item and then send a request back to the main link and callback with the item.
def callback(self, response):
item = response.meta['item']
main_url = response.meta['main_url']
...
yield Request(main_url, self.parse, meta={'item': item})
For the next example I will use this item.
class MyItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
author_description = scrapy.Field()
tags = scrapy.Field()
I extended the example from here to follow the about the author link and then yield the item.
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
item = MyItem()
item['text'] = quote.css('span.text::text').get()
item['author'] = quote.css('span small::text').get()
item['tags'] = quote.css('div.tags a.tag::text').getall()
about_page = quote.xpath('.//a[text()="(about)"]/#href').extract_first()
yield response.follow(
about_page,
self.parse_about_page,
meta={
'item':item,
'main_url': response.url
}
)
def parse_about_page(self, response):
item = response.meta['item']
item['author_description'] = response.xpath('//div[#class="author-description"]').extract_first()
yield item
# here you could go back to the main_page
# beware, this will only work if you turn off the duplicate filter
# and then result in an endless loop!
yield response.follow(
response.meta['main_url'],
self.parse,
meta={'item': item }
)

Scrapy response.css has empty list

Need the get the Product name and Price from this page
https://www.lazada.com.my/shop-smart-tvs/
Started with
scrapy shell "https://www.lazada.com.my/shop-smart-tvs/"
The output I got was this.
response <200 https://www.lazada.com.my/shop-smart-tvs/>
I also used proxy and user-agents, to make sure there wasn't any blocks.
But when I used
response.css(".c13VH6 , .c16H9d a").css('::text').extract()
and
response.css(".c16H9d a::attr(title)").extract()
I get blank lists []
Tried the same approach with another site, this works.
P.S. I used a Chrome CSS selector widget to get the CSS selectors.
Please tell me where I went wrong.
The data is probably loaded dynamically, so you would have to render the page with something like Splash or Selenium to make your CSS expression work.
When you go to the page source you can see the product data is present in a big json-file, so I would just load the json as a python dictionary, and get the data you want from there:
import scrapy
import json
class LazadaSpider(scrapy.Spider):
name = 'lazada'
allowed_domains = ['lazada.com']
start_urls = ['https://www.lazada.com.my/shop-smart-tvs/']
def parse(self, response):
script = response.xpath(
"//script[starts-with(text(), 'window.pageData')]/text()"
).extract_first()
first = script.index('{')
last = len(script)
products = json.loads(script[first:last])
items = products['mods']['listItems']
for item in items:
name = item['name']
price = item['price']
yield {'name': name,
'price': price}

Change website deliver country with Scrapy

I need to scrape the website http://www.yellowkorner.com/
By choosing a different country, all the prices will change. There are 40+ countries listed, and each of those must be scrapped.
My current spider is pretty simple
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), self.parse_prices)
def parse_prices(self, response):
yield None
How can I scrape price information for all countries?
Open the page with firebug and refresh. Inspecting the web page at the panel Network / Sub Panel Cookies you will see that the page saves de country information with cookies (see image below).
So you have to force the cookie "YellowKornerCulture" attribute values LANGUAGE and COUNTRY at the request. I made an example based on your code to get the available countries on the site and a loop to get all the prices. See the code below:
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
countries = self.get_countries(response)
#countries = ['BR', 'US'] try this if you only have some countries
for country in countries:
#With the expression re(r'/photos/\d\d\d\d/.*$') you only get photos with 4-digit ids. I think this is not your goal.
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), cookies={'YellowKornerCulture' : 'Language=US&Country='+str(country), 'YellowKornerHistory' : '', 'ASP.NET_SessionId' : ''}, callback=self.parse_prices, dont_filter=True, meta={'country':country})
def parse_prices(self, response):
yield {
'name': response.xpath('//h1[#itemprop="name"]/text()').extract()[0],
'price': response.xpath('//span[#itemprop="price"]/text()').extract()[0],
'country': response.meta['country']
}
#function that gets the countries avaliables on the site
def get_countries(self, response):
return response.xpath('//select[#id="ctl00_languageSelection_ddlCountry"]/option/attribute::value').extract()
Took a certain time to figure this out but you have to erase another cookies that the site is using to choose the language page. Also I fixed the language value to English(US). The parameter dont_filter=True was used because you are requesting an already requested url each loop iteration and the default behavior of scrapy is don't repeat a request to the same url due performance reasons.
PS: The xpath expressions provided can be improved.
Hope this helps.

Resources