Scraping dynamic amazon page with scrolling - web-scraping

I am trying to scrape products on Amazon's Best Seller 100 for a particular category. For example -
https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_nav_0
The 100 products are divided into two pages with 50 products on each page.
Earlier, the page was static and all 50 products use to appear on the page. However, now the page is dynamic and I need to scroll down to see all 50 products on the page.
I was using scrapy to scrape the page earlier. Would really appreciate if you could help me out with this. Thanks!
Adding my code below -
import scrapy
from scrapy_splash import SplashRequest
class BsrNewSpider(scrapy.Spider):
name = 'bsr_new'
allowed_domains = ['www.amazon.in']
#start_urls = ['https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0']
script = '''
function main(splash, args)
splash.private_mode_enabled = false
url = args.url
assert(splash:go(url))
assert(splash:wait(0.5))
return splash:html()
end
'''
def start_requests(self):
url = 'https://www.amazon.in/gp/bestsellers/kitchen/ref=zg_bs_nav_0'
yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
'lua_source': self.script
})
def parse(self, response):
for rev in response.xpath("//div[#id='gridItemRoot']"):
yield {
'Segment': "Home", #Enter name of the segment here
#'Sub-segment':segment,
'ASIN' : rev.xpath(".//div/div[#class='zg-grid-general-faceout']/div/a[#class='a-link-normal']/#href").re('\S*/dp/(\S+)_\S+')[0][:10],
'Rank' : rev.xpath(".//span[#class='zg-bdg-text']/text()").get(),
'Name' : rev.xpath("normalize-space(.//a[#class='a-link-normal']/span/div/text())").get(),
'No. of Ratings' : rev.xpath(".//span[contains(#class,'a-size-small')]/text()").get(),
'Rating' : rev.xpath(".//span[#class='a-icon-alt']/text()").get(),
'Price' : rev.xpath(".//span[#class='a-size-base a-color-price']//text()").get()
}
next_page = response.xpath("//a[text()='Next page']/#href").get()
if next_page:
url = response.urljoin(next_page)
yield SplashRequest(url, callback = self.parse, endpoint = "execute", args = {
'lua_source': self.script
})
Regards
Sreejan

Here is an alternate approach that does not need Splash.
All 50 products' ASIN is hidden on the first page itself. You can extract these ASIN and build all those 50 product URLs.
import scrapy
import json
class AmazonSpider(scrapy.Spider):
custom_settings ={
'DEFAULT_REQUEST_HEADERS':''# Important
}
name = 'amazon'
start_urls = ['https://www.amazon.com/Best-Sellers-Home-Kitchen/zgbs/home-garden/ref=zg_bs_pg_1?_encoding=UTF8&pg=1']
def parse(self, response):
raw_data = response.css('[data-client-recs-list]::attr(data-client-recs-list)').get()
data = json.loads(raw_data)
for item in data:
url = 'https://www.amazon.com/dp/{}'.format(item['id'])
yield scrapy.Request(url, callback=self.parse_item)
def parse_item(self, response,):
...

Related

Scrapy working erratically while scraping Amazon

I am trying to scrape the brand name that appears on a product page on Amazon. It can be found in the line below the product name as 'Visit the xxxx store' or 'Brand: xxxx'. Once I scrape this text, I just proceed to use the replace function in excel to get the brand name.
For example visit the following page - https://www.amazon.in/dp/B09HC794BD
I use the following code to do so -
class BrandSpider(scrapy.Spider):
name = 'brand'
allowed_domains = ['www.amazon.in']
def start_requests(self):
for asin in asin_list:
url = reviews_url.format(asin)
yield scrapy.Request(url=url, meta={'ASIN':asin})
def parse(self, response):
asin = response.request.meta['ASIN']
brand = response.xpath("normalize-space(//a[contains(#id,'bylineInfo')]/text())").get()
yield {
'ASIN':asin,
'Brand': brand
}
While it generally works, in several instances it is returning me an empty string '' for the brand name. I tried inspecting on the website and it works fine. Please help me with this. Thanks in advance!

Is it possible to scrape data from a sublink and go back to the main link using scrapy? Or is there any tool that I can use

For Example I will scrape a link and will open any other links that is included in it and then scrape those links and go back to the main link.
You could send a request back to the page your requests came from, I just don't think that it would make much sense.
Since you could get all the data you need from the main link the first time around, I think it would be better to pass the item you need for the following pages with the meta attribute of Request or in newer versions of Scrapy with cb_kwargs.
yield Request(
"http://www.example.com",
self.callback,
meta={
'item': your_item,
'main_url': response.url
}
)
You could then access the item or main link using the response's meta attribute, work with your old item and then send a request back to the main link and callback with the item.
def callback(self, response):
item = response.meta['item']
main_url = response.meta['main_url']
...
yield Request(main_url, self.parse, meta={'item': item})
For the next example I will use this item.
class MyItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
author_description = scrapy.Field()
tags = scrapy.Field()
I extended the example from here to follow the about the author link and then yield the item.
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
item = MyItem()
item['text'] = quote.css('span.text::text').get()
item['author'] = quote.css('span small::text').get()
item['tags'] = quote.css('div.tags a.tag::text').getall()
about_page = quote.xpath('.//a[text()="(about)"]/#href').extract_first()
yield response.follow(
about_page,
self.parse_about_page,
meta={
'item':item,
'main_url': response.url
}
)
def parse_about_page(self, response):
item = response.meta['item']
item['author_description'] = response.xpath('//div[#class="author-description"]').extract_first()
yield item
# here you could go back to the main_page
# beware, this will only work if you turn off the duplicate filter
# and then result in an endless loop!
yield response.follow(
response.meta['main_url'],
self.parse,
meta={'item': item }
)

Scrapy scrape multiple pages

I have a function that can scrape individual page. How can I scrape multiple pages after following corresponding links? Do I need a separate function that calls the parse() like the following gotoIndivPage()? Thank you!
import scrapy
class trainingScraper(scrapy,Spider):
name = "..."
start_urls = "url with links to multiple pages"
# for scraping individual page
def parse(self,response):
SELECTOR1 = '.entry-title ::text'
SELECTOR2 = '//li[#class="location"]/ul/li/a/text()'
yield{
'title': response.css(SELECTOR1).extract_first(),
'date': response.xpath(SELECTOR2).extract_first(),
}
def gotoIndivPage(self,response):
PAGE_SELECTOR = '//h3[#class="entry-title"]/a/#href'
for page in response.xpath(PAGE_SELECTOR):
if page:
yield scrapy.Request(
response.urljoin(page),
callback=self.parse
)
I generally create a new function for every different type of HTML structure I'm trying to scrape. So if your links send you to a page with a different HTML structure then your starting page, I would create a new function and pass that to my callback.
def parseNextPage(self, response):
# Parse new page
def parse(self,response):
SELECTOR1 = '.entry-title ::text'
SELECTOR2 = '//li[#class="example"]/ul/li/a/text()'
yield{
'title': response.css(SELECTOR1).extract_first(),
'date': response.xpath(SELECTOR2).extract_first(),
}
href = //li[#class="location"]/ul/li/a/#href
yield scrapy.Request(
url = href,
callback=self.parseNextPage
)

How to get multi users login in scrapy data extraction?

I have successfully extract data for single user login to the website but my requirement is to make the scrapy script to work for multiple users(generally 5000 users login at same time) to pass as list and make data extraction for each users and export the extracted data into csv format.
Here is the code i have try it for single user login:
import scrapy
import csv
username = 'abc'
password = 'xyz'
class GiffgafSpider(scrapy.Spider):
name = 'giffgaf'
allowed_domains = ['www.something.com']
start_urls = ['https://www.something.com/auth/login']
output = "output.csv"
def __init__(self):
open(self.output,"w").close()
def parse(self, response):
token = response.xpath('.//*[#name="form_key"]/#value').extract_first()
yield scrapy.FormRequest('https://www.something.com/auth/login',formdata={'form_key':token,'username': username,'password':password},callback=self.startscraping)
def startscraping(self,response):
yield scrapy.Request('https://www.giffgaff.com/dashboard',callback=self.verifylogin)
def verifylogin(self,response):
with open(self.output,"a",newline="") as f:
writer = csv.writer(f)
phonenumber = response.css("h2.profile-phone-number::text").extract_first()
writer.writerow([nickname,password,phonenumber])
yield {'uname':username,'password':password,'phone':phonenumber}
Thank you.

Change website deliver country with Scrapy

I need to scrape the website http://www.yellowkorner.com/
By choosing a different country, all the prices will change. There are 40+ countries listed, and each of those must be scrapped.
My current spider is pretty simple
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), self.parse_prices)
def parse_prices(self, response):
yield None
How can I scrape price information for all countries?
Open the page with firebug and refresh. Inspecting the web page at the panel Network / Sub Panel Cookies you will see that the page saves de country information with cookies (see image below).
So you have to force the cookie "YellowKornerCulture" attribute values LANGUAGE and COUNTRY at the request. I made an example based on your code to get the available countries on the site and a loop to get all the prices. See the code below:
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
countries = self.get_countries(response)
#countries = ['BR', 'US'] try this if you only have some countries
for country in countries:
#With the expression re(r'/photos/\d\d\d\d/.*$') you only get photos with 4-digit ids. I think this is not your goal.
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), cookies={'YellowKornerCulture' : 'Language=US&Country='+str(country), 'YellowKornerHistory' : '', 'ASP.NET_SessionId' : ''}, callback=self.parse_prices, dont_filter=True, meta={'country':country})
def parse_prices(self, response):
yield {
'name': response.xpath('//h1[#itemprop="name"]/text()').extract()[0],
'price': response.xpath('//span[#itemprop="price"]/text()').extract()[0],
'country': response.meta['country']
}
#function that gets the countries avaliables on the site
def get_countries(self, response):
return response.xpath('//select[#id="ctl00_languageSelection_ddlCountry"]/option/attribute::value').extract()
Took a certain time to figure this out but you have to erase another cookies that the site is using to choose the language page. Also I fixed the language value to English(US). The parameter dont_filter=True was used because you are requesting an already requested url each loop iteration and the default behavior of scrapy is don't repeat a request to the same url due performance reasons.
PS: The xpath expressions provided can be improved.
Hope this helps.

Resources