Change website deliver country with Scrapy - web-scraping

I need to scrape the website http://www.yellowkorner.com/
By choosing a different country, all the prices will change. There are 40+ countries listed, and each of those must be scrapped.
My current spider is pretty simple
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), self.parse_prices)
def parse_prices(self, response):
yield None
How can I scrape price information for all countries?

Open the page with firebug and refresh. Inspecting the web page at the panel Network / Sub Panel Cookies you will see that the page saves de country information with cookies (see image below).
So you have to force the cookie "YellowKornerCulture" attribute values LANGUAGE and COUNTRY at the request. I made an example based on your code to get the available countries on the site and a loop to get all the prices. See the code below:
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
countries = self.get_countries(response)
#countries = ['BR', 'US'] try this if you only have some countries
for country in countries:
#With the expression re(r'/photos/\d\d\d\d/.*$') you only get photos with 4-digit ids. I think this is not your goal.
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), cookies={'YellowKornerCulture' : 'Language=US&Country='+str(country), 'YellowKornerHistory' : '', 'ASP.NET_SessionId' : ''}, callback=self.parse_prices, dont_filter=True, meta={'country':country})
def parse_prices(self, response):
yield {
'name': response.xpath('//h1[#itemprop="name"]/text()').extract()[0],
'price': response.xpath('//span[#itemprop="price"]/text()').extract()[0],
'country': response.meta['country']
}
#function that gets the countries avaliables on the site
def get_countries(self, response):
return response.xpath('//select[#id="ctl00_languageSelection_ddlCountry"]/option/attribute::value').extract()
Took a certain time to figure this out but you have to erase another cookies that the site is using to choose the language page. Also I fixed the language value to English(US). The parameter dont_filter=True was used because you are requesting an already requested url each loop iteration and the default behavior of scrapy is don't repeat a request to the same url due performance reasons.
PS: The xpath expressions provided can be improved.
Hope this helps.

Related

Scrapy working erratically while scraping Amazon

I am trying to scrape the brand name that appears on a product page on Amazon. It can be found in the line below the product name as 'Visit the xxxx store' or 'Brand: xxxx'. Once I scrape this text, I just proceed to use the replace function in excel to get the brand name.
For example visit the following page - https://www.amazon.in/dp/B09HC794BD
I use the following code to do so -
class BrandSpider(scrapy.Spider):
name = 'brand'
allowed_domains = ['www.amazon.in']
def start_requests(self):
for asin in asin_list:
url = reviews_url.format(asin)
yield scrapy.Request(url=url, meta={'ASIN':asin})
def parse(self, response):
asin = response.request.meta['ASIN']
brand = response.xpath("normalize-space(//a[contains(#id,'bylineInfo')]/text())").get()
yield {
'ASIN':asin,
'Brand': brand
}
While it generally works, in several instances it is returning me an empty string '' for the brand name. I tried inspecting on the website and it works fine. Please help me with this. Thanks in advance!

Is it possible to scrape data from a sublink and go back to the main link using scrapy? Or is there any tool that I can use

For Example I will scrape a link and will open any other links that is included in it and then scrape those links and go back to the main link.
You could send a request back to the page your requests came from, I just don't think that it would make much sense.
Since you could get all the data you need from the main link the first time around, I think it would be better to pass the item you need for the following pages with the meta attribute of Request or in newer versions of Scrapy with cb_kwargs.
yield Request(
"http://www.example.com",
self.callback,
meta={
'item': your_item,
'main_url': response.url
}
)
You could then access the item or main link using the response's meta attribute, work with your old item and then send a request back to the main link and callback with the item.
def callback(self, response):
item = response.meta['item']
main_url = response.meta['main_url']
...
yield Request(main_url, self.parse, meta={'item': item})
For the next example I will use this item.
class MyItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
author_description = scrapy.Field()
tags = scrapy.Field()
I extended the example from here to follow the about the author link and then yield the item.
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
item = MyItem()
item['text'] = quote.css('span.text::text').get()
item['author'] = quote.css('span small::text').get()
item['tags'] = quote.css('div.tags a.tag::text').getall()
about_page = quote.xpath('.//a[text()="(about)"]/#href').extract_first()
yield response.follow(
about_page,
self.parse_about_page,
meta={
'item':item,
'main_url': response.url
}
)
def parse_about_page(self, response):
item = response.meta['item']
item['author_description'] = response.xpath('//div[#class="author-description"]').extract_first()
yield item
# here you could go back to the main_page
# beware, this will only work if you turn off the duplicate filter
# and then result in an endless loop!
yield response.follow(
response.meta['main_url'],
self.parse,
meta={'item': item }
)

Scrapy response.css has empty list

Need the get the Product name and Price from this page
https://www.lazada.com.my/shop-smart-tvs/
Started with
scrapy shell "https://www.lazada.com.my/shop-smart-tvs/"
The output I got was this.
response <200 https://www.lazada.com.my/shop-smart-tvs/>
I also used proxy and user-agents, to make sure there wasn't any blocks.
But when I used
response.css(".c13VH6 , .c16H9d a").css('::text').extract()
and
response.css(".c16H9d a::attr(title)").extract()
I get blank lists []
Tried the same approach with another site, this works.
P.S. I used a Chrome CSS selector widget to get the CSS selectors.
Please tell me where I went wrong.
The data is probably loaded dynamically, so you would have to render the page with something like Splash or Selenium to make your CSS expression work.
When you go to the page source you can see the product data is present in a big json-file, so I would just load the json as a python dictionary, and get the data you want from there:
import scrapy
import json
class LazadaSpider(scrapy.Spider):
name = 'lazada'
allowed_domains = ['lazada.com']
start_urls = ['https://www.lazada.com.my/shop-smart-tvs/']
def parse(self, response):
script = response.xpath(
"//script[starts-with(text(), 'window.pageData')]/text()"
).extract_first()
first = script.index('{')
last = len(script)
products = json.loads(script[first:last])
items = products['mods']['listItems']
for item in items:
name = item['name']
price = item['price']
yield {'name': name,
'price': price}

Scrapy scrape multiple pages

I have a function that can scrape individual page. How can I scrape multiple pages after following corresponding links? Do I need a separate function that calls the parse() like the following gotoIndivPage()? Thank you!
import scrapy
class trainingScraper(scrapy,Spider):
name = "..."
start_urls = "url with links to multiple pages"
# for scraping individual page
def parse(self,response):
SELECTOR1 = '.entry-title ::text'
SELECTOR2 = '//li[#class="location"]/ul/li/a/text()'
yield{
'title': response.css(SELECTOR1).extract_first(),
'date': response.xpath(SELECTOR2).extract_first(),
}
def gotoIndivPage(self,response):
PAGE_SELECTOR = '//h3[#class="entry-title"]/a/#href'
for page in response.xpath(PAGE_SELECTOR):
if page:
yield scrapy.Request(
response.urljoin(page),
callback=self.parse
)
I generally create a new function for every different type of HTML structure I'm trying to scrape. So if your links send you to a page with a different HTML structure then your starting page, I would create a new function and pass that to my callback.
def parseNextPage(self, response):
# Parse new page
def parse(self,response):
SELECTOR1 = '.entry-title ::text'
SELECTOR2 = '//li[#class="example"]/ul/li/a/text()'
yield{
'title': response.css(SELECTOR1).extract_first(),
'date': response.xpath(SELECTOR2).extract_first(),
}
href = //li[#class="location"]/ul/li/a/#href
yield scrapy.Request(
url = href,
callback=self.parseNextPage
)

Going through categories that have different names for each category and product name

I'm trying to scrape data of a website called: https://www.powermaxed.com/.
Its directory structure is not very consistent, and I don't know what to do next.
Here is the code that I use for scraping:
from scrapy.spiders import CrawlSpider
class MySpider(CrawlSpider):
name = 'powermaxed'
start_urls = ['https://www.powermaxed.com/']
def parse_product(self, response):
yield {
'product_title': response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//h1/text()').extract_first()
'product_price_w/_tax': response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//div[#class="row"]//div[#class="product-buy-wrapper"]//ul[#class="list-unstyled pp"]//li//h2//span[#id="formated_price"]/text()').extract_first()
'product_price_w/o_tax': response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//div[#class="row"]//div[#class="product-buy-wrapper"]//ul[#class="list-unstyled pp"]//li//span[#id="formated_tax"]/text()').extract_first()
'product_desc': response.xpath('//div[#id="product-tabs"]//div[#class="tab-content"]//div[#id="tab-description"]//p/text()').extract_first()
'product_uses': response.xpath('//div[#id="product-tabs"]//div[#class="tab-content"]//div[#id="tab-description"]//ul//li/text()').extract()
}
The extracted data would be the product infomation.
I need is to access all product pages from all directories on this website
and extract the information I've put in the code.
I scrapy shelled the website so I've set what data I want to extract on the spider.
You can simply scrape all the pages and return a product if there's one:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'powermaxed.com'
start_urls = ['https://www.powermaxed.com/']
rules = (
Rule(LinkExtractor(), callback='parse_product'),
)
def parse_product(self, response):
product_title = response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//h1/text()').extract_first()
if product_title:
yield {
'product_title': product_title,
'product_price_w/_tax': response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//div[#class="row"]//div[#class="product-buy-wrapper"]//ul[#class="list-unstyled pp"]//li//h2//span[#id="formated_price"]/text()').extract_first(),
'product_price_w/o_tax': response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//div[#class="row"]//div[#class="product-buy-wrapper"]//ul[#class="list-unstyled pp"]//li//span[#id="formated_tax"]/text()').extract_first(),
'product_desc': response.xpath('//div[#id="product-tabs"]//div[#class="tab-content"]//div[#id="tab-description"]//p/text()').extract_first(),
'product_uses': response.xpath('//div[#id="product-tabs"]//div[#class="tab-content"]//div[#id="tab-description"]//ul//li/text()').extract(),
}
Can you add more details to your questions? What kind of help do you need?
Get all main categories from homepage, for example, like nav#supermenu ul > li > a[href]:not(.tllhome) and scrape all the products from there. Iterate by left filter block div.panel-category a if needed.
Also you can try to scrape categories from sitemap (https://www.powermaxed.com/sitemap.xml) and get all the products from these pages.

Resources