Need the get the Product name and Price from this page
https://www.lazada.com.my/shop-smart-tvs/
Started with
scrapy shell "https://www.lazada.com.my/shop-smart-tvs/"
The output I got was this.
response <200 https://www.lazada.com.my/shop-smart-tvs/>
I also used proxy and user-agents, to make sure there wasn't any blocks.
But when I used
response.css(".c13VH6 , .c16H9d a").css('::text').extract()
and
response.css(".c16H9d a::attr(title)").extract()
I get blank lists []
Tried the same approach with another site, this works.
P.S. I used a Chrome CSS selector widget to get the CSS selectors.
Please tell me where I went wrong.
The data is probably loaded dynamically, so you would have to render the page with something like Splash or Selenium to make your CSS expression work.
When you go to the page source you can see the product data is present in a big json-file, so I would just load the json as a python dictionary, and get the data you want from there:
import scrapy
import json
class LazadaSpider(scrapy.Spider):
name = 'lazada'
allowed_domains = ['lazada.com']
start_urls = ['https://www.lazada.com.my/shop-smart-tvs/']
def parse(self, response):
script = response.xpath(
"//script[starts-with(text(), 'window.pageData')]/text()"
).extract_first()
first = script.index('{')
last = len(script)
products = json.loads(script[first:last])
items = products['mods']['listItems']
for item in items:
name = item['name']
price = item['price']
yield {'name': name,
'price': price}
Related
I am trying to scrape data from Boots.com skincare category page: Boots.skincare
There are 122 pages of skincare products in total.
I have successfully scraped the data on the first page using requests and BeautifulSoup.
Here is the code:
productlinks = []
r = requests.get('https://www.boots.com/beauty/skincare/skincare-all-skincare')
soup = BeautifulSoup(r.content, 'lxml')
productlist = soup.find_all('div', class_ = 'product_name')
for item in productlist:
for link in item.find_all('a',href = True):
productlinks.append(link['href'])
However, when I tried to expand the scraper to other pages, it only returned the result of first page.
I've tried using loop, but it was repeating the same product url.
Following code gave me 48 results but there were duplicates of first page's 24 items.
productlinks = []
for i in range (24,72,24):
page = f'https://www.boots.com/beauty/skincare/skincare-all-skincare#facet:&productBeginIndex:{i}&orderBy:&pageView:grid&minPrice:&maxPrice:&pageSize:&'
soup = BeautifulSoup(r.content,'lxml')
productlist = soup.find_all('div', class_ = 'product_name')
for item in productlist:
for link in item.find_all('a',href = True):
productlinks.append(link['href'])
I tried to used the url of the 2nd page but it still returned data from the first page
productlinks = []
r = requests.get('https://www.boots.com/beauty/skincare/skincare-all-skincare#facet:&productBeginIndex:24&orderBy:&pageView:grid&minPrice:&maxPrice:&pageSize:&')
soup = BeautifulSoup(r.content,'lxml')
productlist = soup.find_all('div', class_ = 'product_name')
for item in productlist:
for link in item.find_all('a',href = True):
productlinks.append(link['href'])
I've searched for similar questions, but most of the websites URL use page = i to identify the page, instead Boots.com uses productBeginIndex:{i} in the URL.
I am not sure if this is the reason to cause the issue.
If you go to Network tab in Chrome, you will notice that when you switch pages, there is a POST request to:
Request URL:
https://www.boots.com/ProductListingViewRedesign?ajaxStoreImageDir=%2Fwcsstore%2FeBootsStorefrontAssetStore%2F&searchType=1000&advancedSearch=&cm_route2Page=&filterTerm=&storeId=11352&cm_pagename=&manufacturer=&sType=SimpleSearch&metaData=&catalogId=28501&searchTerm=&resultsPerPage=24&filterFacet=&resultCatEntryType=&gridPosition=&emsName=&disableProductCompare=false&langId=-1&facet=&categoryId=2300180
This post request has a payload - you can also find it in Network tab. Do a POST request with correct headers and payload, and you will get your expected results.
If you have difficulties at any step, post back your attempts, and you will receive further help.
I am trying to scrape the brand name that appears on a product page on Amazon. It can be found in the line below the product name as 'Visit the xxxx store' or 'Brand: xxxx'. Once I scrape this text, I just proceed to use the replace function in excel to get the brand name.
For example visit the following page - https://www.amazon.in/dp/B09HC794BD
I use the following code to do so -
class BrandSpider(scrapy.Spider):
name = 'brand'
allowed_domains = ['www.amazon.in']
def start_requests(self):
for asin in asin_list:
url = reviews_url.format(asin)
yield scrapy.Request(url=url, meta={'ASIN':asin})
def parse(self, response):
asin = response.request.meta['ASIN']
brand = response.xpath("normalize-space(//a[contains(#id,'bylineInfo')]/text())").get()
yield {
'ASIN':asin,
'Brand': brand
}
While it generally works, in several instances it is returning me an empty string '' for the brand name. I tried inspecting on the website and it works fine. Please help me with this. Thanks in advance!
For Example I will scrape a link and will open any other links that is included in it and then scrape those links and go back to the main link.
You could send a request back to the page your requests came from, I just don't think that it would make much sense.
Since you could get all the data you need from the main link the first time around, I think it would be better to pass the item you need for the following pages with the meta attribute of Request or in newer versions of Scrapy with cb_kwargs.
yield Request(
"http://www.example.com",
self.callback,
meta={
'item': your_item,
'main_url': response.url
}
)
You could then access the item or main link using the response's meta attribute, work with your old item and then send a request back to the main link and callback with the item.
def callback(self, response):
item = response.meta['item']
main_url = response.meta['main_url']
...
yield Request(main_url, self.parse, meta={'item': item})
For the next example I will use this item.
class MyItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
author_description = scrapy.Field()
tags = scrapy.Field()
I extended the example from here to follow the about the author link and then yield the item.
class QuotesSpider(scrapy.Spider):
name = "quotes"
start_urls = ['http://quotes.toscrape.com/page/1/']
def parse(self, response):
for quote in response.css('div.quote'):
item = MyItem()
item['text'] = quote.css('span.text::text').get()
item['author'] = quote.css('span small::text').get()
item['tags'] = quote.css('div.tags a.tag::text').getall()
about_page = quote.xpath('.//a[text()="(about)"]/#href').extract_first()
yield response.follow(
about_page,
self.parse_about_page,
meta={
'item':item,
'main_url': response.url
}
)
def parse_about_page(self, response):
item = response.meta['item']
item['author_description'] = response.xpath('//div[#class="author-description"]').extract_first()
yield item
# here you could go back to the main_page
# beware, this will only work if you turn off the duplicate filter
# and then result in an endless loop!
yield response.follow(
response.meta['main_url'],
self.parse,
meta={'item': item }
)
I'm trying to scrape data of a website called: https://www.powermaxed.com/.
Its directory structure is not very consistent, and I don't know what to do next.
Here is the code that I use for scraping:
from scrapy.spiders import CrawlSpider
class MySpider(CrawlSpider):
name = 'powermaxed'
start_urls = ['https://www.powermaxed.com/']
def parse_product(self, response):
yield {
'product_title': response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//h1/text()').extract_first()
'product_price_w/_tax': response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//div[#class="row"]//div[#class="product-buy-wrapper"]//ul[#class="list-unstyled pp"]//li//h2//span[#id="formated_price"]/text()').extract_first()
'product_price_w/o_tax': response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//div[#class="row"]//div[#class="product-buy-wrapper"]//ul[#class="list-unstyled pp"]//li//span[#id="formated_tax"]/text()').extract_first()
'product_desc': response.xpath('//div[#id="product-tabs"]//div[#class="tab-content"]//div[#id="tab-description"]//p/text()').extract_first()
'product_uses': response.xpath('//div[#id="product-tabs"]//div[#class="tab-content"]//div[#id="tab-description"]//ul//li/text()').extract()
}
The extracted data would be the product infomation.
I need is to access all product pages from all directories on this website
and extract the information I've put in the code.
I scrapy shelled the website so I've set what data I want to extract on the spider.
You can simply scrape all the pages and return a product if there's one:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class MySpider(CrawlSpider):
name = 'powermaxed.com'
start_urls = ['https://www.powermaxed.com/']
rules = (
Rule(LinkExtractor(), callback='parse_product'),
)
def parse_product(self, response):
product_title = response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//h1/text()').extract_first()
if product_title:
yield {
'product_title': product_title,
'product_price_w/_tax': response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//div[#class="row"]//div[#class="product-buy-wrapper"]//ul[#class="list-unstyled pp"]//li//h2//span[#id="formated_price"]/text()').extract_first(),
'product_price_w/o_tax': response.xpath('//div[#class="container"]//div[#class="row"]//div[#id="content"]//div[#class="row"]//div[#class="product-buy-wrapper"]//ul[#class="list-unstyled pp"]//li//span[#id="formated_tax"]/text()').extract_first(),
'product_desc': response.xpath('//div[#id="product-tabs"]//div[#class="tab-content"]//div[#id="tab-description"]//p/text()').extract_first(),
'product_uses': response.xpath('//div[#id="product-tabs"]//div[#class="tab-content"]//div[#id="tab-description"]//ul//li/text()').extract(),
}
Can you add more details to your questions? What kind of help do you need?
Get all main categories from homepage, for example, like nav#supermenu ul > li > a[href]:not(.tllhome) and scrape all the products from there. Iterate by left filter block div.panel-category a if needed.
Also you can try to scrape categories from sitemap (https://www.powermaxed.com/sitemap.xml) and get all the products from these pages.
I need to scrape the website http://www.yellowkorner.com/
By choosing a different country, all the prices will change. There are 40+ countries listed, and each of those must be scrapped.
My current spider is pretty simple
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), self.parse_prices)
def parse_prices(self, response):
yield None
How can I scrape price information for all countries?
Open the page with firebug and refresh. Inspecting the web page at the panel Network / Sub Panel Cookies you will see that the page saves de country information with cookies (see image below).
So you have to force the cookie "YellowKornerCulture" attribute values LANGUAGE and COUNTRY at the request. I made an example based on your code to get the available countries on the site and a loop to get all the prices. See the code below:
# coding=utf-8
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blogspider'
start_urls = ['http://www.yellowkorner.com/photos/index.aspx']
def parse(self, response):
countries = self.get_countries(response)
#countries = ['BR', 'US'] try this if you only have some countries
for country in countries:
#With the expression re(r'/photos/\d\d\d\d/.*$') you only get photos with 4-digit ids. I think this is not your goal.
for url in response.css('a::attr("href")').re(r'/photos/\d\d\d\d/.*$'):
yield scrapy.Request(response.urljoin(url), cookies={'YellowKornerCulture' : 'Language=US&Country='+str(country), 'YellowKornerHistory' : '', 'ASP.NET_SessionId' : ''}, callback=self.parse_prices, dont_filter=True, meta={'country':country})
def parse_prices(self, response):
yield {
'name': response.xpath('//h1[#itemprop="name"]/text()').extract()[0],
'price': response.xpath('//span[#itemprop="price"]/text()').extract()[0],
'country': response.meta['country']
}
#function that gets the countries avaliables on the site
def get_countries(self, response):
return response.xpath('//select[#id="ctl00_languageSelection_ddlCountry"]/option/attribute::value').extract()
Took a certain time to figure this out but you have to erase another cookies that the site is using to choose the language page. Also I fixed the language value to English(US). The parameter dont_filter=True was used because you are requesting an already requested url each loop iteration and the default behavior of scrapy is don't repeat a request to the same url due performance reasons.
PS: The xpath expressions provided can be improved.
Hope this helps.