Best approach to scrape dynamic website(built using react) using python scrapy - web-scraping

I have been trying to scrape this website Link using scrapy and scrapy-splash. This website as far as i know developed in react. response.xpath always returns empty list with any classname. Please suggest me a way to approach scraping of this react website. I have setup splash using this link and able to scrape some other websites in the same project but unable to scrape this react made website. Code for the spider is written below:
import scrapy
from scrapy_splash import SplashRequest
class NykaaFashionbrandsSpider(scrapy.Spider):
name = 'nykaa_fashionbrands'
start_urls = ["https://www.nykaafashion.com/"]
custom_settings = {
'FEED_FORMAT': 'csv',
'FEED_URI': 'fashion_brands.csv'
}
def start_requests(self):
for url in self.start_urls:
yield SplashRequest(url, self.parse,
endpoint='render.html',
args={'wait': 3},
)
def parse(self, response):
print(response.xpath('//*[#class="br-inner"]/ul/li/text()').extract())
# I am trying to get the list items

if you need to scrape all products or product in a particular category you have you use API url like this one:
https://www.nykaafashion.com/rest/appapi/V2/categories/products?categoryId=6151&PageSize=12&sort=popularity&currentPage=2&filter_format=v2
This piece of response:
"products": [
{
"sku": "CTWK0648",
"imageUrl": "https://adn-static1.nykaa.com/nykdesignstudio-images/pub/media/catalog/product/3/8/3884c_1.jpg?rnd=20200526195200",
"isOutOfStock": 0,
"subTitle": "Black Embellished Sandals",
"title": "Catwalk",
"price": 1995,
"tag": {},
"offerCount": 0,
"categoryId": [
"102",
"3528",
"3522",
"2",
"6151",
"6557"
],
"discount": 55,
"offers": null,
"discountedPrice": 899,
"actionUrl": "/catwalk-black-embellished-sandals-3/p/537684",
"aspectRatio": 0.75,
"sizeVariation": [
{
"id": "537678",
"title": "4"
},
{
"id": "537679",
"title": "5"
},
{
"id": "537680",
"title": "6"
},
{
"id": "537681",
"title": "7"
}
],
"type": "configurable",
"id": "537684"
},
Splash not needed for this website

I would suggest you should definitely give cloudscraper a try. I recently tested scraping OpenSea, and it worked perfectly.
install it by running
pip install cloudscraper
to scrape the data do:
import cloudscraper
scraper = cloudscraper.create_scraper(browser="chrome")
url = "https://www.nykaafashion.com/"
scraped_status = scraper.get(url) #get status code
scraped = scraper.get(url).text #get the data

Related

WooCommerce GET Requests wp-json/wc/v3/products/categories?page=1 shows random categories on each call?

Our onlineshop recently has issues getting the correct category orders over to our stock management system via API.
Whenever we test the API Calls in Postman with:
wp-json/wc/v3/products/categories?page=1
we get a completely random category order output like this:
First call:
[
{
"id": 5179,
"name": "Redmi Note 2022",
"slug": "redmi-note-2022",
"parent": 3054,
Second call:
[
{
"id": 5181,
"name": "Displayeinheit",
"slug": "displayeinheit-redmi-note-2022",
"parent": 5179,
Any advice how we can resolve this issue?
The issue has been resolved:
The problem was caused by the plugin:
https://wordpress.org/plugins/taxonomy-terms-order/

Can WooCommerce Api update stock quantities in bulk?

Is that possible to update the inventory/stock quanitity in bulk using WooCommerce API? From the documentation for every update we may need to call PUT /wp-json/wc/v3/products/. We have more than 1000 products, it is very inefficiency to call 1000+ api just to update the quatity?
If there is any other thoughts, please welcome to share. Thanks
You can use this endpoint wp-json/wc/v3/products/batch
{
"Update": [
{
"Id": "39",
"Default_attributes": [
{
"Id": null,
"Name": null,
"Sku": "NEW-SKU",
"StockQuantity": "5"
}
]
}
]
}
here is an example body

To get different response for single post and multiple posts from WP REST API

Is there any option to get different type response for WordPress single Post and Posts list?
My target for posts list response as
[
{"id":1,
"date":"2017-08-20T07:26:55",
"link":"http://localhost/wordpress/2017/08/20/test-post",
"title":{"rendered":"Test post"}
},
{"id":2,
"date":"2017-08-20T07:26:55",
"link":"http://localhost/wordpress/2017/08/20/test-post",
"title":{"rendered":"Test post"}
},
{"id":3,
"date":"2017-08-20T07:26:55",
"link":"http://localhost/wordpress/2017/08/20/test-post",
"title":{"rendered":"Test post"}
},
{"id":4,
"date":"2017-08-20T07:26:55",
"link":"http://localhost/wordpress/2017/08/20/test-post",
"title":{"rendered":"Test post"}
}
]
For single post response as
{
"id": 92,
"date": "2017-08-20T07:13:42",
"date_gmt": "2017-08-20T07:13:42",
"guid": {
"rendered": "http://devel8/wp-news/?p=1"
},
"modified": "2017-08-20T07:13:42",
"modified_gmt": "2017-08-20T07:13:42",
"slug": "hello-world-2",
"status": "publish",
"type": "post",
"link": "http://localhost/wordpress/2017/08/20/hello-world-2/",
"title": {
"rendered": "Hello world!"
},
"content": {
"rendered": "<p>Welcome to WordPress. This is your first post. Edit or delete it, then start writing!</p>\n",
"protected": false
},
"excerpt": {
"rendered": "<p>Welcome to WordPress. This is your first post. Edit or delete it, then start writing!</p>\n",
"protected": false
},
"author": 1,
"featured_media": 0,
"comment_status": "open",
"ping_status": "open",
"sticky": false,
"template": "",
"format": "standard",
"meta": [],
"categories": [
1
],
"tags": [],
.....
.....
}
}
Note: using register_rest_field() and rest_prepare_post filter we can modify the response for both (single and multiple posts) But we need separately response.
OR
There is any option to know the request is made for multiple posts or single post in the get_callback function of register_rest_field().
Thanks in advance.
Since i don't get any response and solution from any one, so i had decided to develop a WordPress plugin which will meet my requirement to handle WP REST API request response for single post and posts list or multiple posts or loop of post or group of posts differently and admin can control from back end.
After long struggle i have developed the plugin named as
One Call – WP REST API Extension
Core features of plugin are
Custom and back end control rest api prefix such as ‘test-api’ where
‘wp-json’ is default to initially secure the api call.
Get different reponse for list posts and single post responses.
For posts list (multiple) call, you can control ‘one_call’ fields
from back end.
WordPress Posts fields filtering options from back end for posts list
(loop of Posts).
Hope this plugin help others like me who has plan to develop mobile application for WordPress Website using Ionic, Phonegap,React Native, Framework& and NativeScript etc.

WooCommerce REST API: updating order line item metadata for shipment

I've stubled upon an issue for updating order line items' metadata via WooCommerce REST API using node.js.
I've been following these steps for updating orders (and was succesful with some fields):
https://woocommerce.github.io/woocommerce-rest-api-docs/#update-an-order
Now, what I would like to achieve is changing the number of shipped line items of an order. Something I would normally use the partial orders WC plugin in the wordpress UI.
Below, you can find a screenshot of a line item I get from WC using the orders API call. The last element of the meta_data array has key 'shipped' and it contains an array with one object, stating that one (out of two ordered items) had been shipped:
"line_items": [{
"id": 1937,
"name": "Maya",
"product_id": 1271,
"variation_id": 1272,
"quantity": 2,
"tax_class": "",
"subtotal": "140.00",
"subtotal_tax": "0.00",
"total": "140.00",
"total_tax": "0.00",
"taxes": [],
"meta_data": [{
"id": 21637,
"key": "pa_product-color",
"value": "beige"
}, {
"id": 21638,
"key": "pa_shoe-size",
"value": "42"
}, {
"id": 21639,
"key": "pa_shoe-width",
"value": "wide"
}, {
"id": 21640,
"key": "shipped",
"value": [{
"qty": 1,
"date": "Nov 21, 2017"
}
]
}
],
"sku": "2522BE42W",
"price": 70
},
As you can see, the value of the key 'shipped' is an object. When I ty to send it (back) to WC, I get an error saying:
"data":{"status":400,"params":{"line_items":"line_items[0][meta_data][3][value] is not of type string."}}}
When I try to send the value as a string, i.e.
lineItems[0].meta_data = [{key:"shipped", value: "[{qty:'2'}]" }]
I get no errors, but WC treats this as string, not as an object and doesn't update the shipment qty in the DB the way I intended (it only pulls the shipped quantity down to 0 instead):
{
"id": 21640,
"key": "shipped",
"value": "[{qty:'2'}]"
}
Any insights or ideas - how could I modify the shipped quantity of line items via the WC API?
So, apparently there was a bug in WP 4.9 version, which was fixed recently in the following commit:
https://github.com/woocommerce/woocommerce/pull/17849
It concerns REST API schema and after merging the fix to WooCommerce, the problems disappear and now I am able to send the data as an object.
More on the topic can be found here:
https://github.com/woocommerce/wc-api-dev/pull/74

scrape under "show more"

I am trying to scrape all the objects with the same tag from a specific site (Google Scholar) with BeautifulSoup, but it doesn't scrap the object under the "show more" at the end of the page. How can I fix it?
Here's an example of my code:
# -*- coding: cp1253 -*-
from urllib import urlopen
from bs4 import BeautifulSoup
webpage=urlopen('http://scholar.google.gr/citations?user=FwuKA4UAAAAJ&hl=el')
soup=BeautifulSoup(webpage)
for t in soup.findAll('a',{"class":"gsc_a_at"}):
print t.text
You have to pass pagination parameters to the request url.
cstart - Parameter defines the result offset. It skips the given number of results. It's used for pagination. (e.g., 0 (default) is the first page of results, 20 is the 2nd page of results, 40 is the 3rd page of results, etc.).
pagesize - Parameter defines the number of results to return. (e.g., 20 (default) returns 20 results, 40 returns 40 results, etc.). Maximum number of results to return is 100.
You could also use a third party solution like SerpApi to do this for you. It's a paid API with a free trial.
Example python code (available in other libraries also) to retrieve the second page of results:
from serpapi import GoogleSearch
params = {
"engine": "google_scholar_author",
"hl": "en",
"author_id": "FwuKA4UAAAAJ",
"start": "20",
"api_key": "secret_api_key"
}
search = GoogleSearch(params)
results = search.get_dict()
Example JSON output:
"articles": [
{
"title": "MuseumScrabble: Design of a mobile game for children’s interaction with a digitally augmented cultural space",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:RHpTSmoSYBkC",
"citation_id": "FwuKA4UAAAAJ:RHpTSmoSYBkC",
"authors": "C Sintoris, A Stoica, I Papadimitriou, N Yiannoutsou, V Komis, N Avouris",
"publication": "Social and organizational impacts of emerging mobile devices: Evaluating use …, 2012",
"cited_by": {
"value": 69,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=6286720977869955347",
"serpapi_link": "https://serpapi.com/search.json?cites=6286720977869955347&engine=google_scholar&hl=en",
"cites_id": "6286720977869955347"
},
"year": "2012"
},
{
"title": "The effective combination of hybrid usability methods in evaluating educational applications of ICT: Issues and challenges",
"link": "https://scholar.google.com/citations?view_op=view_citation&hl=en&user=FwuKA4UAAAAJ&cstart=20&citation_for_view=FwuKA4UAAAAJ:hqOjcs7Dif8C",
"citation_id": "FwuKA4UAAAAJ:hqOjcs7Dif8C",
"authors": "N Tselios, N Avouris, V Komis",
"publication": "Education and Information Technologies 13 (1), 55-76, 2008",
"cited_by": {
"value": 68,
"link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=1046912849634390721",
"serpapi_link": "https://serpapi.com/search.json?cites=1046912849634390721&engine=google_scholar&hl=en",
"cites_id": "1046912849634390721"
},
"year": "2008"
},
...
Check out the documentation for more details.
Disclaimer: I work at SerpApi.
In Chrome, try F12 --> Network, select 'Preserve log' and disable cache.
Now hit the show more button.
Check the GET/POST request being sent. You will know what to do next.

Resources