crawl dynamic data using scrapy - web-scraping

I try to get the product rating information from target.com. The URL for the product is
http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty
After looking through response.body, I find out that the rating information is not statically loaded. So I need to get using other ways. I find some similar questions saying in order to get dynamic data, I need to
find out the correct XHR and where to send request
use FormRequest to get the right json
parse json
(if I am wrong about the steps please tell me)
I am stuck at step 2 right now, i find out that one XHR named 15258543 contained rating distribution, but I don't know how can I sent a request to get the json. Like to where and use what parameter.
Can someone can walk me through this?
Thank you!

The trickiest thing is to get that 15258543 product ID dynamically and then use it inside the URL to get the reviews. This product ID can be found in multiple places on the product page, for instance, there is a meta element that we can use:
<meta itemprop="productID" content="15258543">
Here is a working spider that makes a separate GET request to get the reviews, loads the JSON response via json.loads() and prints the overall product rating:
import json
import scrapy
class TargetSpider(scrapy.Spider):
name = "target"
allowed_domains = ["target.com"]
start_urls = ["http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty"]
def parse(self, response):
product_id = response.xpath("//meta[#itemprop='productID']/#content").extract_first()
return scrapy.Request("http://tws.target.com/productservice/services/reviews/v1/reviewstats/" + product_id,
callback=self.parse_ratings,
meta={"product_id": product_id})
def parse_ratings(self, response):
data = json.loads(response.body)
print(data["result"][response.meta["product_id"]]["coreStats"]["AverageOverallRating"])
Prints 4.5585.

Related

Cannot Authenticate to website, Scrapy Spider, Bad Request

I'm trying to write a webscraper to study different social media platforms and now I'm working on one for Gab. When I try to log in I get what I believe is a 400 HTTP code, bad request and I'm not sure why.
import scrapy
from scrapy.http import FormRequest
from scrapy.utils.response import open_in_browser
class FeedparserSpider(scrapy.Spider):
name = 'feedparser'
allowed_domains = ['gab.com']
start_urls = ['https://gab.com/auth/sign_in/']
def parse(self, response):
# Everything we need to sign in
authenticity_token = response.xpath('//form[#class="simple_form new_user"]/input[#name="authenticity_token"]/#value').get()
user_email = "my#eamiladdress.com"
user_password = "MyPassword"
open_in_browser(response)
return FormRequest.from_response("https://gab.com/auth/sign_in", formdata={
'authenticity_token': authenticity_token,
'user[email]' : user_email,
'user[password]': user_password,
}, callback=self.parsefeed)
def parsefeed(self, response):
home_url = 'https://gab.com/home'
yield scrapy.Request(url=home_url, callback=self.parse_feed)
def parse_feed(self, response):
open_in_browser(response)
Current Predicament
I suspect I'll need to change my formdata to include a user object with an email and a password property but I'm not sure.
I am VERY new to web scraping so dont have many troubleshooting strategies or insight yet. Any advice on what and how I should proceed would be very helpful and if this post could use any additional details please let me know and will add it as quickly as possible.
You need to pass 'FormRequest.from_response' your response:
return FormRequest.from_response(response, formdata={
'authenticity_token': authenticity_token,
'user[email]' : user_email,
'user[password]': user_password},
callback=self.parsefeed)
Then you'll get to a page with the following text:
To use the Gab Social web application, please enable JavaScript. Alternatively, try one of the native apps for Gab Social for your platform.
You need to check how to website checks for javascript and see if you can bypass it.

scrapy giving a different output than on website, problem with geo location?

I'm really a newbie in all of this and am just trying to learn a bit more about this.
So I had a lot of help to get this going but now I'm stuck on a very weird problem.
I am scraping info from a grocery store in Australia. As I'm located in the state of Victoria when I go on a website the price of a Redbull is 10.5$ but as soon as I run my script I get 11.25$.
I am guessing it might have to do with a geolocation...but not sure.
I basically need some help as to where to look to find how to get the right price I get when I go to the website.
Also, I noticed that when I do go to the same website from my phone it gives me the price of 11.25$, but if I go to the app of the store I get the accurate price of 10.5$.
import json
import scrapy
class SpidervenderSpider(scrapy.Spider):
name = 'spidervender'
allowed_domains = ['woolworths.com.au']
start_urls = ['https://www.woolworths.com.au/shop/productdetails/306165/red-bull-energy-drink']
def parse(self, response):
product_schema = json.loads(response.css('script[type="application/ld+json"]::text').get())
yield {
'title': product_schema['name'],
'price': product_schema['offers']['price']
}
So the code works perfectly but the price is (I presume) for a different part of Australia.

Scrape dynamic info from same URL using python or any other tool

I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic

Detecting valid search parameters for a site? (Web scraping)

I'm trying to scrape a bunch of search results from the site:
http://www.wileyopenaccess.com/view/journals.html
Currently the results show up on 4 pages. The 4th page could be accessed with http://www.wileyopenaccess.com/view/journals.html?page=4
I'd like some way to get all of the results on one page for easier scraping, but I have no idea how to determine which request parameters are valid. I tried a couple of things like:
http://www.wileyopenaccess.com/view/journals.html?per_page=100
http://www.wileyopenaccess.com/view/journals.html?setlimit=100
to no avail. Is there a way to detect the valid parameters of this search?
I'm using BeautifulSoup; is there some obvious way to do this that I've overlooked?
Thanks
You cannot pass any magic params to get all the links but you can use the Next button to get all the pages which will work regardless of how many pages there may be:
from bs4 import BeautifulSoup
def get_all_pages():
response = requests.get('http://www.wileyopenaccess.com/view/journals.html')
soup = BeautifulSoup(response.text)
yield soup.select("div.journalRow")
nxt = soup.select_one("div.journalPagination.borderBox a[title^=Next]")
while nxt:
response = requests.get(nxt["href"])
soup = BeautifulSoup(response.text)
yield soup.select("div.journalRow")
nxt = soup.select_one("div.journalPagination.borderBox a[title^=Next]")
for page in get_all_pages():
print(page)

Python-requests: Can't scrape all the html code from a page

I am trying to scrape the content of the
Financial Times Search page.
Using Requests, I can easily scrape the articles' titles and hyperlinks.
I would like to get the next page's hyperlink, but I can not find it in the Requests response, unlike the articles' titles or hyperlinks.
from bs4 import BeautifulSoup
import requests
url = 'http://search.ft.com/search?q=SABMiller+PLC&t=all&rpp=100&fa=people%2Corganisations%2Cregions%2Csections%2Ctopics%2Ccategory%2Cbrand&s=-lastPublishDateTime&f=lastPublishDateTime[2000-01-01T00%3A00%3A00%2C2016-01-01T23%3A59%3A59]&curations=ARTICLES%2CBLOGS%2CVIDEOS%2CPODCASTS&highlight=true&p=1et'
response = requests.get(url, auth=(my login informations))
soup = BeautifulSoup(response.text, "lxml")
def get_titles_and_links():
titles = soup.find_all('a')
for ref in titles:
if ref.get('title') and ref.get('onclick'):
print ref.get('href')
print ref.get('title')
The get_titles_and_links() function gives me the titles and links of all the articles.
However, with a similar function for the next page, I have no results:
def get_next_page():
next_page = soup.find_all("li", class_="page next")
return next_page
Or:
def get_next_page():
next_page = soup.find_all('li')
for ref in next_page:
if ref.get('page next'):
print ref.get('page next')
If you can see the required links in the page source, but are not able to get them via requests or urllib. It can mean two things.
There is something wrong with your logic. Let's assume it's not that.
Then the thing remains is: Ajax, those parts of the page you are looking for are loaded by javascript after the document.onload method fired. So you cannot get something that's not there in the first place.
My solutions(more like suggestions) are
Reverse engineer the network requests. Difficult, but universally applicable. I personally do that. You might want to use re module.
Find something that renders javascript. That's just to say that, simulate web browsing. You might wanna check out the webdriver component of selenium, Qt etc. This is easier, but kinda memory hungry and consumes a lot more network resource compared to 1.

Resources