scrapy giving a different output than on website, problem with geo location? - web-scraping

I'm really a newbie in all of this and am just trying to learn a bit more about this.
So I had a lot of help to get this going but now I'm stuck on a very weird problem.
I am scraping info from a grocery store in Australia. As I'm located in the state of Victoria when I go on a website the price of a Redbull is 10.5$ but as soon as I run my script I get 11.25$.
I am guessing it might have to do with a geolocation...but not sure.
I basically need some help as to where to look to find how to get the right price I get when I go to the website.
Also, I noticed that when I do go to the same website from my phone it gives me the price of 11.25$, but if I go to the app of the store I get the accurate price of 10.5$.
import json
import scrapy
class SpidervenderSpider(scrapy.Spider):
name = 'spidervender'
allowed_domains = ['woolworths.com.au']
start_urls = ['https://www.woolworths.com.au/shop/productdetails/306165/red-bull-energy-drink']
def parse(self, response):
product_schema = json.loads(response.css('script[type="application/ld+json"]::text').get())
yield {
'title': product_schema['name'],
'price': product_schema['offers']['price']
}
So the code works perfectly but the price is (I presume) for a different part of Australia.

Related

How can I get tweet having only tweet ID without using twitter API?

I have a large number of Tweet IDs that have been collected by other people (https://github.com/echen102/us-pres-elections-2020), and I now want to get these tweets from those IDs. What should I do without the Twitter API?
Do you want the url ? It is : https://twitter.com/user/status/<tweet_id>
If you want the text of the tweet withou using the api , you have to render the page, and then scrape it.
You can do it with one module, requests-html:
from requests_html import HTMLSession
session = HTMLSession()
url = "https://twitter.com/user/status/1414963866304458758"
r = session.get(url)
r.html.render(sleep=2)
tweet_text = r.html.find('.css-1dbjc4n.r-1s2bzr4', first=True)
print(tweet_text.text)
Output:
Here’s a serious national security question: Why does the Biden administration want to protect COMMUNISM from blame for the Cuban Uprising? They attribute it to vaccines. Even if the Big Guy can’t comprehend it, Hunter could draw a picture for him.

Scraping "older" pages with scrapy, rules and link extractors

I have been working on a project with scrapy. With help, from this lovely community I have managed to be able to scrape the first page of this website: http://www.rotoworld.com/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav. I am trying to scrape information from the "older" pages as well. I have researched "crawlspider", rules and link extractors, and believed I had the proper code. I want the spider to perform the same loop on subsequent pages. Unfortunately at the moment when I run it, it just spits out the 1st page, and doesn't continue to the "older" pages.
I am not exactly sure what I need to change and would really appreciate some help. There are posts going all the way back to February of 2004... I am new to data mining, and not sure if it is actually a realistic goal to be able to scrape every post. If it is I would like to though. Please any help is appreciated. Thanks!
import scrapy
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(crawlspider):
name = "RotoPlayerNews"
start_urls = [
'http://www.rotoworld.com/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[#id="cp1_ctl00_btnNavigate1"]',)), callback="parse_page", follow= True),)
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
position= item.xpath(".//div[#class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[#class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}
If your intention is to fetch the data traversing multiple pages, you don't need to go for scrapy. If you still want to have any solution related to scrapy then I suggest you opt for splash to handle the pagination.
I would do something like below to get the items (assuming you have already installed selenium in your machine):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.rotoworld.com/playernews/nfl/football/")
wait = WebDriverWait(driver, 10)
while True:
for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='pb']"))):
player = item.find_element_by_xpath(".//div[#class='player']/a").text
player = player.encode() #it should handle the encoding issue; I'm not totally sure, though
print(player)
try:
idate = wait.until(EC.presence_of_element_located((By.XPATH, "//div[#class='date']"))).text
if "Jun 9" in idate: #put here any date you wanna go back to (last limit: where the scraper will stop)
break
wait.until(EC.presence_of_element_located((By.XPATH, "//input[#id='cp1_ctl00_btnNavigate1']"))).click()
wait.until(EC.staleness_of(item))
except:break
driver.quit()
My suggestion: Selenium
If you want to change of page automatically, you can use Selenium WebDriver.
Selenium makes you to be able to interact with the page click on buttons, write on inputs, etc. You'll need to change your code to scrap the data an then, click on the older button. Then, it'll change the page and keep scraping.
Selenium is a very useful tool. I'm using it right now, on a personal project. You can take a look at my repo on GitHub to see how it works. In the case of the page that you're trying to scrap, you cannot go to older just changing the link to be scraped, so, you need to use Selenium to do change between pages.
Hope it helps.
No need to use Selenium in current case. Before scraping you need to open url in browser and press F12 to inspect code and to see packets in Network Tab. When you press next or "OLDER" in your case you can see new set of TCP packets in Network tab. It provide to you all you need. When you understand how it work you can write working spider.
import scrapy
from scrapy import FormRequest
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(CrawlSpider):
name = "RotoPlayerNews"
start_urls = [
'http://www.<DOMAIN>/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[#id="cp1_ctl00_btnNavigate1"]',)), callback="parse", follow= True),)
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
position= item.xpath(".//div[#class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[#class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}
older = response.css('input#cp1_ctl00_btnNavigate1')
if not older:
return
inputs = response.css('div.aspNetHidden input')
inputs.extend(response.css('div.RW_pn input'))
formdata = {}
for input in inputs:
name = input.css('::attr(name)').extract_first()
value = input.css('::attr(value)').extract_first()
formdata[name] = value or ''
formdata['ctl00$cp1$ctl00$btnNavigate1.x'] = '42'
formdata['ctl00$cp1$ctl00$btnNavigate1.y'] = '17'
del formdata['ctl00$cp1$ctl00$btnFilterResults']
del formdata['ctl00$cp1$ctl00$btnNavigate1']
action_url = 'http://www.<DOMAIN>/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav&rw=1'
yield FormRequest(
action_url,
formdata=formdata,
callback=self.parse
)
Be carefull you need to replace all to corrent one in my code.

Detecting valid search parameters for a site? (Web scraping)

I'm trying to scrape a bunch of search results from the site:
http://www.wileyopenaccess.com/view/journals.html
Currently the results show up on 4 pages. The 4th page could be accessed with http://www.wileyopenaccess.com/view/journals.html?page=4
I'd like some way to get all of the results on one page for easier scraping, but I have no idea how to determine which request parameters are valid. I tried a couple of things like:
http://www.wileyopenaccess.com/view/journals.html?per_page=100
http://www.wileyopenaccess.com/view/journals.html?setlimit=100
to no avail. Is there a way to detect the valid parameters of this search?
I'm using BeautifulSoup; is there some obvious way to do this that I've overlooked?
Thanks
You cannot pass any magic params to get all the links but you can use the Next button to get all the pages which will work regardless of how many pages there may be:
from bs4 import BeautifulSoup
def get_all_pages():
response = requests.get('http://www.wileyopenaccess.com/view/journals.html')
soup = BeautifulSoup(response.text)
yield soup.select("div.journalRow")
nxt = soup.select_one("div.journalPagination.borderBox a[title^=Next]")
while nxt:
response = requests.get(nxt["href"])
soup = BeautifulSoup(response.text)
yield soup.select("div.journalRow")
nxt = soup.select_one("div.journalPagination.borderBox a[title^=Next]")
for page in get_all_pages():
print(page)

Python-requests: Can't scrape all the html code from a page

I am trying to scrape the content of the
Financial Times Search page.
Using Requests, I can easily scrape the articles' titles and hyperlinks.
I would like to get the next page's hyperlink, but I can not find it in the Requests response, unlike the articles' titles or hyperlinks.
from bs4 import BeautifulSoup
import requests
url = 'http://search.ft.com/search?q=SABMiller+PLC&t=all&rpp=100&fa=people%2Corganisations%2Cregions%2Csections%2Ctopics%2Ccategory%2Cbrand&s=-lastPublishDateTime&f=lastPublishDateTime[2000-01-01T00%3A00%3A00%2C2016-01-01T23%3A59%3A59]&curations=ARTICLES%2CBLOGS%2CVIDEOS%2CPODCASTS&highlight=true&p=1et'
response = requests.get(url, auth=(my login informations))
soup = BeautifulSoup(response.text, "lxml")
def get_titles_and_links():
titles = soup.find_all('a')
for ref in titles:
if ref.get('title') and ref.get('onclick'):
print ref.get('href')
print ref.get('title')
The get_titles_and_links() function gives me the titles and links of all the articles.
However, with a similar function for the next page, I have no results:
def get_next_page():
next_page = soup.find_all("li", class_="page next")
return next_page
Or:
def get_next_page():
next_page = soup.find_all('li')
for ref in next_page:
if ref.get('page next'):
print ref.get('page next')
If you can see the required links in the page source, but are not able to get them via requests or urllib. It can mean two things.
There is something wrong with your logic. Let's assume it's not that.
Then the thing remains is: Ajax, those parts of the page you are looking for are loaded by javascript after the document.onload method fired. So you cannot get something that's not there in the first place.
My solutions(more like suggestions) are
Reverse engineer the network requests. Difficult, but universally applicable. I personally do that. You might want to use re module.
Find something that renders javascript. That's just to say that, simulate web browsing. You might wanna check out the webdriver component of selenium, Qt etc. This is easier, but kinda memory hungry and consumes a lot more network resource compared to 1.

crawl dynamic data using scrapy

I try to get the product rating information from target.com. The URL for the product is
http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty
After looking through response.body, I find out that the rating information is not statically loaded. So I need to get using other ways. I find some similar questions saying in order to get dynamic data, I need to
find out the correct XHR and where to send request
use FormRequest to get the right json
parse json
(if I am wrong about the steps please tell me)
I am stuck at step 2 right now, i find out that one XHR named 15258543 contained rating distribution, but I don't know how can I sent a request to get the json. Like to where and use what parameter.
Can someone can walk me through this?
Thank you!
The trickiest thing is to get that 15258543 product ID dynamically and then use it inside the URL to get the reviews. This product ID can be found in multiple places on the product page, for instance, there is a meta element that we can use:
<meta itemprop="productID" content="15258543">
Here is a working spider that makes a separate GET request to get the reviews, loads the JSON response via json.loads() and prints the overall product rating:
import json
import scrapy
class TargetSpider(scrapy.Spider):
name = "target"
allowed_domains = ["target.com"]
start_urls = ["http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty"]
def parse(self, response):
product_id = response.xpath("//meta[#itemprop='productID']/#content").extract_first()
return scrapy.Request("http://tws.target.com/productservice/services/reviews/v1/reviewstats/" + product_id,
callback=self.parse_ratings,
meta={"product_id": product_id})
def parse_ratings(self, response):
data = json.loads(response.body)
print(data["result"][response.meta["product_id"]]["coreStats"]["AverageOverallRating"])
Prints 4.5585.

Resources