Scraping "older" pages with scrapy, rules and link extractors

Scraping "older" pages with scrapy, rules and link extractors - web-scraping

I have been working on a project with scrapy. With help, from this lovely community I have managed to be able to scrape the first page of this website: http://www.rotoworld.com/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav. I am trying to scrape information from the "older" pages as well. I have researched "crawlspider", rules and link extractors, and believed I had the proper code. I want the spider to perform the same loop on subsequent pages. Unfortunately at the moment when I run it, it just spits out the 1st page, and doesn't continue to the "older" pages.
I am not exactly sure what I need to change and would really appreciate some help. There are posts going all the way back to February of 2004... I am new to data mining, and not sure if it is actually a realistic goal to be able to scrape every post. If it is I would like to though. Please any help is appreciated. Thanks!
import scrapy
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(crawlspider):
name = "RotoPlayerNews"
start_urls = [
'http://www.rotoworld.com/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[#id="cp1_ctl00_btnNavigate1"]',)), callback="parse_page", follow= True),)
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
position= item.xpath(".//div[#class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[#class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}

If your intention is to fetch the data traversing multiple pages, you don't need to go for scrapy. If you still want to have any solution related to scrapy then I suggest you opt for splash to handle the pagination.
I would do something like below to get the items (assuming you have already installed selenium in your machine):
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("http://www.rotoworld.com/playernews/nfl/football/")
wait = WebDriverWait(driver, 10)
while True:
for item in wait.until(EC.presence_of_all_elements_located((By.XPATH,"//div[#class='pb']"))):
player = item.find_element_by_xpath(".//div[#class='player']/a").text
player = player.encode() #it should handle the encoding issue; I'm not totally sure, though
print(player)
try:
idate = wait.until(EC.presence_of_element_located((By.XPATH, "//div[#class='date']"))).text
if "Jun 9" in idate: #put here any date you wanna go back to (last limit: where the scraper will stop)
break
wait.until(EC.presence_of_element_located((By.XPATH, "//input[#id='cp1_ctl00_btnNavigate1']"))).click()
wait.until(EC.staleness_of(item))
except:break
driver.quit()

My suggestion: Selenium
If you want to change of page automatically, you can use Selenium WebDriver.
Selenium makes you to be able to interact with the page click on buttons, write on inputs, etc. You'll need to change your code to scrap the data an then, click on the older button. Then, it'll change the page and keep scraping.
Selenium is a very useful tool. I'm using it right now, on a personal project. You can take a look at my repo on GitHub to see how it works. In the case of the page that you're trying to scrap, you cannot go to older just changing the link to be scraped, so, you need to use Selenium to do change between pages.
Hope it helps.

No need to use Selenium in current case. Before scraping you need to open url in browser and press F12 to inspect code and to see packets in Network Tab. When you press next or "OLDER" in your case you can see new set of TCP packets in Network tab. It provide to you all you need. When you understand how it work you can write working spider.
import scrapy
from scrapy import FormRequest
from scrapy.contrib.spiders import CrawlSpider,Rule
from scrapy.contrib.linkextractors import LinkExtractor
class Roto_News_Spider2(CrawlSpider):
name = "RotoPlayerNews"
start_urls = [
'http://www.<DOMAIN>/playernews/nfl/football/',
]
Rules = (Rule(LinkExtractor(allow=(), restrict_xpaths=('//input[#id="cp1_ctl00_btnNavigate1"]',)), callback="parse", follow= True),)
def parse(self, response):
for item in response.xpath("//div[#class='pb']"):
player = item.xpath(".//div[#class='player']/a/text()").extract_first()
position= item.xpath(".//div[#class='player']/text()").extract()[0].replace("-","").strip()
team = item.xpath(".//div[#class='player']/a/text()").extract()[1].strip()
report = item.xpath(".//div[#class='report']/p/text()").extract_first()
date = item.xpath(".//div[#class='date']/text()").extract_first() + " 2018"
impact = item.xpath(".//div[#class='impact']/text()").extract_first().strip()
source = item.xpath(".//div[#class='source']/a/text()").extract_first()
yield {"Player": player,"Position": position, "Team": team,"Report":report,"Impact":impact,"Date":date,"Source":source}
older = response.css('input#cp1_ctl00_btnNavigate1')
if not older:
return
inputs = response.css('div.aspNetHidden input')
inputs.extend(response.css('div.RW_pn input'))
formdata = {}
for input in inputs:
name = input.css('::attr(name)').extract_first()
value = input.css('::attr(value)').extract_first()
formdata[name] = value or ''
formdata['ctl00$cp1$ctl00$btnNavigate1.x'] = '42'
formdata['ctl00$cp1$ctl00$btnNavigate1.y'] = '17'
del formdata['ctl00$cp1$ctl00$btnFilterResults']
del formdata['ctl00$cp1$ctl00$btnNavigate1']
action_url = 'http://www.<DOMAIN>/playernews/nfl/football-player-news?ls=roto%3anfl%3agnav&rw=1'
yield FormRequest(
action_url,
formdata=formdata,
callback=self.parse
)
Be carefull you need to replace all to corrent one in my code.

Related

While trying to scrape the data from the website it is displaying none as the "output"

Link of the website: https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2
how to get the location, job type , salary details from the website.
Can you please help me in locating the above mentioned details in the HTML code using Beautifulsoup.
html code

The site uses a backend api to deliver the info, if you look at your browser's Developer Tools - Network - fetch/XHR and refresh the page you'll see the data load via json in a request with a similar url to the one you posted.
So if we edit your URL to be the same as the backend api url then we can hit it and parse the JSON. Unfortunately the pay amount is buried in some HTML within the JSON so we have to get it out with BeautifulSoup and a bit of regex to match the £###,### pattern.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2'
search = 'https://awg.wd3.myworkdayjobs.com/wday/cxs/awg/AW/'+url.split('AW')[-1] #api endpoint from Developer Tools
data = requests.get(search).json()
posted = data['jobPostingInfo']['startDate']
location = data['jobPostingInfo']['location']
title = data['jobPostingInfo']['title']
desc = data['jobPostingInfo']['jobDescription']
soup = BeautifulSoup(desc,'html.parser')
pay_text = soup.text
sterling = [x[0] for x in re.findall('(\£[0-9]+(\,[0-9]+)?)', pay_text)][0] #get any £###,#### type text
final = {
'title':title,
'posted':posted,
'location':location,
'pay':sterling
}
print(final)

scrapy giving a different output than on website, problem with geo location?

I'm really a newbie in all of this and am just trying to learn a bit more about this.
So I had a lot of help to get this going but now I'm stuck on a very weird problem.
I am scraping info from a grocery store in Australia. As I'm located in the state of Victoria when I go on a website the price of a Redbull is 10.5$ but as soon as I run my script I get 11.25$.
I am guessing it might have to do with a geolocation...but not sure.
I basically need some help as to where to look to find how to get the right price I get when I go to the website.
Also, I noticed that when I do go to the same website from my phone it gives me the price of 11.25$, but if I go to the app of the store I get the accurate price of 10.5$.
import json
import scrapy
class SpidervenderSpider(scrapy.Spider):
name = 'spidervender'
allowed_domains = ['woolworths.com.au']
start_urls = ['https://www.woolworths.com.au/shop/productdetails/306165/red-bull-energy-drink']
def parse(self, response):
product_schema = json.loads(response.css('script[type="application/ld+json"]::text').get())
yield {
'title': product_schema['name'],
'price': product_schema['offers']['price']
}
So the code works perfectly but the price is (I presume) for a different part of Australia.

Scrape dynamic info from same URL using python or any other tool

I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.

This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic

Loading the full HTML after clicking a button to load additional elements with Selenium

I want to scrape a page and collect all links. The page shows 30 entries and to view the full list it's necessary to click a load all button.
I'm using following code:
from selenium import webdriver
from bs4 import BeautifulSoup
driver = webdriver.PhantomJS()
driver.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchfrom=header&lid=1&entry=edgar%20degas&searchtype=p&action=paging&pg=all')
labtn = driver.find_element_by_css_selector('a.load-all')
labtn.click()
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
soup = BeautifulSoup(source_code, 'lxml')
url_list = []
for div in soup.find_all(class_ ='image-container'):
for childdiv in div.find_all('a'):
url_list.append(childdiv['href'])
print(url_list)
Here's the HTML mark-up
<div class="loadAllbtn">
<a class="load-all" id="loadAllUpcomingPast" href="javascript:void(0);">Load all</a>
</div>
I am still getting the original 30 links and the initial code. It seems that I'm not properly using Selenium and would like to know what I'm doing wrong.
Selenium works so far. Node JS is installed, I managed to make a screenshot and save it to a file.

When you click "Load all" you make additional request to receive all items. You need to wait some time for server response:
from selenium.webdriver.support.ui import WebDriverWait as wait
driver = webdriver.PhantomJS()
driver.get('http://www.christies.com/lotfinder/searchresults.aspx?&searchfrom=header&lid=1&entry=edgar%20degas&searchtype=p&action=paging&pg=all')
labtn = driver.find_element_by_css_selector('a.load-all')
labtn.click()
wait(driver, 15).until(lambda x: len(driver.find_elements_by_css_selector("div.detailscontainer")) > 30)
Above code should allow you to wait up to 15 seconds until number of items exceed 30. Then you can scrape page source with complete list of items
P.S. Note that you don't need to use these lines of code
elem = driver.find_element_by_xpath("//*")
source_code = elem.get_attribute("outerHTML")
to get page source. Just try
source_code = driver.page_source
P.P.S. Also you don't need to use BeautifulSoup to get links to each item. You can do it as
links = [link.get_attribute('href') for link in driver.find_elements_by_css_selector('div.image-container>a')]

Python-getting data from an asp.net AJAX application

Using Python, I'm trying to read the values on http://utahcritseries.com/RawResults.aspx. I can read the page just fine, but am having difficulty changing the value of the year combo box, to view data from other years. How can I read the data for years other than the default of 2002?
The page appears to be doing an HTTP Post once the year combo box has changed. The name of the control is ct100$ContentPlaceHolder1$ddlSeries. I try setting a value for this control using urllib.urlencode(postdata), but I must be doing something wrong-the data on the page is not changing. Can this be done in Python?
I'd prefer not to use Selenium, if at all possible.
I've been using code like this(from stackoverflow user dbr)
import urllib
postdata = {'ctl00$ContentPlaceHolder1$ddlSeries': 9}
src = urllib.urlopen(
"http://utahcritseries.com/RawResults.aspx",
data = urllib.urlencode(postdata)
).read()
print src
But seems to be pulling up the same 2002 data. I've tried using firebug to inspect the headers and I see a lot of extraneous and random-looking data being sent back and forth-do I need to post these values back to the server also?

Use the excellent mechanize library:
from mechanize import Browser
b = Browser()
b.open("http://utahcritseries.com/RawResults.aspx")
b.select_form(nr=0)
year = b.form.find_control(type='select')
year.get(label='2005').selected = True
src = b.submit().read()
print src
Mechanize is available on PyPI: easy_install mechanize