Detecting valid search parameters for a site? (Web scraping) - web-scraping

I'm trying to scrape a bunch of search results from the site:
http://www.wileyopenaccess.com/view/journals.html
Currently the results show up on 4 pages. The 4th page could be accessed with http://www.wileyopenaccess.com/view/journals.html?page=4
I'd like some way to get all of the results on one page for easier scraping, but I have no idea how to determine which request parameters are valid. I tried a couple of things like:
http://www.wileyopenaccess.com/view/journals.html?per_page=100
http://www.wileyopenaccess.com/view/journals.html?setlimit=100
to no avail. Is there a way to detect the valid parameters of this search?
I'm using BeautifulSoup; is there some obvious way to do this that I've overlooked?
Thanks

You cannot pass any magic params to get all the links but you can use the Next button to get all the pages which will work regardless of how many pages there may be:
from bs4 import BeautifulSoup
def get_all_pages():
response = requests.get('http://www.wileyopenaccess.com/view/journals.html')
soup = BeautifulSoup(response.text)
yield soup.select("div.journalRow")
nxt = soup.select_one("div.journalPagination.borderBox a[title^=Next]")
while nxt:
response = requests.get(nxt["href"])
soup = BeautifulSoup(response.text)
yield soup.select("div.journalRow")
nxt = soup.select_one("div.journalPagination.borderBox a[title^=Next]")
for page in get_all_pages():
print(page)

Related

Scrape dynamic info from same URL using python or any other tool

I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic

Python-requests: Can't scrape all the html code from a page

I am trying to scrape the content of the
Financial Times Search page.
Using Requests, I can easily scrape the articles' titles and hyperlinks.
I would like to get the next page's hyperlink, but I can not find it in the Requests response, unlike the articles' titles or hyperlinks.
from bs4 import BeautifulSoup
import requests
url = 'http://search.ft.com/search?q=SABMiller+PLC&t=all&rpp=100&fa=people%2Corganisations%2Cregions%2Csections%2Ctopics%2Ccategory%2Cbrand&s=-lastPublishDateTime&f=lastPublishDateTime[2000-01-01T00%3A00%3A00%2C2016-01-01T23%3A59%3A59]&curations=ARTICLES%2CBLOGS%2CVIDEOS%2CPODCASTS&highlight=true&p=1et'
response = requests.get(url, auth=(my login informations))
soup = BeautifulSoup(response.text, "lxml")
def get_titles_and_links():
titles = soup.find_all('a')
for ref in titles:
if ref.get('title') and ref.get('onclick'):
print ref.get('href')
print ref.get('title')
The get_titles_and_links() function gives me the titles and links of all the articles.
However, with a similar function for the next page, I have no results:
def get_next_page():
next_page = soup.find_all("li", class_="page next")
return next_page
Or:
def get_next_page():
next_page = soup.find_all('li')
for ref in next_page:
if ref.get('page next'):
print ref.get('page next')
If you can see the required links in the page source, but are not able to get them via requests or urllib. It can mean two things.
There is something wrong with your logic. Let's assume it's not that.
Then the thing remains is: Ajax, those parts of the page you are looking for are loaded by javascript after the document.onload method fired. So you cannot get something that's not there in the first place.
My solutions(more like suggestions) are
Reverse engineer the network requests. Difficult, but universally applicable. I personally do that. You might want to use re module.
Find something that renders javascript. That's just to say that, simulate web browsing. You might wanna check out the webdriver component of selenium, Qt etc. This is easier, but kinda memory hungry and consumes a lot more network resource compared to 1.

crawl dynamic data using scrapy

I try to get the product rating information from target.com. The URL for the product is
http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty
After looking through response.body, I find out that the rating information is not statically loaded. So I need to get using other ways. I find some similar questions saying in order to get dynamic data, I need to
find out the correct XHR and where to send request
use FormRequest to get the right json
parse json
(if I am wrong about the steps please tell me)
I am stuck at step 2 right now, i find out that one XHR named 15258543 contained rating distribution, but I don't know how can I sent a request to get the json. Like to where and use what parameter.
Can someone can walk me through this?
Thank you!
The trickiest thing is to get that 15258543 product ID dynamically and then use it inside the URL to get the reviews. This product ID can be found in multiple places on the product page, for instance, there is a meta element that we can use:
<meta itemprop="productID" content="15258543">
Here is a working spider that makes a separate GET request to get the reviews, loads the JSON response via json.loads() and prints the overall product rating:
import json
import scrapy
class TargetSpider(scrapy.Spider):
name = "target"
allowed_domains = ["target.com"]
start_urls = ["http://www.target.com/p/bounty-select-a-size-paper-towels-white-8-huge-rolls/-/A-15258543#prodSlot=medium_1_4&term=bounty"]
def parse(self, response):
product_id = response.xpath("//meta[#itemprop='productID']/#content").extract_first()
return scrapy.Request("http://tws.target.com/productservice/services/reviews/v1/reviewstats/" + product_id,
callback=self.parse_ratings,
meta={"product_id": product_id})
def parse_ratings(self, response):
data = json.loads(response.body)
print(data["result"][response.meta["product_id"]]["coreStats"]["AverageOverallRating"])
Prints 4.5585.

Why does this ScraperWiki for an ASPX site return only the same page of search results?

I'm trying to scrape an ASP-powered site using ScraperWiki's tools.
I want to grab a list of BBSes in a particular area code from the BBSmates.com website. The site displays 20 BBS search results at a time, so I will have to do form submits in order to move from one page of results to the next.
This blog post helped me get started. I thought the following code would grab the final page of BBS listings for the 314 area code (page 79).
However, the response I get is the FIRST page.
url = 'http://bbsmates.com/browsebbs.aspx?BBSName=&AreaCode=314'
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open(url)
html = response.read()
br.select_form(name='aspnetForm')
br.form.set_all_readonly(False)
br['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$GridView1'
br['__EVENTARGUMENT'] = 'Page$79'
print br.form
response2 = br.submit()
html2 = response2.read()
print html2
The blog post I cited above mentions that in their case there was a problem with a SubmitControl, so I tried disabling the two SubmitControls on this form.
br.find_control("ctl00$cmdLogin").disabled = True
Disabling cmdLogin generated HTTP Error 500.
br.find_control("ctl00$ContentPlaceHolder1$Button1").disabled = True
Disabling ContentPlaceHolder1$Button1 didn't make any difference. The submit went through, but the page it returned was still page 1 of the search results.
It's worth noting that this site does NOT use "Page$Next."
Can anyone help me figure out what I need to do to get ASPX form submit to work?
You need to post the values the page gives (EVENTVALIDATION, VIEWSTATE, etc.).
This code will work (note that it uses the awesome Requests library and not Mechanize)
import lxml.html
import requests
starturl = 'http://bbsmates.com/browsebbs.aspx?BBSName=&AreaCode=314'
s = requests.session() # create a session object
r1 = s.get(starturl) #get page 1
html = r1.text
root = lxml.html.fromstring(html)
#pick up the javascript values
EVENTVALIDATION = root.xpath('//input[#name="__EVENTVALIDATION"]')[0].attrib['value']
#find the __EVENTVALIDATION value
VIEWSTATE = root.xpath('//input[#name="__VIEWSTATE"]')[0].attrib['value']
#find the __VIEWSTATE value
# build a dictionary to post to the site with the values we have collected. The __EVENTARGUMENT can be changed to fetch another result page (3,4,5 etc.)
payload = {'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1','__EVENTARGUMENT':'Page$25','__EVENTVALIDATION':EVENTVALIDATION,'__VIEWSTATE':VIEWSTATE,'__VIEWSTATEENCRYPTED':'','ctl00$txtUsername':'','ctl00$txtPassword':'','ctl00$ContentPlaceHolder1$txtBBSName':'','ctl00$ContentPlaceHolder1$txtSysop':'','ctl00$ContentPlaceHolder1$txtSoftware':'','ctl00$ContentPlaceHolder1$txtCity':'','ctl00$ContentPlaceHolder1$txtState':'','ctl00$ContentPlaceHolder1$txtCountry':'','ctl00$ContentPlaceHolder1$txtZipCode':'','ctl00$ContentPlaceHolder1$txtAreaCode':'314','ctl00$ContentPlaceHolder1$txtPrefix':'','ctl00$ContentPlaceHolder1$txtDescription':'','ctl00$ContentPlaceHolder1$Activity':'rdoBoth','ctl00$ContentPlaceHolder1$drpRPP':'20'}
# post it
r2 = s.post(starturl, data=payload)
# our response is now page 2
print r2.text
When you get to the end of the results (resultpage 21) you have to pick up the VIEWSTATE and EVENTVALIDATION values again (and do that every 20 pages).
Note that there are a few values that you post that are empty, and a few that include values. The full list is like this:
'ctl00$txtUsername':'','ctl00$txtPassword':'','ctl00$ContentPlaceHolder1$txtBBSName':'','ctl00$ContentPlaceHolder1$txtSysop':'','ctl00$ContentPlaceHolder1$txtSoftware':'','ctl00$ContentPlaceHolder1$txtCity':'','ctl00$ContentPlaceHolder1$txtState':'','ctl00$ContentPlaceHolder1$txtCountry':'','ctl00$ContentPlaceHolder1$txtZipCode':'','ctl00$ContentPlaceHolder1$txtAreaCode':'314','ctl00$ContentPlaceHolder1$txtPrefix':'','ctl00$ContentPlaceHolder1$txtDescription':'','ctl00$ContentPlaceHolder1$Activity':'rdoBoth','ctl00$ContentPlaceHolder1$drpRPP':'20'
Here is a discussion on the Scraperwiki mailing list on a similar problem: https://groups.google.com/forum/#!topic/scraperwiki/W0Xi7AxfZp0

Search a url for unique phrase using Google API

Does Google have an API with a function which will verify if a specific phrase can be found at a given url?
Say I have a webpage url: www.mysite/2011/01/check-if-phrase-exists
I want to know if the phrase foobar exists somewhere on that document (it can be anywhere on the html document - not just "readable text").
The function/api would return True or False.
Question Update The "method" should avoid me from having to retrieve the entire page to my server and search myself. It is the fetching of the webpage to my server that I am trying to avoid (to cut down on bandwidth).
I don't think they do, but you could do this yourself without much code (this is adapted from the App Engine docs):
import urllib2
url = "http://www.google.com/"
try:
result = urllib2.urlopen(url)
my_search_function(result)
# or perhaps my_search_function(result.content)
except urllib2.URLError, e:
handleError(e)
Then you can just define my_search_function(text) to do what you need

Resources