Scraping an ASPX page with authentication. Using Python 3 - asp.net

I am trying to use python`s requests library to scrape an ASPX site and get information from a Table inside.
The problem I am experiencing has also been well described in How to web scrape an ASPX page that requires authentication with no replies at the time of writing.
The way I am currently going about it is by:
creating a requests session,
fetching a request header.
The information received from the get request is parsed using BeautifulSoup.
Setting all of the parameters to a login_data dictionary.
import urllib.parse
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36"}
with requests.session() as session:
session.headers.update(headers)
response=session.get(login_url)
soup=BeautifulSoup(response.content)
VIEWSTATE = soup.find(id="__VIEWSTATE")['value']
VIEWSTATEGENERATOR = soup.find(id="__VIEWSTATEGENERATOR")['value']
EVENTVALIDATION = soup.find(id="__EVENTVALIDATION")['value']
EVENTTARGET = soup.find(id="__EVENTTARGET")['value']
EVENTARGUEMENT = soup.find(id="__EVENTARGUMENT")['value']
PREVIOUSPAGE = soup.find(id="__PREVIOUSPAGE")['value']
CMSESSIONID = soup.find(id="CMSessionId")['value']
soup.find(id="MasterHeaderPlaceHolder_ctl00_userNameTextbox")['value']
login_data= {
"__VIEWSTATE" : VIEWSTATE,
"txtUserName" : account_name,
"txtPassword" : account_pass,
"__VIEWSTATEGENERATOR" : VIEWSTATEGENERATOR,
"__EVENTVALIDATION": EVENTVALIDATION,
"__EVENTTARGET" : EVENTTARGET,
"__EVENTARGUEMENT" : EVENTARGUEMENT,
"__PREVIOUSPAGE" : PREVIOUSPAGE,
"CMSessionId" : CMSESSIONID,
"MasterHeaderPlaceHolder_ctl00_userNameTextbox" : account_name,
"MasterHeaderPlaceHolder_ctl00_passwordTextbox" : account_pass,
"MasterHeaderPlaceHolder_ctl00_tempPasswordTextbox" : account_pass
}
login_data_encoded = urllib.parse.urlencode(login_data) #*
Further to this, the login_data dictionary is being passed to a post request to the login_url as the data.
The same session is then used to try and get the request from the report_url.
response_1 = session.post(login_url, data=login_data)
response_2 = session.get(report_url)
The problem seems to be that the login is not being effected. as the get request is being re-routed to a login page.
Can anyone kindly shed some light on why this is the case? I am guessing that this is the correct flow, however please let me know if there is anything I am doing wrong or that can be improved.
I am unfortunately currently limited to using only requests or other popular python 3 libraries as it is a requirement (using references to "browser".exe files, as suggested in some replies on the subject, is not an option.)

Related

While trying to scrape the data from the website it is displaying none as the "output"

Link of the website: https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2
how to get the location, job type , salary details from the website.
Can you please help me in locating the above mentioned details in the HTML code using Beautifulsoup.
html code
The site uses a backend api to deliver the info, if you look at your browser's Developer Tools - Network - fetch/XHR and refresh the page you'll see the data load via json in a request with a similar url to the one you posted.
So if we edit your URL to be the same as the backend api url then we can hit it and parse the JSON. Unfortunately the pay amount is buried in some HTML within the JSON so we have to get it out with BeautifulSoup and a bit of regex to match the £###,### pattern.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2'
search = 'https://awg.wd3.myworkdayjobs.com/wday/cxs/awg/AW/'+url.split('AW')[-1] #api endpoint from Developer Tools
data = requests.get(search).json()
posted = data['jobPostingInfo']['startDate']
location = data['jobPostingInfo']['location']
title = data['jobPostingInfo']['title']
desc = data['jobPostingInfo']['jobDescription']
soup = BeautifulSoup(desc,'html.parser')
pay_text = soup.text
sterling = [x[0] for x in re.findall('(\£[0-9]+(\,[0-9]+)?)', pay_text)][0] #get any £###,#### type text
final = {
'title':title,
'posted':posted,
'location':location,
'pay':sterling
}
print(final)

Scrapy dowload zip file from ASP.NET site

I need some help getting scrapy to download a file from an asp.net site. Normally from a browser one would click the link and the file would begin downloading, but that is not possible with scrapy so what I am trying to do is the following:
def retrieve(self, response):
print('Response URL: {}'.format(response.url))
pattern = re.compile('(dg[^\']*)')
for file in response.xpath('//table[#id="dgFile"]/tbody/tr/td[2]/a'):
file_url = file.xpath('#href').extract_first()
target = re.search(pattern, file_url).group(1)
viewstate = response.xpath('//*[#id="__VIEWSTATE"]/#value').extract_first()
viewstategenerator = response.xpath('//*[#id="__VIEWSTATEGENERATOR"]').extract_first()
eventvalidation = response.xpath('//*[#id="__EVENTVALIDATION"]').extract_first()
data = {
'_EVENTTARGET': target,
'_VIEWSTATE': viewstate,
'_VIEWSTATEGEERATOR': viewstategenerator,
'_EVENTVALIDATION': eventvalidation
}
yield FormRequest.from_response(
response,
formdata=data,
callback=self.end(response)
)
I am trying to submit the information to the page in order the receive the zip file back as a response, however this is not working as I hoped it would. Instead I am simply getting the same page as a response.
In a situation like this is it even possible to use scrapy to download this file? does anyone have any pointers?
I have also tried to use Selenium+PhantomJS but I run into a dead end trying to transfer the session from scrapy to selenium. I would be willing to use selenium for this one function but I need to use scrapy for this project.

Flask remote authentication issue with Graphite

I have a Flask app which sends a request to a Graphite server to authenticate and redirect to it's dashboard (changed setting REMOTE_USER_AUTHENTICATION = True). The request is as follows:
url = 'https://graphite.localdomain/dashboard'
s = requests.Session()
r = s.get(url, auth=('userx', 'passwordx'),verify=False)
print r.cookies.__dict__
return (r.text, r.status_code, r.headers.items())
The authentication from the request to the graphite server is good, I get 200's for valid users, and 401's for invalid users.
"print r.cookies.__dict__" will output...
{'_now': 1429303134, '_policy': <cookielib.DefaultCookiePolicy instance
at 0x7f263ec2b638>, '_cookies': {'graphite.localdomain': {'/':
{'sessionid': Cookie(version=0, name='sessionid',
value='**********masked**********', port=None, port_specified=False,
domain='graphite.localdomain', domain_specified=False,
domain_initial_dot=False, path='/', path_specified=True, secure=False,
expires=1430512734, discard=False, comment=None, comment_url=None,
rest={'httponly': None}, rfc2109=False)}}}, '_cookies_lock': <_RLock
owner=None count=0>}
...which appears right because it looks identical to the one I get from logging in directly to Graphite. But, when I return the response object (Return a requests.Response object from Flask) the browser returns content encoding errors in both Chrome and FireFox. If I change that to something like...
return r.content
...the dashboard page appears, but it's missing everything because the CSS and JS resources are 404'ing.
I am obviously not understanding something, any help would be greatly appreciated.

Staying Logged In Using Requests and Python

I am trying to log onto a website using python and requests. I'm pretty sure I am logging on properly. The next part is I go to a different page and try to download a file from that page. However, in order to download the file you have to be logged in. When I go to download the file, however, it redirects me to the log-in menu saying I haven't logged in. I am stuck and don't know what to do! By the way, the website is grabcad.com, what I'm basically trying to do is press the download all button featured on such a page
http://grabcad.com/library/apple-ipod-touch-5th-gen-1
payload = {'member[email]': 'username', 'member[password]': 'pass'}
with requests.Session() as s:
rObject = s.post('http://www.grabcad.com/login', data=payload)
cookies = rObject.cookies
rObject = s.get('http://www.grabcad.com' + downloadUrl, cookies=cookies)
#download URL is something I obtain early and I know it's correct. It's the URL for when you press the downloadAll button
path = 'C:\\User\\Desktop\\filename
with open(path, 'wb') as f:
for chunk in rObject.iter_content():
f.write(chunk)
So I took an altogether different route to solve the problem, I simply used mechanize which is an automated browswer tool for python.
#how to use mechanize to log-in, specifically for grabcad
b.open('http://grabcad.com/login')
b.form = list(b.forms())[1]
control = b.form.find_control("member[email]")
control2 = b.form.find_control("member[password]")
control.value = 'username'
control2.value = 'pass'
b.submit()
#Download Part
path = 'C:\\User\\Desktop\\filename
b.retrieve('https://www.grabcad.com' + downloadUrl, path)
#downloadUrl is obtained earlier and is simply the URL for the download
How are you ensuring that you're logged in correctly? I would print out the html after sending that post request from the session object & ensure it isn't a login page or an invalid password page. Cookies are automatically persistent across requests made on the session object, so I believe that the initial login isn't successful (http://docs.python-requests.org/en/latest/user/advanced/#session-objects).
Personally, I would use selenium for this though.
I have correctly logged into grabcad with the following code:
import requests
s = requests.session()
payload = {'member[email]': 'yourEmail', 'member[password]': 'yourPassword'}
p = s.post('https://grabcad.com/login', data=payload) # Ensure you're posting to HTTPS

Why does this ScraperWiki for an ASPX site return only the same page of search results?

I'm trying to scrape an ASP-powered site using ScraperWiki's tools.
I want to grab a list of BBSes in a particular area code from the BBSmates.com website. The site displays 20 BBS search results at a time, so I will have to do form submits in order to move from one page of results to the next.
This blog post helped me get started. I thought the following code would grab the final page of BBS listings for the 314 area code (page 79).
However, the response I get is the FIRST page.
url = 'http://bbsmates.com/browsebbs.aspx?BBSName=&AreaCode=314'
br = mechanize.Browser()
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
response = br.open(url)
html = response.read()
br.select_form(name='aspnetForm')
br.form.set_all_readonly(False)
br['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$GridView1'
br['__EVENTARGUMENT'] = 'Page$79'
print br.form
response2 = br.submit()
html2 = response2.read()
print html2
The blog post I cited above mentions that in their case there was a problem with a SubmitControl, so I tried disabling the two SubmitControls on this form.
br.find_control("ctl00$cmdLogin").disabled = True
Disabling cmdLogin generated HTTP Error 500.
br.find_control("ctl00$ContentPlaceHolder1$Button1").disabled = True
Disabling ContentPlaceHolder1$Button1 didn't make any difference. The submit went through, but the page it returned was still page 1 of the search results.
It's worth noting that this site does NOT use "Page$Next."
Can anyone help me figure out what I need to do to get ASPX form submit to work?
You need to post the values the page gives (EVENTVALIDATION, VIEWSTATE, etc.).
This code will work (note that it uses the awesome Requests library and not Mechanize)
import lxml.html
import requests
starturl = 'http://bbsmates.com/browsebbs.aspx?BBSName=&AreaCode=314'
s = requests.session() # create a session object
r1 = s.get(starturl) #get page 1
html = r1.text
root = lxml.html.fromstring(html)
#pick up the javascript values
EVENTVALIDATION = root.xpath('//input[#name="__EVENTVALIDATION"]')[0].attrib['value']
#find the __EVENTVALIDATION value
VIEWSTATE = root.xpath('//input[#name="__VIEWSTATE"]')[0].attrib['value']
#find the __VIEWSTATE value
# build a dictionary to post to the site with the values we have collected. The __EVENTARGUMENT can be changed to fetch another result page (3,4,5 etc.)
payload = {'__EVENTTARGET': 'ctl00$ContentPlaceHolder1$GridView1','__EVENTARGUMENT':'Page$25','__EVENTVALIDATION':EVENTVALIDATION,'__VIEWSTATE':VIEWSTATE,'__VIEWSTATEENCRYPTED':'','ctl00$txtUsername':'','ctl00$txtPassword':'','ctl00$ContentPlaceHolder1$txtBBSName':'','ctl00$ContentPlaceHolder1$txtSysop':'','ctl00$ContentPlaceHolder1$txtSoftware':'','ctl00$ContentPlaceHolder1$txtCity':'','ctl00$ContentPlaceHolder1$txtState':'','ctl00$ContentPlaceHolder1$txtCountry':'','ctl00$ContentPlaceHolder1$txtZipCode':'','ctl00$ContentPlaceHolder1$txtAreaCode':'314','ctl00$ContentPlaceHolder1$txtPrefix':'','ctl00$ContentPlaceHolder1$txtDescription':'','ctl00$ContentPlaceHolder1$Activity':'rdoBoth','ctl00$ContentPlaceHolder1$drpRPP':'20'}
# post it
r2 = s.post(starturl, data=payload)
# our response is now page 2
print r2.text
When you get to the end of the results (resultpage 21) you have to pick up the VIEWSTATE and EVENTVALIDATION values again (and do that every 20 pages).
Note that there are a few values that you post that are empty, and a few that include values. The full list is like this:
'ctl00$txtUsername':'','ctl00$txtPassword':'','ctl00$ContentPlaceHolder1$txtBBSName':'','ctl00$ContentPlaceHolder1$txtSysop':'','ctl00$ContentPlaceHolder1$txtSoftware':'','ctl00$ContentPlaceHolder1$txtCity':'','ctl00$ContentPlaceHolder1$txtState':'','ctl00$ContentPlaceHolder1$txtCountry':'','ctl00$ContentPlaceHolder1$txtZipCode':'','ctl00$ContentPlaceHolder1$txtAreaCode':'314','ctl00$ContentPlaceHolder1$txtPrefix':'','ctl00$ContentPlaceHolder1$txtDescription':'','ctl00$ContentPlaceHolder1$Activity':'rdoBoth','ctl00$ContentPlaceHolder1$drpRPP':'20'
Here is a discussion on the Scraperwiki mailing list on a similar problem: https://groups.google.com/forum/#!topic/scraperwiki/W0Xi7AxfZp0

Resources