Find the final redirected url using Python - http

I am trying to use python to find the final redirected URL for a url. I tried various solutions from stackoverflow answers but nothing worked for me. I am only getting the original url.
To be specific, I tried requests, urllib2 and urlparse libraries and none of them worked as they should. Here are some of the codes I tried:
Solution 1:
s = requests.session()
r = s.post('https://www.boots.com/search/10055096', allow_redirects=True)
print(r.history)
print(r.history[1].url)
Result:
[<Response [301]>, <Response [302]>]
https://www.boots.com/search/10055096
Solution 2:
import urlparse
url = 'https://www.boots.com/search/10055096'
try:
out = urlparse.parse_qs(urlparse.urlparse(url).query)['out'][0]
print(out)
except Exception as e:
print('not found')
Result:
not found
Solution 3:
import urllib2
def get_redirected_url(url):
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler)
request = opener.open(url)
return request.url
print(get_redirected_url('https://www.boots.com/search/10055096'))
Result:
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Expected URL below is the final redirected page and that is what I want to return.
Original URL: https://www.boots.com/search/10055096
Expected URL: https://www.boots.com/gillette-fusion5-razor-blades-4pk-10055096
Solution #1 was the closest one. At least it returned 2 responses but second respond wasn't the final page, it seems like it was the loading page looking at the content of it.

The first request returns with a html file which contains a JS to update the site and Java scripts are not processed by requests . You can find the updated link by using
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://www.boots.com/search/10055096')
soup = BeautifulSoup(r.content,'html.parser')
reg = soup.find('input',id='searchBoxText').findNext('script').contents[0]
print(re.search(r'ht[\w\://\.-]+', reg).group())

Related

Python Requests doesn't render full code from page

Im trying to capture each agents data from this page using python requests.
Capture These
But the response.text doesn't render the code shown in the code inspector. **See snapshot.**a
Below is my script.
import requests
import re
response = requests.get('https://exprealty.com/agents/#/?city=Aberdeen+WA&country=US')
result = re.search('Mike Arthur',response.text)
try:
print (result.group())
except:
print('Nothing found.')

While trying to scrape the data from the website it is displaying none as the "output"

Link of the website: https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2
how to get the location, job type , salary details from the website.
Can you please help me in locating the above mentioned details in the HTML code using Beautifulsoup.
html code
The site uses a backend api to deliver the info, if you look at your browser's Developer Tools - Network - fetch/XHR and refresh the page you'll see the data load via json in a request with a similar url to the one you posted.
So if we edit your URL to be the same as the backend api url then we can hit it and parse the JSON. Unfortunately the pay amount is buried in some HTML within the JSON so we have to get it out with BeautifulSoup and a bit of regex to match the £###,### pattern.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2'
search = 'https://awg.wd3.myworkdayjobs.com/wday/cxs/awg/AW/'+url.split('AW')[-1] #api endpoint from Developer Tools
data = requests.get(search).json()
posted = data['jobPostingInfo']['startDate']
location = data['jobPostingInfo']['location']
title = data['jobPostingInfo']['title']
desc = data['jobPostingInfo']['jobDescription']
soup = BeautifulSoup(desc,'html.parser')
pay_text = soup.text
sterling = [x[0] for x in re.findall('(\£[0-9]+(\,[0-9]+)?)', pay_text)][0] #get any £###,#### type text
final = {
'title':title,
'posted':posted,
'location':location,
'pay':sterling
}
print(final)

Scrape dynamic info from same URL using python or any other tool

I am trying to scrape the URL of every company who has posted a job offer on this website:
https://jobs.workable.com/
I want to pull the info to generate some stats re this website.
The problem is that when I click on an add and navigate through the job post, the url is always the same. I know a bit of python so any solution using it would be useful. I am open to any other approach though.
Thank you in advance.
This is just a pseudo code to give you the idea of what you are looking for.
import requests
headers = {'User-Agent': 'Mozilla/5.0'}
first_url = 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc'
base_url= 'https://job-board-v3.workable.com/api/v1/jobs?query=&orderBy=postingUpdateTime+desc&offset='
page_ids = ['0','10','20','30','40','50'] ## can also be created dynamically this is just raw
for pep_id in page_ids:
# for initial page
if(pep_id == '0'):
page = requests.get(first_url, headers=headers)
print('You still need to parse the first page')
##Enter some parsing logic
else:
final_url = base_url + str(pep_id)
page = requests.get(final_url, headers=headers)
print('You still need to parse the other pages')
##Enter some parsing logic

Scraping an ASPX page with authentication. Using Python 3

I am trying to use python`s requests library to scrape an ASPX site and get information from a Table inside.
The problem I am experiencing has also been well described in How to web scrape an ASPX page that requires authentication with no replies at the time of writing.
The way I am currently going about it is by:
creating a requests session,
fetching a request header.
The information received from the get request is parsed using BeautifulSoup.
Setting all of the parameters to a login_data dictionary.
import urllib.parse
import requests
from bs4 import BeautifulSoup
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36"}
with requests.session() as session:
session.headers.update(headers)
response=session.get(login_url)
soup=BeautifulSoup(response.content)
VIEWSTATE = soup.find(id="__VIEWSTATE")['value']
VIEWSTATEGENERATOR = soup.find(id="__VIEWSTATEGENERATOR")['value']
EVENTVALIDATION = soup.find(id="__EVENTVALIDATION")['value']
EVENTTARGET = soup.find(id="__EVENTTARGET")['value']
EVENTARGUEMENT = soup.find(id="__EVENTARGUMENT")['value']
PREVIOUSPAGE = soup.find(id="__PREVIOUSPAGE")['value']
CMSESSIONID = soup.find(id="CMSessionId")['value']
soup.find(id="MasterHeaderPlaceHolder_ctl00_userNameTextbox")['value']
login_data= {
"__VIEWSTATE" : VIEWSTATE,
"txtUserName" : account_name,
"txtPassword" : account_pass,
"__VIEWSTATEGENERATOR" : VIEWSTATEGENERATOR,
"__EVENTVALIDATION": EVENTVALIDATION,
"__EVENTTARGET" : EVENTTARGET,
"__EVENTARGUEMENT" : EVENTARGUEMENT,
"__PREVIOUSPAGE" : PREVIOUSPAGE,
"CMSessionId" : CMSESSIONID,
"MasterHeaderPlaceHolder_ctl00_userNameTextbox" : account_name,
"MasterHeaderPlaceHolder_ctl00_passwordTextbox" : account_pass,
"MasterHeaderPlaceHolder_ctl00_tempPasswordTextbox" : account_pass
}
login_data_encoded = urllib.parse.urlencode(login_data) #*
Further to this, the login_data dictionary is being passed to a post request to the login_url as the data.
The same session is then used to try and get the request from the report_url.
response_1 = session.post(login_url, data=login_data)
response_2 = session.get(report_url)
The problem seems to be that the login is not being effected. as the get request is being re-routed to a login page.
Can anyone kindly shed some light on why this is the case? I am guessing that this is the correct flow, however please let me know if there is anything I am doing wrong or that can be improved.
I am unfortunately currently limited to using only requests or other popular python 3 libraries as it is a requirement (using references to "browser".exe files, as suggested in some replies on the subject, is not an option.)

Flask remote authentication issue with Graphite

I have a Flask app which sends a request to a Graphite server to authenticate and redirect to it's dashboard (changed setting REMOTE_USER_AUTHENTICATION = True). The request is as follows:
url = 'https://graphite.localdomain/dashboard'
s = requests.Session()
r = s.get(url, auth=('userx', 'passwordx'),verify=False)
print r.cookies.__dict__
return (r.text, r.status_code, r.headers.items())
The authentication from the request to the graphite server is good, I get 200's for valid users, and 401's for invalid users.
"print r.cookies.__dict__" will output...
{'_now': 1429303134, '_policy': <cookielib.DefaultCookiePolicy instance
at 0x7f263ec2b638>, '_cookies': {'graphite.localdomain': {'/':
{'sessionid': Cookie(version=0, name='sessionid',
value='**********masked**********', port=None, port_specified=False,
domain='graphite.localdomain', domain_specified=False,
domain_initial_dot=False, path='/', path_specified=True, secure=False,
expires=1430512734, discard=False, comment=None, comment_url=None,
rest={'httponly': None}, rfc2109=False)}}}, '_cookies_lock': <_RLock
owner=None count=0>}
...which appears right because it looks identical to the one I get from logging in directly to Graphite. But, when I return the response object (Return a requests.Response object from Flask) the browser returns content encoding errors in both Chrome and FireFox. If I change that to something like...
return r.content
...the dashboard page appears, but it's missing everything because the CSS and JS resources are 404'ing.
I am obviously not understanding something, any help would be greatly appreciated.

Resources