Python Requests doesn't render full code from page - python-requests

Im trying to capture each agents data from this page using python requests.
Capture These
But the response.text doesn't render the code shown in the code inspector. **See snapshot.**a
Below is my script.
import requests
import re
response = requests.get('https://exprealty.com/agents/#/?city=Aberdeen+WA&country=US')
result = re.search('Mike Arthur',response.text)
try:
print (result.group())
except:
print('Nothing found.')

Related

Can you display a PlainTextResponse in the Swagger UI for a FastAPI API?

Right now, I can only view a PlainTextResponse by manually entering the API URL path into my browser. However, I would like to be able to view PlainTextResponses in my Swagger UI. It seems like the OpenAPI loads indefinitely every time I try to request a PlainTextResponse
Here is a sample:
from fastapi import FastAPI
from fastapi.responses import PlainTextResponse
import pandas as pd
app = FastAPI()
#app.get("/plain_text", response_class=PlainTextResponse)
async def plain_text():
url = 'https://raw.githubusercontent.com/ccodwg/Covid19Canada/master/official_datasets/can/phac_n_tests_performed_timeseries_prov.csv'
df = pd.read_csv(url, index_col=0)
return PlainTextResponse(df.to_csv(index=False), media_type="text/plain")
This sample actually works. I'm assuming its because this specific CSV file is relatively small.
However, once you start using larger CSV files, it seems like you are unable to display the response in the UI. For example, try https://raw.githubusercontent.com/Schlumberger/hackathon/master/backend/dataset/data-large.csv instead and it will load forever on the UI but displays relatively quickly if you use the URL path.
I don't know what is happening in at your end, but here is a mvp showing how PlainTextRepsonse comes through in the auto generated docs.
from fastapi import FastAPI
from fastapi.responses import PlainTextResponse
app = FastAPI()
#app.get("/")
def root():
return PlainTextResponse("Plain Response!")
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000, )
Executing the operation (using the "Try out" button) in the generated docs, yields the following result:
Turns out, it's not PlainTextResponse's issue but rather SwaggerUI, credits to this user's answer. Disabling syntax highlighting seems to significantly improve performance and hence the Swagger UI no longer hangs on large responses.
from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import pandas as pd
app = FastAPI(swagger_ui_parameters={"syntaxHighlight": False})
#app.get("/plain_text", response_class=StreamingResponse)
async def plain_text():
url = 'https://raw.githubusercontent.com/Schlumberger/hackathon/master/backend/dataset/data-large.csv'
df = pd.read_csv(url, index_col=0)
return StreamingResponse(iter([df.to_csv(index=False)]), media_type="text/csv")

While trying to scrape the data from the website it is displaying none as the "output"

Link of the website: https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2
how to get the location, job type , salary details from the website.
Can you please help me in locating the above mentioned details in the HTML code using Beautifulsoup.
html code
The site uses a backend api to deliver the info, if you look at your browser's Developer Tools - Network - fetch/XHR and refresh the page you'll see the data load via json in a request with a similar url to the one you posted.
So if we edit your URL to be the same as the backend api url then we can hit it and parse the JSON. Unfortunately the pay amount is buried in some HTML within the JSON so we have to get it out with BeautifulSoup and a bit of regex to match the £###,### pattern.
import requests
from bs4 import BeautifulSoup
import re
url = 'https://awg.wd3.myworkdayjobs.com/AW/job/Lincoln/Business-Analyst_R15025-2'
search = 'https://awg.wd3.myworkdayjobs.com/wday/cxs/awg/AW/'+url.split('AW')[-1] #api endpoint from Developer Tools
data = requests.get(search).json()
posted = data['jobPostingInfo']['startDate']
location = data['jobPostingInfo']['location']
title = data['jobPostingInfo']['title']
desc = data['jobPostingInfo']['jobDescription']
soup = BeautifulSoup(desc,'html.parser')
pay_text = soup.text
sterling = [x[0] for x in re.findall('(\£[0-9]+(\,[0-9]+)?)', pay_text)][0] #get any £###,#### type text
final = {
'title':title,
'posted':posted,
'location':location,
'pay':sterling
}
print(final)

Find the final redirected url using Python

I am trying to use python to find the final redirected URL for a url. I tried various solutions from stackoverflow answers but nothing worked for me. I am only getting the original url.
To be specific, I tried requests, urllib2 and urlparse libraries and none of them worked as they should. Here are some of the codes I tried:
Solution 1:
s = requests.session()
r = s.post('https://www.boots.com/search/10055096', allow_redirects=True)
print(r.history)
print(r.history[1].url)
Result:
[<Response [301]>, <Response [302]>]
https://www.boots.com/search/10055096
Solution 2:
import urlparse
url = 'https://www.boots.com/search/10055096'
try:
out = urlparse.parse_qs(urlparse.urlparse(url).query)['out'][0]
print(out)
except Exception as e:
print('not found')
Result:
not found
Solution 3:
import urllib2
def get_redirected_url(url):
opener = urllib2.build_opener(urllib2.HTTPRedirectHandler)
request = opener.open(url)
return request.url
print(get_redirected_url('https://www.boots.com/search/10055096'))
Result:
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found
Expected URL below is the final redirected page and that is what I want to return.
Original URL: https://www.boots.com/search/10055096
Expected URL: https://www.boots.com/gillette-fusion5-razor-blades-4pk-10055096
Solution #1 was the closest one. At least it returned 2 responses but second respond wasn't the final page, it seems like it was the loading page looking at the content of it.
The first request returns with a html file which contains a JS to update the site and Java scripts are not processed by requests . You can find the updated link by using
import requests
from bs4 import BeautifulSoup
import re
r = requests.get('https://www.boots.com/search/10055096')
soup = BeautifulSoup(r.content,'html.parser')
reg = soup.find('input',id='searchBoxText').findNext('script').contents[0]
print(re.search(r'ht[\w\://\.-]+', reg).group())

Scrapy dowload zip file from ASP.NET site

I need some help getting scrapy to download a file from an asp.net site. Normally from a browser one would click the link and the file would begin downloading, but that is not possible with scrapy so what I am trying to do is the following:
def retrieve(self, response):
print('Response URL: {}'.format(response.url))
pattern = re.compile('(dg[^\']*)')
for file in response.xpath('//table[#id="dgFile"]/tbody/tr/td[2]/a'):
file_url = file.xpath('#href').extract_first()
target = re.search(pattern, file_url).group(1)
viewstate = response.xpath('//*[#id="__VIEWSTATE"]/#value').extract_first()
viewstategenerator = response.xpath('//*[#id="__VIEWSTATEGENERATOR"]').extract_first()
eventvalidation = response.xpath('//*[#id="__EVENTVALIDATION"]').extract_first()
data = {
'_EVENTTARGET': target,
'_VIEWSTATE': viewstate,
'_VIEWSTATEGEERATOR': viewstategenerator,
'_EVENTVALIDATION': eventvalidation
}
yield FormRequest.from_response(
response,
formdata=data,
callback=self.end(response)
)
I am trying to submit the information to the page in order the receive the zip file back as a response, however this is not working as I hoped it would. Instead I am simply getting the same page as a response.
In a situation like this is it even possible to use scrapy to download this file? does anyone have any pointers?
I have also tried to use Selenium+PhantomJS but I run into a dead end trying to transfer the session from scrapy to selenium. I would be willing to use selenium for this one function but I need to use scrapy for this project.

Python-requests: Can't scrape all the html code from a page

I am trying to scrape the content of the
Financial Times Search page.
Using Requests, I can easily scrape the articles' titles and hyperlinks.
I would like to get the next page's hyperlink, but I can not find it in the Requests response, unlike the articles' titles or hyperlinks.
from bs4 import BeautifulSoup
import requests
url = 'http://search.ft.com/search?q=SABMiller+PLC&t=all&rpp=100&fa=people%2Corganisations%2Cregions%2Csections%2Ctopics%2Ccategory%2Cbrand&s=-lastPublishDateTime&f=lastPublishDateTime[2000-01-01T00%3A00%3A00%2C2016-01-01T23%3A59%3A59]&curations=ARTICLES%2CBLOGS%2CVIDEOS%2CPODCASTS&highlight=true&p=1et'
response = requests.get(url, auth=(my login informations))
soup = BeautifulSoup(response.text, "lxml")
def get_titles_and_links():
titles = soup.find_all('a')
for ref in titles:
if ref.get('title') and ref.get('onclick'):
print ref.get('href')
print ref.get('title')
The get_titles_and_links() function gives me the titles and links of all the articles.
However, with a similar function for the next page, I have no results:
def get_next_page():
next_page = soup.find_all("li", class_="page next")
return next_page
Or:
def get_next_page():
next_page = soup.find_all('li')
for ref in next_page:
if ref.get('page next'):
print ref.get('page next')
If you can see the required links in the page source, but are not able to get them via requests or urllib. It can mean two things.
There is something wrong with your logic. Let's assume it's not that.
Then the thing remains is: Ajax, those parts of the page you are looking for are loaded by javascript after the document.onload method fired. So you cannot get something that's not there in the first place.
My solutions(more like suggestions) are
Reverse engineer the network requests. Difficult, but universally applicable. I personally do that. You might want to use re module.
Find something that renders javascript. That's just to say that, simulate web browsing. You might wanna check out the webdriver component of selenium, Qt etc. This is easier, but kinda memory hungry and consumes a lot more network resource compared to 1.

Resources