Why won't Splash render this webpage? - web-scraping

I'm quite new to Splash and tho I was able to get Splash setup on my Ubuntu 18 (via Splash/Docker) it gives me different results for this page:
https://www.overstock.com/Home-Garden/Area-Rugs/31446/subcat.html
Normally it's rendered like so:
But when I try to render it in Splash, it renders it like this:
I have tried changing the user agent in Splash to this:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36
Consequently, this makes the Splash script like so:
function main(splash, args)
splash:set_user_agent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'
)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
Yet, despite these additions, it still fails to render the page.
How can I get Splash to render this page?

It seems like overstock.com requires a Connection and Accept headers. Add it to your request and it should work as expected.
Tested on Postman, with and without the Connection: keep-alive && Accept: */* headers; I get the same error page:
After adding the two headers above:
Therefor your request should be edited accordingly:
function main(splash, args)
splash:set_custom_headers({
["Connection"] = "keep-alive",
["Accept"] = "*/*",
})
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end

Related

Scraping data from https://cardano.ideascale.com webpage, but server noticed I am using Internet Explorer

I am scraping the content of this link. And my procedure is:
GET-TOKEN to get a Bearer token.
GET Fork Gitcoin and deploy on Cardano using the above token in the header and get json content in response.
My issue was when i run my below code, when run get /detail I got response as I am using Internet Explorer to access, that is weird because my request header has "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36".
<div id="ie-unsupported-alert" class="ie-d-none">
<p>We noticed you are using Internet Explorer. We don\'t have support for this browser in Incoming Moderation!
</p>
<p>We recommend using the Microsoft Edge Browser, Chrome, Firefox or Safari. <a
href="https://help.ideascale.com/knowledge/internet-web-browsers-supported-by-ideascale">Click for more
info.</a></p>
</div>
Can anyone explain the error and teach me how to fix it?
Below is my python code.
import requests
def get_content(url):
s = requests.session()
response = s.get(f"https://cardano.ideascale.com/a/community/api/get-token")
if response.status_code != 200:
print(f"\033[4m\033[1m{response.status_code}\033[0m")
return None
cookies = response.cookies
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
'Accept': 'application/json,',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'Authorization': f'Bearer {response.content.decode("utf-8")}',
'Alt-Used': 'cardano.ideascale.com',
'Connection': 'keep-alive',
'Referer': url,
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'TE': 'trailers',
}
#import ipdb; ipdb.set_trace()
response = s.get(f"{url}/detail", headers=headers, cookies=cookies)
print(response.content)
get_content("https://cardano.ideascale.com/c/idea/317821")

Python Scrapy: How to login into ASP.net website

I try to make scripts to login into private website and crawl data with Scrapy.
However this website requested to login.
I used chrome to check network when do manual login and found out that have 3 request was sent out after i clicked login button.
The first is login
Login request
The second is checkuservalid
Check valid user
Request to index page
Get request to index page
Note: Request 1 and 2 just display and disappear after login success.
I try to do as some instruction with scrapy FormRequest, request_from respone but can not login.
Please help give me some advices for this case.
import scrapy
class LoginSpider(scrapy.Spider):
name = "Test"
start_urls = ['http://hvsfcweb.fushan.fihnbb.com/Login.aspx']
headers = {'Content-Type': 'application/json; charset=UTF-8',
'Referer': 'http://hvsfcweb.fushan.fihnbb.com/Login.aspx',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def start_request(self):
yield scrapy.Request(url=self.start_urls,
method="POST",
body='{"userCode":"hluvan","pwd":"1","lang":"en-us","loc":"S010^B125"}',
headers = self.headers,
callback=self.parse)
def parse(self, response):
filename = f'quotes.html'
with open(filename, 'wb') as f:
f.write(response.body)

Scrapy - 403 error not solving even after adding Headers

I am trying to scrape doordash.com. But everytime I run the request it shows 403 and also this line INFO: Ignoring response <403 http://doordash.com/>: HTTP status code is not handled or not allowed.
I tried many things like adding User-Agent but still it didn't work. I also added full headers but again same thing is happening.
Here's my code:
class DoordashSpider(scrapy.Spider):
name = 'doordash'
allowed_domains = ['doordash.com']
start_urls = ['http://doordash.com/']
def start_requests(self):
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br'}
for url in self.start_urls:
yield scrapy.Request(url, headers=headers)
def parse(self, response):
print('Crawled Successfully')
How to get 200 ?

400 Bad request on python requests.get()

I am doing a bit of web scraping with political donations and have a link that I am scraping from one page than I then need to scrape. I can get the secondary links just fine, however when i try to send a requests.get() call, the html returned from the call gives me a bad request 400 error.
I've already tried to change the request around by changing or adding more headers but nothing seems to work.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept - Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache - Control": "max - age = 0",
"Connection": "keep-alive",
"DNT": "1",
"Host": "docquery.fec.gov",
"Referer": "http://www.politicalmoneyline.com/tr/tr_MG_IndivDonor.aspx?tm=3",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
params = {
"14960391627": ""
}
pdf_page = requests.get(potential_donor[10], headers=headers, params=params)
html = pdf_page.text
soup_donor_page = BeautifulSoup(html, 'html.parser')
print(soup_donor_page)
note: the url of the sites should look something like this:
http://docquery.fec.gov/cgi-bin/fecimg/?14960391627
with the ending digits being different
The output of the print(soup_donor_page) is:
400 Bad request
Your browser sent an invalid request.
I need to get the actual html of the page in order to grab the embedded pdf from the page.
I suspect the cause is an issue with requests that arises when it is provided a parameter without a value.
Try building the url with a format string instead:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
param = "14960391627"
r = requests.get(f"http://docquery.fec.gov/cgi-bin/fecimg/?{param}", headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.find("embed")["src"])
Result:
http://docquery.fec.gov/pdf/859/14960388859/14960388859_002769.pdf#zoom=fit&navpanes=0

ASP.NET Core Azure App Service httpContext.Request.Headers["Host"] Value

Faced strange behaviour today.
We are hosting asp.net core 1.1 web app with Azure App Services and using subdomains that route to a specific controller or area.
So in my SubdomainConstraint: IRouteConstraint I use
HttpContext.Request.Headers["Host"]
to get host name. That previously returned smth like that
mywebsite.com or subdomain.mywebsite.com
Starting today (or a maybe yesterday) it started to return my App Service name instead of host name. On localhost everything works fine.
Enumerating through
Context.Request.Headers
in one of my Views gives me on localhost:
Accept :
text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding : gzip, deflate, sdch, br
Accept-Language : ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,ca;q=0.2
Cookie : .AspNetCore.Antiforgery....
Host : localhost:37202
User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36
(KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Upgrade-Insecure-Requests : 1
in Azure App Service:
Connection : Keep-Alive
Accept : text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8
Accept-Encoding : gzip, deflate, sdch
Accept-Language : ru-RU,ru;q=0.8,en-US;q=0.6,en;q=0.4,ca;q=0.2
Cookie : AspNetCore.Antiforgery....
Host : mydeploymentname:80
Max-Forwards : 10
User-Agent : Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36
Upgrade-Insecure-Requests : 1
X-LiveUpgrade : 1
X-WAWS-Unencoded-URL : /
X-Original-URL : /
X-ARR-LOG-ID : 9c76e796-84a8-4335-919c-9ca4rb745f4fefdfde
DISGUISED-HOST : mywebsite.com
X-SITE-DEPLOYMENT-ID : mydeploymentname
WAS-DEFAULT-HOSTNAME : mydeploymentname.azurewebsites.net
X-Forwarded-For : IP:56548
MS-ASPNETCORE-TOKEN : a97b93ba-6106-4301-87b2-8af9a929d7dc
X-Original-For : 127.0.0.1:55602
X-Original-Proto : http
I can get what I need from
Headers["DISGUISED-HOST"]
But having problems with redirects to a login page, it redirects to the wrong URL with my deployment name.
Wondering if I could mess something up anywhere. But we've made last deployment like a few days ago and it worked fine after that.
This is caused by a regression in AspNetCoreModule deployed to a small number of apps in Azure App Service. This issue is being investigated. Please follow this thread for status.
Here is a workaround you can use until the fix is deployed: in your Configure method (typically in startup.cs), add the following:
public void Configure(IApplicationBuilder app, IHostingEnvironment env, ILoggerFactory loggerFactory)
{
app.Use((ctx, next) =>
{
string disguisedHost = ctx.Request.Headers["DISGUISED-HOST"];
if (!String.IsNullOrWhiteSpace(disguisedHost))
{
ctx.Request.Host = new Microsoft.AspNetCore.Http.HostString(disguisedHost);
}
return next();
});
// Rest of your code here...
}

Resources