Scrapy - 403 error not solving even after adding Headers - web-scraping

I am trying to scrape doordash.com. But everytime I run the request it shows 403 and also this line INFO: Ignoring response <403 http://doordash.com/>: HTTP status code is not handled or not allowed.
I tried many things like adding User-Agent but still it didn't work. I also added full headers but again same thing is happening.
Here's my code:
class DoordashSpider(scrapy.Spider):
name = 'doordash'
allowed_domains = ['doordash.com']
start_urls = ['http://doordash.com/']
def start_requests(self):
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br'}
for url in self.start_urls:
yield scrapy.Request(url, headers=headers)
def parse(self, response):
print('Crawled Successfully')
How to get 200 ?

Related

Scraping data from https://cardano.ideascale.com webpage, but server noticed I am using Internet Explorer

I am scraping the content of this link. And my procedure is:
GET-TOKEN to get a Bearer token.
GET Fork Gitcoin and deploy on Cardano using the above token in the header and get json content in response.
My issue was when i run my below code, when run get /detail I got response as I am using Internet Explorer to access, that is weird because my request header has "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36".
<div id="ie-unsupported-alert" class="ie-d-none">
<p>We noticed you are using Internet Explorer. We don\'t have support for this browser in Incoming Moderation!
</p>
<p>We recommend using the Microsoft Edge Browser, Chrome, Firefox or Safari. <a
href="https://help.ideascale.com/knowledge/internet-web-browsers-supported-by-ideascale">Click for more
info.</a></p>
</div>
Can anyone explain the error and teach me how to fix it?
Below is my python code.
import requests
def get_content(url):
s = requests.session()
response = s.get(f"https://cardano.ideascale.com/a/community/api/get-token")
if response.status_code != 200:
print(f"\033[4m\033[1m{response.status_code}\033[0m")
return None
cookies = response.cookies
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
'Accept': 'application/json,',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'Authorization': f'Bearer {response.content.decode("utf-8")}',
'Alt-Used': 'cardano.ideascale.com',
'Connection': 'keep-alive',
'Referer': url,
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'TE': 'trailers',
}
#import ipdb; ipdb.set_trace()
response = s.get(f"{url}/detail", headers=headers, cookies=cookies)
print(response.content)
get_content("https://cardano.ideascale.com/c/idea/317821")

Python Scrapy: How to login into ASP.net website

I try to make scripts to login into private website and crawl data with Scrapy.
However this website requested to login.
I used chrome to check network when do manual login and found out that have 3 request was sent out after i clicked login button.
The first is login
Login request
The second is checkuservalid
Check valid user
Request to index page
Get request to index page
Note: Request 1 and 2 just display and disappear after login success.
I try to do as some instruction with scrapy FormRequest, request_from respone but can not login.
Please help give me some advices for this case.
import scrapy
class LoginSpider(scrapy.Spider):
name = "Test"
start_urls = ['http://hvsfcweb.fushan.fihnbb.com/Login.aspx']
headers = {'Content-Type': 'application/json; charset=UTF-8',
'Referer': 'http://hvsfcweb.fushan.fihnbb.com/Login.aspx',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def start_request(self):
yield scrapy.Request(url=self.start_urls,
method="POST",
body='{"userCode":"hluvan","pwd":"1","lang":"en-us","loc":"S010^B125"}',
headers = self.headers,
callback=self.parse)
def parse(self, response):
filename = f'quotes.html'
with open(filename, 'wb') as f:
f.write(response.body)

Why won't Splash render this webpage?

I'm quite new to Splash and tho I was able to get Splash setup on my Ubuntu 18 (via Splash/Docker) it gives me different results for this page:
https://www.overstock.com/Home-Garden/Area-Rugs/31446/subcat.html
Normally it's rendered like so:
But when I try to render it in Splash, it renders it like this:
I have tried changing the user agent in Splash to this:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36
Consequently, this makes the Splash script like so:
function main(splash, args)
splash:set_user_agent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'
)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
Yet, despite these additions, it still fails to render the page.
How can I get Splash to render this page?
It seems like overstock.com requires a Connection and Accept headers. Add it to your request and it should work as expected.
Tested on Postman, with and without the Connection: keep-alive && Accept: */* headers; I get the same error page:
After adding the two headers above:
Therefor your request should be edited accordingly:
function main(splash, args)
splash:set_custom_headers({
["Connection"] = "keep-alive",
["Accept"] = "*/*",
})
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end

400 Bad request on python requests.get()

I am doing a bit of web scraping with political donations and have a link that I am scraping from one page than I then need to scrape. I can get the secondary links just fine, however when i try to send a requests.get() call, the html returned from the call gives me a bad request 400 error.
I've already tried to change the request around by changing or adding more headers but nothing seems to work.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept - Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache - Control": "max - age = 0",
"Connection": "keep-alive",
"DNT": "1",
"Host": "docquery.fec.gov",
"Referer": "http://www.politicalmoneyline.com/tr/tr_MG_IndivDonor.aspx?tm=3",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
params = {
"14960391627": ""
}
pdf_page = requests.get(potential_donor[10], headers=headers, params=params)
html = pdf_page.text
soup_donor_page = BeautifulSoup(html, 'html.parser')
print(soup_donor_page)
note: the url of the sites should look something like this:
http://docquery.fec.gov/cgi-bin/fecimg/?14960391627
with the ending digits being different
The output of the print(soup_donor_page) is:
400 Bad request
Your browser sent an invalid request.
I need to get the actual html of the page in order to grab the embedded pdf from the page.
I suspect the cause is an issue with requests that arises when it is provided a parameter without a value.
Try building the url with a format string instead:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
param = "14960391627"
r = requests.get(f"http://docquery.fec.gov/cgi-bin/fecimg/?{param}", headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.find("embed")["src"])
Result:
http://docquery.fec.gov/pdf/859/14960388859/14960388859_002769.pdf#zoom=fit&navpanes=0

Why are there 2 requests from my browser?

I have a simple node server. All it does is log the req.headers and res (I am learning!).
let http = require('http');
function handleIncomingRequest(req, res) {
console.log('---------------------------------------------------');
console.log(req.headers);
console.log('---------------------------------------------------');
console.log();
console.log('---------------------------------------------------');
res.writeHead(200, {'Content-Type': 'application/json'});
res.end(JSON.stringify( {error: null}) + '\n');
}
let s = http.createServer(handleIncomingRequest);
s.listen(8080);
When I use curl to test the server it sends 1 request. When I use chrome it sends 2 different requests.
{ host: 'localhost:8080',
connection: 'keep-alive',
'cache-control': 'max-age=0',
'upgrade-insecure-requests': '1',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
accept: 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'accept-encoding': 'gzip, deflate, sdch, br',
'accept-language': 'en-GB,en;q=0.8' }
and
{ host: 'localhost:8080',
connection: 'keep-alive',
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/56.0.2924.87 Safari/537.36',
accept: 'image/webp,image/*,*/*;q=0.8',
referer: 'http://localhost:8080/',
'accept-encoding': 'gzip, deflate, sdch, br',
'accept-language': 'en-GB,en;q=0.8' }
This is in incognito mode as in normal mode there are 3 requests!
What is the browser doing and why?
Hard to tell without seeing the full transaction data (for example, what was the request, i.e. what came after GET or POST - and what were the answers from the server).
But it could be caused by the 'upgrade-insecure-requests': '1' header:
When a server encounters this preference in an HTTP request’s headers,
it SHOULD redirect the user to a potentially secure representation of
the resource being requested.
See this.
accept: 'image/webp,image/*,*/*;q=0.8'
On the other hand, the second request is probably for an image only, most likely the favicon.ico or a (bigger) icon for iPad/iPhone maybe (that could explain the 3 requests). You should check out the full request data to be sure.
You can use F12 en select network in the browser to see what's really happening.

Resources