400 Bad request on python requests.get() - python-requests

I am doing a bit of web scraping with political donations and have a link that I am scraping from one page than I then need to scrape. I can get the secondary links just fine, however when i try to send a requests.get() call, the html returned from the call gives me a bad request 400 error.
I've already tried to change the request around by changing or adding more headers but nothing seems to work.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept - Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache - Control": "max - age = 0",
"Connection": "keep-alive",
"DNT": "1",
"Host": "docquery.fec.gov",
"Referer": "http://www.politicalmoneyline.com/tr/tr_MG_IndivDonor.aspx?tm=3",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
params = {
"14960391627": ""
}
pdf_page = requests.get(potential_donor[10], headers=headers, params=params)
html = pdf_page.text
soup_donor_page = BeautifulSoup(html, 'html.parser')
print(soup_donor_page)
note: the url of the sites should look something like this:
http://docquery.fec.gov/cgi-bin/fecimg/?14960391627
with the ending digits being different
The output of the print(soup_donor_page) is:
400 Bad request
Your browser sent an invalid request.
I need to get the actual html of the page in order to grab the embedded pdf from the page.

I suspect the cause is an issue with requests that arises when it is provided a parameter without a value.
Try building the url with a format string instead:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
param = "14960391627"
r = requests.get(f"http://docquery.fec.gov/cgi-bin/fecimg/?{param}", headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.find("embed")["src"])
Result:
http://docquery.fec.gov/pdf/859/14960388859/14960388859_002769.pdf#zoom=fit&navpanes=0

Related

Scraping data from https://cardano.ideascale.com webpage, but server noticed I am using Internet Explorer

I am scraping the content of this link. And my procedure is:
GET-TOKEN to get a Bearer token.
GET Fork Gitcoin and deploy on Cardano using the above token in the header and get json content in response.
My issue was when i run my below code, when run get /detail I got response as I am using Internet Explorer to access, that is weird because my request header has "User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36".
<div id="ie-unsupported-alert" class="ie-d-none">
<p>We noticed you are using Internet Explorer. We don\'t have support for this browser in Incoming Moderation!
</p>
<p>We recommend using the Microsoft Edge Browser, Chrome, Firefox or Safari. <a
href="https://help.ideascale.com/knowledge/internet-web-browsers-supported-by-ideascale">Click for more
info.</a></p>
</div>
Can anyone explain the error and teach me how to fix it?
Below is my python code.
import requests
def get_content(url):
s = requests.session()
response = s.get(f"https://cardano.ideascale.com/a/community/api/get-token")
if response.status_code != 200:
print(f"\033[4m\033[1m{response.status_code}\033[0m")
return None
cookies = response.cookies
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/104.0.0.0 Safari/537.36",
'Accept': 'application/json,',
'Accept-Language': 'en-US,en;q=0.5',
'Accept-Encoding': 'gzip, deflate',
'Cache-Control': 'no-cache',
'Pragma': 'no-cache',
'Authorization': f'Bearer {response.content.decode("utf-8")}',
'Alt-Used': 'cardano.ideascale.com',
'Connection': 'keep-alive',
'Referer': url,
'Sec-Fetch-Dest': 'empty',
'Sec-Fetch-Mode': 'cors',
'Sec-Fetch-Site': 'same-origin',
'TE': 'trailers',
}
#import ipdb; ipdb.set_trace()
response = s.get(f"{url}/detail", headers=headers, cookies=cookies)
print(response.content)
get_content("https://cardano.ideascale.com/c/idea/317821")

Python Scrapy: How to login into ASP.net website

I try to make scripts to login into private website and crawl data with Scrapy.
However this website requested to login.
I used chrome to check network when do manual login and found out that have 3 request was sent out after i clicked login button.
The first is login
Login request
The second is checkuservalid
Check valid user
Request to index page
Get request to index page
Note: Request 1 and 2 just display and disappear after login success.
I try to do as some instruction with scrapy FormRequest, request_from respone but can not login.
Please help give me some advices for this case.
import scrapy
class LoginSpider(scrapy.Spider):
name = "Test"
start_urls = ['http://hvsfcweb.fushan.fihnbb.com/Login.aspx']
headers = {'Content-Type': 'application/json; charset=UTF-8',
'Referer': 'http://hvsfcweb.fushan.fihnbb.com/Login.aspx',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def start_request(self):
yield scrapy.Request(url=self.start_urls,
method="POST",
body='{"userCode":"hluvan","pwd":"1","lang":"en-us","loc":"S010^B125"}',
headers = self.headers,
callback=self.parse)
def parse(self, response):
filename = f'quotes.html'
with open(filename, 'wb') as f:
f.write(response.body)

Scrapy - 403 error not solving even after adding Headers

I am trying to scrape doordash.com. But everytime I run the request it shows 403 and also this line INFO: Ignoring response <403 http://doordash.com/>: HTTP status code is not handled or not allowed.
I tried many things like adding User-Agent but still it didn't work. I also added full headers but again same thing is happening.
Here's my code:
class DoordashSpider(scrapy.Spider):
name = 'doordash'
allowed_domains = ['doordash.com']
start_urls = ['http://doordash.com/']
def start_requests(self):
headers= {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
'Accept-Language': 'en-US,en;q=0.9',
'Accept-Encoding': 'gzip, deflate, br'}
for url in self.start_urls:
yield scrapy.Request(url, headers=headers)
def parse(self, response):
print('Crawled Successfully')
How to get 200 ?

Why won't Splash render this webpage?

I'm quite new to Splash and tho I was able to get Splash setup on my Ubuntu 18 (via Splash/Docker) it gives me different results for this page:
https://www.overstock.com/Home-Garden/Area-Rugs/31446/subcat.html
Normally it's rendered like so:
But when I try to render it in Splash, it renders it like this:
I have tried changing the user agent in Splash to this:
Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36
Consequently, this makes the Splash script like so:
function main(splash, args)
splash:set_user_agent(
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.135 Safari/537.36'
)
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end
Yet, despite these additions, it still fails to render the page.
How can I get Splash to render this page?
It seems like overstock.com requires a Connection and Accept headers. Add it to your request and it should work as expected.
Tested on Postman, with and without the Connection: keep-alive && Accept: */* headers; I get the same error page:
After adding the two headers above:
Therefor your request should be edited accordingly:
function main(splash, args)
splash:set_custom_headers({
["Connection"] = "keep-alive",
["Accept"] = "*/*",
})
assert(splash:go(args.url))
assert(splash:wait(0.5))
return {
html = splash:html(),
png = splash:png(),
har = splash:har(),
}
end

Convert XHR (XML Http Request) into R command

I am trying to turn an XHR (XMLHttpRequest) request into an R command.
I am using the following code:
library(httr)
x <- POST("https://transparency.entsoe.eu/generation/r2/actualGenerationPerGenerationUnit/getDataTableDetailData/?name=&defaultValue=false&viewType=TABLE&areaType=BZN&atch=false&dateTime.dateTime=17.03.2017+00%3A00%7CUTC%7CDAYTIMERANGE&dateTime.endDateTime=17.03.2017+00%3A00%7CUTC%7CDAYTIMERANGE&area.values=CTY%7C10YBE----------2!BZN%7C10YBE----------2&productionType.values=B02&productionType.values=B03&productionType.values=B04&productionType.values=B05&productionType.values=B06&productionType.values=B07&productionType.values=B08&productionType.values=B09&productionType.values=B10&productionType.values=B11&productionType.values=B12&productionType.values=B13&productionType.values=B14&productionType.values=B20&productionType.values=B15&productionType.values=B16&productionType.values=B17&productionType.values=B18&productionType.values=B19&dateTime.timezone=UTC&dateTime.timezone_input=UTC&dv-datatable-detail_22WAMERCO000010Y_22WAMERCO000008L_length=10&dv-datatable_length=50&detailId=22WAMERCO000010Y_22WAMERCO000008L",
user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.50 Safari/537.36"),
add_headers(`Referer`="https://transparency.entsoe.eu/generation/r2/actualGenerationPerGenerationUnit/show?name=&defaultValue=true&viewType=TABLE&areaType=BZN&atch=false&dateTime.dateTime=17.03.2017+00:00|UTC|DAYTIMERANGE&dateTime.endDateTime=17.03.2017+00:00|UTC|DAYTIMERANGE&area.values=CTY|10YBE----------2!BZN|10YBE----------2&productionType.values=B02&productionType.values=B03&productionType.values=B04&productionType.values=B05&productionType.values=B06&productionType.values=B07&productionType.values=B08&productionType.values=B09&productionType.values=B10&productionType.values=B11&productionType.values=B12&productionType.values=B13&productionType.values=B14&productionType.values=B15&productionType.values=B16&productionType.values=B17&productionType.values=B18&productionType.values=B19&productionType.values=B20&dateTime.timezone=UTC&dateTime.timezone_input=UTC&dv-datatable_length=100",
Connection = "keep-alive",
Host = "https://transparency.entsoe.eu/",
Accept = "application/json, text/javascript, */*; q=0.01",
`Accept-Encoding` = "gzip, deflate, br",
Origin = "https://transparency.entsoe.eu",
`X-Requested-With` = "XMLHttpRequest",
`Content-Type` = "application/json;charset=UTF-8",
`Accept-Language`= "en-US,en;q=0.8,nl;q=0.6,fr-FR;q=0.4,fr;q=0.2"))
But I keep getting an 400 error: bad request instead of the 200 which would mark a successful response.
I've extracted the values via the Chrome network monitor from this website. The XHR request is sent when the plus button is clicked. I can send it repeatedly from my browser, but it doesn't seem to work from R.
What am I doing wrong in creating the Post request?

Resources