Scrapy request not going through - web-scraping

I don't know how to frame this question exactly. I am beginner at web scraping and I am trying to crawl a website using Python Scrapy. The website is dynamic and uses javascript and can't retrieve any data using the basic level xpath and CSS selectors.
I am trying to mimic the API request through my spider by requesting the url which has the data in json object. That request url is throwing a HTTP status code is not handled or not allowed error.
I think I am calling the wrong URL. 9/10 times this method of calling the json object url directly has worked for me. What can I do different?
the url has parameters and form data items in the headers section and the url doesn't even look like a valid website url
it starts with https://ih3kc909gb-dsn.algolia.net/1/indexes....
I know this is a long question but I could really use some help with how to get a response for this?

You should use start_requests() method instead of start_urls property. You can read more about it from here . Now, all you need to do is to make a POST request.
Code
import scrapy
class carswitch(scrapy.Spider):
name = 'car'
headers = {
"Connection": "keep-alive",
"Pragma": "no-cache",
"Cache-Control": "no-cache",
"sec-ch-ua": "\" Not;A Brand\";v=\"99\", \"Google Chrome\";v=\"91\", \"Chromium\";v=\"91\"",
"accept": "application/json",
"sec-ch-ua-mobile": "?0",
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
"content-type": "application/x-www-form-urlencoded",
"Origin": "https://carswitch.com",
"Sec-Fetch-Site": "cross-site",
"Sec-Fetch-Mode": "cors",
"Sec-Fetch-Dest": "empty",
"Referer": "https://carswitch.com/",
"Accept-Language": "en-US,en;q=0.9"
}
body = '{"params":"query=&hitsPerPage=24&page=0&numericFilters=%5B%22country_id%3D1%22%2C%22used_car%20%3D%201%22%5D&facetFilters=&typoTolerance=&tagFilters=%5B%5D&attributesToHighlight=%5B%5D&attributesToRetrieve=%5B%22make%22%2C%22make_ar%22%2C%22model%22%2C%22model_ar%22%2C%22year%22%2C%22trim%22%2C%22displayTrim%22%2C%22colorPaint%22%2C%22bodyType%22%2C%22salePrice%22%2C%22transmissionType%22%2C%22GPS%22%2C%22carID%22%2C%22inspectionID%22%2C%22inspectionStatus%22%2C%22rate%22%2C%22certified_dealer_id%22%2C%22dealer_category%22%2C%22used_car%22%2C%22new%22%2C%22top_condition%22%2C%22featured%22%2C%22photo%22%2C%22modifiedPlace%22%2C%22city%22%2C%22mileage%22%2C%22urgent_sales%22%2C%22price_dropped%22%2C%22urgent_sales_days%22%2C%22urgent_sales_end_date%22%2C%22date%22%2C%22negotiable%22%2C%22oldPrice%22%2C%22zero_downpayment%22%2C%22cashOnly%22%2C%22hasPriceGuidance%22%2C%22dealerOffer%22%2C%22maxPrice%22%2C%22fairPrice%22%2C%22pricey_deal%22%2C%22fair_deal%22%2C%22good_deal%22%2C%22great_deal%22%2C%22dealership_info%22%2C%22logo_small%22%2C%22GCCspecs%22%2C%22country%22%2C%22export%22%2C%22monthly_price%22%5D"}'
def start_requests(self):
url = 'https://ih3kc909gb-dsn.algolia.net/1/indexes/All_Carswitch_Cars/query?x-algolia-agent=Algolia%20for%20JavaScript%20(3.33.0)%3B%20Browser&x-algolia-application-id=IH3KC909GB&x-algolia-api-key=493a9bbc57331df3b278fa39c1dd8f2d'
yield Request(url=url, method='POST', headers=self.headers, body=self.body, callback=self.parse)
def parse(self,response):
print(response.body)

Related

wordpress json API fetch problem from localhost

if I call the json api v2 in wordpress from a local website via fetch, it works without any problems.
fetch('https://example.com/wp-json/wp/v2/pages/', { method: 'GET', headers: { 'content-type': 'application/json' } } )
when I call the old json api via fetch I get an error unable to fetch (directly with the browser no problem)
fetch('https://example.com/?json=get_page_index&custom_fields=*&dev=1&count=800', { method: 'GET', headers: { 'content-type': 'plain/text' } } )
in the network debug tab I see
provisional headers are shown Learn more
application/json: application/json
DNT: 1
Referrer: http://localhost:8080/
User-Agent: Mozilla/5.0 (iPad; CPU OS 13_3 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/87.0.4280.7
What could be the reason.

Python Scrapy: How to login into ASP.net website

I try to make scripts to login into private website and crawl data with Scrapy.
However this website requested to login.
I used chrome to check network when do manual login and found out that have 3 request was sent out after i clicked login button.
The first is login
Login request
The second is checkuservalid
Check valid user
Request to index page
Get request to index page
Note: Request 1 and 2 just display and disappear after login success.
I try to do as some instruction with scrapy FormRequest, request_from respone but can not login.
Please help give me some advices for this case.
import scrapy
class LoginSpider(scrapy.Spider):
name = "Test"
start_urls = ['http://hvsfcweb.fushan.fihnbb.com/Login.aspx']
headers = {'Content-Type': 'application/json; charset=UTF-8',
'Referer': 'http://hvsfcweb.fushan.fihnbb.com/Login.aspx',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.71 Safari/537.36',
'X-Requested-With': 'XMLHttpRequest',
}
def start_request(self):
yield scrapy.Request(url=self.start_urls,
method="POST",
body='{"userCode":"hluvan","pwd":"1","lang":"en-us","loc":"S010^B125"}',
headers = self.headers,
callback=self.parse)
def parse(self, response):
filename = f'quotes.html'
with open(filename, 'wb') as f:
f.write(response.body)

400 Bad request on python requests.get()

I am doing a bit of web scraping with political donations and have a link that I am scraping from one page than I then need to scrape. I can get the secondary links just fine, however when i try to send a requests.get() call, the html returned from the call gives me a bad request 400 error.
I've already tried to change the request around by changing or adding more headers but nothing seems to work.
headers = {
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8",
"Accept - Encoding": "gzip, deflate",
"Accept-Language": "en-US,en;q=0.9",
"Cache - Control": "max - age = 0",
"Connection": "keep-alive",
"DNT": "1",
"Host": "docquery.fec.gov",
"Referer": "http://www.politicalmoneyline.com/tr/tr_MG_IndivDonor.aspx?tm=3",
"Upgrade-Insecure-Requests": "1",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
params = {
"14960391627": ""
}
pdf_page = requests.get(potential_donor[10], headers=headers, params=params)
html = pdf_page.text
soup_donor_page = BeautifulSoup(html, 'html.parser')
print(soup_donor_page)
note: the url of the sites should look something like this:
http://docquery.fec.gov/cgi-bin/fecimg/?14960391627
with the ending digits being different
The output of the print(soup_donor_page) is:
400 Bad request
Your browser sent an invalid request.
I need to get the actual html of the page in order to grab the embedded pdf from the page.
I suspect the cause is an issue with requests that arises when it is provided a parameter without a value.
Try building the url with a format string instead:
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
param = "14960391627"
r = requests.get(f"http://docquery.fec.gov/cgi-bin/fecimg/?{param}", headers=headers)
soup = BeautifulSoup(r.content, "html.parser")
print(soup.find("embed")["src"])
Result:
http://docquery.fec.gov/pdf/859/14960388859/14960388859_002769.pdf#zoom=fit&navpanes=0

Convert XHR (XML Http Request) into R command

I am trying to turn an XHR (XMLHttpRequest) request into an R command.
I am using the following code:
library(httr)
x <- POST("https://transparency.entsoe.eu/generation/r2/actualGenerationPerGenerationUnit/getDataTableDetailData/?name=&defaultValue=false&viewType=TABLE&areaType=BZN&atch=false&dateTime.dateTime=17.03.2017+00%3A00%7CUTC%7CDAYTIMERANGE&dateTime.endDateTime=17.03.2017+00%3A00%7CUTC%7CDAYTIMERANGE&area.values=CTY%7C10YBE----------2!BZN%7C10YBE----------2&productionType.values=B02&productionType.values=B03&productionType.values=B04&productionType.values=B05&productionType.values=B06&productionType.values=B07&productionType.values=B08&productionType.values=B09&productionType.values=B10&productionType.values=B11&productionType.values=B12&productionType.values=B13&productionType.values=B14&productionType.values=B20&productionType.values=B15&productionType.values=B16&productionType.values=B17&productionType.values=B18&productionType.values=B19&dateTime.timezone=UTC&dateTime.timezone_input=UTC&dv-datatable-detail_22WAMERCO000010Y_22WAMERCO000008L_length=10&dv-datatable_length=50&detailId=22WAMERCO000010Y_22WAMERCO000008L",
user_agent("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.50 Safari/537.36"),
add_headers(`Referer`="https://transparency.entsoe.eu/generation/r2/actualGenerationPerGenerationUnit/show?name=&defaultValue=true&viewType=TABLE&areaType=BZN&atch=false&dateTime.dateTime=17.03.2017+00:00|UTC|DAYTIMERANGE&dateTime.endDateTime=17.03.2017+00:00|UTC|DAYTIMERANGE&area.values=CTY|10YBE----------2!BZN|10YBE----------2&productionType.values=B02&productionType.values=B03&productionType.values=B04&productionType.values=B05&productionType.values=B06&productionType.values=B07&productionType.values=B08&productionType.values=B09&productionType.values=B10&productionType.values=B11&productionType.values=B12&productionType.values=B13&productionType.values=B14&productionType.values=B15&productionType.values=B16&productionType.values=B17&productionType.values=B18&productionType.values=B19&productionType.values=B20&dateTime.timezone=UTC&dateTime.timezone_input=UTC&dv-datatable_length=100",
Connection = "keep-alive",
Host = "https://transparency.entsoe.eu/",
Accept = "application/json, text/javascript, */*; q=0.01",
`Accept-Encoding` = "gzip, deflate, br",
Origin = "https://transparency.entsoe.eu",
`X-Requested-With` = "XMLHttpRequest",
`Content-Type` = "application/json;charset=UTF-8",
`Accept-Language`= "en-US,en;q=0.8,nl;q=0.6,fr-FR;q=0.4,fr;q=0.2"))
But I keep getting an 400 error: bad request instead of the 200 which would mark a successful response.
I've extracted the values via the Chrome network monitor from this website. The XHR request is sent when the plus button is clicked. I can send it repeatedly from my browser, but it doesn't seem to work from R.
What am I doing wrong in creating the Post request?

Login Authentication with Requests

I'm trying to work with requests (python 3.4) to create a session where I log into gamefaqs.com and navigate to a board page so that I can scrape the content off to get relavant information for what I'm trying to accomplish. I directly copied the header and payload information from the developer console in firefox.
import requests
import urllib3
url = 'http://www.gamefaqs.com/user/login'
url2 = 'http://www.gamefaqs.com/user/Leight_Weight/boards'
header = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.5',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0',
'Referer': 'http://www.gamefaqs.com/',
'Connection': 'keep-alive',
'Host': 'www.gamefaqs.com',
}
payload = {
'path': "http://www.gamefaqs.com/",
'key': "71548de4",
'EMAILADDR': "username",
'PASSWORD': "password",
}
with requests.Session() as s:
p = s.get(url, headers=header)
p = s.post(url, headers=header, data=payload, cookies = s.cookies)
The problem that I'm having is that I'm not receiving back the authentication cookie passed from the website to my session. I'm using fiddler to track the post request from Python. Despite the request header information being identical to the request header information in firefox, the response header information is very different.
The response header from firefox (as seen by Fiddler):
Firefox Response Header
The response header from Python (as seen by Fiddler):
Python Response Header
At this point I'm at a bit of a loss. As far as I can tell my code is sound and the request headers are correct, however not receiving the authentication cookie proves something is wrong. If you look in the response header the codes are different (302 vs 200). I'm not sure what the error is.
As it turns out - the payload item 'key' changes depending on your session. I didn't catch this initially because I didn't think of the fact that browsers use persistent cookies through open/close, something this solution does not use.
I did a bit of a heavy-handed approach to finding the right key value using BeautifulSoup, but the result remains the same. Once I had the appropriate key value, I added that to the payload before doing the post command and viola - successful login.
For posterity's sake, the code is below.
import requests
from bs4 import BeautifulSoup as bs
url = 'http://www.gamefaqs.com/user/login'
url2 = 'http://www.gamefaqs.com/user/Leight_Weight/boards'
header = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.5',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:44.0) Gecko/20100101 Firefox/44.0',
'Referer': 'http://www.gamefaqs.com/',
'Connection': 'keep-alive',
'Host': 'www.gamefaqs.com',
}
payload = {
'PASSWORD': "password",
'path': "http://www.gamefaqs.com/",
'EMAILADDR': "username",
}
with requests.Session() as s:
resp = s.get(url, headers=header)
parse = bs(resp.text)
keyval = parse.find_all('form')[1].contents[1]['value']
payload['key'] = keyval
p = s.post(url, headers=header, data=payload)

Resources