requests post Requests Headers Payload - http

url : https://www.lotteon.com/p/product/LO1486128200?sitmNo=LO1486128200_1486128201&ch_no=100065&ch_dtl_no=1000030&entryPoint=pcs&dp_infw_cd=CHT&NaPm=ct%3Dlde41r14%7Cci%3Dc522920c2ab46a637e9b91c71b82a2649b6c1d19%7Ctr%3Dslbrc%7Csn%3D1243359%7Chk%3D1f928974fb136e1a4ff715351e786dfda81b79be
{afflPdLwstMrgnRt:null,
afflPdMrgnRt:null,
aplyStdDttm:"20230127145554",
brdNo:"P31294",
cartDvsCd:"01",
chCsfCd:"PA",
chDtlNo:"1000030",
chNo:"100065",
chTypCd:"PA08",
cpnBoxVersion:"V2",
ctrtTypCd:"A",
dvCst:0,
fprdDvPdYn:"N",
infwMdiaCd:"PC",
lrtrNo:"",
maxPurQty:999999,
pcsLwstMrgnRt:1,
scatNo:"BC02030400",
sfcoPdLwstMrgnRt:1,
sfcoPdMrgnRt:9,
sitmNo:"LO1486128200_1486128201",
slPrc:2700000,
slQty:1,
spdNo:"LO1486128200",
strCd:"",
thdyPdYn:"N",
trGrpCd:"SR",
trNo:"LO10022441"}
This is what you want as the Requests Payload value when calling requests. I can't find the corresponding values in the soup or url query in the above url. Is there any code where i can find those values through url?

Related

How to extract the original response header from a site before redirection using the requests module in python

def login_with_requests():
url = "https://url/login/"
login_data = {'csrfmiddlewaretoken':'', 'username':'username', 'password':'password'}
response = requests.get(url)
# print(response.headers)
response_cookies = response.cookies
print(csrf_token)
csrfmiddlewarepattern = re.compile(r'csrfmiddlewaretoken\W\svalue\W{2}([a-zA-Z0-9]+)\W')
matches = csrfmiddlewarepattern.finditer(response.text)
for match in matches:
csrfmiddlewaretoken = match.group(1)
# print(csrfmiddlewaretoken)
login_data['csrfmiddlewaretoken'] = csrfmiddlewaretoken
login_response = requests.post(url, cookies=response_cookies, data=login_data)
print(login_response.headers)
print(login_response.history)
I'm able to successfully login to a site using this code. The problem I have is that when I make a post request to the login site with the necessary parameters, although it is successful the site makes a redirection to the home page. Therefore I receive 2 response headers; the first one is the actual post response (status:302) to the login made containing a redirection header to the home page and the second one is the response containing data meant for the home page.
My problem is that the first response from the site contains a session-id token that I need before I can keep on interacting with the website. But the login_response.headers returns the final response headers which are meant for the request made to the redirected home page.
How can I extract the original response headers received from the site before the redirection as it contains the session-id token that I need for further interaction with the website?
I checked the login_response.history data, it seems to only return the status code for the previous request.
I found a solution, i thought i should share.
def login_with_requests_PaymentSite():
url = "<site.com/login/>"
login_data = {'csrfmiddlewaretoken':'', 'username':<username>, 'password':<password>}
response = requests.get(url)
csrf_token = response.cookies # Cookies returned from site for non-authenticated user.
# Extract csrfmiddlewaretoken that would be used to make a login post request.
csrfmiddlewarepattern = re.compile(r'csrfmiddlewaretoken\W\svalue\W{2}([a-zA-Z0-9]+)\W')
matches = csrfmiddlewarepattern.finditer(response.text)
for match in matches:
csrfmiddlewaretoken = match.group(1)
login_data['csrfmiddlewaretoken'] = csrfmiddlewaretoken # Save csrfmiddlewaretoken in the post login data.
# Start a session
session = requests.Session()
login_session = session.post(url, cookies=csrf_token, data=login_data) # Login to the site
sessionid_cookies = login_session.cookies # Set sessionid cookies that would be used for consecutive requests.
login_response_file = open('Login_Response_Paymentsite.html', 'w')
login_response_file.write(login_session.text)
login_response_file.close()
transaction_history_url = "<site.com/transactions/>"
transaction_history = requests.get(transaction_history_url, cookies=sessionid_cookies)
print("\n Result returned for the transactions page: ")
print(transaction_history.text)
userinfomation_url = "<site.com/userinformation/>"
userinformation = requests.get(userinfomation_url, cookies=sessionid_cookies)
print('\n Result returned for userinformation page: ')
print(userinformation.text)
To be able to make consecutive requests to a site after successful login using the requests module you have to make use of the requests.Session() method. This method helps you to store the session_id returned by the web application after successful login. If you make use of requests.post method instead you won't be able to retrieve the session_id. But using the requests.Session method stores the session_id automatically.
After making the post request;
login_session = session.post(url, cookies=csrf_token, data=login_data) # Login to the site You extract the session_id that would be used for consecutive requests with sessionid_cookies = login_session.cookies

How can I covert the requests code to scrapy?

def get_all_patent():
patent_list = []
for i in range(100):
res = requests.get(url).text
patent_list.append(res)
return patent_list
Because scrapy can't get response from request,reference:How can I get the response from the Request in Scrapy?
I want to extend the variable patent_list,But I can't get response body.
Can I through the Request meta or do something in Response?

Unable to modify request in middleware using Scrapy

I am in the process of scraping public data regarding metheorology for a project (data science), and in order to effectively do that I need to change the proxy used on my scrapy requests in the event of a 403 response code.
For this, I have defined a download middleware to handle such situation, which is as follows
class ProxyMiddleware(object):
def process_response(self, request, response, spider):
if response.status == 403:
f = open("Proxies.txt")
proxy = random_line(f) # Just returns a random line from the file with a valid structure ("http://IP:port")
new_request = Request(url=request.url)
new_request.meta['proxy'] = proxy
spider.logger.info("[Response 403] Changed proxy to %s" % proxy)
return new_request
return response
After properly adding the class to settings.py, I expected this middleware to deal with 403 responses by generating a new request with the new proxy, hence finishing in a 200 response. The observed behaviour is that it actually gets executed (I can see the Logger info about Changed proxy), but the new request does not seem to be made. Instead, I'm getting this:
2018-12-26 23:33:19 [bot_2] INFO: [Response] Changed proxy to https://154.65.93.126:53281
2018-12-26 23:33:26 [bot_2] INFO: [Response] Changed proxy to https://176.196.84.138:51336
... indefinitely with random proxies, which makes me think that I'm still retrieving 403 errors and the proxy is not changing.
Reading the documentation, regarding process_response, it states:
(...) If it returns a Request object, the middleware chain is halted and the returned request is rescheduled to be downloaded in the future. This is the same behavior as if a request is returned from process_request().
Is it possible that "in the future" is not "right after it is returned"? How should I do to change the proxy for all requests from that moment on?
Scrapy will drop duplicate requests to the same url by default, so that's probably what's happening on your spider. To check if this is your case you can set this settings:
DUPEFILTER_DEBUG=True
LOG_LEVEL='DEBUG'
To solve this you should add dont_filter=True:
new_request = Request(url=request.url, dont_filter=True)
Try this:
class ProxyMiddleware(object):
def process_response(self, request, response, spider):
if response.status == 403:
f = open("Proxies.txt")
proxy = random_line(f)
new_request = Request(url=request.url)
new_request.meta['proxy'] = proxy
spider.logger.info("[Response 403] Changed proxy to %s" % proxy)
return new_request
else:
return response
A better approach would be to use scrapy random proxies module instead:
'DOWNLOADER_MIDDLEWARES' : {
'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
'rotating_proxies.middlewares.BanDetectionMiddleware': 620
},

How to find header data and name? (Python-requests)

I want to use requests to web scrape on a login site. I already done the code using selenium but it is very inconvenient and slower to do it that way as I want to make it public(every user has to download chrome driver).
The problem is, there are multiple requests from the site and I don't have any experience processing that data and extracting the header data and name. Any help is great, thanks.
[Premise]
Using requests module you can send requests in these way:
import requests
url = "http://www.example.com" # request url
headers = { # headers dict to send in request
"header_name": "headers_value",
}
params = { # params to be encoded in the url
"param_name": "param_value",
}
data = { # data to send in the request body
"data_name": "data_value",
}
# Send GET request.
requests.get(url, params=params, headers=headers)
# Send POST request.
requests.post(url, params=params, headers=headers, data=data)
Once you perform a request, you can get much information from the response object:
>>> import requests
# We perform a request and get the response object.
>>> response = requests.get(url, params=params, headers=headers)
>>> response = requests.post(url, params=params, headers=headers, data=data)
>>> response.status_code # server response status code
>>> 200 # eg.
>>> response.request.method
>>> 'GET' # or eventually 'POST'
>>> response.request.headers # headers you sent with the request
>>> {'Accept-Encoding': 'gzip, deflate, br'} # eg.
>>> response.request.url # sent request url
>>> 'http://www.example.com'
>>> response.response.body
>>> 'name=value&name2=value2' # eg.
In conclusion, you can retrieve all the information that you can find in Dev Tools in the browser, from the response object. You need nothing else.
Dev Tools view
Dev Tool view 2
Once you send a GET or POST requests you can retrieve information from Dev Tools:
In General:
Request URL: the url you sent the request to. Corresponds to response.request.url
Request Method: corresponds to response.request.method
Status Code: corresponds to response.status_code
In Response Headers:
You find response headers which correspond to response.headers
eg. Connection: Keep-Alive,
Content-Length: 0,
Content-Type: text/html; charset=UTF-8...
In Requests Headers:
You find request headers which correspond to response.request.headers
In Form Data:
You can find the data you passed with data keyword in requests.post.
Corresponds to response.request.body

Meteor Get request

Bear with me for any mistakes/wrong terminology since I am new to all this. I am using meteor to develop my project and i need to make a get request to an external API. (I already added meteor add http) Below is my code:
HTTP.call( 'GET', 'url', {}, function( error, response ) {
if ( error ) {
console.log( error );
} else {
console.log( response );
}
});
If i use the code inside my Client folder in Meteor I get the following error No 'Access-Control-Allow-Origin' header is present on the requested resource. Origin 'http://localhost:3000' is therefore not allowed access meteor It has something to do with CORS which I didn't understand how to implement. If I use the code above in my Server side I do get the correct response in the console but how do I use it as a var on my client javascript code?
Tou can use .call function of HTTP and pass your header in options:
HTTP.call(method, url, [options], [asyncCallback])
Arguments
method String
The HTTP method to use, such as "GET", "POST", or "HEAD".
url String
The URL to retrieve.
asyncCallback Function
Optional callback. If passed, the method runs asynchronously, instead of synchronously, and calls asyncCallback. On the client, this callback is required.
Options
content String
String to use as the HTTP request body.
data Object
JSON-able object to stringify and use as the HTTP request body. Overwrites content.
query String
Query string to go in the URL. Overwrites any query string in url.
params Object
Dictionary of request parameters to be encoded and placed in the URL (for GETs) or request body (for POSTs). If content or data is specified, params will always be placed in the URL.
auth String
HTTP basic authentication string of the form "username:password"
headers Object
Dictionary of strings, headers to add to the HTTP request.
timeout Number
Maximum time in milliseconds to wait for the request before failing. There is no timeout by default.
followRedirects Boolean
If true, transparently follow HTTP redirects. Cannot be set to false on the client. Default true.
npmRequestOptions Object
On the server, HTTP.call is implemented by using the npm request module. Any options in this object will be passed directly to the request invocation.
beforeSend Function
On the client, this will be called before the request is sent to allow for more direct manipulation of the underlying XMLHttpRequest object, which will be passed as the first argument. If the callback returns false, the request will be not be send.
Souce: Here
Fixed it. On client side
Meteor.call("getURL",'url',{},function(err,res){
if(err){
console.log('Error: '+err);
}
if(!err){
console.log('Response: '+res);
}
and on server
Meteor.methods({
'getURL': function(url_l){
console.log("Request: "+url_l)
return HTTP.get(url_l)
}
});

Resources