How can I covert the requests code to scrapy? - python-requests

def get_all_patent():
patent_list = []
for i in range(100):
res = requests.get(url).text
patent_list.append(res)
return patent_list
Because scrapy can't get response from request,reference:How can I get the response from the Request in Scrapy?
I want to extend the variable patent_list,But I can't get response body.
Can I through the Request meta or do something in Response?

Related

How to get new token headers during runtime of scrapy spider

I am running a scrapy spider that starts by getting an authorization token from the website I am scraping from, using basic requests library. The function for this is called get_security_token(). This token is passed as a header to the scrapy request. The issue is that the token expires after 300 seconds, and then I get a 401 error. Is there anyway for a spider to see the 401 error, run the get_security_token() function again, and then pass the new token on to all future request headers?
import scrapy
class PlayerSpider(scrapy.Spider):
name = 'player'
def start_requests(self):
urls = ['URL GOES HERE']
header_data = {'Authorization':'Bearer 72bb65d7-2ff1-3686-837c-61613454928d'}
for url in urls:
yield scrapy.Request(url = url, callback = self.parse,headers = header_data)
def parse(self, response):
yield response.json()
if it's pure scrapy you can add handle_httpstatus_list = [501] after start_urls
and then in you parse method you need to do something like this:
if response.status == 501:
get_security_token()

How to find header data and name? (Python-requests)

I want to use requests to web scrape on a login site. I already done the code using selenium but it is very inconvenient and slower to do it that way as I want to make it public(every user has to download chrome driver).
The problem is, there are multiple requests from the site and I don't have any experience processing that data and extracting the header data and name. Any help is great, thanks.
[Premise]
Using requests module you can send requests in these way:
import requests
url = "http://www.example.com" # request url
headers = { # headers dict to send in request
"header_name": "headers_value",
}
params = { # params to be encoded in the url
"param_name": "param_value",
}
data = { # data to send in the request body
"data_name": "data_value",
}
# Send GET request.
requests.get(url, params=params, headers=headers)
# Send POST request.
requests.post(url, params=params, headers=headers, data=data)
Once you perform a request, you can get much information from the response object:
>>> import requests
# We perform a request and get the response object.
>>> response = requests.get(url, params=params, headers=headers)
>>> response = requests.post(url, params=params, headers=headers, data=data)
>>> response.status_code # server response status code
>>> 200 # eg.
>>> response.request.method
>>> 'GET' # or eventually 'POST'
>>> response.request.headers # headers you sent with the request
>>> {'Accept-Encoding': 'gzip, deflate, br'} # eg.
>>> response.request.url # sent request url
>>> 'http://www.example.com'
>>> response.response.body
>>> 'name=value&name2=value2' # eg.
In conclusion, you can retrieve all the information that you can find in Dev Tools in the browser, from the response object. You need nothing else.
Dev Tools view
Dev Tool view 2
Once you send a GET or POST requests you can retrieve information from Dev Tools:
In General:
Request URL: the url you sent the request to. Corresponds to response.request.url
Request Method: corresponds to response.request.method
Status Code: corresponds to response.status_code
In Response Headers:
You find response headers which correspond to response.headers
eg. Connection: Keep-Alive,
Content-Length: 0,
Content-Type: text/html; charset=UTF-8...
In Requests Headers:
You find request headers which correspond to response.request.headers
In Form Data:
You can find the data you passed with data keyword in requests.post.
Corresponds to response.request.body

How to get the Only success (200 ok) Response from Http Request Using Groovy script

Here i'm Facing problem with groovy script that.Want to extract the response data from the http request.When the Response data consisting of 200 as value then only extract that value_description and print value.
So here is the response what i am getting
{"value":"200","value_description":"pass"}
and code is
def response = new groovy.json.JsonSlurper().parse("200".equals(prev.getResponseData()))
means is there any possible that if value is 200 and than only print value description.using groovy script please tell me with simple code.
Not sure if this is what you mean, it's really hard to tell, but I think you mean:
import groovy.json.JsonSlurper
def response = new groovy.json.JsonSlurper().parseText(prev.responseData)
if (response.value == '200') {
println response.value_description
}

Python - requests fail silently

import requests
is working properly for all my requests, like so:
url = 'http://www.stackoverflow.com'
response = requests.get(url)
bur the following url does not return any results:
url = 'http://www.billboard.com'
response = requests.get(url)
it stalls and fails silently, returning nothing.
how do I force requests into throwing me an exception response,
so I can know if I'm being blacklisted or else?
Requests won't raise an exception for a bad HTTP response, but you could use raise_for_status to raise a HTTPError exception manually, example:
response = requests.get(url)
response.raise_for_status()
Another option is status_code, which holds the HTTP code.
response = requests.get(url)
if response.status_code != 200:
print('HTTP', response.status_code)
else:
print(response.text)
If a site returns HTTP 200 for bad requests, but has an error message in the response body or has no body, you'll have to check the response content.
error_message = 'Nothing found'
response = requests.get(url)
if error_message in response.text or not response.text:
print('Bad response')
else:
print(response.text)
If a site takes too long to respond you could set a maximum timeout for the request. If the site won't respond in that time a ReadTimeout exception will be raised.
try:
response = requests.get(url, timeout=5)
except requests.exceptions.ReadTimeout:
print('Request timed out')
else:
print(response.text)
with:
import requesocks
#Initialize a new wrapped requests object
session = requesocks.session()
#Use Tor for both HTTP and HTTPS
session.proxies = {'http': 'socks5://localhost:9050',
'https': 'socks5://localhost:9050'}
#fetch a page that shows your IP address
response = session.get('https://www.billboard.com')
print(response.text)
I was able to get:
raise ConnectionError(e)
requesocks.exceptions.ConnectionError: HTTPSConnectionPool(host='www.billboard.com', port=None): Max retries exceeded with url: https://www.billboard.com/

Scrapy: sending http requests and parse response

i have looked at scrapy docs , but Can scrapy send http form (Ex: user name , password ,....) and parse the result of sending this form ?
There's an example in the same page : http://scrapy.readthedocs.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
You have just to pass a callback parameter function to the request and then, parse the result in parse_page2 ;)

Resources