I am running a scrapy spider that starts by getting an authorization token from the website I am scraping from, using basic requests library. The function for this is called get_security_token(). This token is passed as a header to the scrapy request. The issue is that the token expires after 300 seconds, and then I get a 401 error. Is there anyway for a spider to see the 401 error, run the get_security_token() function again, and then pass the new token on to all future request headers?
import scrapy
class PlayerSpider(scrapy.Spider):
name = 'player'
def start_requests(self):
urls = ['URL GOES HERE']
header_data = {'Authorization':'Bearer 72bb65d7-2ff1-3686-837c-61613454928d'}
for url in urls:
yield scrapy.Request(url = url, callback = self.parse,headers = header_data)
def parse(self, response):
yield response.json()
if it's pure scrapy you can add handle_httpstatus_list = [501] after start_urls
and then in you parse method you need to do something like this:
if response.status == 501:
get_security_token()
Related
def get_all_patent():
patent_list = []
for i in range(100):
res = requests.get(url).text
patent_list.append(res)
return patent_list
Because scrapy can't get response from request,reference:How can I get the response from the Request in Scrapy?
I want to extend the variable patent_list,But I can't get response body.
Can I through the Request meta or do something in Response?
I want to handle 404 errors in Scrapy, but not all 404 error cases. How can I raise a 404 error when I don't want to handle it?
Hmmm, it turns out I can handle 404 response from a specific request using errback.
import scrapy
from scrapy.spidermiddlewares.httperror import HttpError
class SampleSpider(scrapy.Spider):
name = 'sample'
allowed_domains = ['example.com']
start_urls = ["https://example.com"]
def parse(self, response):
if response.status in self.handle_httpstatus_list:
return Request(url="https://example.com/404url/", callback=self.parse_page, errback=self.after_404)
def parse_page(self, response):
# parse the page and extract items for success result
def after_404(self, failure):
if failure.check(HttpError) and failure.value.response.status == 404:
print ("We got 404!")
# handle the page for 404 status
else:
# Log others as error
self.logger.error(repr(failure))
This way, other requests that I don't want it to handle 404 status still return the errors as usual.
I made it based on https://docs.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-errbacks
Relatively new to Splash. I'm trying to scrape a website which needs a login. I started off with the Splash API for which I was able to login perfectly. However, when I put my code in a scrapy spider script, using SplashRequest, it's not able to login.
import scrapy
from scrapy_splash import SplashRequest
class Payer1Spider(scrapy.Spider):
name = "payer1"
start_url = "https://provider.wellcare.com/provider/claims/search"
lua_script = """
function main(splash,args)
assert(splash:go(args.url))
splash:wait(0.5)
local search_input = splash:select('#Username')
search_input:send_text('')
local search_input = splash:select('#Password')
search_input:send_text('')
assert(splash:wait(0.5))
local login_button = splash:select('#btnSubmit')
login_button:mouse_click()
assert(splash:wait(7))
return{splash:html()}
end
"""
def start_requests(self):
yield SplashRequest(self.start_url, self.parse_result,args={'lua_source': self.lua_script},)
def parse_result(self, response):
yield {'doc_title' : response.text}
The output HTML is the login page and not the one after logging in.
You have to add endpoint='execute' to your SplashRequest to execute the lua-script:
yield SplashRequest(self.start_url, self.parse_result, args={'lua_source': self.lua_script}, endpoint='execute')
I believe you don't need splash to login to the site indeed. You can try next:
Get https://provider.wellcare.com and then..
# Get request verification token..
token = response.css('input[name=__RequestVerificationToken]::attr(value)').get()
# Forge post request payload...
data = [
('__RequestVerificationToken', token),
('Username', 'user'),
('Password', 'pass'),
('ReturnUrl', '/provider/claims/search'),
]
#Make dict from list of tuples
formdata=dict(data)
# And then execute request
scrapy.FormRequest(
url='https://provider.wellcare.com/api/sitecore/Login',
formdata=formdata
)
Not completely sure if all of this will work. But you can try.
import requests
is working properly for all my requests, like so:
url = 'http://www.stackoverflow.com'
response = requests.get(url)
bur the following url does not return any results:
url = 'http://www.billboard.com'
response = requests.get(url)
it stalls and fails silently, returning nothing.
how do I force requests into throwing me an exception response,
so I can know if I'm being blacklisted or else?
Requests won't raise an exception for a bad HTTP response, but you could use raise_for_status to raise a HTTPError exception manually, example:
response = requests.get(url)
response.raise_for_status()
Another option is status_code, which holds the HTTP code.
response = requests.get(url)
if response.status_code != 200:
print('HTTP', response.status_code)
else:
print(response.text)
If a site returns HTTP 200 for bad requests, but has an error message in the response body or has no body, you'll have to check the response content.
error_message = 'Nothing found'
response = requests.get(url)
if error_message in response.text or not response.text:
print('Bad response')
else:
print(response.text)
If a site takes too long to respond you could set a maximum timeout for the request. If the site won't respond in that time a ReadTimeout exception will be raised.
try:
response = requests.get(url, timeout=5)
except requests.exceptions.ReadTimeout:
print('Request timed out')
else:
print(response.text)
with:
import requesocks
#Initialize a new wrapped requests object
session = requesocks.session()
#Use Tor for both HTTP and HTTPS
session.proxies = {'http': 'socks5://localhost:9050',
'https': 'socks5://localhost:9050'}
#fetch a page that shows your IP address
response = session.get('https://www.billboard.com')
print(response.text)
I was able to get:
raise ConnectionError(e)
requesocks.exceptions.ConnectionError: HTTPSConnectionPool(host='www.billboard.com', port=None): Max retries exceeded with url: https://www.billboard.com/
i have looked at scrapy docs , but Can scrapy send http form (Ex: user name , password ,....) and parse the result of sending this form ?
There's an example in the same page : http://scrapy.readthedocs.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
You have just to pass a callback parameter function to the request and then, parse the result in parse_page2 ;)