Scrapy: sending http requests and parse response - http

i have looked at scrapy docs , but Can scrapy send http form (Ex: user name , password ,....) and parse the result of sending this form ?

There's an example in the same page : http://scrapy.readthedocs.org/en/latest/topics/request-response.html#passing-additional-data-to-callback-functions
def parse_page1(self, response):
item = MyItem()
item['main_url'] = response.url
request = Request("http://www.example.com/some_page.html",
callback=self.parse_page2)
request.meta['item'] = item
return request
def parse_page2(self, response):
item = response.meta['item']
item['other_url'] = response.url
return item
You have just to pass a callback parameter function to the request and then, parse the result in parse_page2 ;)

Related

How to send RedirectResponse from a POST to a GET route in FastAPI?

I want to send data from app.post() to app.get() using RedirectResponse.
#app.get('/', response_class=HTMLResponse, name='homepage')
async def get_main_data(request: Request,
msg: Optional[str] = None,
result: Optional[str] = None):
if msg:
response = templates.TemplateResponse('home.html', {'request': request, 'msg': msg})
elif result:
response = templates.TemplateResponse('home.html', {'request': request, 'result': result})
else:
response = templates.TemplateResponse('home.html', {'request': request})
return response
#app.post('/', response_model=FormData, name='homepage_post')
async def post_main_data(request: Request,
file: FormData = Depends(FormData.as_form)):
if condition:
......
......
return RedirectResponse(request.url_for('homepage', **{'result': str(trans)}), status_code=status.HTTP_302_FOUND)
return RedirectResponse(request.url_for('homepage', **{'msg': str(err)}), status_code=status.HTTP_302_FOUND)
How do I send result or msg via RedirectResponse, url_for() to app.get()?
Is there a way to hide the data in the URL either as path parameter or query parameter? How do I achieve this?
I am getting the error starlette.routing.NoMatchFound: No route exists for name "homepage" and params "result". when trying this way.
Update:
I tried the below:
return RedirectResponse(app.url_path_for(name='homepage')
+ '?result=' + str(trans),
status_code=status.HTTP_303_SEE_OTHER)
The above works, but it works by sending the param as query param, i.e., the URL looks like this localhost:8000/?result=hello. Is there any way to do the same thing but without showing it in the URL?
For redirecting from a POST to a GET method, please have a look at this and this answer on how to do that and the reason for using status_code=status.HTTP_303_SEE_OTHER (example is given below).
As for the reason for getting starlette.routing.NoMatchFound error, this is because request.url_for() receives path parameters, not query parameters. Your msg and result parameters are query ones; hence, the error.
A solution would be to use a CustomURLProcessor, as suggested in this and this answer, allowing you to pass both path (if need to) and query parameters to the url_for() function and obtain the URL. As for hiding the path and/or query parameters from the URL, you can use a similar approach to this answer that uses history.pushState() (or history.replaceState()) to replace the URL in the browser's address bar.
Complete working example can be found below (you can use your own TemplateResponse in the place of HTMLResponse).
from fastapi import FastAPI, Request, status
from fastapi.responses import RedirectResponse, HTMLResponse
from typing import Optional
import urllib
app = FastAPI()
class CustomURLProcessor:
def __init__(self):
self.path = ""
self.request = None
def url_for(self, request: Request, name: str, **params: str):
self.path = request.url_for(name, **params)
self.request = request
return self
def include_query_params(self, **params: str):
parsed = list(urllib.parse.urlparse(self.path))
parsed[4] = urllib.parse.urlencode(params)
return urllib.parse.urlunparse(parsed)
#app.get('/', response_class=HTMLResponse)
def event_msg(request: Request, msg: Optional[str] = None):
if msg:
html_content = """
<html>
<head>
<script>
window.history.pushState('', '', "/");
</script>
</head>
<body>
<h1>""" + msg + """</h1>
</body>
</html>
"""
return HTMLResponse(content=html_content, status_code=200)
else:
html_content = """
<html>
<body>
<h1>Create an event</h1>
<form method="POST" action="/">
<input type="submit" value="Create Event">
</form>
</body>
</html>
"""
return HTMLResponse(content=html_content, status_code=200)
#app.post('/')
def event_create(request: Request):
redirect_url = CustomURLProcessor().url_for(request, 'event_msg').include_query_params(msg="Succesfully created!")
return RedirectResponse(redirect_url, status_code=status.HTTP_303_SEE_OTHER)
Update
Regarding adding query params to url_for(), another solution would be using Starlette's starlette.datastructures.URL, which now provides a method to include_query_params. Example:
from starlette.datastructures import URL
redirect_url = URL(request.url_for('event_msg')).include_query_params(msg="Succesfully created!")
return RedirectResponse(redirect_url, status_code=status.HTTP_303_SEE_OTHER)

How do I set up automatic change of the github token during parsing?

GitHub allows you to send no more than 2500 requests per hour if I have several accounts/tokens, how to set up an automatic token change in Scrapy when a certain level of requests is reached (for example 2500) or for the token to change when responding 403.?
class GithubSpider(scrapy.Spider):
name = 'github.com'
start_urls = ['https://github.com']
tokens = ['token1', 'token2', 'token3', 'token4']
headers = {
'Accept': 'application/vnd.github.v3+json',
'Authorization': 'token ' + tokens[1],
}
def start_requests(self, **cb_kwargs):
for lang in languages:
cb_kwargs['lang'] = lang
url = f'https://api.github.com/search/users?q=language:{lang}%20location:{country}&per_page=100'
yield Request(url=url, headers=self.headers, callback=self.parse, cb_kwargs=cb_kwargs)
You could use the cycle function from the module itertools to create a generator using your list of tokens that you can then cycle through for each request you send to ensure you are using all the tokens equally thereby reducing chance of reaching the limit for any of the tokens.
If you start receiving 403 responses then you will know that all the tokens have reached their limit. See sample code below
from itertools import cycle
class GithubSpider(scrapy.Spider):
name = 'github.com'
start_urls = ['https://github.com']
tokens = cycle(['token1', 'token2', 'token3', 'token4'])
def start_requests(self, **cb_kwargs):
for lang in languages:
headers = {
'Accept': 'application/vnd.github.v3+json',
'Authorization': 'token ' + next(self.tokens)
}
cb_kwargs['lang'] = lang
url = f'https://api.github.com/search/users?q=language:{lang}%20location:{country}&per_page=100'
yield Request(url=url, headers=headers, callback=self.parse, cb_kwargs=cb_kwargs)

How can I covert the requests code to scrapy?

def get_all_patent():
patent_list = []
for i in range(100):
res = requests.get(url).text
patent_list.append(res)
return patent_list
Because scrapy can't get response from request,reference:How can I get the response from the Request in Scrapy?
I want to extend the variable patent_list,But I can't get response body.
Can I through the Request meta or do something in Response?

How to get new token headers during runtime of scrapy spider

I am running a scrapy spider that starts by getting an authorization token from the website I am scraping from, using basic requests library. The function for this is called get_security_token(). This token is passed as a header to the scrapy request. The issue is that the token expires after 300 seconds, and then I get a 401 error. Is there anyway for a spider to see the 401 error, run the get_security_token() function again, and then pass the new token on to all future request headers?
import scrapy
class PlayerSpider(scrapy.Spider):
name = 'player'
def start_requests(self):
urls = ['URL GOES HERE']
header_data = {'Authorization':'Bearer 72bb65d7-2ff1-3686-837c-61613454928d'}
for url in urls:
yield scrapy.Request(url = url, callback = self.parse,headers = header_data)
def parse(self, response):
yield response.json()
if it's pure scrapy you can add handle_httpstatus_list = [501] after start_urls
and then in you parse method you need to do something like this:
if response.status == 501:
get_security_token()

Using HTTPBuilder to execute a HTTP DELETE request

I'm trying to use the Groovy HTTPBuilder library to delete some data from Firebase via a HTTP DELETE request. If I use curl, the following works
curl -X DELETE https://my.firebase.io/users/bob.json?auth=my-secret
Using the RESTClient class from HTTPBuilder works if I use it like this:
def client = new RESTClient('https://my.firebase.io/users/bob.json?auth=my-secret')
def response = client.delete(requestContentType: ContentType.ANY)
However, when I tried breaking down the URL into it's constituent parts, it doesn't work
def client = new RESTClient('https://my.firebase.io')
def response = client.delete(
requestContentType: ContentType.ANY,
path: '/users/bob.json',
query: [auth: 'my-secret']
)
I also tried using the HTTPBuilder class instead of RESTClient
def http = new HTTPBuilder('https://my.firebase.io')
// perform a POST request, expecting TEXT response
http.request(Method.DELETE, ContentType.ANY) {
uri.path = '/users/bob.json'
uri.query = [auth: 'my-secret']
// response handler for a success response code
response.success = { resp, reader ->
println "response status: ${resp.statusLine}"
}
}
But this also didn't work. Surely there's a more elegant approach than stuffing everything into a single string?
There's an example of using HttpURLClient in the tests to do a delete, which in its simplest form looks like:
def http = new HttpURLClient(url:'https://some/path/')
resp = http.request(method:DELETE, contentType:JSON, path: "destroy/somewhere.json")
def json = resp.data
assert json.id != null
assert resp.statusLine.statusCode == 200
Your example is very close to the test for the delete in a HTTPBuilder.
A few differences I see are:
Your path is absolute and not relative
Your http url path doesn't end with trailing slash
You're using content type ANY where test uses JSON. Does the target need the content type to be correct? (Probably not as you're not setting it in curl example unless it's doing some voodoo on your behalf)
Alternatively you could use apache's HttpDelete but requires more boiler plate. For a HTTP connection this is some code I've got that works. You'll have to fix it for HTTPS though.
def createClient() {
HttpParams params = new BasicHttpParams()
HttpProtocolParams.setVersion(params, HttpVersion.HTTP_1_1)
HttpProtocolParams.setContentCharset(params, "UTF-8")
params.setBooleanParameter(ClientPNames.HANDLE_REDIRECTS, true)
SchemeRegistry registry = new SchemeRegistry()
registry.register(new Scheme("http", PlainSocketFactory.getSocketFactory(), 80))
ClientConnectionManager ccm = new PoolingClientConnectionManager(registry)
HttpConnectionParams.setConnectionTimeout(params, 8000)
HttpConnectionParams.setSoTimeout(params, 5400000)
HttpClient client = new DefaultHttpClient(ccm, params)
return client
}
HttpClient client = createClient()
def url = new URL("http", host, Integer.parseInt(port), "/dyn/admin/nucleus$component/")
HttpDelete delete = new HttpDelete(url.toURI())
// if you have any basic auth, you can plug it in here
def auth="USER:PASS"
delete.setHeader("Authorization", "Basic ${auth.getBytes().encodeBase64().toString()}")
// convert a data map to NVPs
def data = [:]
List<NameValuePair> nvps = new ArrayList<NameValuePair>(data.size())
data.each { name, value ->
nvps.add(new BasicNameValuePair(name, value))
}
delete.setEntity(new UrlEncodedFormEntity(nvps))
HttpResponse response = client.execute(delete)
def status = response.statusLine.statusCode
def content = response.entity.content
I adopted the code above from a POST version, but the principle is the same.

Resources