How to not retry a Scrapy request? - web-scraping

Sometimes I don't want to retry on an error in a Scrapy request since I am only checking if the page exists. How to not retry a request?

It turns out it has dont_retry argument.
Request('myurl', callback=self.method_to_parse, meta={'dont_retry':True})

Related

AirFlow sends exception wrapped with HTTP 200 OK

When we call a DAG using REST, the AirFlow tries to execute it but if there is an exception it sends back the stack trace as a response body wrapped with HTTP 200 OK.
How can we change that feature so that if there is any execution error, it would rather send 400 error?
This is the intended behavior, assuming that you're using the 'get DAG run' endpoint. You should reference the state field that is returned.

get https response from scrapy shell

I have a spider that is getting cookies from a site in the first few steps. I would like to get the cookies, start the scrape, and if the HTTP status of the current request == 302, I want to loop back to the cookies part to refresh them. How can I log the HTTP status as a variable in scrapy shell, to add in an "if http_status ==302, break and go back to step 1"? Thank you!
I'm an idiot. If anyone comes across this, all you have to do it set your variable (in my case http_response) to response.status. so http_response = response.status returns '200' or whatever depending on the status of the current request. lol solved.

How timeout option in xdmp:http-post works?

In XQuery code when I make an xdmp:http-post call, I can configure a timeout value in the request options. Say I configure it as 5 seconds, it gives a timeout exception back.
My question here is, will MarkLogic try to complete the calling XQuery module or cancel it? Lot of times this needs to be done from admin console to cancel the query manually.
will MarkLogic try to complete the calling XQuery module or cancel it?
The module that you happen to be invoking from the xdmp:http-post() does not know that the client has timed out and stopped waiting for a response to be sent. It will continue processing the request and work to generate a response.
If you would like for it to have a shorter timeout closer to the timeout value of the module invoking xdmp:http-post(), then you could add xdmp:set-request-time-limit() to set an explicit (shorter) timeout for this request.
xdmp:set-request-time-limit(6),
for $i in (1 to 1000)
return ( xdmp:log("I'm feeling sleepy..."||$i), xdmp:sleep(1000) )
You could even accept a timeout value as a request parameter to the HTTP POST, so that the client could dynamically set the timeout per request.

How can I execute concurrent HTTP requests in Paw?

In Paw if I have a long running request and attempt to invoke another I'm presented with a dialog that says Stop request? A request is currently running. with the options Cancel, Stop and Stop & Send. How can I keep the first request running and invoke another without canceling the first request?

HTTP status code for "success with errors"?

I've poked around a bit, but I don't see an HTTP status code for when a request's succeeds, but there is an error after the "point of no return".
e.g., Say you process a request, its committed to the database, but while returning the result you run of memory, or encounter a NPE, or what have you. It would have been a 200 response, but now, internally, you aren't able to return the proper, well-formed response.
202 Accepted doesn't seem to fit since we've already processed the request.
What status code means "Success, but errors"? Does one even exist?
HTTP doesn't have such a status code, but there is a best practice that allows you to handle such situations - redirect the user after a POST operation.
Here is a break down -
A POST request tries to modify data on the server
If the server fails, it sends a 500 error to indicate failure
If the server succeeds, it sends a 302 redirect response
The browser then sends a fresh GET request to the server
If this fails, you get a 500 error, otherwise you get a 200
So, your use case of 'Saved data but can't retrieve it immediately' translates to a 302 redirect for the initial POST, followed by a 500 for the subsequent GET.
This approach has other advantages - you get rid of the annoying 'Are you sure you want to resubmit the data?' message. Also keeps your back/forward/refresh buttons usable.
If the server is aware that it has encountered a problem, it should normally return a 5xx error. The most generic one is the 500 Server Error, which the RFC 2616 defines as follows:
500 Internal Server Error
The server encountered an unexpected condition which prevented it
from fulfilling the request.
Then it's the client's responsibility to reattempt the request. If the previous request was partially committed, it's the server's (or the database's) responsibility to roll that back, or to handle the duplicate transaction appropriately.
I agree with #Daniel that the proper response is an HTTP 500 (server error). The web application has to be written to roll back the transaction when there is an error, not leave things half-finished.
One thing you can leverage in your web application is "idempotency". This is the property of a function (or operation) that you can repeat it as many times as you like with the same result. For instance if a read fails, the client can simply retry it until it succeeds. If a deletion appears to fail, the client can again retry and the server will treat the request as valid whether or not the resource being deleted is already gone. And if an update appears to fail, the client can retry that until it gets a successful return from the server. The REST approach to architecting web services makes heavy use of idempotency to make operations robust in the face of error.

Resources