Proxymesh proxies returning 408 status code - web-scraping

I am using proxymesh proxies to crawl different websites. Suddenly it start sending status 408.
Here is log,
Retrying <GET http://www.toysrus.com/product/index.jsp?productId=23046326&cp=&parentPage=search> (failed 1 times): 408 Request Time-out
Any solution will be highly appreciated.

I haven't really used proxymesh myself, but there are some things you should do:
You need to confirm that it isn't a site problem, so going to the site to check if it is also returning a 408 status (or maybe just being down).
The site (and proxymesh) could actually need more time to complete the request, for which according to proxymesh documentation could be increased by sending the X-ProxyMesh-Timeout header.
Check if proxymesh is returning more information (not just 408 status) in the response or headers. If you are using scrapy, check the Retrying settings and you could handle bad statuses with errback method.

Related

What is the best HTTP status code to convey "saved, but with errors"?

In my API, I am processing an object which contains a DSL script which can fail syntax/type validation, however I'll still persist the object regardless of any parsing failure, and send back the persisted object along with the failure messages. However, I am having trouble choosing the right HTTP status code to characterize this condition.
From my StackOverflow research:
HTTP 400 seems inaccurate since the request was not malformed and could be processed.
HTTP 202 seems promising but there isn't really async processing happening.
HTTP 422 is popular in other SO posts though I've never seen it used in practice.
Is sending an HTTP 200 still appropriate in this scenario since nothing really "failed"? Is there a more appropriate HTTP code to use here?
Any 4xx status should imply that the server did nothing, and the state did no change. Therefore, a 2xx code is the most appropriate.
You could use the Warning header:
http://greenbytes.de/tech/webdav/rfc7234.html#header.warning

Implementing an HTTP Server - do I have to respond to all requests?

If I am making an HTTP server, can I choose to ignore requests I don't want to respond to and let them time out?
I'm just wondering whether I am in any sense better off not responding to requests from potentially malicious sources than responding to them with data I'd rather not serve up, or responding with some 403 Forbidden or similar response that lets them know I exist.
A 403 should suffice. But I wouldn't let it just time out. If someone is trying to be cheeky, a time out will be more informative than a Service Unavailable 503.
I answered a relevant question a while back, read the question/answer, it's about a specific use case, but it does mention cases where you don't want to return an HTTP status code because it gives too much info.
RFC - 404 or 400 for relation of entity not found in PUT request
Also have a look at this list of HTTP Status codes, you can always use something like Too Many Requests 429, a Not Acceptable 406 or even something like I'm a teapot 418 ;)

HTTP response after error handling

So, I saw here a couple of questions related to mine but I want to describe a more specific situation. Imagine your API always checks for the amount of daily requests made to an entry point. Once a specific client achieves this limit I'm returning 422 - "Unprocessable entity".
I guess I just want to consult whether someone has a different approach.
Whenever there are too many requests being fired to your server, you can always respond with HTTP 429: Too Many Requests. Check the RFC.
The 429 status code indicates that the user has sent too many requests
in a given amount of time ("rate limiting").
The response representations SHOULD include details explaining the
condition, and MAY include a Retry-After header indicating how long to
wait before making a new request.
For example:
HTTP/1.1 429 Too Many Requests
Content-Type: text/html
Retry-After: 3600
..body..

Which HTTP status code should I use for a health-check failure?

I'm implementing a /_status/ endpoint which does some sanity checks on data in our database.
For example, we are collecting measurements and the status should go "bad" if the latest measurement is over an hour old.
I would like to point Pingdom at this URL to leverage their alerting infrastructure and tell us when something's wrong.
On a "good" status I will serve an HTML page with an HTTP 200 OK status. But what would an appropriate HTTP status code be for "bad"? Or would it be more correct not to convey this information via status code, but via HTML content instead?
Thanks!
Well... this is an old question, but I ended up here, so I thought I'd give my two cents here:
It seems pretty clear that a 2xx should be returned if all is OK
If health is not OK, I think it should return a 5xx result (4xx talks about the client being at fault in the request; 2xx and 3xx are all successful to some degree).
I think that a 5xx is correct because this is a special request that is answering about the state of the whole service. Also, because most Load Balancers offer liveliness checks based on response codes and not all offer a way to parse a more complex payload (other than perhaps a RegExp Match which can make the check brittle).
I agree with #Julien that a 500 (specifically) doesn't seem appropriate, and we've decided on 503 Service Unavailable.
503 seems to fit for a couple of reasons:
It's a 5xx family result code which indicates that something is going on on the server side.
It has a temporary nature to it indicating that it may recover.
We just had a similar discussion in our group. We decided for our purposes that the HTTP response codes should be reporting on your server's success or failure to honor the request. For a GET, this would mean whether or not you can respond with the requested resource. In this case, the requested resource is a health report, so as long as you're returning that successfully, it should be a 200 response.
We're returning JSON for our health check, with a top-level "isHealthy" field set to true or false. Our load balancer and other monitors will parse the JSON and use this field to determine if the system is healthy or not.
If you don't want to parse JSON in your monitors, you could try putting a custom response header to indicate binary health of the system, e.g., System-Health: true or System-Health: false. You might have better luck getting monitors which can check that.
If you really want to use a response code, I would recommend an additional endpoint called something like "health" which returns a "204 No Content" when healthy, and a "404 Not Found" when not healthy. In this case, the resource defined by the URL is, symbolically, the health of your system, and so if it's healthy, you can return a successful response. If it's unhealthy, then it's health can't be found, hence the 404.
If your data is 'bad' because there is a service failure (even if that is a backend job failing) then a HTTP 500 seems like a valid response. It indicates that something, somewhere is broken.
It isn't very specific, you're shrugging your shoulders and saying:
The 500 (Internal Server Error) status code indicates that the server
encountered an unexpected condition that prevented it from fulfilling
the request.
ietf rfc7231
If you ask for health and the server state is not healthy, I'm partial to 409 Conflict which "Indicates that the request could not be processed because of conflict in the current state of the resource" .
Some people might object that if you can respond then the request can be processed, but I disagree. Every error message is a response. The server defines resource semantics. If you ask for the good news resource and the server responds "here is bad news", it didn't give you what it defines to have offered at that resource.
In practice, it's much easier to say 2**="up" 4**="down" and pipe request counts into an availability metric and have a load balancer remove the server from its pool based on the response code. Coming up with ways to argue that "hey, we told you something, so 200 OK" just seems like missing the forrest for the trees to me.

HTTP status code for "success with errors"?

I've poked around a bit, but I don't see an HTTP status code for when a request's succeeds, but there is an error after the "point of no return".
e.g., Say you process a request, its committed to the database, but while returning the result you run of memory, or encounter a NPE, or what have you. It would have been a 200 response, but now, internally, you aren't able to return the proper, well-formed response.
202 Accepted doesn't seem to fit since we've already processed the request.
What status code means "Success, but errors"? Does one even exist?
HTTP doesn't have such a status code, but there is a best practice that allows you to handle such situations - redirect the user after a POST operation.
Here is a break down -
A POST request tries to modify data on the server
If the server fails, it sends a 500 error to indicate failure
If the server succeeds, it sends a 302 redirect response
The browser then sends a fresh GET request to the server
If this fails, you get a 500 error, otherwise you get a 200
So, your use case of 'Saved data but can't retrieve it immediately' translates to a 302 redirect for the initial POST, followed by a 500 for the subsequent GET.
This approach has other advantages - you get rid of the annoying 'Are you sure you want to resubmit the data?' message. Also keeps your back/forward/refresh buttons usable.
If the server is aware that it has encountered a problem, it should normally return a 5xx error. The most generic one is the 500 Server Error, which the RFC 2616 defines as follows:
500 Internal Server Error
The server encountered an unexpected condition which prevented it
from fulfilling the request.
Then it's the client's responsibility to reattempt the request. If the previous request was partially committed, it's the server's (or the database's) responsibility to roll that back, or to handle the duplicate transaction appropriately.
I agree with #Daniel that the proper response is an HTTP 500 (server error). The web application has to be written to roll back the transaction when there is an error, not leave things half-finished.
One thing you can leverage in your web application is "idempotency". This is the property of a function (or operation) that you can repeat it as many times as you like with the same result. For instance if a read fails, the client can simply retry it until it succeeds. If a deletion appears to fail, the client can again retry and the server will treat the request as valid whether or not the resource being deleted is already gone. And if an update appears to fail, the client can retry that until it gets a successful return from the server. The REST approach to architecting web services makes heavy use of idempotency to make operations robust in the face of error.

Resources