Randomly getting HTTP 400 errors while scraping

Randomly getting HTTP 400 errors while scraping - http

My scraping program usually works, but it occassionally gets a HTTP 400 error from the server. There is no lasting throttle effect; it goes back to working immediately after the 400 response.. I'd estimate that ~1/5000 of the responses I get are 400's, while the rest are success (200's). What could be going on.. anything I could do to eliminate these errors?

Http 400 is bad request.
There is something in the request that the server is reacting to. It will not affect any other requests.
A typical cause of this is special characters is a string.
Best way to fix it is to find the root cause, try logging all request if this is possible in your environment.

Related

Why should the client not repeat a bad request?

I‘ve read many times that, when HTTP 400 error code (Bad request) is raised, the client should not repeat the request.
I‘m wondering, if the request couldn‘t be completed, why is it that important the client does not repeat the request? Even though repeating the request wouldn‘t help fixing the error, it seems to be very significant that the client doesn‘t re-send the malformed request.
Why is that?

If a 400 bad request signifies that there was a client side issue, repeating the request would do nothing but waste the servers resources. Most of the time, a query param, header, or part of the body is incorrect, and the request needs to be physically changed to work. Some servers use 400 to show the request was recognized but not completed, but overall most cases need something to be physically changed. A 500 code shows that there was a SERVER side error, which is not the case here. A lot of the time, there will be a response body explaining the error along with the 400 code
The worst thing repeating the request will do is waste server resources since it’s not causing an error on the server, but is basically pointless in most cases to repeat.

Implementing an HTTP Server - do I have to respond to all requests?

If I am making an HTTP server, can I choose to ignore requests I don't want to respond to and let them time out?
I'm just wondering whether I am in any sense better off not responding to requests from potentially malicious sources than responding to them with data I'd rather not serve up, or responding with some 403 Forbidden or similar response that lets them know I exist.

A 403 should suffice. But I wouldn't let it just time out. If someone is trying to be cheeky, a time out will be more informative than a Service Unavailable 503.
I answered a relevant question a while back, read the question/answer, it's about a specific use case, but it does mention cases where you don't want to return an HTTP status code because it gives too much info.
RFC - 404 or 400 for relation of entity not found in PUT request
Also have a look at this list of HTTP Status codes, you can always use something like Too Many Requests 429, a Not Acceptable 406 or even something like I'm a teapot 418 ;)

Returning http 200 OK with error within response body

I'm wondering if it is correct to return HTTP 200 OK when an error occurred on the server side (the error details would be contained inside the response body).
Example:
We're sending HTTP GET
Something unexpected happened on the server side.
Server returns HTTP 200 OK status code with error inside a response (e.g. {"status":"some error occurred"})
Is this the correct behavior or not? Should we change the status code to something else than 200?

No, it's very incorrect to send 200 with a error body
HTTP is an application protocol. 200 implies that the response contains a payload that represents the status of the requested resource. An error message usually is not a representation of that resource.
If something goes wrong while processing GET, the right status code is 4xx ("you messed up") or 5xx ("I messed up").

HTTP status codes say something about the HTTP protocol. HTTP 200 means transmission is OK on the HTTP level (i.e request was technically OK and server was able to respond properly). See this wiki page for a list of all codes and their meaning.
HTTP 200 has nothing to do with success or failure of your "business code". In your example the HTTP 200 is an acceptable status to indicate that your "business code error message" was successfully transferred, provided that no technical issues prevented the business logic to run properly.
Alternatively you could let your server respond with HTTP 5xx if technical or unrecoverable problems happened on the server. Or HTTP 4xx if the incoming request had issues (e.g. wrong parameters, unexpected HTTP method...) Again, these all indicate technical errors, whereas HTTP 200 indicates NO technical errors, but makes no guarantee about business logic errors.
To summarize: YES it is valid to send error messages (for non-technical issues) in your http response together with HTTP status 200. Whether this applies to your case is up to you. If for instance the client is asking for a file that isn't there, that would be more like a 404. If there is a misconfiguration on the server that might be a 500. If client asks for a seat on a plane that is booked full, that would be 200 and your "implementation" will dictate how to recognise/handle this (e.g. JSON block with a { "booking": "failed" })

I think these kinds of problems are solved if we think about real life.
Bad Practice:
Example 1:
Darling everything is FINE/OK (HTTP CODE 200) - (Success):
{
...but I don't want us to be together anymore!!!... (Error)
// Then everything isn't OK???
}
Example 2:
You are the best employee (HTTP CODE 200) - (Success):
{
...But we cannot continue your contract!!!... (Error)
// Then everything isn't OK???
}
Good Practices:
Darling I don't feel good (HTTP CODE 400) - (Error):
{
...I no longer feel anything for you, I think the best thing is to separate... (Error)
// In this case, you are alerting me from the beginning that something is wrong ...
}
This is only my personal opinion, each one can implement it as it is most comfortable or needs.
Note: The idea for this explanation was drawn from a great friend #diosney

Even if I want to return a business logic error as HTTP code there is no such
acceptable HTTP error code for that errors rather than using HTTP 200 because it will misrepresent the actual error.
So, HTTP 200 will be good for business logic errors. But all errors which are covered by HTTP error codes should use them.
Basically HTTP 200 means what server correctly processes user request (in case of there is no seats on the plane it is no matter because user request was correctly processed, it can even return just a number of seats available on the plane, so there will be no business logic errors at all or that business logic can be on client side. Business logic error is an abstract meaning, but HTTP error is more definite).

To clarify, you should use HTTP error codes where they fit with the protocol, and not use HTTP status codes to send business logic errors.
Errors like insufficient balance, no cabs available, bad user/password qualify for HTTP status 200 with application specific error handling in the response body.
See this software engineering answer:
I would say it is better to be explicit about the separation of protocols. Let the HTTP server and the web browser do their own thing, and let the app do its own thing. The app needs to be able to make requests, and it needs the responses--and its logic as to how to request, how to interpret the responses, can be more (or less) complex than the HTTP perspective.

I think people have put too much weight into the application logic versus protocol matter. The important thing is that the response should make sense. What if you have an API that serves a dynamic resource and a request is made for X which is derived from template Y with data Z and either Y or Z isn't currently available? Is that a business logic error or a technical error? The correct answer is, "who cares?"
Your API and your responses need to be intelligible and consistent. It should conform to some kind of spec, and that spec should define what a valid response is. Something that conforms to a valid response should yield a 200 code. Something that does not conform to a valid response should yield a 4xx or 5xx code indicative of why a valid response couldn't be generated.
If your spec's definition of a valid response permits { "error": "invalid ID" }, then it's a successful response. If your spec doesn't make that accommodation, it would be a poor decision to return that response with a 200 code.
I'd draw an analogy to calling a function parseFoo. What happens when you call parseFoo("invalid data")? Does it return an error result (maybe null)? Or does it throw an exception? Many will take a near-religious position on whether one approach or the other is correct, but ultimately it's up to the API specification.
"The status-code element is a three-digit integer code giving the result of the attempt to understand and satisfy the request"
Obviously there's a difference of opinion with regards to whether "successfully returning an error" constitutes an HTTP success or error. I see different people interpreting the same specs different ways. So pick a side, sure, but also accept that either way the whole world isn't going to agree with you. Me? I find myself somewhere in the middle, but I'll offer some commonsense considerations.
If your server-side code catches an unexpected exception when dispatching a request, that sounds like the very definition of a 500 Internal Server Error. This seems to be OP's situation. The application should not return a 200 for unexpected errors, but also see point 3.
If your server-side code should be able to gracefully handle a given invalid input, and it doesn't constitute an "exceptional" error condition, your spec should accommodate HTTP 200 responses that provide meaningful diagnostic information.
Above all: Have a spec. Make it consistent. Stick to it.
In OP's situation, it sounds like you have a de-facto standard that unhandled exceptions yield a 200 with a distinguishable response body. It's not ideal, but if it's not breaking things and actively causing problems, you probably have bigger, more important problems to solve.

HTTP Is the Protocol handling the transmission of data over the internet.
If that transmission breaks for whatever reason the HTTP error codes tell you why it can't be sent to you.
The data being transmitted is not handled by HTTP Error codes. Only the method of transmission.
HTTP can't say 'Ok, this answer is gobbledigook, but here it is'. it just says 200 OK.
i.e : I've completed my job of getting it to you, the rest is up to you.
I know this has been answered already but I put it in words I can understand. sorry for any repetition.

IIS responding to single request with two responses

We have a user making a POST to our webserver (windows server 2003, IIS 6). They get the full response from our webapp, but then IIS also responds with a "400 Bad Request". No other information is provided (yes I have friendly errors turned off).
At first, I thought maybe it was some middleware injecting a response in there. However, I was able to find the following in the HTTPSys error logs which confirms that it is coming from our server:
2013-08-09 23:36:40 11901 80 HTTP/0.0 Unparsed - 400 - BadRequest -
There are a whole slew of these errors piling up, and I have no idea why. Unparsed doesn't really tell me anything, so I don't have much to go on. I was able to get them to produce a wireshark trace, which shows that we are indeed responding with the full correct response and then appending a 400 bad request response. I copied their request EXACTLY from wireshark and tried it from my machine and of course, I can't reproduce it (I get the one valid response back).
So I am completely unable to reproduce the "Unparsed" error, I WAS however able to get two responses back from one request. I intentionally broke the line endings between the request headers and body and I got back a full correct response followed by "Bad Request (Invalid Verb)".
Two questions
1) Does anyone have any ideas as to how to produce an "Unparsed" error in HTTPsys logs? Any thoughts on how I might go about reproducing this?
2) WHY is IIS responding to a single request with two responses? Is that normal behavior, or indicative of a configuration error?
Thanks for anyone willing to offer help on this terrible headache!

Going to answer my own question here because I wouldn't wish this pain upon anyone else.
It turned out their POST had a slight error in content-length, I am thinking it wasn't including the final "\r\n" but whatever it is the count ended up being short by 2.
The reason this never showed up when I copied the exact request and sent it from my machine was that I wasn't properly closing the connection. For some reason, closing the connection with a FIN actually makes the difference and causes the 400 Bad Request.
Not sure why this happens, maybe the extra chars sent in the initial request are somehow read when the FIN comes? I don't know, but there you have it. Fix the content-length and the pain goes away.

HTTP status code for "success with errors"?

I've poked around a bit, but I don't see an HTTP status code for when a request's succeeds, but there is an error after the "point of no return".
e.g., Say you process a request, its committed to the database, but while returning the result you run of memory, or encounter a NPE, or what have you. It would have been a 200 response, but now, internally, you aren't able to return the proper, well-formed response.
202 Accepted doesn't seem to fit since we've already processed the request.
What status code means "Success, but errors"? Does one even exist?

HTTP doesn't have such a status code, but there is a best practice that allows you to handle such situations - redirect the user after a POST operation.
Here is a break down -
A POST request tries to modify data on the server
If the server fails, it sends a 500 error to indicate failure
If the server succeeds, it sends a 302 redirect response
The browser then sends a fresh GET request to the server
If this fails, you get a 500 error, otherwise you get a 200
So, your use case of 'Saved data but can't retrieve it immediately' translates to a 302 redirect for the initial POST, followed by a 500 for the subsequent GET.
This approach has other advantages - you get rid of the annoying 'Are you sure you want to resubmit the data?' message. Also keeps your back/forward/refresh buttons usable.

If the server is aware that it has encountered a problem, it should normally return a 5xx error. The most generic one is the 500 Server Error, which the RFC 2616 defines as follows:
500 Internal Server Error
The server encountered an unexpected condition which prevented it
from fulfilling the request.
Then it's the client's responsibility to reattempt the request. If the previous request was partially committed, it's the server's (or the database's) responsibility to roll that back, or to handle the duplicate transaction appropriately.

I agree with #Daniel that the proper response is an HTTP 500 (server error). The web application has to be written to roll back the transaction when there is an error, not leave things half-finished.
One thing you can leverage in your web application is "idempotency". This is the property of a function (or operation) that you can repeat it as many times as you like with the same result. For instance if a read fails, the client can simply retry it until it succeeds. If a deletion appears to fail, the client can again retry and the server will treat the request as valid whether or not the resource being deleted is already gone. And if an update appears to fail, the client can retry that until it gets a successful return from the server. The REST approach to architecting web services makes heavy use of idempotency to make operations robust in the face of error.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex