What should the HTTP Status Code of a Degraded Health Check Be?

What should the HTTP Status Code of a Degraded Health Check Be? - http

I have a health check endpoint at /status that returns the following status codes and response bodies:
Healthy - 200 OK
Degraded - ?
Unhealthy - 503 Service Unnavailable
What should the HTTP status code be for a degraded response be? A 'degraded' check is used for checks that did succeed but are slow or unstable. What HTTP status code makes the most sense?

The most suitable HTTP status code for a "Degraded" status response from a health endpoint is nothing other than 200 OK.
I say this because I can't find any better code in the official Hypertext Transfer Protocol (HTTP) Status Code Registry maintained by IANA, pointed to by [RFC7231] HTTP/1.1: Semantics and Content. Unofficial codes should be avoided, because they only make your API more difficult to understand.
You should design your APIs so that they become easy to use. Resource names, HTTP verbs, status codes, etc. should be more or less self-explanatory, so that people who already know "the REST language" can immediately understand how to use your API without having to decipher vague names or unusual status codes. Which brings me to the next part of my answer...
Other comments on your design
The most natural way to interpret a 5xx response to any request is that the operation in question failed.
So a 503 Service Unavailable response to a GET /status request means that the status checking operation itself failed. Such a response would only be useful if we can be certain that /status is a health endoint, as pointed out in the API Health Check draft referred to in Nkosi's answer:
A health endpoint is only meaningful in the context of the component
it indicates the health of. It has no other meaning or purpose. As
such, its health is a conduit to the health of the component.
Clients SHOULD assume that the HTTP response code returned by the
health endpoint is applicable to the entire component (e.g. a larger
API or a microservice).
But with a URL path of just /status, it is not completely obvious that this really is a health endpoint. From looking at the URL, we only know that it returns information about the status of something, but we can't really be sure what that "something" is.
Since you're also telling us that yes, it is in fact a health endpoint, I must suggest that you change the name to health. I would also suggest placing it under some base path, e.g. /things/health, to make it more clear which component it indicates the health of.
If, on the other hand, /status was actually a resource of it own, i.e. something that represents the status of some other component/thing (like its name currently suggests), then 200 OK is the only reasonable status for successful invocations, even if the thing that it indicates the status of is "Unhealthy". In that case, a 5xx would mean that no status could be obtained, and details in the response payload would be assumed to be related to a failure in the /status service itself.
So be careful with how you name things and what status codes you use!

Consider returning a custom code within the 2xx Success range that is not already taken within the known/common status codes. Similar to some of the unofficial codes not supported by any standard.
For example 218 This is fine (Apache Web Server)
Used as a catch-all error condition for allowing response bodies to flow through Apache when ProxyErrorOverride is enabled. When ProxyErrorOverride is enabled in Apache, response bodies that contain a status code of 4xx or 5xx are automatically discarded by Apache in favor of a generic response or a custom response specified by the ErrorDocument directive
After doing some research I came across a draft
Health Check Response Format for HTTP APIs: draft-inadarei-api-health-check-03
Where they also made similar suggestions
In case of the “warn” status, endpoints MUST return HTTP status in the 2xx-3xx range, and additional information SHOULD be provided, utilizing optional fields of the response.
where the warn status in the draft is healthy, with some concerns, which I believe aligns closely to your desired model.
While not definitive, I believe it provides some ideas to help with the eventual design.

I would be wary of splitting hairs like this on a healthcheck on the upstream server side. The service providing the healthcheck should be lightly (and concurrently) testing all its upstream dependencies based on its own set of policies or rules - request timeouts, connection failures and so on. In reality the healthcheck either works or it doesn't and the application shouldn't really need to be keeping track of the results of the healthcheck (other than capturing metrics about what happened). IMHO a stateful healthcheck is a recipe for disaster.
I typically use the following interface for application healthchecks:
204 - No Content, everything is working within tolerences
500 - Something failed, and here's some details in the response about what went wrong
Where it gets tricky depends on your architecture. You may have a VIP or reverse proxy that is interpreting this response and deciding if a given node is healthy or not, in which case it's going to either route the request to a healthy node or return the 503 Service Unavailable. This decision is going to made on some policy basis - x healthcheck requests failed over a y time period across z upstream services.
If you use a mesh then everyone can feed data back to the service registry to keep the health state up to date and it can be based on actual service calls rather than a healthcheck.
The client is perfectly placed to make a decision based on the health of services it depends on as they can keep track of the various responses from the service. Circuit breakers are an excellent way to handle that and can do it continuously on actual requests rather than just on the healthcheck. Circuit breaker libraries (such as resilience4j) will do this for you at the cost of setting up some policies about how many failed/slow requests constitute a bad service. Service Registrys like the netflix eureka can help with the discovery and ongoing monitoring.

Assuming you are referring to the status code of a liveness/healthcheck endpoint of a service - to distinguish from 200 OK a 203 likely seems applicable and in line with:
https://datatracker.ietf.org/doc/draft-inadarei-api-health-check/
https://www.rfc-editor.org/rfc/rfc7234#section-5.5 despite being deprecated Warning: 199-header MAY carry details
align max-age with livenessProbe.periodSeconds
HTTP/1.1 203 Non-Authoritative Information
Warning: 199 - "FooBar Warning Details"
Content-Type: application/health+json
Cache-Control: max-age=10
Connection: close
{"status": "warn"}

Related

What is the most appropriate HTTP response from a backend service when attempting to remove an entry that no longer exists in the Database?

My team is developing a simple backend service that provides the operations ADD, GET and REMOVE a very simple item. All are triggered by an http request and they do not much besides adding, getting and removing the item from a database.
Regarding the specific scenario in which a REMOVE operation is triggered on a item that is not present in the DB (e.g. was removed before), our question is what should be the response of the service? We having been debating options like 200 + some specific message, 410 - resource gone, amongst other 2XX and 4XX possibilities, but we haven't reached a consensus.
I hope this is not Bikeshedding.
Thank you for your help.

What should be the response of the service?
It's important to highlight that status codes are meant to indicate the result of the server's attempt to understand and satisfy the client request. Having said that, 2xx status codes are unsuitable for this situation and should be avoided:
The 2xx (Successful) class of status code indicates that the client's request was successfully received, understood, and accepted.
The most suitable status code would be in the 4xx range:
The 4xx (Client Error) class of status code indicates that the client seems to have erred. Except when responding to a HEAD request, the server SHOULD send a representation containing an explanation of the error situation, and whether it is a temporary or permanent condition.
The 404 status code seems to be what you are looking for, as it indicates that the server can't find the requested resource:
6.5.4. 404 Not Found
The 404 (Not Found) status code indicates that the origin server did not find a current representation for the target resource or is not willing to disclose that one exists. A 404 status code does not indicate whether this lack of representation is temporary or permanent; [...]
If you are concerned on how the client will understand the 404 reponse, you could provide them with a payload stating that such resource is no longer available.
And just bear in mind that ADD and REMOVE are not standard HTTP methods. Hopefully that was a typo and you are using POST (or PUT) and DELETE to express operations over your resources.

Which HTTP status code should be used for business errors in API design?

Lets say I have an API endpoint that executes some business operation which can result in many different failures that are not connected directly to the request.
The request is correctly formed and I cannot return 4xx failures, but the logic of the application dictates that I return different error messages.
Now I want the client to be able to differentiate these error messages so that different actions can be taken depending on the code. I can return a custom JSON like this e.g.
{
"code": 15,
"message": "Some business error has occurred"
}
Now the question is which HTTP status code should I use for such occasions if no standard code like Conflict or NotFound makes sense.
It seems that 500 InternalServerError is logical, but then how can I additionally flag that this cannot be retried, should it be just documented that given status codes is not possible to retry so one can retry if you don't get one of those?

Consult RFC 7231:
503 Service Unavailable looks like a potential candidate, but the RFC mentions that this is supposed to represent a problem "which will likely be alleviated after some delay." This would indicate to a client that it could try the same call later, maybe after business hours or on the weekend. This is not what you want.
501 Not Implemented could be possible, but the RFC mentions "This
is the appropriate response when the server does not recognize the
request method and is not capable of supporting it for any resource. A 501 response is cacheable by default;" This does not appear to be the case here - the HTTP method itself was presumably valid - the failure here seems to be happening at the business rules layer (e.g. sending in an account number that is not in the database), rather than an HTTP method (GET, POST, etc.) that you never got around to implementing.
That leaves the last serious candidate,
500 Internal Server Error
The 500 (Internal Server Error) status code indicates that the server
encountered an unexpected condition that prevented it from fulfilling
the request.
This is the error code that is normally used for generic "an exception occurred in the app" situations. 500 is the best choice.
As to how to distinguish this from a "temporal internal trouble" error, you can include this as part of the HTTP body - just make sure that your client can parse out the custom codes!

Is 418 "I'm a teapot" really an HTTP response code?

Is 418 "I'm a teapot" really an HTTP response code?
There are various references to this on the internet, including in lists of response codes, but I can't figure out whether it's a weird joke.

I use this code. I have nginx reverse-proxying requests to two separate HTTP servers. One handles requests for unauthenticated users, and the second handles requests for authenticated users. The problem in this particular case, is the first server is the one that determines if the user is authenticated. Please don't ask why.
So, if the first server determines the user is authenticated, it responds 418 I'm a teapot. NGINX then reroutes the traffic internally to the second server. As far as the browser is concerned, it was a single request.
This is in the spirit of HTCPCP code 418, because if you attempt to BREW with a teapot, the appropriate response is "I'm not the kind of thing that can handle that request, but there may be others." .. In other words, "I'm a teapot. Find a coffee maker." (the second server being the coffee maker).
Ultimately, while 418 is not explicitly defined in RFC 7231, it is still covered by the umbrella of 4xx (Client Error).
6. Response Status Codes
4xx (Client Error): The request contains bad syntax or cannot be fulfilled
6.5. Client Error 4xx
The 4xx (Client Error) class of status code indicates that the client
seems to have erred. Except when responding to a HEAD request, the
server SHOULD send a representation containing an explanation of the
error situation, and whether it is a temporary or permanent
condition. These status codes are applicable to any request method.
User agents SHOULD display any included representation to the user.

HTTP response code 418 was originally defined in RFC 2324 ("Hyper Text Coffee Pot Control Protocol (HTCPCP/1.0)") and RFC 7168 ("The Hyper Text Coffee Pot Control Protocol for Tea Efflux Appliances (HTCPCP-TEA)") protocols.
Per Wikipedia: List of HTTP status codes: #418
This code was defined in 1998 as one of the traditional IETF April Fools' jokes, in RFC 2324, Hyper Text Coffee Pot Control Protocol, and is not expected to be implemented by actual HTTP servers. The RFC specifies this code should be returned by teapots requested to brew coffee. This HTTP status is used as an Easter egg in some websites, including Google.com.

Yes, I can confirm, that I've seen HTTP 418 coming back from a real production server. It does exist.

Yes it's a "real" code in the sense that it was published as part of an official RFC by the Internet Engineering Task Force, RFC-2324. However, since that RFC was published on April 1 and meant as an April fools joke (along with the rest of Hyper Text Coffee Pot Control Protocol), not for legitimate implementation, it's not a "serious" code in the typical sense*. That's why most sites use it as an Easter egg, but otherwise avoid it. As noted by wizulus in this comment, there are often more appropriate statuses like 400 (Bad Request). Despite its satirical nature, it's now a reserved code (presumably as a result of becoming a popular engineering meme for the briefest moment), so don't expect it to be going anywhere anytime soon.
*according to the RFC's author, Larry M Masinter, the HTTP extension in question actually does serve a serious-but-satirical purpose: "it identifies many of the ways in which HTTP has been extended inappropriately."

You really should not ask a HTTCP (Hyper-Text Teapot Control Protocol) to brew coffee. That's a job for a HTCPCP (Hyper-Text Coffee Pot Control Protocol).

I think it is safer to treat 418 as a reserved code that once had a half - official meaning but now officially "unassigned".
I assume, historically something has been differently thought about these codes than it is now. This sounds meaningless and funny today; probably was not?
In other words, I would avoid using this code.

I actually use it. In my project we have a backend and a frontend and if I'm developing a new API I use 418 to denote "Bad Requests that should not be possible to make in the frontend". They now trigger an event in our error reporting tool with severity level "warning", where standard 400 only trigger at level "info".
I would not like to use a 500 because it is an error on the caller and I don't think it is a regular 400, because we have many cases where the backend is handling the validation and 400's is not a bug. We could have used a 501, but it is in the 4xx because it is a request error.

HTTP Status 404 or 400 if no such API endpoint exists?

Status code 400 Bad Request is used when the client has sent a request that cannot be processed due to being malformed.
Status code 404 Not Found is used when the requested resource does not exist / cannot be found.
My question is, when a client sends a request to an endpoint my API does not serve, which of these status codes is more appropriate?
Should an endpoint be considered a "resource", and thus a 404 be returned? My issue with this is that if the client only checks the status code, they cannot tell the difference between a 404 indicating that they got to the correct endpoint, but there was no result matching their query, versus a 404 indicating that they queried a non-existing endpoint.
Alternatively, should we expect that a client has prior knowledge of all available API endpoints, and thus treat their request as malformed and return a 400 if the endpoint they are trying to reach does not exist?
Maybe this depends on whether the endpoints are REST or not. If they are REST endpoints, the client should not need prior API knowledge, but be able to learn about all relevant API endpoints by navigating the API from a single root endpoint. In such a case, I guess 404 would be more appropriate.
In my specific case right now, this is an internal (non-REST) HTTP API, where I expect the client to have prior knowledge of all API endpoints, so I am leaning towards 400, to avoid issues where 404 from accessing the wrong endpoint could be misconstrued as a 404 indicating that what they sought from the correct endpoint could not be found.
Thoughts?

As a convenience, many modern APIs provide human-readable endpoints for developer convenience. The intent of REST, however, is that URLs are treated as opaque - they may happen to contain semantic content, but can't be relied upon to do so. There's no such thing as a "malformed" URL. There's only a URL that points to something and a URL that doesn't.
Now, that's the REST dogma (and arguably also the HTTP 1.1 spec). That doesn't mean it's what you should do. If you have a single internal client for your API, and that's not going to change, you have a lot of flexibility in designing your own standards. Just make sure to document them, especially those that might confuse the guy straight out of college that they hire to replace you when you move on.

Which HTTP status code should I use for a health-check failure?

I'm implementing a /_status/ endpoint which does some sanity checks on data in our database.
For example, we are collecting measurements and the status should go "bad" if the latest measurement is over an hour old.
I would like to point Pingdom at this URL to leverage their alerting infrastructure and tell us when something's wrong.
On a "good" status I will serve an HTML page with an HTTP 200 OK status. But what would an appropriate HTTP status code be for "bad"? Or would it be more correct not to convey this information via status code, but via HTML content instead?
Thanks!

Well... this is an old question, but I ended up here, so I thought I'd give my two cents here:
It seems pretty clear that a 2xx should be returned if all is OK
If health is not OK, I think it should return a 5xx result (4xx talks about the client being at fault in the request; 2xx and 3xx are all successful to some degree).
I think that a 5xx is correct because this is a special request that is answering about the state of the whole service. Also, because most Load Balancers offer liveliness checks based on response codes and not all offer a way to parse a more complex payload (other than perhaps a RegExp Match which can make the check brittle).
I agree with #Julien that a 500 (specifically) doesn't seem appropriate, and we've decided on 503 Service Unavailable.
503 seems to fit for a couple of reasons:
It's a 5xx family result code which indicates that something is going on on the server side.
It has a temporary nature to it indicating that it may recover.

We just had a similar discussion in our group. We decided for our purposes that the HTTP response codes should be reporting on your server's success or failure to honor the request. For a GET, this would mean whether or not you can respond with the requested resource. In this case, the requested resource is a health report, so as long as you're returning that successfully, it should be a 200 response.
We're returning JSON for our health check, with a top-level "isHealthy" field set to true or false. Our load balancer and other monitors will parse the JSON and use this field to determine if the system is healthy or not.
If you don't want to parse JSON in your monitors, you could try putting a custom response header to indicate binary health of the system, e.g., System-Health: true or System-Health: false. You might have better luck getting monitors which can check that.
If you really want to use a response code, I would recommend an additional endpoint called something like "health" which returns a "204 No Content" when healthy, and a "404 Not Found" when not healthy. In this case, the resource defined by the URL is, symbolically, the health of your system, and so if it's healthy, you can return a successful response. If it's unhealthy, then it's health can't be found, hence the 404.

If your data is 'bad' because there is a service failure (even if that is a backend job failing) then a HTTP 500 seems like a valid response. It indicates that something, somewhere is broken.
It isn't very specific, you're shrugging your shoulders and saying:
The 500 (Internal Server Error) status code indicates that the server
encountered an unexpected condition that prevented it from fulfilling
the request.
ietf rfc7231

If you ask for health and the server state is not healthy, I'm partial to 409 Conflict which "Indicates that the request could not be processed because of conflict in the current state of the resource" .
Some people might object that if you can respond then the request can be processed, but I disagree. Every error message is a response. The server defines resource semantics. If you ask for the good news resource and the server responds "here is bad news", it didn't give you what it defines to have offered at that resource.
In practice, it's much easier to say 2**="up" 4**="down" and pipe request counts into an availability metric and have a load balancer remove the server from its pool based on the response code. Coming up with ways to argue that "hey, we told you something, so 200 OK" just seems like missing the forrest for the trees to me.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex