Understand the weak comparison function - http

HTTP 1.1 defines a weak comparison function for cache validators:
in order to be considered equal,
both validators MUST be identical in every way, but either or
both of them MAY be tagged as "weak" without affecting the
result.
I understand that following statement (for two ETags) is true:
W/"Foo" = "Foo"
Now I'm wondering what real world use case might exist where a server compares a weak ETag against a strong one.

There are cases where servers first assign a weak etag, and later on promote it to a strong etag (by removing the "W/" prefix). An example is Apache moddav (or is it plain httpd?), when configured to create entity tags based on the filesystem timestamp of the file being served.

Related

How to determine if a DAV folder had parallel updates while I was modifying it

I'm syncing local client with a DAV folder (CardDAV in this particular case).
For this folder, I have ETag (CTag in SabreDAV dialect, to distinguish folder etags from item etags). If CTag has changed, I need to re-sync again. But if this change was caused by myself (e.g. I just uploaded a contact into this CardDAV folder), isn't there any way to avoid resync?
Ideally, I wanted that the DAV server would return this info on each request which changes anything on the server:
CTag1, CTag of the current folder as it was before my action was applied
CTag2, CTag of the current folder after my action was applied
ETag assigned to the item in question (although it's not relevant to this particular question).
This would let me understand if CTag change was only caused by my own actions (and no resync needed) or something else occurred in between (and thus resync is needed).
Currently, I can only query the folder for its CTag at any time but I have no clue what to do if CTag changed (in pseudo-code):
cTag0 = ReadStoredValue() ' The value left from the previous sync.
cTag1 = GetCTag()
If cTag0 <> cTag1 Then
Resync()
End If
UploadItem() ' Can get race condition if another client changes anything right now
cTag2 = GetCTag()
cTag2 will obviously be not the same as cTag1 but this provide zero information on whether something else occurred in the middle (another client changed something in the same folder). So, cTag0 <> cTag1 compare won't save me from race conditions, I could think that I'm in sync while some other update sneaked unnoticed.
Would be great to have:
cTag0 = ReadStoredValue() ' The value left from the previous sync.
(cTag1, cTag2) = UploadItem()
If cTag0 == cTag1
' No resync needed, just remember new CTag for the next sync cycle.
cTag0 = cTag2
Else
Resync()
cTag0 = cTag2
End If
I'm aware of DAV-Sync protocol extension but this would be a different story. In this task, I'm referring to the standard DAV, no extensions allowed.
EDIT: One thought which just crossed my mind. I noticed that CTag is sequential. It's a number which gets incremented by 1 on each operation with the folder. So if it's increased by more than 1 between obtaining CTag, making my action and then obtaining CTag again, this will indicate something else has just occurred. But this does not seem to be reliable, I'm afraid it's too implementation-specific to rely on this behavior. Looking for a more robust solution.
How to determine if a DAV folder had parallel updates while I was modifying it
This is very similar to
How to avoid time conflict or overlap for CalDAV?
Technically in pure DAV you are not guaranteed to be able to do this. Though in the real world, most servers will return you the ETag in the response to the PUT which was used to create/update the resource. This allows you to reconcile concurrent changes to the same resource.
There is also the Calendar Server Bulk Change Requests for *DAV Protocols
which is supported by some servers and which provides a more specific way to do this.
Since it isn't an RFC, I wouldn't suggest to rely on that though.
So what you would probably do is a PUT. If that returns you the ETag, you are good and can reconcile by syncing the collection (by whatever mechanism, PROPFIND:1, CTag or sync-report). If not, you either have the option to reconciling by other means (e.g. comparing/hashing the content), or to just treat the change as a concurrent edit, which I think most implementations do.
If you are very lucky, the server may also return the CTag/sync-token in the PUT. But AFAIK there is no standard for that, servers are not required to do it.
For this folder, I have ETag (CTag in SabreDAV dialect)
This is a misconception of yours. A CTag is absolutely not the same like an ETag, it is its own thing documented over here:
CalDAV CTag.
I'm aware of DAV-Sync protocol extension but this would be a different story. In this task, I'm referring to the standard DAV, no extensions allowed.
CTag is not a DAV standard at all, it is a private Apple extension (there is no RFC for that).
Standard HTTP/1.1 specs the ETag. It corresponds to the resource representation and doesn't apply to WebDAV collection contents, which are distinct to that. WebDAV collections often also have contents (that can be retrieved by GET etc), the ETag corresponds to that.
The official standard which replaces the proprietary CTag extension is in fact DAV-Sync aka RFC 6578. And the sync-token property and header is what replaces CTag header.
So if "no extensions allowed" is your use case, you need to resource comparison on the client side. Pure WebDAV doesn't provide this capability.
I noticed that CTag is sequential
CTags are not sequential, they are opaque tokens. Specific servers may use a sequence, but that is completely arbitrary. (the same is true for all DAV tokens, they are always opaque)

Should I test all enum values in a contract?

I have a doubt about about whether I should consider a certain type of test functional or contract.
Let's say I have an API like /getToolType, that accepts a {object" "myObject"} as input, and returns at type in the form {type: "[a-z]+"}
It was agreed between client and server that the types returned will match a set of strings, let's say [hammer|knife|screwdriver], so the consumer decided to parse them in an enum, with a fallback value when the returned type is unknown.
Should the consumer include a test case for each type(hammer, knife, screwdriver) to ensure the producer is still following the agreement that it will always return , for instance , the lowercase string "hammer" when /getToolType is called with an hammer object?
Or would you consider such a test case as functional? And why?
IMO the short answer is 'no'.
Contract testing is more interested in structure, if we start boundary testing the API we move into functional test territory, which is best done in the provider code base. You can use a matcher to ensure only one of those three values is returned, this should ensure the Provider build can't return other values.
I would echo #J_A_X's comments - there is no right or wrong answer, just be wary of testing all permutations of input/output data.
Great question. Short answer: there's no right or wrong way, just how you want to do it.
Longer answer:
The point of Pact (and contract testing) is to test specific scenarios and making sure that they match up. You could simply, in your contract, create a regex that allows any string type for those enums, or maybe null, but only if your consumer simply doesn't care about that value. For instance, if the tool type had a brand, I wouldn't care about the brand, just that it's returned back as a string since I just display the brand verbatim on the consumer (front-end).
However, if it was up to me, from what I understand of your scenario, it seems like the tool type is actually pretty important considering the endpoint it's hitting, hence I would probably have specific tests and contracts for each enum to make sure that those particular scenarios on my consumer are valid (I call X with something and I expect Y to have tool type Z).
Both of these solutions are valid, what it comes down to is this: Do you think the specific tool type is important to the consumer? If it is, create contracts specific to it, if not, then just create a generic contract.
Hope that helps.
The proper state is that consumer consumes hammer, knife, and screwdriver, c=(hammer,knife,screwdriver) for short while producer produces hammer, knife, and screwdriver, p=(hammer,knife,screwdriver).
There are four regression scenarios:
c=(hammer,knife,screwdriver,sword), p=(hammer,knife,screwdriver)
c=(hammer,knife,screwdriver), p=(hammer,knife,screwdriver,sword)
c=(hammer,knife,screwdriver), p=(hammer,knife)
c=(hammer,knife), p=(hammer,knife,screwdriver)
1 and 3 break the contract in a very soft way.
In the 1st scenario, the customer declared a new type that is not (yet) supported by the producer.
In the 3rd scenario, the producer stops supporting a type.
The gravity of scenarios may of course wary, as something I consider soft regression, might be in a certain service in a business-critical process.
However, if it is critical then there is a significant motivation to cover it with a dedicated test case.
2nd and 4th scenarios are more severe, in both cases, the consumer may end up in an error, e.g. might be not able to deserialize the data.
Having a test case for each type should detect scenario 3 and 4.
In the 1st scenario, it may trigger the developer to create an extra test case that will fail on the producer site.
However, the test cases are helpless against the 2nd scenario.
So despite the relatively high cost, this strategy does not provide us with full test coverage.
Having one test case with a regex covering all valid types (i.e. hammer|knife|screwdriver) should be a strong trigger for the consumer developer to redesign the test case in 1st and 4th scenario.
Once the regex is adjusted to new consumer capabilities it can detect scenario 4 with probability p=1/3 (i.e. the test will fail if the producer selected screwdriver as sample value).
Even without regex adjustment, it will detect the 3rd scenario with p=1/3.
This strategy is helpless against the 1st and 2nd scenario.
However, on top of the regex, we can do more.
Namely, we can design the producer test case with random data.
Assuming that the type in question is defined as follows:
enum Tool {hammer,knife,screwdriver}
we can render the test data with:
responseBody = Arranger.some(Tool.class);
This piece of code uses test-arranger, but there are other libraries that can do the same as well.
It selects one of the valid enum values.
Each time it can be a different one.
What does it change?
Now we can detect the 2nd scenario and after regex adjustment the 4th one.
So it covers the most severe scenarios.
There is also a drawback to consider.
The producer test is nondeterministic, depending on the drawn value it can either succeed or fail which is considered to be an antipattern.
When some tests sometimes fail despite the tested code being correct, people start to ignore the results of the tests.
Please note that producer test case with random data is not the case, it is in fact the opposite.
It can sometimes succeed despite the tested code is not correct.
It still is far from perfect, but it is an interesting tradeoff as it is the first strategy that managed to address the very severe 2nd scenario.
My recommendation is to use the producer test case with random data supported with a regex on the customer side.
Nonetheless, there is no perfect solution, and you should always consider what is important for your services.
Specifically, if the consumer can safely ignore unknown values, the recommended approach might be not a perfect fit.

Generating a multipart/byterange response without scanning the parts ahead of sending

I would like to generate a multipart byte range response. Is there a way for me to do it without scanning each segment I am about to send out, since I need to generate multipart boundary strings?
For example, I can have a user request a byterange that would have me fetch and scan 2GB of data, which in my case involves me loading that data into my (slow) VM as strings and so forth. Ideally I would like to simply state in the response that a part has a length of a certain number of bytes, and be done with it. Is there any tooling that could provide me with this option? I see that many developers just grab a UUID as the boundary and are probably willing to risk a tiny probability that it will appear somewhere within the part, but that risk seems to be small enough multiple people are taking it?
To explain in more detail: scanning the parts ahead of time (before generating the response) is not really feasible in my case since I need to fetch them via HTTP from an upstream service. This means that I effectively have to prefetch the entire part first to compute a non-matching multipart boundary, and only then can I splice that part into the response.
Assuming the data can be arbitrary, I don’t see how you could guarantee absence of collisions without scanning the data.
If the format of the data is very limited (like... base 64 encoded?), you may be able to pick a boundary that is known to be an illegal sequence of bytes in that format.
Even if your boundary does collide with the data, it must be followed by headers such as Content-Range, which is even more improbable, so the client is likely to treat it as an error rather than consume the wrong data.
Major Web servers use very simple strategies. Apache grabs 8 random bytes at startup and renders them in hexadecimal. nginx uses a sequential counter left-padded with zeroes.
UUIDs are designed to avoid collisions with other UUIDs, not with arbitrary data. A UUID is no more likely to be a good boundary than a completely random string of the same length. Moreover, some UUID variants include information that you may not want to disclose, such as your machine’s MAC address.
Ideally I would like to simply state in the response that a part has a length of a certain number of bytes, and be done with it. Is there any tooling that could provide me with this option?
Maybe you can avoid supporting multiple ranges and simply tell the clients to request each range separately. In that case, you don’t use the multipart format, so there is no problem.
If you do want to send multiple ranges in one response, then RFC 7233 requires the multipart format, which requires the boundary string.
You can, of course, invent your own mechanism instead of that of RFC 7233. In that case:
You cannot use 206 (Partial Content). You must use 200 (OK) or some other applicable status code.
You cannot use the multipart/byteranges media type. You must come up with your own media type.
You cannot use the Range request header.
Because a 200 (OK) response to a GET request is supposed to carry a (full) representation of the resource, you must do one of the following:
encode the requested ranges in the URL; or
use something like POST instead of GET; or
use a custom, non-standard status code instead of 200 (OK); or
(not sure if this is a correct approach) use media type parameters, send them in Accept, and add Accept to Vary.
The chunked transfer coding may be useful, but you cannot rely on it alone, because it is a property of the connection, not of the payload.

Is If-Modified-Since strong or weak validation?

HTTP 1.1 states that there can be either strong and weak ETag/If-None-Match validation. My questions is, is Last-Modified/If-Modified-Since validation strong or weak?
This has implications whether sub-range requests can be made or not.
From http://greenbytes.de/tech/webdav/draft-ietf-httpbis-p5-range-23.html#rfc.section.4.3:
"A response might transfer only a subrange of a representation if the connection closed prematurely or if the request used one or more Range specifications. After several such transfers, a client might have received several ranges of the same representation. These ranges can only be safely combined if they all have in common the same strong validator, where "strong validator" is defined to be either an entity-tag that is not marked as weak (Section 2.3 of [Part4]) or, if no entity-tag is provided, a Last-Modified value that is strong in the sense defined by Section 2.2.2 of [Part4]."
An ETag can be strong or weak depending on its suffix. Normally it will be strong, except if you access dynamic content where the content management system (CMS) handles that which is IMHO very uncommon.
However, the If-Modified-Since headers result should be strong too if and only if nobody manipulates the metadata of the files in the filesystem. In Linux it is pretty simple with the touch command, however I think you normally don't need to care about that. If somebody manipulates your server you have a different problem entirely.

Does if-match HTTP header require two-phase commits?

I'm trying to design a RESTful web API, so I've been studying rfc2616. I like the idea of using ETags for optimistic concurrency and was trying to use it to make a safe way to add resources without race-conditions. However, I noticed the following two statements in section 14.24:
If the request would, without the If-Match header field, result in anything other than a 2xx or 412 status, then the If-Match header MUST be ignored.
A request intended to update a resource (e.g., a PUT) MAY include an If-Match header field to signal that the request method MUST NOT be applied if the entity corresponding to the If-Match value (a single entity tag) is no longer a representation of that resource.
I'm using a RDBMS and don't know whether a transaction will successfully commit until I try it, so I think the first requirement seems a bit onerous. Consider a case where somebody supplies an If-Match header with mismatched ETags: If the commit would succeed, then I should heed the If-Match header, NOT attempt the commit, and return 412. If the commit would fail, then a request without the If-Match header would have resulted in a non-2XX/412 response, so I MUST ignore the If-Match header, meaning I should attempt the commit.
As far as I can figure out, I have 2 options:
Use 2-phase commits to gain foresight into whether the commit will succeed before attempting it.
Ignore the first requirement above, and return 412 even if ignoring If-Match would have resulted in a non-2XX/412 response. (this is the one I'm leaning towards)
Any other ideas? Am I misinterpreting the specs?
Wouldn't something like "update unless modified" (optimistic locking) work? The entity would need to store a version number or the etag in the database.
run validations that don't require a commit, ignoring the etag, return error if necessary
update entity where id = :the_id and etag = :expected_etag
this returns either 0 or 1 for affected rows
if 0 the resource has seen a concurrent update (or the id is completely wrong, which you could check separately). In this case return 412
commit
if the commit fails, return error as appropriate
Maybe this is somewhat on the theoretical side, but based on my current understanding of the HTTP specification, I would classify the usage of If-Match-like headers practically unusable for all but maybe the safe methods, because of this:
"If the request would, without the If-Match header field, result in anything other than a 2xx or 412 status, then the If-Match header MUST be ignored".
Why? Simply because in most practical cases, it's just impossible to foresee what should happen if the request was carried out.
As an example, who can forsee a IO-level error or some exceptional case occuring in code that must be run?
It'd be more "solvable" if 5xx where added to 2xx and 412.

Resources