doubts on optimistic concurrency control of creating item - azure-cosmosdb

I noticed ContainerProxy of cosmosdb had a few methods that contains etag and match_condition parameters which I understand is for optimistic concurrency control. But one thing I do not understand is why create_item also has etag and match_condition, see the link HERE. The way how optimistic concurrency control works is we retrieve etag from a record and use it to check whether the record is being changed during the process of updating. But for creating a new record, there is no etag to retrieve. How do we really use etag and match_condition in create_item?

The SDK is just a nice wrapper around the Cosmos REST API. According to the REST API description the create a document operation does not use an etag and as you said there's no reason why it would. Could be a leftover from writing reusable code for the different operations.

Related

Update large datasets in apigee baas

I have a scenario where a collection in the BaaS will have to be updated frequently. As I understand, to insert entities into a collection, I can do single HTTP POST requests with the payload containing an array of entities.
However, using HTTP PUT, I have to insert a single entity per request and I'm not sure about its performance.
What is the best / recommended way of updating a collection with a large number of entities?
Regards
You can do a batch update using HTTP PUT. See here: http://apigee.com/docs/app-services/content/updating-collections. Notice that the ql clause is Usergrid query language to restrict which items get updated.
As far as I know, that needs to be done singularly. I apologize for any inconvenience this may cause.

How should Transient Resources be retrieved in a RESTful API

For a while I was (wrongly) thinking that a RESTful API just exposed CRUD operation to persisted entities for a web application. When you code something up in "the real world" you soon find out that this is not enough. For example, a bank account transfer doesn't have to be a persisted entity. It could be a transient resource where you POST to /transfers/ and in the payload you specify the details:
{"accountToCredit":1234, "accountToDebit":5678, "amount":10}
Using POST here makes sense because it changes the state on the server ($10 moves from one account to another every time this POST occurs).
What should happen in the case where it doesn't affect the server? The simple first answer would be to use GET. For example, you want to get a list of savings and checking accounts that have less than $100. You would then call something like GET to /accounts/searchResults?minBalance=0&maxBalance=100. What happens though if your search parameter need to use complex objects that wouldn't fit in the maximum length of a GET request.
My first thought was to use POST, but after thinking about it some more it should probably be a PUT since it isn't changing the state of the server, but from my (limited) understanding I always though of PUT as updating a resource and POST as creating a resource (like creating this search results). So which should be used in this case?
I found the following links which provide some information but it wasn't clear to me what should be used in the different cases:
Transient REST Representations
How to design RESTful search/filtering?
RESTful URL design for search
I would agree with your approach, it seems reasonable to me to use GET when searching for resources, and as said in one of your provided links, the whole point of query strings is for doing things like search. I also agree that PUT fits better when you want to update some resource in an idempotent way (no matter how many times you hit the request, the result will be the same).
So generally, I would do it as you propose. Now, if you are limited by the maximum length of GET request, then you could use POST or PUT, passing your parameters in a JSON, in a URI like:
PUT /api/search
You could see this as a "search resource" where you send new parameters. I know it seems like a workaround and you may be worried that REST is about avoiding verbs in the URIs. Well, there are few cases that it's still acceptable and RESTful to use verbs, e.g. in cases where calculation or conversion is involved to generate the result (for more about this, check this reference).
PS. I think this workaround is still RESTful, but even if it wasn't, REST isn't an obsession and an ultimate goal. Being pragmatic and keeping a clean API design might be a better approach, even if in few cases you are not RESTful.

RESTful Alternatives to DELETE Request Body

While the HTTP 1.1 spec seems to allow message bodies on DELETE requests, it seems to indicate that servers should ignore it since there are no defined semantics for it.
4.3 Message Body
A server SHOULD read and forward a message-body on any request; if the
request method does not include defined semantics for an entity-body,
then the message-body SHOULD be ignored when handling the request.
I've already reviewed several related discussions on this topic on SO and beyond, such as:
Is an entity body allowed for an HTTP DELETE request?
Payloads of HTTP Request Methods
HTTP GET with request body
Most discussions seem to concur that providing a message body on a DELETE may be allowed, but is generally not recommended.
Further, I've noticed a trend in various HTTP client libraries where more and more enhancements seem to be getting logged for these libraries to support request bodies on DELETE. Most libraries seem to oblige, although occasionally with a little bit of initial resistance.
My use case calls for the addition of some required metadata on a DELETE (e.g. the "reason" for deletion, along with some other metadata required for deletion). I've considered the following options, none of which seem completely appropriate and inline with HTTP specs and/or REST best practices:
Message Body - The spec indicates that message bodies on DELETE have no semantic value; not fully supported by HTTP clients; not standard practice
Custom HTTP Headers - Requiring custom headers is generally against standard practices; using them is inconsistent with the rest of my API, none of which require custom headers; further, no good HTTP response available to indicate bad custom header values (probably a separate question altogether)
Standard HTTP Headers - No standard headers are appropriate
Query Parameters - Adding query params actually changes the Request-URI being deleted; against standard practices
POST Method - (e.g. POST /resourceToDelete { deletemetadata }) POST is not a semantic option for deleting; POST actually represents the opposite action desired (i.e. POST creates resource subordinates; but I need to delete the resource)
Multiple Methods - Splitting the DELETE request into two operations (e.g. PUT delete metadata, then DELETE) splits an atomic operation into two, potentially leaving an inconsistent state. The delete reason (and other related metadata) are not part of the resource representation itself.
My first preference would probably be to use the message body, second to custom HTTP headers; however, as indicated, there are some downsides to these approaches.
Are there any recommendations or best practices inline with REST/HTTP standards for including such required metadata on DELETE requests? Are there any other alternatives that I haven't considered?
Despite some recommendations not to use the message body for DELETE requests, this approach may be appropriate in certain use cases. This is the approach we ended up using after evaluating the other options mentioned in the question/answers, and after collaborating with consumers of the service.
While the use of the message body is not ideal, none of the other options were perfectly fitting either. The request body DELETE allowed us to easily and clearly add semantics around additional data/metadata that was needed to accompany the DELETE operation.
I'd still be open to other thoughts and discussions, but wanted to close the loop on this question. I appreciate everyone's thoughts and discussions on this topic!
Given the situation you have, I would take one of the following approaches:
Send a PUT or PATCH: I am deducing that the delete operation is virtual, by the nature of needing a delete reason. Therefore, I believe updating the record via a PUT/PATCH operation is a valid approach, even though it is not a DELETE operation per se.
Use the query parameters: The resource uri is not being changed. I actually think this is also a valid approach. The question you linked was talking about not allowing the delete if the query parameter was missing. In your case, I would just have a default reason if the reason is not specified in the query string. The resource will still be resource/:id. You can make it discoverable with Link headers on the resource for each reason (with a rel tag on each to identify the reason).
Use a separate endpoint per reason: Using a url like resource/:id/canceled. This does actually change the Request-URI and is definitely not RESTful. Again, link headers can make this discoverable.
Remember that REST is not law or dogma. Think of it more as guidance. So, when it makes sense to not follow the guidance for your problem domain, don't. Just make sure your API consumers are informed of the variance.
What you seem to want is one of two things, neither of which are a pure DELETE:
You have two operations, a PUT of the delete reason followed by a DELETE of the resource. Once deleted, the contents of the resource are no longer accessible to anyone. The 'reason' cannot contain a hyperlink to the deleted resource. Or,
You are trying to alter a resource from state=active to state=deleted by using the DELETE method. Resources with state=deleted are ignored by your main API but might still be readable to an admin or someone with database access. This is permitted - DELETE doesn't have to erase the backing data for a resource, only to remove the resource exposed at that URI.
Any operation which requires a message body on a DELETE request can be broken down into at it's most general, a POST to do all the necessary tasks with the message body, and a DELETE. I see no reason to break the semantics of HTTP.
I suggest you include the required metadata as part of the URI hierarchy itself. An example (Naive):
If you need to delete entries based on a date range, instead of passing the start date and end date in body or as query parameters, structure the URI such a way that you pass the required information as part of the URI.
e.g.
DELETE /entries/range/01012012/31122012 -- Delete all entries between 01 January 2012 to 31st December 2012
Hope this helps.
I would say that query parameters are part of the resource definition, thus you can use them to define the scope of your operation, then "apply" the operation.
My conclusion is that Query Parameters as you defined it is the best approach.

What's the suggested way of storing a resource ETag?

Where should I store the ETag for a given resource?
Approach A: compute on the fly
Get the resource and compute the ETag on the fly upon each request:
$resource = $repository->findByPK($id); // query
// Compute ETag
$etag = md5($resource->getUpdatedAt());
$response = new Response();
$response->setETag($etag);
$response->setLastModified($resource->getUpdatedAt());
if($response->isNotModified($this->getRequest())) {
return $response; // 304
}
Approach B: storing at database level
Saving a bit of CPU time while making INSERT and UPDATE statements a bit slower (we use triggers to get ETag updated):
$resource = $repository->findByPK($id); // query
$response = new Response();
$response->setETag($resource->getETag());
$response->setLastModified($resource->getUpdatedAt());
if ($response->isNotModified($this->getRequest())) {
return $response;
}
Approach C: caching the ETag
This is like approach B but ETag is stored in some cache middleware.
I suppose it would depend on the cost of having available the items going into the ETag itself.
I mean, the user sends along a request for a given resource; this should trigger a retrieval operation on the database (or some other operation).
If the retrieval is something simple such as fetching a file, then inquiring on the file stats is fast, and there's no need of storing anything anywhere: a MD5 of the file path plus its update time is enough.
If the retrieval implies querying a database, then it depends on whether you can decompose the query without losing performance (e.g., the user requests an article by ID. You might retrieve relevant data from the article table only. So a cache "hit" will entail a single SELECT on a primary key. But a cache "miss" means you have to query again the database, wasting the first query - or not - depending on your model).
If the query (or sequence of queries) is well-decomposable (and the resulting code maintenable) then I'd go with the dynamic ETag again.
If it is not, then most depends on the query cost and the overall cost of maintenance of a stored-ETag solution. If the query is costly (or the output is bulky) and INSERT/UPDATEs are few, then (and, I think, only then) it will be advantageous to store a secondary column (or table) with the ETag.
As for the caching middleware, I don't know. If I had a framework keeping track of everything for me, I might say 'go for it' -- the middleware is supposed to caring and implementing the points above. Should the middleware be implementation-agnostic (unlikely, unless it's a cut-and-paste slap-on ... which is not unheard of), then there would be either the risk of it "screening" updates to the resource, or maybe an excessive awkwardness on invoking some cache-clearing API upon updates. Both factors would need to be evaluated against the load improvement offered by ETag support.
I don't think that in this case a 'silver bullet' exists.
Edit: in your case there is little - or even no - difference between cases A and B. To be able to implement getUpdatedAt(), you would need to store the update time in the model.
In this specific case I think that it would be simpler and more maintainable the dynamic, explicit calculation of the ETag (case A). The retrieval cost is incurred in any case, and the explicit calculation cost is that of a MD5 calculation, which is really fast and completely CPU-bound. The advantages in maintainability and simplicity in my opinion are overwhelming.
On a semi-related note, it occurs to me that in some cases (infrequent updates to the database and much more frequent queries to the same) it might be advantageous and almost transparent to implement a global Last-Modified time for the whole database. If the database has not changed, then there is no way that any query to the database can return varied resources, no matter what the query is. In such a situation, one would only need to store the Last-Modified global flag in some easy and quick to retrieve place (not necessarily the database). For example
function dbModified() {
touch('.last-update'); // creates the file, or updates its modification time
}
in any UPDATE/DELETE code. The resource would then add a header
function sendModified() {
$tsstring = gmdate('D, d M Y H:i:s ', filemtime('.last-update')) . 'GMT';
Header("Last-Modified: " . $tsstring);
}
to inform the browser of that resource's modification time.
Then, any request for a resource including If-Modified-Since could be bounced back with a 304 without ever accessing the persistency layer (or at least saving all persistent resource access). No update time at record level would (have to) be needed:
function ifNotModified() {
// Check out timezone settings. The GMT helps but it's not always the ticket
$ims = isset($_SERVER['HTTP_IF_MODIFIED_SINCE'])
? strtotime($_SERVER['HTTP_IF_MODIFIED_SINCE'])
: -1; // This ensures the test will FAIL
if (filemtime('.last-update') <= $ims) {
// The database was never updated after the resource retrieval.
// There's no way the resource may have changed.
exit(Header('HTTP/1.1 304 Not Modified'));
}
}
One would put the ifNotModified() call as early as possible in the resource supply route, the sendModified as early as possible in the resource output code, and the dbModified() wherever the database gets modified significantly as far as resources are concerned (i.e., you can and probably should avoid it when logging access statistics to database, as long as they do not influence the resources' content).
In my opinion persisting ETags is BAD IDEA unless your business logic is ABOUT persisting ETags. Like when you write application to track users basing on ETags and this is a business feature :).
Potential savings in execution time will be small or non-existng. Bad sides of this solution are certain and growing as your application grows.
According to specification Resource in the same version shall give different E-Tags depending on end point from with has been obtained.
From http://en.wikipedia.org/wiki/HTTP_ETag:
"Comparing ETags only makes sense with respect to one URL—ETags for resources obtained from different URLs may or may not be equal, so no meaning can be inferred from their comparison."
From this you may conclude that you should persist not just ETags but also its endpoint and store as many ETags as many enpoints you have. Sounds crazy?
Even if you want to ignore HTTP specification and just provide one Etag for Entity without any metadata about its endpoints. You still kind of binding at least 2 layers (caching and business logic) that ideally should not be mixed. Idea behind of having Entity (versus some lose data) is to have separated and not coupled business logic in them, and do not pollute them with stuff about networking, view layer data or... caching.
IHMO, this depends on how often resources are updated vs how often resources are read.
If each ETag is read 1 or 2 times between modifications, then just calculate them on the fly.
If your resources are read far more times than they're updated, then you'd better cache them, calcultating the ETag every time the resource is modified (so you don't have to bother with out-of-date cached ETags).
If ETags are modified almost as often as they're read, then I'd still cache them, especially since it seems your resources are stored on a database.

What's the justification behind disallowing partial PUT?

Why does an HTTP PUT request have to contain a representation of a 'whole' state and can't just be a partial?
I understand that this is the existing definition of PUT - this question is about the reason(s) why it would be defined that way.
i.e:
What is gained by preventing partial PUTs?
Why was preventing idempotent partial updates considered an acceptable loss?
PUT means what the HTTP spec defines it to mean. Clients and servers cannot change that meaning. If clients or servers use PUT in a way that contradicts its definition, at least the following thing might happen:
Put is by definition idempotent. That means a client (or intermediary!) can repeat a PUT any number of times and be sure that the effect will be the same. Suppose an intermediary receives a PUT request from a client. When it forwards the request to the server, there is a network problem. The intermediary knows by definition that it can retry the PUT until it succeeds. If the server uses PUT in a non idempotent way these potential multiple calls will have an undesired effect.
If you want to do a partial update, use PATCH or use POST on a sub-resource and return 303 See Other to the 'main' resource, e.g.
POST /account/445/owner/address
Content-Type: application/x-www-form-urlencoded
street=MyWay&zip=22222&city=Manchaster
303 See Other
Location: /account/445
EDIT: On the general question why partial updates cannot be idempotent:
A partial update cannot be idempotent in general because the idempotency depends on the media type semantics. IOW, you might be able to specify a format that allows for idempotent patches, but PATCH cannot be guaranteed to be idempotent for every case. Since the semantics of a method cannot be a function of the media type (for orthogonality reasons) PATCH needs to be defined as non-idempotent. And PUT (being defined as idempotent) cannot be used for partial updates.
Because, I guess, this would have translated in inconsistent "views" when multiple concurrent clients access the state. There isn't a "partial document" semantics in REST as far as I can tell and probably the benefits of adding this in face of the complexity of dealing with that semantics in the context of concurrency wasn't worth the effort.
If the document is big, there is nothing preventing you from building multiple independent documents and have an overarching document that ties them together. Furthermore, once all the bits and pieces are collected, a new document can be collated on the server I guess.
So, considering one can "workaround" this "limitations", I can understand why this feature didn't make the cut.
Short answer: ACIDity of the PUT operation and the state of the updated entity.
Long answer:
RFC 2616 : Paragraph 2.5, "POST method requests the enclosed entity to be accepted as a new subordinate of the requested URL". Paragraph 2.6, "PUT method requests the enclosed entity to be stored at the specified URL".
Since every time you execute POST, the semantic is to create a new entity instance on the server, POST constitutes an ACID operation. But repeating the same POST twice with the same entity in the body still might result in different outcome, if for example the server has run out of storage to store the new instance that needs to be created - thus, POST is not idempotent.
PUT on the other hand has a semantic of updating an existing entity. There's no guarantee that even if a partial update is idempotent, it is also ACID and results in consistent and valid entity state. Thus, to ensure ACIDity, PUT semantic requires the full entity to be sent. Even if it was not a goal for the HTTP protocol authors, the idempotency of the PUT request would happen as a side effect of the attempt to enforce ACID.
Of course, if the HTTP server has close knowledge of the semantic of the entities, it can allow partial PUTs, since it can ensure through server-side logic the consistency of the entity. This however requires tight coupling between the data and the server.
With a full document update, it's obvious, without knowing any details of the particular API or what its limitations on the document structure are, what the resulting document will be after the update.
If a certain method was known to never be a partial content update, and an API someone provided only supported that method, then it would always be clear what someone using the API would have to do to change a document to have a given set of valid contents.

Resources