How to pre-process Riak data entry before sending it to a client - riak

I have a Riak installation with many nodes. It stores entries, with relatively big blobs on a value side. In some cases, clients will need only a subset of this data, so there is no need to transfer them over the network.
Is there any elegant solution to pre-process this data on the server side, before actually sending them to client.
The only idea I have is to install a small "agent" on every node, that will interact with client in it's own protocol and acts as a proxy, that will reduce data based on query.
But such solution will work only, if I can know (based on key) on which node particular entry is stored. Is there a way to do it?

You may be able to do that with MapReduce, if you specify a single bucket/key as input. The map function would run local to where the data is stored, and send its result to the reduce function on whichever node received the request from the client. Since you are providing a specific key, there shouldn't be any coverage folding which is what causes the heavy load the docs warn about.
This would mean that your requests would effectively be using r=1, so if there is ever an outage you would get some false not found results.

Related

Would real world Meteor application use server side Methods almost exclusively?

I'm learning Meteor and fundamentally enjoy how fast I can build data driven applications however as I went through the Creating Posts chapter in the Discover Meteor book I learned about using server side Methods. Specifically the primary reason (and there are a number of very valid reasons to use these) was because of the timestamp. You wouldn't want to rely on the client date/time, you'd want to use the server date/time.
Makes sense except that in almost every application I've ever built we store date/time of row create/update in a column. Effectively every single create or update to the database records date/time which in Meteor now looks like I would need to use server side Methods to ensure data integrity.
If I'm understanding correctly that pretty much eliminates the ease of use and real-time nature of a client side Collection because I'll need to use Methods for almost every single update and create to our databases.
Just wanted to check and see how everyone else is doing this in the real world. Are you just querying a server side Method that just returns the date/time and then using client side Collection or something else?
Thanks!
The short answer to this question is that yes, every operation that affects the server's database will go through a server-side method. The only difference is whether you are defining this method explicitly or not.
When you are just getting started with Meteor, you will probably do insert/update/remove operations directly on client collections using validators, which check for whether the operation is allowed. This usage is actually calling predefined methods on both the server and client: (for a collection named foo the you have /foo/insert, for example) which simply checks the specified validators before doing the operation. As you become more familiar with Meteor you will probably override these default methods, for reasons you described (among others.)
When using your own methods, you will typically want to define a method both on the server and the client, just as the default collection functions do for you. This is because of Meteor's latency compensation, which allows most client operations to be reflected immediately in the browser without any noticeable lag, as long as they are permitted. Meteor does this by first simulating the effect of a method call in the client, updating the client's cached data temporarily, then sending the actual method call to the server. If the server's method causes a different set of changes than the client's simulation, the client's cache will be updated to reflect this when the server method returns. This also means that if the client's method would have done the same as the server, we've basically allowed for an instant operation from the perspective of the client.
By defining your own methods on the server and client, you can extend this to fill your own needs. For example, if you want to insert timestamps on updates, have the client insert whatever timestamp in the simulation method. The server will insert an authoritative timestamp, which will replace the client's timestamp when the method returns. From the client's perspective, the insert operation will be instant, except for an update to the timestamp if the client's time happens to be way off. (By the way, you may want to check out my timesync package for displaying relative server time accurately on the client.)
A final note: it's good to understand what scope you are doing collection operations in, as this was one of the this that originally confused me about Meteor. For example, if you have a collection instance in the client Foo, Foo.insert() in normal client code will call the default pair of client/server methods. However, Foo.insert() in a client method will run only in a simulation and will never call server code - so you will need to define the same method on the server and make sure you do Foo.insert() there as well, for the method to work properly.
A good rule of thumb for moving forward is to replace groups of validated collection operations with your own methods that do the same operations, and then adding specific extra features on the server and client respectively.
In short— yes!
Publications exist to send out a 'live', and dynamic, subset of the database to the client, sending DDP added messages for existing records, followed by a ready, and then added, changed, and deleted messages to keep the client's cache consistent.
Methods exist to- directly, or indirectly— cause Mongo Updates, and like it was mentioned by Andrew, they are always in use.
But truly, because of Meteor's publication architecture, any edits to collections that are currently being published to at least one client, will be published via DDP - regardless of the source of the change to Mongo - even an outside process.

How send request with AFNetworking 2 in strict sequential order?

I'm doing the sync to mirror a sqlite DB to a server one.
I have a Master-Detail table, where the details must be send to the server ASAP. However, is possible that detail 3 arrive before detail 2. I need to mimic the steps made to the document and respect the order of the operations.
When a record is saved locally, I send a notification and then post the data. How I can guarantee a strict sequential order using AFNetworking?
By default, operations run concurrently, with no guarantee of order. The only way to ensure that actions play is to prevent more than one request operation from running at a given time, by setting the operationQueue.maximumConcurrentOperations property to 1 (or, if you're not using a manager, make sure to enqueue operations into an operation queue with the property set thusly).

How efficient is Meteor's DDP at syncing very large collections?

Meteor's DDP protocol works very well for syncing a small collection of data from a server to a browser-based client, which inherently limits the amount of data that is processed.
However, consider a situation where Meteor is being used to sync a large collection from one server to another, or just the DDP protocol itself is used to sync one MongoDB with another.
How efficient is DDP in this case (computationally)? How well does it scale to several clients? Is the limit to performance only bandwidth or will DDP hit some CPU bound as well? What is the largest amount of data that can be reasonably synced over DDP right now? Is DDP just the wrong approach for doing this (see references below)?
Some additional thoughts:
As far as I know, the current version of DDP keeps track of each client's entire collection, so it can't be asymptotically very efficient.
Smart Collections were created to improve the performance of server-to-client collection of syncing. But it's unclear to me if this is improving DDP or something else.
See also:
How to implement real-time replication of MongoDB (or CouchDB) to many remote clients
DDP vs Straight MongoDB access for synching large amounts of data
EDIT:
After some empirical experience with this, I have to conclude that the answer is "not very efficient". See https://stackoverflow.com/a/21835534/586086 for an explanation.
Discussions with Meteor devs indicated that this problem will be addressed in the future with a revision of DDP and the publish-subscribe API, whereby the merge box will be removed and clients will handle merging. This will save CPU/memory on the server and allow for much larger datasets to be sent over the wire.
Basically it is more a matter of what and how you are publishing to the client than the number of clients. A request is usually handled in log2(N) if indexed, therefore it is quite easy for the server to recompute the result set even if (in the worst case) the whole collection would change. So, from the server side you can quite quickly get the new result sets to publish to the clients (if they changed from the one they had already).
The real problem (and common error) comes when you do publish everything to the client (like with the former autopublish), so make a publication wisely so that you do only give what the client is supposed to see. You can either prune the documents hiding useless fields or reduce the result set to send to the client by creating a publication with specific to your data scope of use parameters.
Data reactivity (session parameter bound on a publication) should also be handled with care, if for example you are sending a request each time you press a key in the search field, you might quickly overload the connection (still strongly depending on the size of the set you are publishing). We had to take care of this trying to build real estate service over meteor, the data set being over several gigabytes it was quite challenging to handle this without blocking the pipe with overloaded data.
In term of bandwidth, the DDP is quite good because it does supports clever entries updating (sending only fields changes instead of the whole document), moving an item is (will be) supported to (server side reordering).
Also take a look on this excellent answer concerning huge collections, what is done under the hood.

When a Firebase node syncs, is the full new value sent to the server or just the difference?

And does it work the same way when an object is being updated via a callback like ref.on('value', ...?
I tried to figure it out in myself in the Chrome dev tools but wasn't able to.
This makes a difference for me because I'm working on an app where users might store large amounts of text. If only diffs are sent over the wire, it's a lot more lightweight and I can sync much more frequently. If full values are sent, I wouldn't want to do that.
When data is written, the Firebase server currently sends all the data being written down to the server. If you write a large object, and then rewrite the whole object with the same object again, the entire object will be sent over the wire (This may be changing in the future, but that's the current implementation).
When sending data from the server back out to other clients, we do do some optimization and don't transmit some of the duplicate data.
Firebase is designed to allow you to granularly access data. I would strongly suggest you address into the data that is changing and only update the relevant portions. For example:
//inefficient method:
ref.set(HUGE_BLOCK_OF_JSON);
//efficient method:
ref.child("a").child("b").child("c").set(SOME_SMALL_PIECE_OF_DATA);
When you address into a piece of data, only that small piece is transmitted and rebroadcast to other clients.
Firebase is intended for true real-time apps where updates are made as soon as data changes. If you find yourself intentionally caching changes for a while and saving them as big blobs for performance reasons, you should probably be breaking up your data and only writing the relevant portions.

Consequences of POST not being idempotent (RESTful API)

I am wondering if my current approach makes sense or if there is a better way to do it.
I have multiple situations where I want to create new objects and let the server assign an ID to those objects. Sending a POST request appears to be the most appropriate way to do that.
However since POST is not idempotent the request may get lost and sending it again may create a second object. Also requests being lost might be quite common since the API is often accessed through mobile networks.
As a result I decided to split the whole thing into a two-step process:
First sending a POST request to create a new object which returns the URI of the new object in the Location header.
Secondly performing an idempotent PUT request to the supplied Location to populate the new object with data. If a new object is not populated within 24 hours the server may delete it through some kind of batch job.
Does that sound reasonable or is there a better approach?
The only advantage of POST-creation over PUT-creation is the server generation of IDs.
I don't think it worths the lack of idempotency (and then the need for removing duplicates or empty objets).
Instead, I would use a PUT with a UUID in the URL. Owing to UUID generators you are nearly sure that the ID you generate client-side will be unique server-side.
well it all depends, to start with you should talk more about URIs, resources and representations and not be concerned about objects.
The POST Method is designed for non-idempotent requests, or requests with side affects, but it can be used for idempotent requests.
on POST of form data to /some_collection/
normalize the natural key of your data (Eg. "lowercase" the Title field for a blog post)
calculate a suitable hash value (Eg. simplest case is your normalized field value)
lookup resource by hash value
if none then
generate a server identity, create resource
Respond => "201 Created", "Location": "/some_collection/<new_id>"
if found but no updates should be carried out due to app logic
Respond => 302 Found/Moved Temporarily or 303 See Other
(client will need to GET that resource which might include fields required for updates, like version_numbers)
if found but updates may occur
Respond => 307 Moved Temporarily, Location: /some_collection/<id>
(like a 302, but the client should use original http method and might do automatically)
A suitable hash function might be as simple as some concatenated fields, or for large fields or values a truncated md5 function could be used. See [hash function] for more details2.
I've assumed you:
need a different identity value than a hash value
data fields used
for identity can't be changed
Your method of generating ids at the server, in the application, in a dedicated request-response, is a very good one! Uniqueness is very important, but clients, like suitors, are going to keep repeating the request until they succeed, or until they get a failure they're willing to accept (unlikely). So you need to get uniqueness from somewhere, and you only have two options. Either the client, with a GUID as Aurélien suggests, or the server, as you suggest. I happen to like the server option. Seed columns in relational DBs are a readily available source of uniqueness with zero risk of collisions. Round 2000, I read an article advocating this solution called something like "Simple Reliable Messaging with HTTP", so this is an established approach to a real problem.
Reading REST stuff, you could be forgiven for thinking a bunch of teenagers had just inherited Elvis's mansion. They're excitedly discussing how to rearrange the furniture, and they're hysterical at the idea they might need to bring something from home. The use of POST is recommended because its there, without ever broaching the problems with non-idempotent requests.
In practice, you will likely want to make sure all unsafe requests to your api are idempotent, with the necessary exception of identity generation requests, which as you point out don't matter. Generating identities is cheap and unused ones are easily discarded. As a nod to REST, remember to get your new identity with a POST, so it's not cached and repeated all over the place.
Regarding the sterile debate about what idempotent means, I say it needs to be everything. Successive requests should generate no additional effects, and should receive the same response as the first processed request. To implement this, you will want to store all server responses so they can be replayed, and your ids will be identifying actions, not just resources. You'll be kicked out of Elvis's mansion, but you'll have a bombproof api.
But now you have two requests that can be lost? And the POST can still be repeated, creating another resource instance. Don't over-think stuff. Just have the batch process look for dupes. Possibly have some "access" count statistics on your resources to see which of the dupe candidates was the result of an abandoned post.
Another approach: screen incoming POST's against some log to see whether it is a repeat. Should be easy to find: if the body content of a request is the same as that of a request just x time ago, consider it a repeat. And you could check extra parameters like the originating IP, same authentication, ...
No matter what HTTP method you use, it is theoretically impossible to make an idempotent request without generating the unique identifier client-side, temporarily (as part of some request checking system) or as the permanent server id. An HTTP request being lost will not create a duplicate, though there is a concern that the request could succeed getting to the server but the response does not make it back to the client.
If the end client can easily delete duplicates and they don't cause inherent data conflicts it is probably not a big enough deal to develop an ad-hoc duplication prevention system. Use POST for the request and send the client back a 201 status in the HTTP header and the server-generated unique id in the body of the response. If you have data that shows duplications are a frequent occurrence or any duplicate causes significant problems, I would use PUT and create the unique id client-side. Use the client created id as the database id - there is no advantage to creating an additional unique id on the server.
I think you could also collapse creation and update request into only one request (upsert). In order to create a new resource, client POST a “factory” resource, located for example at /factory-url-name. And then the server returns the URI for the new resource.
Why don't you use a request Id on your originating point (your originating point should do two things, send a GET request on request_id=2 to see if it's request has been applied - like a response with person created and created as part of request_id=2
This will ensure your originating system knows what was the last request that was executed as the request id is stored in db.
Second thing, if your originating point finds that last request was still at 1 not yet 2, then it may try again with 3, to make sure if by any chance just the GET response has gotten lost but the request 2 was created in the db.
You can introduce number of tries for your GET request and time to wait before firing again a GET etc kinds of system.

Resources