DDP vs Straight MongoDB access for synching large amounts of data

DDP vs Straight MongoDB access for synching large amounts of data - meteor

We are building an app in Meteor that will be participating in an education ecosystem.
There are a number of applications (e.g. a GradeBook, a Student Information System, a Reporting System...) that will all need to have their data stores kept in synch with Meteor. The datastore size will be in the hundreds of thousands of documents.
My understanding is that DDP is used to connect "clients" to a Meteor app (by subscribing to feeds when Meteor is pushing data changes and RPC to get the data in to Meteor). And a "client" is generally scoped to a user...so the size of the data set is relatively small compared to the universe of data (a teacher might have access to 100 of the 250K documents).
If I connected a Reporting System (as a "client") to Meteor with DDP, all data in the store would need to be synched...does that mean that every time the Reporting System lost the connection to Meteor, all data would be re-sent from Meteor to the DDP client? (because the Reporting System is interested in ALL the data)...and if that's the case, DDP wouldn't be the way to keep application in synch, right?...it's meant more for much smaller scoped data sets....and we should probably be interacting directly with Mongo to keep things synch.
Thanks!
Mike

based on this
http://meteor.com/blog/2012/03/21/introducing-ddp
Distributed Data Protocol. DDP is a standard way to solve the biggest problem facing client-side JavaScript developers: querying a server-side database, sending the results down to the client, and then pushing changes to the client whenever anything changes in the database.
it seems clear that any new DDP client, receives all data and then deltas as the data changes.
i would suggest that if your 'client' doesnt need reactivity / realtime updates / 2 way synching, you should pull the data directly from mongo and avoid the overhead of 'syncing'. for a 'reporting system' this should be perfectly acceptable, grab a bunch of data, generate reports. you shouldnt care about changing data in this context, just a snapshot and reports from that snapshot.
if you do need the more real time features, DDP is likely worth the overhead and initial setup difficulty.

I think nate's answer goes perfect on what you should do especially considering the volume of data. And if you need to display a whole lot of data if you're using pages to use a paginated subscription so that you can enjoy the realtime functionality (if you decide to use it) without downloading it all at once. Keep in mind though that at the moment the data is sent down like this (for each session, so if the tab is closed and reopened it will be redone):
1 - Connect to DDP Server/Proxy (Long Polling now due to websocket issues with chrome)
2 - Establish a 'subscription'
3 - Fetch all data relevant to subscription (initial download)
4 - Subscription is complete, now the client will 'listen' for changes
5 - Any updates (remove/update/insert, etc) are sent down to the client
There really isn't a sync system at this point where old data is kept offline (in a localstorage or indexed db or anything) so that step no 3 can be avoided and only the syncs from that point would occur.
This is mind, if there is a connectivity interruption (e.g losing connectivitiy for a short peroid of time Meteor will lose connection to the DDP wire and when it reconnects it download everything again as if it were from scratch.

Related

How efficient is Meteor's DDP at syncing very large collections?

Meteor's DDP protocol works very well for syncing a small collection of data from a server to a browser-based client, which inherently limits the amount of data that is processed.
However, consider a situation where Meteor is being used to sync a large collection from one server to another, or just the DDP protocol itself is used to sync one MongoDB with another.
How efficient is DDP in this case (computationally)? How well does it scale to several clients? Is the limit to performance only bandwidth or will DDP hit some CPU bound as well? What is the largest amount of data that can be reasonably synced over DDP right now? Is DDP just the wrong approach for doing this (see references below)?
Some additional thoughts:
As far as I know, the current version of DDP keeps track of each client's entire collection, so it can't be asymptotically very efficient.
Smart Collections were created to improve the performance of server-to-client collection of syncing. But it's unclear to me if this is improving DDP or something else.
See also:
How to implement real-time replication of MongoDB (or CouchDB) to many remote clients
DDP vs Straight MongoDB access for synching large amounts of data
EDIT:
After some empirical experience with this, I have to conclude that the answer is "not very efficient". See https://stackoverflow.com/a/21835534/586086 for an explanation.
Discussions with Meteor devs indicated that this problem will be addressed in the future with a revision of DDP and the publish-subscribe API, whereby the merge box will be removed and clients will handle merging. This will save CPU/memory on the server and allow for much larger datasets to be sent over the wire.

Basically it is more a matter of what and how you are publishing to the client than the number of clients. A request is usually handled in log2(N) if indexed, therefore it is quite easy for the server to recompute the result set even if (in the worst case) the whole collection would change. So, from the server side you can quite quickly get the new result sets to publish to the clients (if they changed from the one they had already).
The real problem (and common error) comes when you do publish everything to the client (like with the former autopublish), so make a publication wisely so that you do only give what the client is supposed to see. You can either prune the documents hiding useless fields or reduce the result set to send to the client by creating a publication with specific to your data scope of use parameters.
Data reactivity (session parameter bound on a publication) should also be handled with care, if for example you are sending a request each time you press a key in the search field, you might quickly overload the connection (still strongly depending on the size of the set you are publishing). We had to take care of this trying to build real estate service over meteor, the data set being over several gigabytes it was quite challenging to handle this without blocking the pipe with overloaded data.
In term of bandwidth, the DDP is quite good because it does supports clever entries updating (sending only fields changes instead of the whole document), moving an item is (will be) supported to (server side reordering).
Also take a look on this excellent answer concerning huge collections, what is done under the hood.

Meteorjs how much of a real time is it really?

I had my chance to play with this tool for a while now and made a chat application instead of a hello world. My project has 2 meteor applications sharing the same mongo database:
client
operator
when I type a message from the operator console it sometimes takes as much as 7-8 seconds to appear to the subscribed client. So my question is...how much of a real time can I expect from this meteor? Right now I can see better results with other services such as pubnub or pusher.
Should the delay come from the fact that it's 2 applications subscribed to the same db?
P.S. I need 2 applications because the client and operator apps are totally different mostly in design and media libraries (css/jquery plugins etc.) which is the only way I found to make the client app much lighter.

If you use two databases without DDP your apps are not going to operate in real time. You should either use one complete app or use DDP to relay messages to the other instance (via Meteor.connect)
This is a bit of an issue for the moment if you want to do the subscription on the server as there isn't really server to server ddp support with subscriptions yet. So you need to use the client to make the subscription:
connection = Meteor.connect("http://YourOtherMetorInstanceUrl");
connection.subscribe("messages");
Instead of
Meteor.subscribe("messages");
In your client app, of course using the same subscription names as you do for your corresponding publish functions on the other meteor instance

Akshat's answer is good, but there's a bit more explanation of why:
When Meteor is running it adds an observer to the collection, so any changes to data in that collection are immediately reactive. But, if you have two applications writing to the same database (and this is how you are synchronizing data), the observer is not in place. So it's not going to be fully real-time.
However, the server does regularly poll the database for outside changes, hence the 7-8 second delay.
It looks like your applications are designed this way to overcome the limitation Meteor has right now where all client code is delivered to all clients. Fixing this is on the roadmap.
In the mean time, in addition to Akshat's suggestion, I would also recommend using Meteor methods to insert messages. Then from client application, use Meteor.call('insertMessage', options ... to add messages via DDP, which will keep the application real-time.
You would also want to separate the databases.

Disconnected meteor application

I am interested in creating an application using the the Meteor framework that will be disconnected from the network for long periods of time (multiple hours). I believe meteor stores local data in RAM in a mini-mongodb js structure. If the user closes the browser, or refreshes the page, all local changes are lost. It would be nice if local changes were persisted to disk (localStorage? indexedDB?). Any chance that's coming soon for Meteor?
Related question... how does Meteor deal with document conflicts? In other words, if 2 users edit the same MongoDB JSON doc, how is that conflict resolved? Optimistic locking?

Conflict resolution is "last writer wins".
More specifically, each MongoDB insert/update/remove operation on a client maps to an RPC. RPCs from a given client always play back in order. RPCs from different clients are interleaved on the server without any particular ordering guarantee.
If a client tries to issue RPCs while disconnected, those RPCs queue up until the client reconnects, and then play back to the server in order. When multiple clients are executing offline RPCs, the order they finally run on the server is highly dependent on exactly when each client reconnects.
For some offline mutations like MongoDB's $inc and $addToSet, this model works pretty well as is. But many common modifiers like $set won't behave very well across long disconnects, because the mutation will likely conflict with intervening changes from other clients.
So building "offline" apps is more than persisting the local database. You also need to define RPCs that implement some type of conflict resolution. Eventually we hope to have turnkey packages that implement various resolution schemes.

Caching data and notifying clients about changes in data in ASP.NET

We are thinking to make some architectural changes in our application, which might affect the technologies we'll be using as a result of those changes.
The change that I'm referring in this post is like this:
We've found out that some parts of our application have common data and common services, so we extracted those into a GlobalServices service, with its own master data db.
Now, this service will probably have its own cache, so that it won't have to retrieve data from the db on each call.
So, when one client makes a call to that service that updates data, other clients might be interested in that change, or not. Now that depends on whether we decide to keep a cache on the clients too.
Meaning that if the clients will have their own local cache, they will have to be notified somehow (and first register for notifications). If not, they will always get the data from the GlobalServices service.
I need your educated advice here guys:
1) Is it a good idea to keep a local cache on the clients to begin with?
2) If we do decide to keep a local cache on the clients, would you use
SqlCacheDependency to notify the clients, or would you use WCF for
notifications (each might have its cons and pros)
Thanks a lot folks,
Avi

I like the sound of your SqlCacheDependency, but I will answer this from a different perspective as I have worked with a team on a similar scenario. We created a master database and used triggers to create XML representations of data that was being changed in the master, and stored it in a TransactionQueue table, with a bit of meta data about what changed, when and who changed it. The client databases would periodically check the queue for items it was interested in, and would process the XML and update it's own tables as necessary.
We also did the same in reverse for the client to update the master. We set up triggers and a TransactionQueue table on the client databases to send data back to the master. This in turn would update all of the other client databases when they next poll.
The nice thing about this is that it is fairly agnostic on client platform, and client data structure, so we were able to use the method on a range of legacy and third party systems. The other great point here is that you can take any of the databases out of the loop (including the master - e.g. connection failure) and the others will still work fine. This worked well for us as our master database was behind our corporate firewall, and the simpler web databases were sitting with our ISP.
There are obviously cons to this approach, like race hazard, so we were careful with the order of transaction processing, error handling, de-duping etc. We also built a management GUI to provide a human interaction layer before important data was changed in the master.
Good luck! Tim

Flex - best strategy for keeping client data in synch with backend database?

In an Adobe flex applicaiton using BlazeDS AMF remoting, what is the best stategy for keeping the local data fresh and in synch with the backend database?
In a typical web application, web pages refresh the view each time they are loaded, so the data in the view is never too old.
In a Flex application, there is the tempation to load more data up-front to be shared across tabs, panels, etc. This data is typically refreshed from the backend less often, so there is a greater chance of it being stale - leading to problems when saving, etc.
So, what's the best way to overcome this problem?
a. build the Flex application as if it was a web app - reload the backend data on every possible view change
b. ignore the problem and just deal with stale data issues when they occur (at the risk of annoying users who are more likely to be working with stale data)
c. something else
In my case, keeping the data channel open via LiveCycle RTMP is not an option.

a. Consider optimizing back-end changes through a proxy that does its own notification or poling: it knows if any of the data is dirty, and will quick-return (a la a 304) if not.
b. Often, users look more than they touch. Consider one level of refresh for looking and another when they start and continue to edit.
Look at BuzzWord: it locks on edit, but also automatically saves and unlocks frequently.
Cheers

If you can't use the messaging protocol in BlazeDS, then I would have to agree that you should do RTMP polling over HTTP. The data is compressed when using RTMP in AMF which helps speed things up so the client is waiting long between updates. This would also allow you to later scale up to the push methods if the product's customer decides to pay up for the extra hardware and licenses.

You don't need Livecycle and RTMP in order to have a notification mechanism, you can do it with the channels from BlazeDS and use a streaming/long polling strategy

In the past I have gone with choice "a". If you were using Remote Objects you could setup some cache-style logic to keep them in sync on the remote end.
Sam

Can't you use RTMP over HTTP (HTTP Polling)?
That way you can still use RTMP, and although it is much slower than real RTMP you can still braodcast updates this way.
We have an app that uses RTMP to signal inserts, updates and deletes by simply broadcasting RTMP messages containing the Table/PrimaryKey pair, leaving the app to automatically update it's data. We do this over Http using RTMP.

I found this article about synchronization:
http://www.databasejournal.com/features/sybase/article.php/3769756/The-Missing-Sync.htm
It doesn't go into technical details but you can guess what kind of coding will implement this strategies.
I also don't have fancy notifications from my server so I need synchronization strategies.
For instance I have a list of companies in my modelLocator. It doesn't change really often, it's not big enough to consider pagination, I don't want to reload it all (removeAll()) on each user action but yet I don't want my application to crash or UPDATE corrupt data in case it has been UPDATED or DELETED from another instance of the application.
What I do now is saving in a SESSION the SELECT datetime. When I come back for refreshing the data I SELECT WHERE last_modified>$SESSION['lastLoad']
This way I get only rows modified after I loaded the data (most of the time 0 rows).
Obviously you need to UPDATE last_modified on each INSERT and UPDATE.
For DELETE it's more tricky. As the guy point out in his article:
"How can we send up a record that no longer exists"
You need to tell flex which item it should delete (say by ID) so you cannot really DELETE on DELETE :)
When a user delete a company you do an UPDATE instead: deleted=1
Then on refresh companies, for row where deleted=1 you just send back the ID to flex so that it makes sure this company isn't in the model anymore.
Last but not the least, you need to write a function that clean rows where deleted=1 and last_modified is older than ... 3days or whatever you think suits your needs.
The good thing is that if a user delete a row by mistake it's still in the database and you can save it from real delete within 3days.

Rather than caching on flex client, why not do caching on server side? Some reasons,
1) When you cache data on server side, its centralized and you can make sure all clients have the same state of data
2) There are much better options available on server side for caching rather than on flex. Also you can have a cron job which refreshes data based on some frequency say every 24 hours.
3) As data is cached on server and it doesn't need to fetch it from db every time, communication with flex will be much faster
Regards,
Tejas

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex