What is the most bandwidth efficient way to unidirectionally synchronise a list of data from one server to many clients?
I have sizeable chunk of data (perhaps 20,000, 50-byte records) which I need to periodically synchronise to a series of clients over the Internet (perhaps 10,000 clients). Records may added, removed or updated only at the server end.
Something similar to bittorrent? Or even using bittorrent. Or maybe invent a wrapper around bittorrent.
(Assuming you pay for bandwidth on your server and not the others ...)
Ok, so we've got some detail now - perhaps 10 GB of total (uncompressed) data, every 3 days, so that's 100 GB per month.
That's actually not really a sizeable chunk of data these days. Whose bandwidth are you trying to save - yours, or your clients'?
Does the data perhaps compress very readily? For raw binary data it's not uncommon to achieve 50% compression, and if the data happens to have a lot of repeated patterns within it then 80%+ is possible.
That said, if you really do need a system that can just transfer the changes, my thoughts are:
make sure you've got a well defined primary key field - use that as your key to identify each record
record a timestamp for each record to say when it last changed
have each client tell you the timestamp of the last change it knows of, so you can calculate the deltas
ensure that full downloads are possible too, in case clients get out of sync
Related
I tried to run a Gremlin query adding a property to vertex through Gremlin console.
g.V().hasLabel("user").has("status", "valid").property(single, "type", "valid")
I constantly get this error:
org.apache.tinkerpop.gremlin.jsr223.console.RemoteException: Connection to server is no longer active
This error happens after query is running for one or two minutes.
I tried some simple queries like g.V().limit(10) and it works fine.
Since the affected vertex count is more than 4 million, not sure if it is failing due to timeout issue.
I also tried to split it into small batches:
g.V().hasLabel("user").has("status", "valid").hasNot("type").limit(200000).property(single, "type", "valid")
It succeeded for first few batches and started failing again.
Is there any recommendations for updating millions of vertices?
The precise approach you take may vary depending on the backend graph database and storage you are using as well as the capacity of the hardware being used.
The capacity of the hardware where Gremlin Server is running in terms of number of CPUs and most importantly, memory, will also be a factor as will the setting of the query timeout value.
To do this in Gremlin, if you had a way to identify distinct ranges of vertices easily you could split this up into multiple threads each doing batches of updates. If the example you show is representative of your actual need then that is likely not possible in this case.
Likewise some graph databases provide a bulk load capability that is often a good way to do large batch updates but probably not an option here as you need to do essentially a conditional update based on looking at the current presence (or not) of a property.
Without more information about your data model and hardware etc. the best answer is probably to do two things:
Use smaller limits. Maybe try 5K or even just 1K at first and work up from there until you find a reliable sweet spot.
Increase the query timeout settings.
You may need to experiment to find the sweet spot for your environment as the capacity of the hardware will definitely play a role in situations like this as well as how you write your query.
I read in a stackoverflow post that (link here)
By using predictable (e.g. sequential) IDs for documents, you increase the chance you'll hit hotspots in the backend infrastructure. This decreases the scalability of the write operations.
I would like if anyone could explain better on the limitations that can occur when using sequential or user provided id.
Cloud Firestore scales horizontally by allocated key ranges to machines. As load increases beyond a certain threshold on a single machine, it will split the range being served by it and assign it to 2 machines.
Let's say you just starting writing to Cloud Firestore, which means a single server is currently handling the entire range.
When you are writing new documents with random Ids, when we split the range into 2, each machine will end up with roughly the same load. As load increases, we continue to split into more machines, with each one getting roughly the same load. This scales well.
When you are writing new documents with sequential Ids, if you exceed the write rate a single machine can handle, the system will try to split the range into 2. Unfortunately, one half will get no load, and the other half the full load! This doesn't scale well as you can never get more than a single machine to handle your write load.
In the case where a single machine is running more load than it can optimally handle, we call this "hot spotting". Sequential Ids mean we cannot scale to handle more load. Incidentally, this same concept applies to index entries too, which is why we warn sequential index values such as timestamps of now as well.
So, how much is too much load? We generally say 500 writes/second is what a single machine will handle, although this will naturally vary depending on a lot of factors, such as how big a document you are writing, number of transactions, etc.
With this in mind, you can see that smaller more consistent workloads aren't a problem, but if you want something that scales based on traffic, sequential document ids or index values will naturally limit you to what a single machine in the database can keep up with.
There's an SQLite database being used to store static-sized data in a round-robin fashion.
For example, 100 days of data are stored. On day 101, day 1 is deleted and then day 101 is inserted.
The number of rows is the same between days. The the individual fields in the rows are all integers (32-bit or less) and timestamps.
The database is stored on an SD card with poor I/O speed,
something like a read speed of 30 MB/s.
VACUUM is not allowed because it can introduce a wait of several seconds
and the writers to that database can't be allowed to wait for write access.
So the concern is fragmentation, because I'm inserting and deleting records constantly
without VACUUMing.
But since I'm deleting/inserting the same set of rows each day,
will the data get fragmented?
Is SQLite fitting day 101's data in day 1's freed pages?
And although the set of rows is the same,
the integers may be 1 byte day and then 4 bytes another.
The database also has several indexes, and I'm unsure where they're stored
and if they interfere with the perfect pattern of freeing pages and then re-using them.
(SQLite is the only technology that can be used. Can't switch to a TSDB/RRDtool, etc.)
SQLite will reuse free pages, so you will get fragmentation (if you delete so much data that entire pages become free).
However, SD cards are likely to have a flash translation layer, which introduces fragmentation whenever you write to some random sector.
Whether the first kind of fragmentation is noticeable depends on the hardware, and on the software's access pattern.
It is not possible to make useful predictions about that; you have to measure it.
In theory, WAL mode is append-only, and thus easier on the flash device.
However, checkpoints would be nearly as bad as VACUUMs.
I would like to use firebase to load a 2D map in my site. There will be dynamic load of fields when user scroll map and also show changed map-fields.
But i am interested what is more efficient.
Read list of values even if i need only about 50% of loaded fields (e.g. 100 loaded fields)
geoRef.startAt(null, start).endAt(null, end).on('value', callback);
maybe better:
geoRef.startAt(null, start).endAt(null, end).once('value', callback);
geoRef.startAt(null, start).endAt(null, end).on('child_changed', callback);
or read a lot of single value (e.g. 50x)
geoRef.child(valueX).child(valueY).on('value', callback);
These reads will be triggerd for every user scroll. So that there will be a lot of 50x vs 1x(50%) reads.
Thanks
It's impossible to answer this question specifically without details. Are we talking 1000 records that are each 10 bytes or 1 million records that are each 5MB?
Data size and network bandwidth are the consideration here, not the number of connections. Firebase holds a socket opened to the server, so the overhead of establishing TCP connections is not a concern (as it would be with multiple HTTP requests), although the time it takes a request to return from the server (latency) is.
This leaves only two considerations: how much data and how many records.
For instance, if my system contains 1002 records and I want 1000 of them, and each is 1KB in size, it's going to be faster to simply request them all at once (since this requires only the latency of waiting for one response from the server). But if I want 10 of them, requesting them separately would likely be faster.
Even more ideal would be to segment them using priorities or split them cleverly into multiple paths by category, time frame, or another context. Then I can retrieve only segments of the data as a single transaction.
For example:
/messages/today
/messages/yesterday
/messages/all_messages
Now, assuming today is measured in hundreds and the payload is 1KB, I can just grab this whole list and iterate it client side any time I'd like--not worth the energy to grab them individually. If this is my common use case, perfect.
And assuming all_messages is measured in the millions of records, each about 1KB, then to grab 100 messages from here, I'll naturally gravitate to snagging each one individually.
I need to write a client/server app stored on a network file system. I am quite aware that this is a no-no, but was wondering if I could sacrifice performance (Hermes: "And this time I mean really slash.") to prevent data corruption.
I'm thinking something along the lines of:
Create a separate file in the system everytime a write is called (I'm willing do it for every connection if necessary)
Store the file name as the current millisecond timestamp
Check to see if the file with that time or earlier exists
If the same one exists wait a random time between 0 to 10 ms, and try again.
While file is the earliest timestamp, do work, delete file lock, otherwise wait 10ms and try again.
If a file persists for more than a minute, log as an error, stop until it is determined that the data is not corrupted by a person.
The problem I see is trying to maintain the previous state if something locks up. Or choosing to ignore it, if the state change was actually successful.
Is there a better way of doing this, that doesn't involve not doing it this way? Or has anyone written one of these with a lot less problems than the Sqlite FAQ warns about? Will these mitigations even factor in to preventing data corruption?
A couple of notes:
This must exist on an NSF, the why is not important because it is not my decision to make (it doesn't look like I was clear enough on that point).
The number of readers/writers on the system will be between 5 and 10 all reading and writing at the same time, but rarely on the same record.
There will only be clients and a shared memory space, there is no way to put a server on there, or use a server based RDMS, if there was, obviously I would do it in a New York minute.
The amount of data will initially start off at about 70 MB (plain text, uncompressed), it will grown continuous from there at a reasonable, but not tremendous rate.
I will accept an answer of "No, you can't gain reasonably guaranteed concurrency on an NFS by sacrificing performance" if it contains a detailed and reasonable explanation of why.
Yes, there is a better way. Don't use NFS to do this.
If you are willing to create a new file every time something changes, I expect that you have a small amount of data and/or very infrequent changes. If the data is small, why use SQLite at all? Why not just have files with node names and timestamps?
I think it would help if you described the real problem you are trying to solve a bit more. For example if you have many readers and one writer, there are other approaches.
What do you mean by "concurrency"? Do you actually mean "multiple readers/multiple writers", or can you get by with "multiple readers/one writer with limited latency"?