Can you sacrifice performance to get concurrency in Sqlite on a NFS? - sqlite

I need to write a client/server app stored on a network file system. I am quite aware that this is a no-no, but was wondering if I could sacrifice performance (Hermes: "And this time I mean really slash.") to prevent data corruption.
I'm thinking something along the lines of:
Create a separate file in the system everytime a write is called (I'm willing do it for every connection if necessary)
Store the file name as the current millisecond timestamp
Check to see if the file with that time or earlier exists
If the same one exists wait a random time between 0 to 10 ms, and try again.
While file is the earliest timestamp, do work, delete file lock, otherwise wait 10ms and try again.
If a file persists for more than a minute, log as an error, stop until it is determined that the data is not corrupted by a person.
The problem I see is trying to maintain the previous state if something locks up. Or choosing to ignore it, if the state change was actually successful.
Is there a better way of doing this, that doesn't involve not doing it this way? Or has anyone written one of these with a lot less problems than the Sqlite FAQ warns about? Will these mitigations even factor in to preventing data corruption?
A couple of notes:
This must exist on an NSF, the why is not important because it is not my decision to make (it doesn't look like I was clear enough on that point).
The number of readers/writers on the system will be between 5 and 10 all reading and writing at the same time, but rarely on the same record.
There will only be clients and a shared memory space, there is no way to put a server on there, or use a server based RDMS, if there was, obviously I would do it in a New York minute.
The amount of data will initially start off at about 70 MB (plain text, uncompressed), it will grown continuous from there at a reasonable, but not tremendous rate.
I will accept an answer of "No, you can't gain reasonably guaranteed concurrency on an NFS by sacrificing performance" if it contains a detailed and reasonable explanation of why.

Yes, there is a better way. Don't use NFS to do this.
If you are willing to create a new file every time something changes, I expect that you have a small amount of data and/or very infrequent changes. If the data is small, why use SQLite at all? Why not just have files with node names and timestamps?
I think it would help if you described the real problem you are trying to solve a bit more. For example if you have many readers and one writer, there are other approaches.

What do you mean by "concurrency"? Do you actually mean "multiple readers/multiple writers", or can you get by with "multiple readers/one writer with limited latency"?

Related

Copying 100 GB with continues change of file between datacenters with R-sync is good idea?

I have a datacenter A which has 100GB of the file changing every millisecond. I need to copy and place the file in Datacenter B. In case of failure on Datacenter A, I need to utilize the file in B. As the file is changing every millisecond does r-sync can handle it at 250 miles far datacenter? Is there any possibility of getting the corropted file? As it is continuously updating when we call this as a finished file in datacenter B ?
rsync is a relatively straightforward file copying tool with some very advanced features. This would work great for files and directory structures where change is less frequent.
If a single file with 100GB of data is changing every millisecond, that would be a potential data change rate of 100TB per second. In reality I would expect the change rate to be much smaller.
Although it is possible to resume data transfer and potentially partially reuse existing data, rsync is not made for continuous replication at that interval. rsync works on a file level and is not as commonly used as a block-level replication tool. However there is an --inplace option. This may be able to provide you the kind of file synchronization you are looking for. https://superuser.com/questions/576035/does-rsync-inplace-write-to-the-entire-file-or-just-to-the-parts-that-need-to
When it comes to distance, the 250 miles may result in at least 2ms of additional latency, if accounting for the speed of light, which is not all that much. In reality this would be more due to cabling, routers and switches.
rsync by itself is probably not the right solution. This question seems to be more about physics, link speed and business requirements than anything else. It would be good to know the exact change rate, and to know if you're allowed to have gaps in your restore points. This level of reliability may require a more sophisticated solution like log shipping, storage snapshots, storage replication or some form of distributed storage on the back end.
No, rsync is probably not the right way to keep the data in sync based on your description.
100Gb of data is of no use to anybody without without the means to maintain it and extract information. That implies structured elements such as records and indexes. Rsync knows nothing about this structure therefore cannot ensure that writes to the file will transition from one valid state to another. It certainly cannot guarantee any sort of consistency if the file will be concurrently updated at either end and copied via rsync
Rsync might be the right solution, but it is impossible to tell from what you have said here.
If you are talking about provisioning real time replication of a database for failover purposes, then the best method is to use transaction replication at the DBMS tier. Failing that, consider something like drbd for block replication but bear in mind you will have to apply database crash recovery on the replicated copy before it will be usable at the remote end.

How Redis RDB persistance actually works behind the scene?

i was going through Redis RDB persistence. I having some doubts regarding RDB persistence related to its disadvantage.
Understanding So far:
We should use rdb persistence when we need to save the snapshot of dataset currently in memory at some regular interval.
I can understand that in this way we can lose some data in case of server break down. But another disadvantage that i can't understand is how fork can be time consuming when persisting large dataset using rdb.
Quoting from Documentation
RDB needs to fork() often in order to persist on disk using a child
process. Fork() can be time consuming if the dataset is big, and may
result in Redis to stop serving clients for some millisecond or even
for one second if the dataset is very big and the CPU performance not
great. AOF also needs to fork() but you can tune how often you want to
rewrite your logs without any trade-off on durability.
I know how fork works as per my knowledge When parent process forks it create a new Child process and we can allow some code that child process will execute based on its pid or we can provide it some new executable that it will work on using exec() system call.
but things that i don't understand how it will be heavy task when size of dataset is larger?
I think i know the answer but i m not sure about that
Quoted from this link https://www.bottomupcs.com/fork_and_exec.xhtml
When a process calls fork then
the operating system will create a new process that is exactly the same as the parent process. This means all the state that was talked about previously is copied, including open files, register state and all memory allocations, which includes the program code.
As per above statement whole dataset of redis will be copied to child.
Am i understanding right?
When standard fork is called with copy-on-write the OS must still copy all the page table entries, which can take time time if you have small 4k pages and a huge dataset, this is what makes the actual fork() time slow.
You can also find a lot of time and memory is required if your dataset is changing a lot in a sparse way, as copy-on-write semantics triggers the actual memory pages to be copied as changes are made to the original. Redis also performs incremental rehashing and maintains expiry etc. so an instance that is more active will typically take longer to save to disk.
More reading:
Faster forking of large processes on Linux?
http://kirkwylie.blogspot.co.uk/2008/11/linux-fork-performance-redux-large.html

How to calculate duration for a BerkeleyDB dump/load operation for a given BDB file?

I'm using a 3rd party application that uses BerkeleyDB for its local datastore (called BMC Discovery). Over time, its BDB files fragment and become ridiculously large, and BMC Software scripted a compact utility that basically uses db_dump piped into db_load with a new file name, and then replaces the original file with the rebuilt file.
The time it can take for large files is insanely long, and can take hours, while some others for the same size take half that time. It seems to really depend on the level of fragmentation in the file and/or type of data in it (I assume?).
The utility provided uses a crude method to guestimate the duration based on the total size of the datastore (which is composed of multiple BDB files). Ex. if larger than 1G say "will take a few hours" and if larger than 100G say "will take many hours". This doesn't help at all.
I'm wondering if there would be a better, more accurate way, using the commands provided with BerkeleyDB.6.0 (on Red Hat), to estimate the duration of a db_dump/db_load operation for a specific BDB file ?
Note: Even though this question mentions a specific 3rd party application, it's just to put you in context. The question is generic to BerkelyDB.
db_dump/db_load are the usual (portable) way to defragment.
Newer BDB (like last 4-5 years, certainly db-6.x) has a db_hotbackup(8) command that might be faster by avoiding hex conversions.
(solutions below would require custom coding)
There is also a DB->compact(3) call that "optionally returns unused Btree, Hash or Recno database pages to the underlying filesystem.". This will likely lead to a sparse file which will appear ridiculously large (with "ls -l") but actually only uses the blocks necessary to store the data.
Finally, there is db_upgrade(8) / db_verify(8), both of which might be customized with DB->set_feedback(3) to do a callback (i.e. a progress bar) for long operations.
Before anything, I would check configuration using db_tuner(8) and db_stat(8), and think a bit about tuning parameters in DB_CONFIG.

What file format do people use when logging data to a FAT32 file system using a 8bit microcontroller?

Updated question to be less vague.
I plan to log sensor data by time so something like sqlite would be perfect, but it requires too much resources in something like an atmega328p. Most of the searching will be done off the uC.
What do other people use? Flat text files? XML? A more complicated data structure?
Thanks for the feedback. It is good to know what other people are using. I've decided to serialize my data structures and save them in a binary file to eliminate string processing on the uC for now.
I've used flat text files for similar projects, albiet years ago, but I believe it's still a good approach for that environment. Since you don't need to process the data on-chip, you want it to be as efficient as possible (as little overhead as possible).
However, if you want more flexibility and weren't as concerned about space, perhaps saving JSON objects would be better, where each field is identified clearly. A tiny bit of overhead for creating the objects, but allows you to add and remove fields without complex logic on the interpreting side. I would pick JSON over XML just because you have about half the overhead (in space, and probably in processing).
With a small micro-controller like the 328, it is very important to determine the space requirements.
How big is each record? How many records do you want to store? How will you get the records off of the micro-controller?
Like Doug, I usually use a flat text to store data. So each record might contain year, day of the year and a value if I am storing a value once a day.
The file would look like:
11,314,100<cr>
11,315,99<cr>
11,316,98<cr>
11,317,220<cr>
You could store approximately 90-100 records, requiring you to dump the data every three months
If you need more then the 1kEEprom holds (200 5 byte records, 100 10 byte records or simliar) then you will need additional memory using an IC, SD or Flash.
If you want to unplug the memory and plug it into a PC, then the SD or Flash would be best.
You could use a vinculum chip from FTDIChip.com to simplify writing the fat files to flash drive.

Most bandwidth efficient unidirectional synchronise (server to multiple clients)

What is the most bandwidth efficient way to unidirectionally synchronise a list of data from one server to many clients?
I have sizeable chunk of data (perhaps 20,000, 50-byte records) which I need to periodically synchronise to a series of clients over the Internet (perhaps 10,000 clients). Records may added, removed or updated only at the server end.
Something similar to bittorrent? Or even using bittorrent. Or maybe invent a wrapper around bittorrent.
(Assuming you pay for bandwidth on your server and not the others ...)
Ok, so we've got some detail now - perhaps 10 GB of total (uncompressed) data, every 3 days, so that's 100 GB per month.
That's actually not really a sizeable chunk of data these days. Whose bandwidth are you trying to save - yours, or your clients'?
Does the data perhaps compress very readily? For raw binary data it's not uncommon to achieve 50% compression, and if the data happens to have a lot of repeated patterns within it then 80%+ is possible.
That said, if you really do need a system that can just transfer the changes, my thoughts are:
make sure you've got a well defined primary key field - use that as your key to identify each record
record a timestamp for each record to say when it last changed
have each client tell you the timestamp of the last change it knows of, so you can calculate the deltas
ensure that full downloads are possible too, in case clients get out of sync

Resources