Sync and back up files encrypted (using a raspberry pi) - encryption

I am currently looking for a way to synchronize confidential files between two PCs (and possibly an always running raspberry pi - would serve as a host and backup).
On each PC I have an LUKS-encrypted partition. I want to synchronize the files in those partitions with the rpi, but I don't want to store them on rpi in clear text.
I think the only reliable way is to encrypt the files while still on the PC (in every other way the files could be obtained as long as there is physical access to the rpi).
One possible way is storing the files also in a encrypted partition of the rpi and sending the pass-phrase to the rpi every time I want to sync, but I did not find an extremely simple way to do this (e.g. Unison doesn't over such a feature) + the pass-phrase could be obtained by simple manipulations.
The second way I thought of was storing the files in an encrypted container an synchronizing the container, but with every little change the whole file would have to be uploaded to the rpi.
So, is there a fast way to encrypt single files (esp. only the changed ones and possibly combine it with synchronization right away)?
I read openssl is one way of encrypting single files.
I don't know much about encryption or synchronization, but I want to find a way that is reasonably safe and not more than reasonably complex and doesn't use any external services...
Thank you very much for reading and considering my question,
Max
Edit: One part that might solve my problem right away:
If I use a container (luks) and change some files, will the changes in the container file be proportional to the changes I made in the files AND will rsync only transmit the changed parts of the big container file?
Edit: After editing my question the first time I continued researching and found this article: Off Site Encrypted Backups using Rsync and AES
This article covers backing up files to a remote machine and encrypting them before transmitting them. The next step will be to compare files and use the more recent one. I can probably use a local sync mechanism (which rsync offers) if there not an option for that already.
Edit: I finally found this discussion debating whether a truecrypt container could be synced via rsync. The discussion concluded that it in fact is possible. This might be the perfect solution for me then. I would still be interested whether it is possible with luks-containers as well (I might try that out), but I will probably simply use truecrypt.

This discussion presents a solution.
If a truecrypt container is synced by rsync only the affected blocks of the container will be updated.
I tried out the procedure explained in the article using an LUKS-container (aes-xts-plain) and it worked, too. So, this answers my question.

Related

Distributed eventual consistency Key Value Store

I find it difficult to convince myself the advantage of using complex design like DynamoDB over simple duplication strategy.
Let's say we want to build a distributed key/value data store over 5 servers. (each server has exactly the same duplica).
Eventual consistency system, like DynamoDB, typically uses complicated conflicts reconcile, vector timestamp, etc. to achieve eventually consistency.
But instead, why couldn't we simply do the following:
For write, client will issue the write command to all the servers. So all servers will execute the clients' write command in the same order. It will reply to clients before servers commit the write.
For read, client will just do a round robin, only one server at a time will take care of read command. (Other servers won't see the read command)
Yes, client may experience temporary stale data, but eventually all replica will have the same dataset, which is the same semantic as DynamoDB.
What's the disadvantage of this simple design vs Complicated DynamoDB?
Your strategy has a few disadvantages, but their exact nature depends on details you haven't covered.
One obvious example is dealing with network segmentation. That is, when one part of your network becomes segmented (disconnected) from another part.
In this case, you have a couple of choices about how to react when you try to write some data to the server, and that fails. You might just assume that it worked, and continue as if everything was fine. If you do that, and the server later comes back on line, a read may return stale data.
To prevent that, you might treat a failed write as a true failure, and refuse to accept the write until/unless all servers confirm the write. This, unfortunately, makes the system as a whole quite fragile--in fact, much more fragile (at least with respect to writing) than if you didn't replicate at all (because if any of the servers go off-line, you can't write any more). It also has one more problem: it limits write throughput to the (current) speed of the slowest server, so even if they're all working, unless they're perfectly balanced (unlikely to happen) you're wasting capacity.
To prevent those problems, many systems (including Paxos, if memory serves) use some sort of "voting" based system. That is, you attempt to write to all the servers. You consider the write complete if and only if the majority of servers confirm that they've received the write. Likewise on a read, you attempt to read from all the servers, and you consider a value properly read if and only if the majority of servers agree on the value.
This way, up to one fewer than half the servers can be off-line at any given time, and you can still read and write data. Likewise, if you have a few servers that react a little more slowly than the rest, that doesn't slow down operations overall.
Of course, you need to fill in quite a few details to create a working system--but the fact remains that the basic concept is pretty simple, as outlined above.

Downloading data directly to volatile memory

When you download a file from the internet whether it be a FTP request, a Peer to Peer connection, ext. you are always prompted with a window asking where to store the file on your HDD or SSD, maybe you have a little NAS enclosure in your house.. either way you put it this information is being stored to a physical drive and the information is not considered volatile. It is stored digitally or magnetically and readily available to you even after the system is restarted.
Is it possible for software to be programmed to download and store information directly to a designated location in RAM without it ever touching a form of non-volatile memory?
If this is not possible can you please elaborate on why?
Otherwise if this is possible, if you could give me examples of software that implement this, or perhaps a scenario where this would be the only resolution to generate a desired outcome?
Thank you for the help. I feel this must be possible, however, I cant think of anytime I've encountered this and google doesn't seem to understand what I'm asking.
edit: This is being asked from the perspective of a novice programmer; someone who is looking into creating something like this. I seem to have over-inflated my own question. I suppose what I mean to ask is as follows:
How is software such as RAMDisk programmed, how exactly does it work, and are heavily abstract languages such as C# and Java incapable of implementing such a feature?
This is actually not very hard to do if I understand your request correctly. What you're looking for is tmpfs[1].
Carve our a tmpfs partition (if /tmp isn't tmpfs for you by default), mount it at a location, say something like /volative.
Then you can simply configure your browser or whatever application to download all files to folder/directory henceforth. Since tmpfs is essentially ram mounted as a folder, it's reset after reboot.
Edit: OP asks for how tmpfs and related ram based file systems are implemented. This is something that is usually Operating system specific, but the general idea probably remains the same: The driver responsible for the ram file system mmap() the required amount of memory and then exposes that memory in a way file system APIs typical to your operating system (For example POSIX-y operations on linux/solaris/bsd) can access it.
Here's a paper describing the implemention of tmpfs on solaris[2]
Further note: If however you're trying to simply download something, use it and delete it without ever hitting disk in a way that's internal entirely to your application, then you can simply allocate memory dynamically based on the size of whatever you're downloading, write bytes into allocated memory and free() it once you're done using it.
This answer assumes you're on a Linux-y operating system. There are likely similar solutions for other operating systems.
References:
[1] https://en.wikipedia.org/wiki/Tmpfs
[2] http://www.solarisinternals.com/si/reading/tmpfs.pdf

What are the options for transferring 60GB+ files over the a network?

I'm about to start developing an application to transfer very large files without any rush but with need of reliability. I would like people that had worked coding such a particular case give me an insight of what I'm about to get into.
The environment will be intranet ftp server> so far using active ftp normal ports windows systems. I might need to also zip up the files before sending and I remember working with a library once that would zip in memory and there was a limit on the size... ideas on this would also be appreciated.
Let me know if I need to clarify something else. I'm asking for general/higher level gotchas if any not really detail help. I've done apps with normal sizes (up to 1GB) before but this one seems I'd need to limit the speed so I don't kill the network or things like that.
Thanks for any help.
I think you can get some inspiration from torrents.
Torrents generally break up the file in manageable pieces and calculate a hash of them. Later they transfer them piece by piece. Each piece is verified against hashes and accepted only if matched. This is very effective mechanism and let the transfer happen from multiple sources and also let is restart any number of time without worrying about corrupted data.
For transfer from a server to single client, I would suggest that you create a header which includes the metadata about the file so the receiver always knows what to expect and also knows how much has been received and can also check the received data against hashes.
I have practically implemented this idea on a client server application but the data size was much smaller, say 1500k but reliability and redundancy were important factors. This way, you can also effectively control the amount of traffic you want to allow through your application.
I think the way to go is to use the rsync utility as an external process to Python -
Quoting from here:
the pieces, using checksums, to possibly existing files in the target
site, and transports only those pieces that are not found from the
target site. In practice this means that if an older or partial
version of a file to be copied already exists in the target site,
rsync transports only the missing parts of the file. In many cases
this makes the data update process much faster as all the files are
not copied each time the source and target site get synchronized.
And you can use the -z switch to have compression on the fly for the data transfer transparently, no need to boottle up either end compressing the whole file.
Also, check the answers here:
https://serverfault.com/questions/154254/for-large-files-compress-first-then-transfer-or-rsync-z-which-would-be-fastest
And from rsync's man page, this might be of interest:
--partial
By default, rsync will delete any partially transferred
file if the transfer is interrupted. In some circumstances
it is more desirable to keep partially transferred files.
Using the --partial option tells rsync to keep the partial
file which should make a subsequent transfer of the rest of
the file much faster

Classic file system problem - concurrent remote processing on a directory

I have an application that processes files in a directory and moves them to another directory along with the processed output. Nothing special about that. An interesting requirement was introduced:
Implement fault tolerance and processing throughput by allowing multiple remote instances to work on the same file store.
Additional considerations are that we can not assume the file system, as we support both Windows and NFS.
Of course the problems is, how do I make sure that the different instances do not try and process the same work, potentially corrupting work or reducing throughput? File locking can be problematic, especially across network shares. We can use a more sophisticated method, such as a simple database or messaging framework, (a la JMS or similar), but the entire cluster needs to be fault tolerant. We can't have one database or messaging provider because of the single point of failure that it introduces.
We've implemented a solution that uses multicast messages to self-discover processing instances and elect a supervisor who assigns work. There's a timeout in case the supervisor goes down and another election takes place. Our networking library, however, isn't very mature and the our implementation of messages is clunky.
My instincts, however, tell me that there is a simpler way.
Thoughts?
I think you can safely assume that rename operations are atomic on all network file systems that you care about. So if you arrange an amount of work to be a single file (or keyed to a single file), then have each server first list the directory containing new work, pick a piece of work, and then have it rename the file to its own server name (say, machine name or IP address). For one of the instances who concurrently perform the same operation, the rename will succeed, so they should then process the work. For the others, it will fail, so they should pick a different file from the listing they got.
For creation of new work, assume that directory creation (mkdir) is atomic, but file creation is not (for file creation, the second writer might overwrite the existing file). So if there are multiple producers of work also, create a new directory for each piece of work.

Sqlite over a network share [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 4 years ago.
Improve this question
Does anyone have real world experience running a Sqlite database on an SMB share on a LAN (Windows or Linux)?
Its clear from the documentation that this is not really the fastest way to share a Sqlite database.
The obvious caveats are that it may be slow, and Sqlite only supports a single thread writing to the DB at a time. So you become a lot less concurrent cause your DB updates now will block the DB for longer (DB will be locked while data is in transit over the network).
For my application the amount of data that I would like to share is fairly small and writes are not too frequent (a few writes every few seconds at most).
What should I watch out for? Can this work?
I know this is not what Sqlite was designed for, I am less interested in a Postgres/MySql/Sql Server based solution as I am trying to keep my app a light as possible with a minimal amount of dependencies.
Related Links:
From the sqlite mailing list, so I guess one big question is how unreliable are the filelock apis over SMB (windows or linux)
My experience of file based databases (i.e. those without a database server process), which goes back over twenty years, is that if you try to share them, they will inevitably eventually get corrupted. I'd strongly suggest you look at MySQL again.
And please note, I am not picking on SQLite - I use it myself, just not as a shared database.
You asked for real-world experience. Here's some:
SQLite locking is robust, ASSUMING the underlying (networked) file system is also robust. Historically, that's been a poor assumption. Recent operating systems get it much better.
If you play by the rules, your biggest problem will be cases where the database stays "locked" for many minutes at a stretch. For example, if the network drops an "unlock" request from a reader, you might be unable to write until the lock expires. If an "unlock" from a writer goes missing, you'll be unable to read. (To be fair, you can experience the same problems with ordinary documents.)
You'll get fewer problems on a good reliable network with "opportunistic locking" (client-level file caching) disabled for the database.
Well I am not great sqlite expert but I believe the Locking of records/tables may not work correctly and may make database corrupt. Because since there is no single server which maintains central locking, two sqlite dll instances on different machines sharing same file over network may not work correctly at all. If database is opened on same machine, sqlite may use file level locking offered by OS to maintain integrity but I doubt if it works correctly over network share.
"If you have many client programs accessing a common database over a
network, you should consider using a client/server database engine
instead of SQLite. SQLite will work over a network filesystem, but
because of the latency associated with most network filesystems,
performance will not be great. Also, the file locking logic of many
network filesystems implementation contains bugs (on both Unix and
Windows). If file locking does not work like it should, it might be
possible for two or more client programs to modify the same part of
the same database at the same time, resulting in database corruption.
Because this problem results from bugs in the underlying filesystem
implementation, there is nothing SQLite can do to prevent it."
from https://www.sqlite.org/whentouse.html
that also applies for any kind of file-based databases like Microsoft Access

Resources