How to setup bidirectional rsync? - rsync

I tend to run simulations on a cluster that produces files larger than 100MB and I can't sync my computer with the cluster. So I considered setting up rsync between the two by following this link.
However, I believe this is just a cron job to sync the backup server with the main server and doesn't work in both directions. What will be the stepwise instructions to set up a bidirectional rsync ?
Both the systems run linux

Rsync isn't really the right tool for this job. You can sort of get it to work, using cron jobs and extremely carefully chosen parameters, but there's significant danger of data loss, especially if you want file deletion to propagate.
I'd recommend a tool like Syncthing for bidirectional sync. You want something that maintains an independent database of what's changed and what hasn't, and real-time updates are nice to have too.

Related

Sync and back up files encrypted (using a raspberry pi)

I am currently looking for a way to synchronize confidential files between two PCs (and possibly an always running raspberry pi - would serve as a host and backup).
On each PC I have an LUKS-encrypted partition. I want to synchronize the files in those partitions with the rpi, but I don't want to store them on rpi in clear text.
I think the only reliable way is to encrypt the files while still on the PC (in every other way the files could be obtained as long as there is physical access to the rpi).
One possible way is storing the files also in a encrypted partition of the rpi and sending the pass-phrase to the rpi every time I want to sync, but I did not find an extremely simple way to do this (e.g. Unison doesn't over such a feature) + the pass-phrase could be obtained by simple manipulations.
The second way I thought of was storing the files in an encrypted container an synchronizing the container, but with every little change the whole file would have to be uploaded to the rpi.
So, is there a fast way to encrypt single files (esp. only the changed ones and possibly combine it with synchronization right away)?
I read openssl is one way of encrypting single files.
I don't know much about encryption or synchronization, but I want to find a way that is reasonably safe and not more than reasonably complex and doesn't use any external services...
Thank you very much for reading and considering my question,
Max
Edit: One part that might solve my problem right away:
If I use a container (luks) and change some files, will the changes in the container file be proportional to the changes I made in the files AND will rsync only transmit the changed parts of the big container file?
Edit: After editing my question the first time I continued researching and found this article: Off Site Encrypted Backups using Rsync and AES
This article covers backing up files to a remote machine and encrypting them before transmitting them. The next step will be to compare files and use the more recent one. I can probably use a local sync mechanism (which rsync offers) if there not an option for that already.
Edit: I finally found this discussion debating whether a truecrypt container could be synced via rsync. The discussion concluded that it in fact is possible. This might be the perfect solution for me then. I would still be interested whether it is possible with luks-containers as well (I might try that out), but I will probably simply use truecrypt.
This discussion presents a solution.
If a truecrypt container is synced by rsync only the affected blocks of the container will be updated.
I tried out the procedure explained in the article using an LUKS-container (aes-xts-plain) and it worked, too. So, this answers my question.

percona xtrabackup incremental backup vs replication

I was playing with percona xtrabackup innobackupex for incremental backups. It is a cool tool and very efficient and effective for incremental backups. However, i could not help but wonder why doing incremental backups would be any better than just doing a regular mysql master-slave replication, and whenever needed to retrieve point-in-time data, just use the binary log?
What advantages would doing incremental backups have over doing master-slave replication? When should you choose to use over the other?
One disadvantage to using master-slave replication as a backup is that accidentally running data damaging commands like
DROP TABLE users;
would replicate to the slave.
They are solutions to two different problems; master-slave is redundancy and backup is resilience.
The MySQL JDBC driver has the ability to connect to many servers. If you look at the driver options (https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-url-format.html) you will notice that the host option is not only host, but hosts. If you specify the URL to both the master and the slave and something happens to the master the driver will automatically connect to the slave instead.
Backup, on the other hand, is, as was mentioned earlier, a way to recover from either a catastrophic crash (having your backups stored off-site is a must) or recover from a catastrophic mistake -- neither of which is served by a master-slave setup. (Well, technically you could have the slave at a different site but that still does not cover the mistake scenario)

How to deploy MPI program?

MPI require I deploy mpi program to each machine. Currently, I put the mpi program in nfs, but this method has 2 issues, one is nfs has latency issue and the other is nfs not suitable for large cluster. I know that I could use some linux shell commands to sync up my program to each node, but looks like not so convenient. especially, when I change the program frequently. Is there any easy method to to that ?
There's nothing wrong with NFS or any other network filing system in large clusters. It just means your file server isn't sized for the job. If you replace NFS with anything like ssh, ftp, scripts, or whatever and change nothing else, I don't think that'll make any significant difference. Also, if the loading time is a significant and bothersome component of the overall runtime then why use MPI in the first place?
OK, enough of playing devils advocate. One thing you can do is to have nodes load your program onto other nodes in a binary tree type arrangement. You'll need a script that will copy the executable to two other nodes along with a copy of the script, start that script running asynchronously on those nodes and then runs the executable locally. The result would be a chain reaction of copying and running spreading across the network. The only difficult bit is in choosing which nodes to copy to so that each one is visited just once. It will be a lot faster.
Depending on the nature of the application and the nature of the NFS network, using a shared file system for both the MPI implementation and the application "should" be able to scale with reasonable performance, to a point. Keep in mind that there is some NFS caching at the node level, so multiple ranks on the same node will not each have to traverse the network to reach the files.
In general terms, I tend to advise that NFS be discontinued at about 128 nodes or 1024 ranks in favor of local installations. That advice changes if the NFS is delivered with 10GigE, IPoIB, or if a high performance file system like SFS or GPFS is used.
If you are committed to local installations, then tools like rsync, or scp are good candidates to distribute the bits. Script the final result. You can even do a tar to shared, and remote command (e.g. ssh, clush) un-tar to local disc. The "solution" only needs to be robust, not polished or elegant.
I'll also chime in to say the NFS should be just fine in this use-case, unless you have a cluster of over 100-200 nodes.
If you just want a lightweight tool for doing many-node parallel operations, I'd suggest pdsh. pdsh is a very common tool on HPC clusters. It includes a command called pdcp for doing parallel node copies, i.e.
pdcp -w node[00-99] myfile /path/to/destination/myfile
Where the nodenames are node00, node01, ... node99.
Similarly, you use the pdsh command to run a command in parallel across all the nodes. I.e.,
pdsh -w node[00-99] /path/to/my/executable
Alternatively, if you're looking for something a little less ad-hoc for doing these operations, I can recommend Ansible as an easy and lightweight configuration management and deployment tool. It's not as simple to get started as pdsh, but might be more manageable in the long run...
For example, a simple Ansible playbook to copy a tarball to all nodes, extract it, and then execute a binary might look like:
---
- hosts: computenodes
user: myname
vars:
num_procs: 32
tasks:
- name: copy and extract tarball to deployment location
action: unarchive src=myapp.tar.gz dest=/path/to/deploy/
- name: execute app
action: command mpirun -np {{num_procs}} /path/to/deploy/myapp.exe

Symfony2 Job Queue or Parallel Processing?

Does anyone know how to run a number of processes in the background either through a job queue or parallel processing.
I have a number of maintenance updates that take time to run and want to do this in the background.
I would recomment Gearman server, it prooved quite stable, it's totally outside of Symfony2, and you have to have server up and running (don't know what your hosting options are), but it distribues jobs perfectly. In skiniest version, it just keeps all jobs in-memory, but you can configure it to use sqlite database as backup, so for any reason server reboots, or gearman deamon breaks, you can just start it again, and your jobs will be perserved. I konw it has been tested with very large loads (adding up 1k jobs per second), and it stood it's ground. It's probably more stable nowdays, I'm speaking from experience 2 yrs ago, where we offloaded some long-running tasks in ZF application to background processing via Gearman. It should be quite self-explanitory how it works from image below:
Checkout RabbitMq. It's the most popular option according to knpbundles.com
Take a look at http://github.com/mmoreram/rsqueue-bundle
Uses Redis as queue core and will be mantained.
Take a look at enqueue libraty. There are a lot of transports (AMQP, STOMP, AmazonSQS, Redis, Filesystem, Doctrine DBAL and more) to choose from. Easy to use and feature rich. That would be enough for simple job queue, though if you need something more sophisticated look at enqueue/job-queue. It can run an exclusive job (only one job running at a given time) or a job with sub-jobs, or a job with something to do after it has been done.
Of course, there is a bundle for it

A program to kill long-running runaway programs

I manage Unix systems where, sometimes, programs like CGI scripts run forever, sometimes eating a lot of CPU time and wasting resources.
I want a program (typically invoked from cron) which can kill these runaways, based on the following criteria (combined with AND and OR):
Name (given by a regexp)
CPU time used
elapsed time (for programs which are blocked on an I/O)
I do not really know what to type in a search engine for this sort of program. I certainly could write it myself in Python but I'm lazy and there is may be a good program already existing?
(I did not tag my question with a language name since a program in Perl or Ruby or whatever would work as well)
Try using system-level quota enforcement instead. Most systems will allow to set per-process CPU time limit for different users.
Examples:
Linux: /etc/security/limits.conf
FreeBSD: /etc/login.conf
CGI scripts can usually be run under their own user ID, for example using mod_suid for Apache.
This might be something more like what you were looking for:
http://devel.ringlet.net/sysutils/timelimit/
Most of the watchdig-like programs or libraries are just trying to see whether a given process is running, so I'd say you'd better off writing your own, using the existing libraries that give out process information.

Resources