rsync vs SyncML (Funambol) - rsync

I would like some idea about how rsync compares to SyncML/Funambol, especially when it comes to bandwidth, sync over unstable network and multiple clients to one server.
This is to sync several mobile devices with a directory structure of growing text-files. (Se we essentially want as much as possible on the server, and inconsistent files is not really a problem, also we know where changes originates).
So far, it seems Funambol doesn't compress, doesn't handle partial updates, and it is difficult to handle interruptions in a file-transfer.
I know rsync doesn't go through the server, but I don't quite see how that is a disadvantage.

Olav,
rsync can:
Compress the data (as you said) - thus gaining better performances over the net.
Synchronize only the newest data within each file - thus, once again, saving time.
Can be ran by multiple users at the same time. It's a very basic backup software behavior.
And one of my favorites: work over a secure shell.
You might want to check Rsyncrypto, for compressing and encrypting at the same time.
Dotan

Related

Hosting big files for users

We need to be able to supply big files to our users. The files can easily grow to 2 or 3GB. These files are not movies or similiar. They are software needed to control and develop robots in an educational capacity.
We have some conflict in our project group in how we should approach this challenge. First of all, Bittorrent is not a solution for us (despite the goodness it could bring us). The files will be availiable through HTTP (not FTP) and via a filestream so we can control who gets access to the files.
As a former pirate in the early days of the internet i have often struggled with corrupt files and using filehashes and filesets to minimize the amount of redownload required. I advocate a small application that downloads and verifies a fileset and extracts the big install file once it is completely downloaded and verified.
My colleagues don't think this is nessecary and point to the TCP/IP protocols inherit capabiltities to avoid corrupt downloads. They also mention that Microsoft has moved away from a downloadmanager for their MSDN files.
Are corrupt downloads still a widespread issue or will the amount of time we spend creating a solution to this problem be wasted, compared to the amount of people who will actually be affected by it?
If a download manager is the way to go, what approach would you suggest we take?
-edit-
Just to clearify. Is downloading 3GB of data in one chunk, over HTTP a problem OR should we make our own EXE that downloads the big file in smaller chunks (and verifies them).
You do not need to go for your own download manager. You can use some really smart approach.
Split files in smaller chunks, let's say 100MB each. So even if a download is corrupted, user will end-up downloading with that particular chunk.
Most of web servers are capable of understanding and treating/serving range headers. You can recommend the users to use download manager / browser add-ons which can use this capacity. If your users are using unix/linux systems, wget is such a utility.
Its true that TCP/IP has capacities of preventing corruption but it basically assumes that network is still up and accessible. #2 mentioned above can be one possible work-around to the problems where network was completely down in middle of download.
And finally, it is always good to provide file hash to your users. This is not only to ensure the download but also to ensure the security of the software that you are distributing.
HTH

How do you set up a large scale Alfresco CIFS server?

Alfresco provides a CIFS connector so it can act just a normal file-server in your intranet.
Compared with a "normal" (windows/samba) based fileserver, certain operations can really hurt the system, e.g. listing a folder with a few thousand files using windows explorer. Not quite sure, but I think permission checking is the primary reason for this case. Anyways, now assume you have a big filesystem hierarchy exposed and many users using CIFS, stressing the system, effectively "knocking it down".
What is the suggested approach to scale / improve performance ?
In my experience Windows Explorer is part of the CIFS performance issue. I don't have exact numbers, but I remember working on an instance with roughly 500GB data, mostly composed of small images and a few texts in a not well balanced folder tree, for which listing a folder with a thousand children was taking in Explorer around a minute to display. The same operation was taking around 3s on Chrome browser.
We never had time to investigate the issue thoroughly, but we saw an impressive amount of traffic generated by Explorer due to prefetch of information of the subfolders of the currently open folder.
Been revisiting the issue a little, and I guess the best answer I can give for now is: Tweak the cache(s).
I used a 5k children space, default cache values and benchmarked executing "ls -alrt" on the CIFS mount running alfresco 4.0.d.
The first execution took roughly two minutes bombarding the (lightning fast) mysql database with approx 200k queries.
The second execution took "only" around 40 seconds, but the amount of queries did not change significantly.
Increasing the CIFS fileinfo cache, I got the second time down to 30 seconds, but I still see 160k DB queries firing. I'm fairly sure this lions share has to do with permissions/ACLs and it should be possible improve the situation a lot.
PS: Windows Explorer definitely behaves a little unexpected, but I cannot confirm that it makes a significant difference regarding user experience.
PPS: https://issues.alfresco.com/jira/browse/ALFCOM-2951
PPPS: I'll look into this further when I find the time - should be this year. ;)
Update: The massive amount of queries is no permission issue.
Permission checks definitely IS a part of the problem. I can't link to anything specific, but browsing alfresco forums and the net for the last few years I've learned that permissions can hurt the performance.
I've read (and experienced) in several scenarios that alfresco spaces with large numbers of children (1000+) can be painfully slow. One part you noticed yourself: it takes a while to go through 100-200k queries. But hook up something into alfresco to watch what's it doing and you'll see that massive amounts of time go on serialization/deserialization (e.g.webscripts for share) and also node traversal (hence the thousands of queries and averages of 400-500 qps when nobody is logged on).
So you're on the right way with your cache optimizations.
Do you have dedicated hardware for your installation? I've had big issues with performance, but I've moved the MySQL server to a separate box (server-grade hardware - 4 cores, 8GB ram, SSD for myqsl server and SAS for tomcat server etc) and I gained a lot. So, get on with begging for the new hardware too :)
I think you're on the right path here.

RAMDisk reliability

I was looking at the existing RAMDisk discussions ... and none seem to bring up any reliability issues. I recently started using a Dataram ramdisk for my source code and am wondering if there are any risks I should be concerned with.
It did speed up the compile time by 30%
I am not fully familiar with that product, but the answer probably depends on whether you have a (good) UPS, and what you are using to sync changes with your hard drive. I had looked into this a while ago (on a linux machine) mapping a portion of the ram as a disk and using RSYNC to persist changes to the hard drive, but discontinued the idea and got a faster hard drive instead :) I would be very interested in seeing this working...

Tools to diagnose slow compile/build reasons

In a number of situations as a programmer, I've found that my compile times are slower than I would like, and I want to understand the reason and fix them. Particular language (yes, I'm using C/C++) tricks have already been discussed, and we apply many of them. I've also seen this question and realize it's related. What I'm more interested in is what tools people use to diagnose hardware/system bottlenecks in the build process. Is there a standard way to prove "Disk read/writes are too slow for our builds - we need SSD's!" or "The anti-virus settings are killing our build times!", etc...?
Resources I've found, none directly related to compiling performance diagnosis:
A TechNet article about using PerfMon (Quite good, close to what I'd like)
This IBM link detailing some PerfMon information, but it's not specific to compiling and appears somewhat out of date.
A webpage specifically describing diagnosis of avg disk queue length
Currently, diagnosing a slow build is very much an art, and my tools of choice are:
PerfMon
Process Explorer
Process Monitor
Push hard enough to get a machine to "just try it". (Basically, trial and error.)
What do others do to diagnose system-level build performance bottlenecks?
Can we come up with a list of PerfMon or Process Explorer statistics to watch for, with thresholds for whats "acceptable" on a modern machine?
PerfMon:
CPU -> % of processor time
MEMORY -> Page/sec
DISK -> Avg. disk queue length
Process Explorer:
CPU -> CPU
DISK -> I/O Delta Total
MEMORY -> Page Faults
I resolved a "too slow build" time issue with Eclipse and Spring recently. For me the solution was to use the Vista Resource Monitor (which identified CPU spiking but not consistently high) and quite a bit of disk activity. I then used Procmon from Sysinternals to identify exactly which files were being heavily accessed.
Part of our build process also involves checking external Maven (binary file) repositories for updates every build. I disabled that check (which also gives me perfect control over when I update those dependencies). If you have resources external to the build machine, benchmark how long it takes to access them (source control, maven, etc.).
Since I'm stuck on 32-bit Vista for now, I decided to try creating a Ramdisk with the 700MB of non-addressable memory (PC has 4GB, Vista only exposes 3.3GB) and place the heavily accessed files as identified by Procmon on the Ramdisk using a nice trick of creating drive junctions to make that move transparent to my IDE. For details see here.
I have used filemon to see the header files that a C++ build was most often opening then used:
“#ifndef checks” so header files are only included once
Precompiled headers
Combined some small header files
Reduce the number of header files included by other header files by tidying up the code.
However these days I would start with a RamDisk and or SSD, but opening lot of header files still uses lots of CPU time.

How to avoid pauses when editing code on a network drive?

I'm planning on doing more coding from home but in order to do so, I need to be able to edit files on a Samba drive on our dev server. The problem I've run into with several editors is that the network latency causes the editor to lock up for long periods of time (Eclipse, TextMate). Some editors cope with this a lot better than others, but are there any file system or other tweaks I can make to minimize the impact of lag?
A few additional points:
There's a policy against having company data on personal machines, so I'd like to avoid checking out the code locally.
The mount is over a PPTP VPN connection.
Mounting to Linux or OS X client
Use a source control system — Subversion, Perforce, Git, Mercurial, Bazaar, etc. — so you're never editing code on a shared server. Instead you should be editing a local work area and committing changes to a repository located on the network.
Also, convince your company to adapt their policy such that company code is allowed on personal machines if it's on an encrypted volume. Encrypted disk images that you can use for this are trivial to create using Disk Utility, and can use strong cryptography. You can get even more security by not storing your encryption passphrase in your keychain, and instead typing it every time you mount the encrypted volume; this means that even if your local user account is compromised, as long as you don't have the volume mounted, nobody else will be able to mount it.
I did this all the time when I was consulting and none of my clients — some of whom had similar rules about company code — ever had a problem with it once I explained how things worked. (I think some of them even started using encrypted disk images even within their offices.)
Remate plugin simply disables this dreadful refresh-on-focus feature.
Download, unpack, doubleclick and choose "Disable Refresh on Regaining Focus" from "Window" menu (you can refresh manually by right-clicking project in drawer). Voila!
If you are accessing the data from your personal computer, it is in your RAM, so we will assume that you just can't store it on your hard drive, floppy, USB stick, etc.
Your solution is a RAM drive. Copy the files you need to edit there using whatever method you prefer (I would suggest source control) and then you can edit them without lag. When you are done commit them back to the server.
As was pointed out your editor may be caching changes to your temp directory, or maybe even your swap file (if it is in memory, then it can get swapped out). The solution to that is get a much larger RAM drive and run a Virtual Machine in the RAM drive. Not sure what OS you are running, but you can get a pretty slim install of most OS's if all you are doing is editing source code.
If you don't have enough RAM, then get a Gigabyte i-RAM solid state drive and remove the battery, that way it will lose everything when you power down.
Set your VMWare to not allow the OS to swap any of the virtual machine. Keep a baseline VM on your hard drive and copy it to your RAM drive before booting it up. Then you can use the hard drive in the VM like a hard drive, even though it is RAM.
Might be a good idea to run a secure erase on your RAM drive before powering down. Also keep in mind that they have found if you super cool a RAM chip before removing it from a functioning computer, and place it in a new computer quick enough, the data may still be intact.
I guess it all comes down to how detailed that policy is, and how it is interpreted.
Good luck!
Short answer: you can do no trick. CIFS is really geared towards LAN with a reasonably calm trafic, so you have zero chance to not suffer intermittent lag accessing a share through a VPN. The editor at some point needs to access the file in blocking IO, because it makes no real sense to do otherwise.
You could switch editor and use Emacs + TRAMP which is geared to work on remote files.

Resources