rdiff-backup-like storage on Artifactory - artifactory

I am looking for a way to store files in Artifactory repository in a storage efficient way and upload/download difference between local version and remote in order to save disk space, bandwidth and time.
There are two good utilities which works in this way rsync and rdiff-backup. Sure there are others.
Is there a way to organize something similar with Artifactory stack?
What is rsync:
DESCRIPTION
Rsync is a fast and extraordinarily versatile file copying tool. It can copy locally,
to/from another host over any remote shell, or to/from a remote rsync daemon. It offers
a large number of options that control every aspect of its behavior and permit very
flexible specification of the set of files to be copied. It is famous for its
delta-transfer algorithm, which reduces the amount of data sent over the network by
sending only the differences between the source files and the existing files in the des-
tination. Rsync is widely used for backups and mirroring and as an improved copy com-
mand for everyday use.

JFrog CLI includes a functionality called "Sync Deletes", allowing to sync files between the local file system and Artifactory.
This functionality is supported by both the "jfrog rt upload" and "jfrog rt download" commands. Both commands accept the optional --sync-deletes flag.
When uploading, the value of this flag specofies a path in Artifactory, under which to sync the files after the upload. After the upload, this path will include only the files uploaded during this upload operation. The other files under this path will be deleted.
The same goes for downloading, but this time, the value of the --sync-deletes flag specifies a path in the local file system, under which files which had not been downloaded from Artifactory are deleted.
Read more this in the following link:
https://www.jfrog.com/confluence/display/CLI/CLI+for+JFrog+Artifactory

Related

Generating ZIP files in azure blob storage

What is the best method to zip large files present in AZ blob storage and download them to the user in an archive file (zip/rar)
does using azure batch can help ?
currently we implement this functions in a traditionally way , we read stream generate zip file and return the result but this take many resources on the server and time for users.
i'am asking about the best technical and technologies solution (preferred way using Microsoft techs)
There are few ways you can do this **from azure-batch only point of view**: (for the initial part user code should own whatever zip api they use to zip their files but once it is in blob and user want to use in the nodes then there are options mentioned below.)
For initial part of your question I found this which could come handy: https://microsoft.github.io/AzureTipsAndTricks/blog/tip141.html (but this is mainly from idea sake and you will know better + need to design you solution space accordingly)
In option 1 and 3 below you need to make sure you user code handle the unzip or unpacking the zip file. Option 2 is the batch built-in feature for *.zip file both at pool and task level.
Option 1: You could have your *rar or *zip file added as azure batch resource files and then unzip them at the start task level, once resource file is downloaded. Azure Batch Pool Start up task to download resource file from Blob FileShare
Option 2: The best opiton if you have zip but not rar file in the play is this feature named Azure batch applicaiton package link here : https://learn.microsoft.com/en-us/azure/batch/batch-application-packages
The application packages feature of Azure Batch provides easy
management of task applications and their deployment to the compute
nodes in your pool. With application packages, you can upload and
manage multiple versions of the applications your tasks run, including
their supporting files. You can then automatically deploy one or more
of these applications to the compute nodes in your pool.
https://learn.microsoft.com/en-us/azure/batch/batch-application-packages#application-packages
An application package is a .zip file that contains the application binaries and supporting files that are required for your
tasks to run the application. Each application package represents a
specific version of the application.
With regards to the size: refer to the max allowed in blob link in the document above.
Option 3: (Not sure if this will fit your scenario) Long shot for your specific scenario but you could also mount virtual blob to the drive at join pool via mount feature in azure batch and you need to write code at start task or some thing to unzip from the mounted location.
Hope this helps :)

Docker layer reuse

Artifactory at the moment stores multiple duplicate docker image layers. If image A and image B both depend on layer SHA__12345 then artifactory will store both layer copies. Which is not a problem unless the layer SHA__12345 is a a gigabyte in size. In that case you can really quickly run out of space.
Is there a way in artifactory to deduplicate overlapping layers for storage reasons?
Thanks!
Artifactory uses checksum-based storage:
A file that is uploaded to Artifactory, first has its SHA1 checksum calculated, and is then renamed to its checksum. It is then hosted in the configured filestore in a directory structure made up of the first two characters of the checksum. For example, a file whose checksum is "ac3f5e56..." would be stored in directory "ac"; a file whose checksum is "dfe12a4b..." would be stored in directory "df" and so forth.
In parallel, Artifactory's creates a database entry mapping the file's checksum to the path it was uploaded to in a repository. This way of storing binaries optimizes many operations in Artifactory since they are implemented through simple database transactions rather than actually manipulating files.
One implication of this is that artifacts are deduplicated in general. Any two artifacts with the same checksum will point to the same file in storage, even if they're in different repositories. This applies to docker layers, as well as all other artifacts. So you shouldn't be having any issues with this.

Btrfs Snapshot WITH Backup

Is there a way to backup a btrfs file system by copying the entire disk over at first backup, but then copying over snapshot files in place of using rsync (or is this a bad idea)?
You can definitely do this, though rsync will duplicate some blocks on the new system.
You might be interested in Buttersink. ButterSink is like rsync, but for btrfs subvolumes instead of files, which makes it much more efficient for things like archiving backup snapshots. It is built on top of btrfs send and receive capabilities. Sources and destinations can be local btrfs file systems, remote btrfs file systems over SSH, or S3 buckets.
For example, the following will copy over just snapshot differences to the remote machine, and create an efficient mirror of your snapshots there:
buttersink /home/snaps/ ssh://backup-server/bak/snaps/

SFTP polling using java

My scenario as follows:
One java program is updating some random files to a SFTP location.
My requirement is as soon as a file is uploaded by the previous java program, using java I need to download the file. The files can be of size 100MB. I am searching for some java API which is helpful in this way. Here I even don't know the name of files. But I can keep a regular expression for this. A same file can be uploaded by previous program periodically. Since file size is high I need to wait until the complete file to be uploaded.
I used Jsch to download files, but I am not getting how to poll using jsch.
Polling
All you can do is to keep listing remote directory periodically, until you find a new file. There's no better way with SFTP. For that you obviously use ChannelSftp.ls().
Regarding selecting files matching certain pattern, see:
JSch ChannelSftp.ls - pass match patterns in java
Waiting until the upload is complete
Again, there's no support for this in widespread implementations of SFTP.
For details, see my answer at:
SFTP file lock mechanism.

How do I scp a file to a Unix host so that a file polling service won't see it before the copy is complete?

I am trying to transfer a file to a remote Unix server using scp. On that server, there is a service which polls the target directory to detect incoming files for processing. I would like to ensure that the polling service does not pick up new files before the copy is complete. Is there a way of doing that?
My file transfer process is a simple scp command embedded in a larger Java program. Ideally, a solution which did not involve changing the Jana would be best (for reasons involving change control processes).
You can scp the file to a different (/tmp) directory and move the
file via ssh after transfer is complete. The different directory needs to be on the same partition as the final destination directory otherwise there will be a copy operation and you'll face a similar problem. Another service on the destination machine can do this move operation.
You can copy the file as hidden (prefix the filename with .) and copy, then move
If you can modify the polling service, you can check active scp processes and ignore files matching scp arguments.
You can check for open files with lsof +d $directory and ignore them in the polling server
I suggest copying the file using rsync instead of scp. rsync already copies new files to temporary filenames, and has many other useful features for file synchronization as well.
$ rsync -a source/path/ remotehost:/target/path/
Of course, you can also copy file-by-file if that's your preference.
If rsync's temporary filenames are sufficient to avoid being picked up by your polling service, then you could simply replace your scp command with a shell script that acts as a wrapper for rsync, eliminating the need to change your Java program.
You would need to know the precise format that your Java program uses to call the scp command, to make sure that the options you feed to rsync do what you expect.
You would also need to figure out how your Java program calls scp. If it does so by full pathname (i.e. /usr/bin/scp), then this solution might put other things at risk on your system that depend on scp (like you, for example, expecting scp to behave as it usually does instead of as a wrapper). Changing a package-installed binary like /usr/bin/scp may also "break" your package registration, making it difficult to install future security updates because a binary has changed to a shell script. And of course, there might be security implications to any change you make.
All in all, I suspect you're better off changing your Java program to make it do precisely what you want, even if that is to launch a shell script to handle aspects of automation that you want to be able to change in the future without modifying your Java.
Good luck!

Resources