Delete cached data from DVC - dvc

I would like to be able to delete individual files or folders from the DVC cache, after they have been pulled with dvc pull, so they don't occupy space in local disk.
Let me make things more concrete and summarize the solutions I found so far. Imagine you have downloaded a data folder using something like:
dvc pull <my_data_folder.dvc>
This will place the downloaded data into .dvc/cache, and it will create a set of soft links in my_data_folder (if you have configured DVC to use soft links)
ls -l my_data_folder
You will see something like:
my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
...
Imagine you don't need this data for a while, and you need to free its space from local disk. I know of two manual approaches for doing that, although I am not sure about the second one:
Preliminary step (optional)
Not needed if you have symlinks (which I believe is true, at least in unix-like OS):
dvc unprotect my_data_folder
Approach 1 (verified):
Delete all the cached data. From the repo's root folder:
rm -r my_data_folder
rm -rf .dvc/cache
This seems to work properly, and will completely free the disk space previously used by the downloaded data. Once we need the data again, we can pull it by doing dvc pull as previously. The drawback is that we are removing all the data downloaded with dvc so far, not only the data corresponding to my_data_folder, so we would need to do dvc pull for all the data again.
Approach 2 (NOT verified):
Delete only specific files (to be thoroughly tested that this does not corrupt DVC in any way):
First, take note of the path indicated in the soft link:
ls -l my_data_folder
You will see something like:
my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
If you want to delete my_data_file_1.pk, from the repo's root folder run:
rm .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
Note on dvc gc
For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.
I would appreciate if someone can suggest a better way, or also comment whether the second approach is actually appropriate. Also, if I want to delete the whole folder and not go file by file, is there any way to do that automatically?
Thank you!

It's not possible at the moment to granularly specify a directory / file to be removed from the cache. Here are the tickets to vote and ask to prioritize this:
dvc gc remove
Reconsider gc implementation
For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.
This is a bit concerning. If you run it with the -w option it keeps only files / dirs that are referenced in the current versions of the .dvc and dvc.lock files. And it should remove everything else.
So, let's say you are building a model:
my_model_file.pk
You created it once and its hash is 4f7bc7702897bec7e0fae679e968d792 and it's written in the dvc.lock or in the my_model_file.dvc.
Then you do another iteration and now hash is different 5a8cc7702897bec7e0faf679e968d363. It should be now written in the .dvc or lock. It means that a model that corresponds to the previous 4f7bc7702897bec7e0fae679e968d792 is not referenced anymore. In this case dvc gc -w should definitely collect it. If that is not happening please create a ticket and we'll try to reproduce and take a look.

Related

How to add a file to a dvc-tracked folder without pulling the whole folder's content?

Let's say I am working inside a git/dvc repo. There is a folder data containing 100k small files. I track it with DVC as a single element, as recommended by the doc:
dvc add data
and because in my experience, DVC is kinda slow when tracking that many files one by one.
I clone the repo on another workspace, and now I have the data.dvc file locally but none of the actual files inside yet. I want to add a file named newfile.txt to the data folder and track it with DVC. Is there a way to do this without pulling the whole content of data locally ?
What I have tried for now:
Adding the data folder again:
mkdir data
mv path/to/newfile.txt data/newfile.txt
dvc add data
The data.dvc file is built again from the local state of data which only contains newfile.txt so this doesn't work.
Adding the file as a single element in data folder:
dvc add data/newfile.txt
I get :
Cannot add 'data/newfile.txt', because it is overlapping with other DVC tracked output: 'data'.
To include 'data/newfile.txt' in 'data', run 'dvc commit data.dvc'
Using dvc commit as suggested
mkdir data
mv path/to/newfile.txt data/newfile.txt
dvc commit data.dvc
Similarly as 1., the data.dvc is rebuilt again from local state of data.
I clone the repo on another workspace, and now I have the data.dvc file locally but none of the actual files inside yet (haven't dvc pulled). I want to add a file to the data folder and track it with DVC. Is there a way to do this without pulling the whole content of data locally ?
Interesting question. I think there is no easy way to do this now because in this other machine if you dvc add data again but with only one file in there, DVC will think you deleted all the other files, create a new cached version of the data dir (containing only the new file), and update the .dvc file accordingly (as you discovered).
You could open a feature request in https://github.com/iterative/dvc.org/issues.

Revert a dvc remove -p command

I have just removed a DVC tracking file by mistake using the command dvc remove training_data.dvc -p, which led to all my training dataset gone completely. I know in Git, we can easily revert a deleted branch based on its hash. Does anyone know how to revert all my lost data in DVC?
You should be safe (at least data is not gone) most likely. From the dvc remove docs:
Note that it does not remove files from the DVC cache or remote storage (see dvc gc). However, remember to run dvc push to save the files you actually want to use or share in the future.
So, if you created training_data.dvc as with dvc add and/or dvc run and dvc remove -p didn't ask/warn you about anything, means that data is cached similar to Git in the .dvc/cache.
There are ways to retrieve it, but I would need to know a little bit more details - how exactly did you add your dataset? Did you commit training_data.dvc or it's completely gone? Was it the only data you have added so far? (happy to help you in comments).
Recovering a directory
First of all, here is the document that describes briefly how DVC stores directories in the cache.
What we can do is to find all .dir files in the .dvc/cache:
find .dvc/cache -type f -name "*.dir"
outputs something like:
.dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir
.dvc/cache/00/db872eebe1c914dd13617616bb8586.dir
.dvc/cache/2d/1764cb0fc973f68f31f5ff90ee0883.dir
(if the local cache is lost and we are restoring data from the remote storage, the same logic applies, commands (e.g. to find files on S3 with .dir extension) look different)
Each .dir file is a JSON with a content of one version of a directory (file names, hashes, etc). It has all the information needed to restore it. The next thing we need to do is to understand which one do we need. There is no one single rule for that, what I would recommend to check (and pick depending on your use case):
Check the date modified (if you remember when this data was added).
Check the content of those files - if you remember a specific file name that was present only in the directory you are looking for - just grep it.
Try to restore them one by one and check the directory content.
Okay, now let's imagine we decided that we want to restore .dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir, (e.g. because content of it looks like:
[
{"md5": "6f597d341ceb7d8fbbe88859a892ef81", "relpath": "test.tsv"}, {"md5": "32b715ef0d71ff4c9e61f55b09c15e75", "relpath": "train.tsv"}
]
and we want to get a directory with train.tsv).
The only thing we need to do is to create a .dvc file that references this directory:
outs:
- md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
path: my-directory
(note, that path /20/b786b6e6f80e2b3fcf17827ad18597.dir became a hash value: 20b786b6e6f80e2b3fcf17827ad18597.dir)
And run dvc pull on this file.
That should be it.

How to build a rpm that installs host dependent files

I have to build one rpm that copies the contents of file A to /path/to/tartetfile if the hostname is A. In all other cases the contents of B should be copied to /path/to/targetfile. I'm aware that this may be a misusage of rpm, but I still have to do it like this. Do you have any ideas how to get this done in an elegant way?
My solution at the moment would be to create an empty /path/to/targetfile in my BUILD directory as well as a /tmp/contents.tar.gz that contains the files A and B. In the postinstall routine i then would extract the relevant parts of /tmp/contents.tar.gz to /path/to/targetfile and delete the tarball afterwards. In the pre-uninstall routine I'd then touch the /tmp/contents.tar.gz to supress rpm reporting errors for an already deleted file.
To me this seems to be a very dirty way to get this done. Do you have better ones?
If you plan on abusing rpm for things it was not desinged for, you'll have to do dirty tricks.
I don't see another workaround for you. I fail to see the use of removing the tar.gz etc, unless that (little?) extra space is really a problem for you. I would propose:
package all files (A and B) into some specific directory (/usr/lib/your-package or whatever), not in compressed format.
in the %post section create just symlinks so that /path/to/targetfile points to /usr/lib/your-package/A or /usr/lib/your-package/B (symlinks take up almost no space). This has the additional value that ls -l /path/to/targetfile will show you which which file it points to, giving you the information whether this is file A or B.
in your %files section declare %ghost /path/to/targetfile for a nice cleanup upon removal.

Can a pre-commit Git hook zip a directory and add it to the repository?

I'm doing development on a Wordpress plugin. My development directory contains a lot of development-specific stuff (e.g. Grunt files, Sass files, the git repository itself, etc.).
Obviously, I don't want to distribute this folder containing all of those development files; people don't want a few MB of Grunt files when they download my Wordpress plugin.
Up until now, though, my "release" process has been cumbersome:
Commit the Git changes
Zip the entire folder
Open the zip file and delete the .git folder, grunt files, and all the other development-specific files
Release the new zip
I don't know the best way to accomplish this, but I'm very vaguely familiar with Git hooks, and I had this thought: could I set up a Git hook that would zip ONLY the needed production files into a ZIP file and store it with the repo? That way, every time I commit it would automatically create a new release ZIP.
Is that possible? If so, could someone point me in the right direction?
Oh also, I'm on Windows (・_・;). So I'm hoping that there's a way to do it on Windows.
I can't speak for Windows, but:
It's technically possible to do that sort of thing in a pre-commit hook.
Don't.
A pre-commit hook that modifies "what you will commit" is annoying (if nothing else, it violates the "rule of least astonishment", where your version control system simply stores the versions you tell it to store). Apart from that, storing large pre-compressed binaries interferes with git's attempt to save space in pack files, and will cause rapid repository bloat, poor performance, running out of memory, and so on. A ZIP-archive is a pre-compressed binary and hence will behave badly.
In general, a more reasonable "hook-y" way to handle releases is to set up a "release server" to which you push new releases, and have the push trigger the archive-generation. (There are ways to do this without a separate server / repository, and you can do it in a more pull-style fashion, but the push-style is easy to illustrate.)
[Edit: I had originally considered git archive but did not realize you could get it to exclude files conveniently, so wrote up the below instead. So, jthill's answer is better and should be one's first resort. I'll leave this in place as an alternative for some case where for some reason, git archive might not do.]
For instance, here's a server-side post-receive hook code fragment that checks whether a branch whose name matches release* has been pushed-to, and if so, invokes a shell function with the name of the branch (once for each such branch):
#! /bin/sh
NULL_SHA1=0000000000000000000000000000000000000000
scan()
{
local oldsha newsha fullref shortref
local optype
while read oldsha newsha fullref; do
case $oldsha,$newsha in
$NULL_SHA1,*) optype=create;;
*,$NULL_SHA1) optype=delete;;
*) optype=update;;
esac
case $fullref in
refs/heads/*)
reftype=branch
shortref=${fullref#refs/heads/}
;;
*)
reftype=other
shortref=fullref
;;
esac
case $optype,$reftype,$shortref in
create,branch,release*|update,branch,release*)
do_release $shortref;;
esac
done
}
scan
(much of the above is boilerplate, which I have stripped down to essentials). You would have to write the do_release function, which might resemble (totally untested):
do_release()
{
local tmpdir=/tmp/build.$$ # or use mktemp -d
# $tmpdir/index is git's index; $tmpdir/t is the work tree
trap "rm -rf $tmpdir; exit 1" 1 2 3 15
rm -rf $tmpdir
mkdir $tmpdir/t
GIT_INDEX_FILE=$tmpdir/index GIT_WORK_TREE=$tmpdir/t git checkout $1
# now clean out grunt files and make zip archive
(cd $workdir/t; rm -rf grunt; zip ../t.zip .)
# put completed zip archive in export location, name it
# based on the branch name
mv $workdir/t.zip /place/where/zip/files/live/$1.zip
# clean up temp dir now, and no longer need to clean up
# on signal related abort
rm -rf $tmpdir
trap - 1 2 3 15
}
There's actually a command for this, git archive.
git archive master -o wizzo-v1.13.0.zip
See the EXAMPLES section, you can select paths, add prefixes to them, define custom postprocessing by output extension, and some more minor tweaks.
Also see the ATTRIBUTES section: you can give files -- arbitrary patterns, really -- an export-ignore attribute to exclude them from archives.
It's got a bunch more handy-dandies, you can get archives from remote repos, expand arbitrary git log --pretty=format: placeholders, the git manpages are definitely worth whatever time you can invest in them.

How do I synchronize in both directions?

I want to use rsync to synchronize two directories in both directions.
I refer to synchronization in classical sense
(not how it is meant in rsync manuals):
I want to update the directories in both directions,
depending on which of them is newer.
Can this be done by rsync (preferable in a Linux-way)?
If not, what other solutions exist?
Just run it twice, with "newer" mode (-u or --update flag) plus -t (to copy file modified time), -r (for recursive folders), and -v (for verbose output to see what it is doing):
rsync -rtuv /path/to/dir_a/* /path/to/dir_b
rsync -rtuv /path/to/dir_b/* /path/to/dir_a
This won't handle deletes, but I'm not sure there is a good solution to that problem with only periodic sync'ing.
Do you know Unison File Synchronizer?
Unison is a file-synchronization tool
for Unix and Windows. It allows two
replicas of a collection of files and
directories to be stored on different
hosts (or different disks on the same
host), modified separately, and then
brought up to date by propagating the
changes in each replica to the other. ...
Note also that it is resilient to failure:
Unison is resilient to failure. It is
careful to leave the replicas and its
own private structures in a sensible
state at all times, even in case of
abnormal termination or communication failures.
You need to run rsync twice and I recommend to run it with -au:
rsync -au /local/source/* /remote/destination
rsync -au /remote/destination/* /local/source
-a (a for archive) is a shortcut for -rlptgoD:
-r Recurse into sub directories
-l Also sync symbolic links
-p Also sync file permissions
-t Also sync file modification times
-g Also sync file groups
-o Also sync file owner
-D Also sync special (not regular/meta) files
Basically whenever you want to create an identical one-to-one copy using rsync, you should always use -a as that's what most users expect to happen when they talk about "syncing". Other answers here seem to overlook that sometimes the content of a file stays unchanged but its owner may have changed or its access permissions may have changed and in that case rsync would not sync the file which could be fatal.
But you also require -u as that tells rsync to completely leave any file/folder alone, in case it exists already at the destination and has a newer last modification date. Without -u rsync would sync regardless if a file/folder is newer or not.
Please note that this solution cannot handle deleted files. Handling deletes is not easily possible as consider the following situation: A file has been deleted at the source, now how shall rsync know if that file once existed and has been deleted (in that case it must be deleted at the destination as well) or whether it never existed at the source (in that case it must be copied from the destination). These two situations look identical to rsync thus it cannot know how to react correctly. It won't help to sync the other way round as that can lead to the same situation: A file exists at the source but not at the destination. Why? Has it never existed at the destination or has it been deleted? Both cases look identical to rsync.
Sync tools that can reliably sync deleted files usually manage a sync log about all past sync operations. If that log reveals that there once was a file and has been synced but now it is missing, it's clear that it has been deleted. If there never was such a file according to the log, it must be synced. By storing all log entries with timestamps, it's even possible that a deleted file comes back and gets deleted multiple times yet the sync tool will always know what to do and the result is always correct. rsync has no such log, it only relies on the current file state of two sides of the operation.
You can however build yourself a sync command using rsync and a bit POSIX shell scripting which gets already very close to a sync tool as described above. As I needed such a tool myself, here is an answer on Stackoverflow that guides you through the creation of such a script.
Thanks jsight
rsync -urv --progress dir_a dir_b && rsync -urv --progress dir_b dir_a
This would result in the second sync happening immediately after 1st sync is over. In case the directory structure is huge, this will save time, as one does not need to sit before the pc. If the structure is huge, remove the verbose and progress stuff
rsync -ur dir_a dir_b && rsync -ur dir_b dir_a
Use rsync <OPTIONS> [hostname:]source-dir [hostname:]dest-dir
for example:
rsync -pogtEtvr --progress --bwlimit=2000 xxx-files different-stuff
Will sync xxx-files to different-stuff/xxx-files .If different-stuff/xxx-files did not exist, it will create it - i.e. copy it.
-pogtEtv - just bunch of options to preserve file metadata, plus v - verbose and r - recursive
--progress - show progress of syncing in real time - super useful if you copy big files
--bwlimit=2000 - sets maximum speed of copying/syncing (bw = bandwidth)
P.S. rsync is critically important when you work over network in case of local machine you can use commands like cp.
Good Luck!
What you need is Rclone. Rclone ("rsync for cloud storage") is a command line Linux program to sync files and directories to and from different cloud storage providers (box,dropbox,ftp etc) and local filesystems. Rlone supports mirror syncing only.
Another more graphical solution which includes real-time syncing would be to use FreeFileSync, which includes the program RealTimeSync. FreefileSync support 2-way bidirectional syncing which includes handling deletes.
I was having the same question and end up using git. It might not fit your situation, but if anyone find this topic and have the same question, you may consider a version control system.
I'm using rsync with inotifywait.
When you change any file, rsync will be executed.
inotifywait -m --exclude "$_LOG_FILE" -r -e create,delete,delete_self,modify,moved_to --format "%w%f" "$folder"
You need run inotifywait on both host. Please check example inotifywait

Resources