Updating tracked dir in DVC - dvc

According to this tutorial when I update file I should remove file from under DVC control first (i.e. execute dvc unprotect <myfile>.dvc or dvc remove <myfile>.dvc) and then add it again via dvc add <mifile>. However It's not clear if I should apply the same workflow for the directories.
I have the directory under DVC control with the following structure:
data/
1.jpg
2.jpg
Should I run dvc unprotect data every time the directory content is updated?
More specifically I'm interested if I should run dvc unprotect data in the following use cases:
New file is added. For example if I put 3.jpg image in the data dir
File is deleted. For example if I delete 2.jpg image in the data dir
File is updated. For example if I edit 1.jpg image via graphic editor.
A combination of the previous use cases (i.e. some files are updated, other deleted and new files are added)

Only when file is updated - i.e. edit 1.jpg with your editor AND only if hadrlink or symlink cache type is enabled.
Please, check this link:
updating tracked files has to be carried out with caution to avoid data corruption when the DVC config option cache.type is set to hardlink or/and symlink
I would strongly recommend reading this document: Performance Optimization for Large Files it explains benefits of using hardlinks/symlinks.

Links above do not work anymore -> here is the up-to-date link and also pasting the instructions here:
Modifying content
Unlink the file with dvc unprotect. This will make train.tsv safe to edit:
dvc unprotect train.tsv
Then edit the content of the file, for example with:
echo "new data item" >> train.tsv
Add the new version of the file back with DVC:
dvc add train.tsv
git add train.tsv.dvc
git commit -m "modify train data"
If you have remote storage and/or an upstream repo:
dvc push
git push
Replacing files
If you want to replace the file altogether, you can take the following steps.
First, stop tracking the file by using dvc remove on the .dvc file. This will remove train.tsv from the workspace (and unlink it from the cache):
dvc remove train.tsv.dvc
Next, replace the file with new content:
echo new > train.tsv
And start tracking it again:
dvc add train.tsv
git add train.tsv.dvc .gitignore
git commit -m "new train data"
If you have remote storage and/or an upstream repo:
dvc push
git push

Related

How to add a file to a dvc-tracked folder without pulling the whole folder's content?

Let's say I am working inside a git/dvc repo. There is a folder data containing 100k small files. I track it with DVC as a single element, as recommended by the doc:
dvc add data
and because in my experience, DVC is kinda slow when tracking that many files one by one.
I clone the repo on another workspace, and now I have the data.dvc file locally but none of the actual files inside yet. I want to add a file named newfile.txt to the data folder and track it with DVC. Is there a way to do this without pulling the whole content of data locally ?
What I have tried for now:
Adding the data folder again:
mkdir data
mv path/to/newfile.txt data/newfile.txt
dvc add data
The data.dvc file is built again from the local state of data which only contains newfile.txt so this doesn't work.
Adding the file as a single element in data folder:
dvc add data/newfile.txt
I get :
Cannot add 'data/newfile.txt', because it is overlapping with other DVC tracked output: 'data'.
To include 'data/newfile.txt' in 'data', run 'dvc commit data.dvc'
Using dvc commit as suggested
mkdir data
mv path/to/newfile.txt data/newfile.txt
dvc commit data.dvc
Similarly as 1., the data.dvc is rebuilt again from local state of data.
I clone the repo on another workspace, and now I have the data.dvc file locally but none of the actual files inside yet (haven't dvc pulled). I want to add a file to the data folder and track it with DVC. Is there a way to do this without pulling the whole content of data locally ?
Interesting question. I think there is no easy way to do this now because in this other machine if you dvc add data again but with only one file in there, DVC will think you deleted all the other files, create a new cached version of the data dir (containing only the new file), and update the .dvc file accordingly (as you discovered).
You could open a feature request in https://github.com/iterative/dvc.org/issues.

Undo 'dvc add' operation

I dvc add-ed a file I did not mean to add. I have not yet committed.
How do I undo this operation? In Git, you would do git rm --cached <filename>.
To be clear: I want to make DVC forget about the file, and I want the file to remain untouched in my working tree. This is the opposite of what dvc remove does.
One issue on the DVC issue tracker suggests that dvc unprotect is the right command. But reading the manual page suggests otherwise.
Is this possible with DVC?
As per mroutis on the DVC Discord server:
dvc unprotect the file; this won't be necessary if you don't use symlink or hardlink caching, but it can't hurt.
Remove the .dvc file
If you need to delete the cache entry itself, run dvc gc, or look up the MD5 in data.dvc and manually remove it from .dvc/cache.
Edit -- there is now an issue on their Github page to add this to the manual: https://github.com/iterative/dvc.org/issues/625
dvc remove appears to do what you need for uncommitted files - at least for files that aren't in a pipeline. The key (which wasn't clear to me from the error or the docs) is to pass the ….dvc file name, otherwise it tries to find and remove it as a section from dvc.yaml.
# Precondition: DVC is configured for the repo. No dvc.yaml file (untested with it)
$ touch so-57966851.txt
$ dvc add so-57966851.txt
WARNING: 'so-57966851.txt' is empty.
100% Adding...|████████████████████████████████████████|1/1 [00:00, 49.98file/s]
To track the changes with git, run:
git add .gitignore so-57966851.txt.dvc
# Ooops! I did the wrong thing! I didn't mean to add that…
$ dvc remove so-57966851.txt.dvc
$ ll so-*.txt
-rw-r--r-- 1 ibboard users 0 Aug 23 20:27 so-57966851.txt
(Tested with v2.5.4)

Excluding files from localgit repo

I am working on a Wordpress site, and been given access to the git repository for this project. The entire WP install is in the Repo. All I care about is being able to push my changes to the theme and a select list of plugin folders, ie:
/wp-content/themes/myTheme2017/
/wp-content/plugins/myPlugin1/
/wp-content/plugins/myPlugin2/
....
How can I exclude everything else from being tracked? How can I update my local WP install, and customize my wp-config.php file, and not have those changes be tracked?
As per How do I configure git to ignore some files locally?, I can specify the files I want excluded much like in gitignore files. Then, I can run git update-index --skip-worktree [<file>...] and get my desired results.
git update-index --skip-worktree wp-config.php
The real question is then can I exclude entire folders? Do I have to run the skip-worktree command on every file?
The real question is then can I exclude entire folders? Do I have to run the skip-worktree command on every file?
Yes, every file: Git does work with content (files), not containers (directories).
You can find here an approach using submodules
git submodule add -f https://github.com/wp-plugins/wp-migrate-db.git ./wp-content/plugins/wp-migrate-db
git commit -m "Added WP Migrate DB plugin"
That allows to commit separately in your parent repo or your submodule.

.gitignored files still shown in RStudio

I added the folder .Rproj.user to .gitignore. However, some files contained in it still show up (see screenshot). Any ideas what can I do about it?
Update
No changes after adding .Rproj.user/**
First of all your files are already committed so you have to remove it from the repo:
# Once you add files to git, it will keep tracking them,
# so we have to delete them and commit your deletion
git rm -r --cached .Rproj.user/**
# Commit the deleted files
git commit -m "Removed files...."
# now add it to the `.gitignore` and the files will be ignored
echo '.Rproj.user/**' > .gitignore
You need to mark it as folder.
In order to do so add the 2 ** as described above
P.S.
Here is a cool hook which will block that kind of files to be added when you try to push them to the server.
What are some more forceful ways than a .gitignore to keep (force) files out of a repo?

Git Archive but first put all the files inside a folder then start archiving

As title suggest, I want to know if there is a single git command that put all my project in one folder first (not including .gitignored files) and then proceed archiving the folder— leaving ignored files not included when archiving which is nice.
This can be beneficial for me as I am working on WordPress plugin with multiple release. Some references.
I want all the files (minus the .gitignored files) move to a folder first then proceed archiving that folder
It is possible in one command provided you define an alias but this isn't git-related:
you can:
clone your repo elsewhere (that way you don't get any ignored or private file)
move your files as you see fit in that local clone
archive (tar cpvf yourArchive.tar yourFolder)
But git archive alone won't help you move those files, which is why I would recommend a script with custom bash commands (not git commands).
You don't really need to copy / clone the repo anywhere.
Make sure you committed all your changes.
Process the files any way you want.
Run tar -cvjf dist/archive-name.tbz2 --transform='s,^,archive-name/,' $(git ls-tree --full-tree -r --name-only --full-name HEAD)
run git reset --hard to restore without any of the changes you made in step #2.
Hints:
The --transform='s,^,archive-name/,' is so your files will be extracted toarchive-name/....`, you can remove it if you don't need that.

Resources