How to add to a DVC stage outputs already tracked by DVC? - dvc

In my project, I already have some files tracked by DVC that I added with dvc add. And now I want to create stages using thses files as outputs and dependencies, but when I try to create a stage I get an error that says ERROR: output '[FILE NAME]' is already specified in stages.
I assume that dvc add add the files to the dependency graph as outputs, thus when I try to include them in a stage it creates a conflict, but I couldn't find anything on the official docuemntation confirming it. So now am confused on how to add outputs to a stage is theses outputs are already tracked by DVC.
Here is an example of what the error I get when creating a stage
>>> dvc stage add -n train -d data/data.csv -o models/model python train.py
ERROR: output 'models/model' is already specified in stages:
- models/model.dvc
- train
In this example the file data/data.csv and directory models/model are already added to dvc but are not added to any stage, however they are present in the dependency graph.
So how do I include theses files into a DVC Stage ? Is there a way to do it without having to remove the files from DVC then add them directly through a Stage?

DVC stage outputs are automatically tracked by DVC, you don't need to do dvc add on them. If you already have done it before, you can safely un-track it with dvc remove first:
Note that the actual output files or directories of the stage (outs field) are not removed by this command, unless the --outs option is used.
One thing to mention / note. When you create a stage and run it, it removes outputs (unless a persistence flag is specified). This done for reproducibility, it's expected that your stage produces its outputs every time it runs.

Related

Delete cached data from DVC

I would like to be able to delete individual files or folders from the DVC cache, after they have been pulled with dvc pull, so they don't occupy space in local disk.
Let me make things more concrete and summarize the solutions I found so far. Imagine you have downloaded a data folder using something like:
dvc pull <my_data_folder.dvc>
This will place the downloaded data into .dvc/cache, and it will create a set of soft links in my_data_folder (if you have configured DVC to use soft links)
ls -l my_data_folder
You will see something like:
my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
...
Imagine you don't need this data for a while, and you need to free its space from local disk. I know of two manual approaches for doing that, although I am not sure about the second one:
Preliminary step (optional)
Not needed if you have symlinks (which I believe is true, at least in unix-like OS):
dvc unprotect my_data_folder
Approach 1 (verified):
Delete all the cached data. From the repo's root folder:
rm -r my_data_folder
rm -rf .dvc/cache
This seems to work properly, and will completely free the disk space previously used by the downloaded data. Once we need the data again, we can pull it by doing dvc pull as previously. The drawback is that we are removing all the data downloaded with dvc so far, not only the data corresponding to my_data_folder, so we would need to do dvc pull for all the data again.
Approach 2 (NOT verified):
Delete only specific files (to be thoroughly tested that this does not corrupt DVC in any way):
First, take note of the path indicated in the soft link:
ls -l my_data_folder
You will see something like:
my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
If you want to delete my_data_file_1.pk, from the repo's root folder run:
rm .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
Note on dvc gc
For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.
I would appreciate if someone can suggest a better way, or also comment whether the second approach is actually appropriate. Also, if I want to delete the whole folder and not go file by file, is there any way to do that automatically?
Thank you!
It's not possible at the moment to granularly specify a directory / file to be removed from the cache. Here are the tickets to vote and ask to prioritize this:
dvc gc remove
Reconsider gc implementation
For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.
This is a bit concerning. If you run it with the -w option it keeps only files / dirs that are referenced in the current versions of the .dvc and dvc.lock files. And it should remove everything else.
So, let's say you are building a model:
my_model_file.pk
You created it once and its hash is 4f7bc7702897bec7e0fae679e968d792 and it's written in the dvc.lock or in the my_model_file.dvc.
Then you do another iteration and now hash is different 5a8cc7702897bec7e0faf679e968d363. It should be now written in the .dvc or lock. It means that a model that corresponds to the previous 4f7bc7702897bec7e0fae679e968d792 is not referenced anymore. In this case dvc gc -w should definitely collect it. If that is not happening please create a ticket and we'll try to reproduce and take a look.

How to add a file to a dvc-tracked folder without pulling the whole folder's content?

Let's say I am working inside a git/dvc repo. There is a folder data containing 100k small files. I track it with DVC as a single element, as recommended by the doc:
dvc add data
and because in my experience, DVC is kinda slow when tracking that many files one by one.
I clone the repo on another workspace, and now I have the data.dvc file locally but none of the actual files inside yet. I want to add a file named newfile.txt to the data folder and track it with DVC. Is there a way to do this without pulling the whole content of data locally ?
What I have tried for now:
Adding the data folder again:
mkdir data
mv path/to/newfile.txt data/newfile.txt
dvc add data
The data.dvc file is built again from the local state of data which only contains newfile.txt so this doesn't work.
Adding the file as a single element in data folder:
dvc add data/newfile.txt
I get :
Cannot add 'data/newfile.txt', because it is overlapping with other DVC tracked output: 'data'.
To include 'data/newfile.txt' in 'data', run 'dvc commit data.dvc'
Using dvc commit as suggested
mkdir data
mv path/to/newfile.txt data/newfile.txt
dvc commit data.dvc
Similarly as 1., the data.dvc is rebuilt again from local state of data.
I clone the repo on another workspace, and now I have the data.dvc file locally but none of the actual files inside yet (haven't dvc pulled). I want to add a file to the data folder and track it with DVC. Is there a way to do this without pulling the whole content of data locally ?
Interesting question. I think there is no easy way to do this now because in this other machine if you dvc add data again but with only one file in there, DVC will think you deleted all the other files, create a new cached version of the data dir (containing only the new file), and update the .dvc file accordingly (as you discovered).
You could open a feature request in https://github.com/iterative/dvc.org/issues.

Revert a dvc remove -p command

I have just removed a DVC tracking file by mistake using the command dvc remove training_data.dvc -p, which led to all my training dataset gone completely. I know in Git, we can easily revert a deleted branch based on its hash. Does anyone know how to revert all my lost data in DVC?
You should be safe (at least data is not gone) most likely. From the dvc remove docs:
Note that it does not remove files from the DVC cache or remote storage (see dvc gc). However, remember to run dvc push to save the files you actually want to use or share in the future.
So, if you created training_data.dvc as with dvc add and/or dvc run and dvc remove -p didn't ask/warn you about anything, means that data is cached similar to Git in the .dvc/cache.
There are ways to retrieve it, but I would need to know a little bit more details - how exactly did you add your dataset? Did you commit training_data.dvc or it's completely gone? Was it the only data you have added so far? (happy to help you in comments).
Recovering a directory
First of all, here is the document that describes briefly how DVC stores directories in the cache.
What we can do is to find all .dir files in the .dvc/cache:
find .dvc/cache -type f -name "*.dir"
outputs something like:
.dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir
.dvc/cache/00/db872eebe1c914dd13617616bb8586.dir
.dvc/cache/2d/1764cb0fc973f68f31f5ff90ee0883.dir
(if the local cache is lost and we are restoring data from the remote storage, the same logic applies, commands (e.g. to find files on S3 with .dir extension) look different)
Each .dir file is a JSON with a content of one version of a directory (file names, hashes, etc). It has all the information needed to restore it. The next thing we need to do is to understand which one do we need. There is no one single rule for that, what I would recommend to check (and pick depending on your use case):
Check the date modified (if you remember when this data was added).
Check the content of those files - if you remember a specific file name that was present only in the directory you are looking for - just grep it.
Try to restore them one by one and check the directory content.
Okay, now let's imagine we decided that we want to restore .dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir, (e.g. because content of it looks like:
[
{"md5": "6f597d341ceb7d8fbbe88859a892ef81", "relpath": "test.tsv"}, {"md5": "32b715ef0d71ff4c9e61f55b09c15e75", "relpath": "train.tsv"}
]
and we want to get a directory with train.tsv).
The only thing we need to do is to create a .dvc file that references this directory:
outs:
- md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
path: my-directory
(note, that path /20/b786b6e6f80e2b3fcf17827ad18597.dir became a hash value: 20b786b6e6f80e2b3fcf17827ad18597.dir)
And run dvc pull on this file.
That should be it.

Yocto: how to remove a layer without rebuild all

I'm playing with a Yocto project that has in its conf/bblayers.conf file the following line:
ADDONSLAYERS += "${#'${OEROOT}/layers/meta-qt5' if os.path.isfile('${OEROOT}/layers/meta-qt5/conf/layer.conf') else ''}"
I partially bitbaked the project but now I want to try to disable the whole meta-qt5 layer.
After commenting out the line above, how to remove the already built files from the output folder and go on with the others?
I tried with bitbake -c cleansstate meta-qt5 but it doesn't work. I guess it works with recipes only, and not with whole layers.
Easiest way to clean a build is to remove TMPDIR temporary folder (default is <build>/tmp).
That will remove previous compilation results, but those are also kept in SSTATE_DIR cache folder. Next build will not rebuild all, it will reuse cache results to speed it up.
Then, you can clean your cache folder for obsolete entries with sstate-cache-management.sh script:
# Example of usage (after sourcing oe-init-build-env)
sstate-cache-management.sh --cache-dir=../sstate-cache -d -y

How to set the build area for rpmbuild per-invocation

I'm modifying an automated build, and want to tell rpmbuild to use a specific build area when invoking it.
This is similar to an existing question, but more specific.
I don't want to run any of the build commands as the root user; the aim is only to have an RPM, not to install anything into the system.
I don't want to require the user to change their dotfiles (e.g. $HOME/.rpmrc); the build should be self-contained and not affect the user's existing settings.
I don't want to hard-code the location into the foo.spec file; that file should be useable as-is if the user wants to build in a different location.
The --buildroot option is not what I need; that sets a pseudo-root filesystem for the make part of the build process, but I need to specify the “build area” for the entire RPM build process.
What I'm looking for is a hypothetical --build-area FOODIR option that can be given to the rpmbuild command, or an equivalent environment variable. It should thus affect just that single invocation of the command and cause it to use a specified user-writable location for its build area.
I've seen references to a _topdir macro that seems to be what I'm talking about, but it doesn't appear to be configurable per invocation.
It would be ideal if rpmbuild could set up its own environment in that location when it needs it, but I don't mind setting up the directories for that per build, since that can be automated as part of the build. The goal is to have that user-writable location exist only for the duration of the build run, and then clean up by deleting that entire location once the RPM file is generated.
It's not documented, but the _topdir macro determines the build area.
So you can set this per-invocation with rpmbuild --define "_topdir ${PWD}/foobar" ... to set the directory to whatever you want.
--define is the key to setting values for any macro, not just _topdir.
The --buildroot option is not what you are looking for. The name is a bit misleading as it is not changing the buildroot but instead is setting the root for the install phase of the build. RPM is basically doing a "make install" as part of the build and is then packing the results of this. The buildroot option allows you to do this install into for example /tmp/myinstallroot.
I recently had to integrate rpm package building into an automated build and had the same problem. What i did was to generate a custom .rpmmacros file with %topdir set appropriately. I then just temporarily changes HOME to the location of that custom .rpmmacros file.
"HOME=mytopdir rpmbuild ...".

Resources