git-ignore dvc.lock in repositories where only the DVC pipelines are used - dvc

I want to use the pipeline functionality of dvc in a git repository. The data is managed otherwise and should not be versioned by dvc. The only functionality which is needed is that dvc reproduces the needed steps of the pipeline when dvc repro is called. Checking out the repository on a new system should lead to an 'empty' repository, where none of the pipeline steps are stored.
Thus, - if I understand correctly - there is no need to track the dvc.lock file in the repository. However, adding dvc.lock to the .gitginore file leads to an error message:
ERROR: 'dvc.lock' is git-ignored.
Is there any way to disable the dvc.lock in .gitignore check for this usecase?

This is definitely possible, as DVC features are loosely coupled to one another. You can do pipelining by writing your dvc.yaml file(s), but avoid data management/versioning by using cache: false in the stage outputs (outs field). See also helper dvc stage add -O (big O, alias of --outs-no-cache).
And the same for initial data dependencies, you can dvc add --no-commit them (ref).
You do want to track dvc.lock in Git though, so that DVC can determine the latest stage of the pipeline associated with the Git commit in every repo copy or branch.
You'll be responsible for placing the right data files/dirs (matching .dvc files and dvc.lock) in the workspace for dvc repro or dvc exp run to behave as expected. dvc checkout won't be able to help you.

Related

DVC experiment is restoring deleted files

I am using DVC to run experiments in my project using
dvc exp run
Now when i make changes to a file(example train.py) and run "dvc exp run" everything goes well,
but my problem is that when making changes by deleting a file(example train.py or an image in the data folder) as soon as i run the "dvc exp run" the file is restored.
how to stop that from happening?
This is my dvc.yaml:
stages:
train:
cmd: python train.py
deps:
- train.py
metrics:
- metrics.txt:
cache: false
From the clarifications under the OP it seems that (both train.py and) the data files are controlled by Git.
[DVC experiments][1] have to be based on the Git HEAD so dvc exp run may be doing git checkout HEAD internally, before reproducing the pipeline (dvc.yaml). Any Git-tracked files will be restored.
UPDATE: Looks like this may be a bug. Being tracked in https://github.com/iterative/dvc/issues/6297. Should be fixed soon!

dvc gc and files in remote cache

dvc documentation for dvc gc command states, the -r option indicates "Remote storage to collect garbage in" but I'm not sure if I understand it correctly. For example I execute this command:
dvc gc -r myremote
What exactly happens if I execute this command? I have 2 possible answers:
dvc checks which files should be deleted, then moves these files to "myremote" and then deletes all these files in local cache but not in remote.
dvc checks which files should be deleted and deletes these files both in local cache and "myremote"
Which one of them is correct?
one of DVC maintainers here.
Short answer: 2. is correct.
A bit of additional information:
Please be careful when using dvc gc. It will clear your cache from all dependencies that are not mentioned in the current HEAD of your git repository.
We are working on making dvc gc preserving whole history by default.
So if you don't want to delete files from your history commits, it would be better to wait for completion of this task.
[EDIT]
Please see comment below.

Do not re-create repositories after updating

We manage systems and thus manage repositories. We remove repositories which we do not use, present in /etc/yum.repos.d/<file>
Our problem is: after an update/upgrade of the system, CentOS automatically re-creates the repositories which were removed, which is an issue for us.
Question: Is there a command / method to ensure repositories are not re-created after an upgrade on CentOS 7 systems.
Those repositories are created by someone, the OS doesn't recreate them.
Either they are restored by an update of a RPM package such as centos-release or by an automatic script you setup/run (ansible?).
I'm not aware of an automatic method to delete a repo; I see a couple of solutions:
Exclude centos-release from the upgradable packages, by adding
exclude=centos-release
to /etc/yum.conf (space separated list), but this could break some updates;
Disable them with:
# yum-config-manager --disable base,updates,extras,centosplus,epel,whatever
(this can be easily scripted and put in a cron or in your ansible playbook)
Write a small script and place it in /etc/cron.hourly/, e.g. /etc/cron.hourly/wipe_repos, containing:
#!/usr/bin/env bash
rm -f /etc/yum.repos.d/CentOS-Base.repo
or, better:
#!/usr/bin/env bash
yum-config-manager --disable base,updates,extras,centosplus,epel,whatever
I would suggest to use solution 2, since the repo files aren't overwritten by updates, but the new versions are placed along the old in .rpmnew files.
This is guaranteed by the flag %config(noreplace) in the source rpm of centos-release, applied to all files in /etc/yum.repos.d/.
You can check this by downloading the .src.rpm and opening the centos-release.spec file.
$ mkdir test && cd test
$ yumdownloader --source centos-release
$ rpm2cpio centos-release*.rpm | cpio -idmv
$ cat centos-release.spec
(or search for the package online and download the src.rpm)
Then scroll down to section %files and you'll notice:
%config(noreplace) /etc/yum.repos.d/*
%config(noreplace) means that all those files are not replaced with new files from an update, but the files from the new rpm are saved with the extension .rpmnew, so you'll have:
$ ls /etc/yum.repos.d/
CentOS-Base.repo <-- here you set them as disabled
CentOS-Base.repo.rpmnew <-- this comes from the update, but yum will ignore it
For reference, see http://people.ds.cam.ac.uk/jw35/docs/rpm_config.html or https://serverfault.com/a/48819/.
As I already said in the comments below the question, the reason why those repositories keep reappearing after an update is quite simple: the files defining the system repositories are owned by the package centos-release and whenever this package gets updated or reinstalled, the repositories reappear.
The package centos-release is a very basic package, it provides the capabilities redhat-release and system-release, and a number of other basic packages depend on it.
[local ~]$ rpm -q --provides centos-release
centos-release = 7-6.1810.2.el7.centos
centos-release(upstream) = 7.6
centos-release(x86-64) = 7-6.1810.2.el7.centos
config(centos-release) = 7-6.1810.2.el7.centos
redhat-release = 7.6-1
system-release = 7.6-1
system-release(releasever) = 7
[local ~]$ rpm -q --whatrequires system-release
setup-2.8.71-10.el7.noarch
grubby-8.28-25.el7.x86_64
[local ~]$ rpm -q --whatrequires redhat-release
initscripts-9.49.46-1.el7.x86_64
systemd-219-62.el7_6.5.x86_64
There is no easy way out of this.
But one possible solution might be to create a customized RPM package to replace centos-release. It should contain the pointers to your own repositories and of course needs to provide the capabilities redhat-release and system-release.
Please be aware that I have no idea if this is actually going to work, it's just something that came to my mind while thinking about the problem. It might save you the work of creating a full custom distribution derived from CentOS, which is the only other way I can think of to achieve what you seem to want.
My solution doesn't exactly solve the problem you request ("how do I delete default repository config files forever?"), but it does stabilize your config changes. If you zero out the files instead of deleting them, then system updates will leave your 'edited' versions unchanged.
I do feel that this is a 'hack', leaving named ghost files, but it's one I can live with. No need to disable or customize redhat-release or system-release.
My problem was slightly different than yours - I maintained different configs for the same repositories for different situations, indicated by filename. On updates the original files would return, leaving me with redundant and incorrect definitions. Now they don't.

Git merge results in 400 rename/rename conflicts, how do I resolve them quickly?

So, I have a number of Wordpress sites managed with a Git repository, all of which are branches off of a central upstream Git repository. I recently applied a bunch of updates to the parent repo, but one of the child website repos had a plugin updated to a different version and now throws up about 400 rename/rename conflicts. All of these conflicts are in an upstream plugin directory that would be safe to just resolve in favor of the upstream branch.
I want to do the following:
Ensure the upstream version of the files 'wins' the merge conflict (e.g. what the --theirs flag does with checkout)
Produce a mergeable history (If it's not safe for a coworker to type "git pull origin master" with an old repo, it's not an option. I'm religiously opposed to rebasing.)
Not restructure my Git repository (My hosting provider, Pantheon, will not install Composer dependencies at deploy time. Upstream plugins have to be part of the repo.)
Not get a repetitive stress injury (Has to be a reasonably small number of commands because I have to resolve these kinds of messes once a month or so.)
If I just type "git checkout wp-content/plugins/** --theirs", I get hit in the face with about 400 errors, and Git refuses to checkout the files. They look like this:
....400 or so errors omitted...
error: path 'wp-content/plugins/wordpress-seo/js/dist/wp-seo-quick-edit-handler-710.min.js' does not have their version
error: path 'wp-content/plugins/wordpress-seo/js/dist/wp-seo-quick-edit-handler-720.min.js' does not have their version
error: path 'wp-content/plugins/wordpress-seo/js/dist/wp-seo-recalculate-710.min.js' does not have their version
error: path 'wp-content/plugins/wordpress-seo/js/dist/wp-seo-recalculate-720.min.js' does not have their version
I categorically refuse to type 400 git rm/git add commands with each individual path included. git checkout --force is not an option, as --theirs and --force are mutually incompatible (for some reason). My current solution is to open Git GUI and manually right-click -> Use Remote Version and then click Yes... 400 times. I don't have to type the path at least but this is still time consuming.
How do I efficiently resolve a large number of rename/rename conflicts in favor of the remote repository?
Do you want to just resolve the conflicted files in favour of the remote, or just take a whole tree as it is in the remote?
For the latter, you could do this:
Just accept the files as-is with conflicts. git add . or similar
Commit the merge.
rm -Rf path/in/question
git checkout origin/branch -- path/in/question
git commit --amend -a
For the former, it's probably something pretty similar
Just accept the files as-is with conflicts. git add . or similar
Commit the merge.
Find files with conflicts. e.g. grep -r -l '>>>>' path/in/question > /tmp/conflicts.txt
Delete the files with conflicts, check out the desired versions, and amend the commit in a similar means to the above.
(If there are files/paths with spaces in them, small adjustments to the above commands may be necessary. I've given the simpler versions for clarity.)

Capifony deploy runs some commands against previous release

I'm running a Capifony deployment. However, I notice that Capifony's in-built commands are running against the previous release, whereas my custom commands are correctly targeting the current release.
For example, if I run cap -d staging deploy, I see some commands output like this (linebreaks added):
--> Updating Composer.......................................
Preparing to execute command: "sh -c 'cd /home/myproj/releases/20130924144349 &&
php composer.phar self-update'"
Execute ([Yes], No, Abort) ? |y|
You'll see that this is referring to my previous release - from 2013.
I also see commands referring to this new release's folder (from 2014):
--> Running migrations......................................
Preparing to execute command: "/home/myproj/releases/20140219150009/
app/console doctrine:migrations:migrate --no-interaction"
Execute ([Yes], No, Abort) ? |y|
In my commands, I use the #{release_path} variable, whereas looking at Capifony's code, it's using #{latest_release}. But obviously I can't change Capifony's code.
This issue against Capistrano talks about something similar, but I don't think it really helps, as again I can't change Capifony's code.
If I delete my releases folder on the server, I have a similar problem - #{latest_release} doesn't have any value, so it attempts to do things like create a folder /app/cache (since the code is something like mkdir -p #{latest_release}/app/cache).
(Assuming I don't delete the current symlink and the release folder, the specific error I see is when it fails to copy vendors: cp: cannot copy a directory, /home/myproj/current/vendor, into itself. However, this is just the symptom of the bigger problem - if it thinks the new release is actually the previous one, that explains why current also points there!)
Any ideas? I'm happy to provide extracts from my deploy.rb or staging.rb (I'm using the multistage extension) but didn't just want to dump in the whole thing, so let me know what you're interested in! Thanks
I finally got to the bottom of this one!
I had a step set to run before deployment:
before "deploy", "maintenance:enable"
This maintenance step (correctly) sets up maintenance mode on the existing site (in the example above, my 2013 one).
However, the maintenance task was referring to the previous release by using the latest_release variable. Since the step was running before deployment, latest_release did indeed refer to the 2013 release. However, once latest_release has been used, its value is set for the rest of the deployment run - so it remained set to the 2013 release!
I therefore resolved this by changing the maintenance code so that it didn't use the latest_release variable. I used current_release instead (which doesn't seem to have this side-effect). However, another approach would be to define your own variable which gets its value in the same way as latest_release would:
set :prev_release, exists?(:deploy_timestamped) ? release_path : current_release
I worked out how latest_release was being set by looking in the Capistrano code. In my environment, I could find this by doing bundle show capistrano (since it was installed with bundler), but the approach will differ for other setups.
Although the reason for my problem was quite specific, my approach may help others: I created an entirely vanilla deployment following the Capifony instructions and gradually added in features from my old deployment until it broke!

Resources