DVC experiment is restoring deleted files - dvc

I am using DVC to run experiments in my project using
dvc exp run
Now when i make changes to a file(example train.py) and run "dvc exp run" everything goes well,
but my problem is that when making changes by deleting a file(example train.py or an image in the data folder) as soon as i run the "dvc exp run" the file is restored.
how to stop that from happening?
This is my dvc.yaml:
stages:
train:
cmd: python train.py
deps:
- train.py
metrics:
- metrics.txt:
cache: false

From the clarifications under the OP it seems that (both train.py and) the data files are controlled by Git.
[DVC experiments][1] have to be based on the Git HEAD so dvc exp run may be doing git checkout HEAD internally, before reproducing the pipeline (dvc.yaml). Any Git-tracked files will be restored.
UPDATE: Looks like this may be a bug. Being tracked in https://github.com/iterative/dvc/issues/6297. Should be fixed soon!

Related

git-ignore dvc.lock in repositories where only the DVC pipelines are used

I want to use the pipeline functionality of dvc in a git repository. The data is managed otherwise and should not be versioned by dvc. The only functionality which is needed is that dvc reproduces the needed steps of the pipeline when dvc repro is called. Checking out the repository on a new system should lead to an 'empty' repository, where none of the pipeline steps are stored.
Thus, - if I understand correctly - there is no need to track the dvc.lock file in the repository. However, adding dvc.lock to the .gitginore file leads to an error message:
ERROR: 'dvc.lock' is git-ignored.
Is there any way to disable the dvc.lock in .gitignore check for this usecase?
This is definitely possible, as DVC features are loosely coupled to one another. You can do pipelining by writing your dvc.yaml file(s), but avoid data management/versioning by using cache: false in the stage outputs (outs field). See also helper dvc stage add -O (big O, alias of --outs-no-cache).
And the same for initial data dependencies, you can dvc add --no-commit them (ref).
You do want to track dvc.lock in Git though, so that DVC can determine the latest stage of the pipeline associated with the Git commit in every repo copy or branch.
You'll be responsible for placing the right data files/dirs (matching .dvc files and dvc.lock) in the workspace for dvc repro or dvc exp run to behave as expected. dvc checkout won't be able to help you.

dvc gc and files in remote cache

dvc documentation for dvc gc command states, the -r option indicates "Remote storage to collect garbage in" but I'm not sure if I understand it correctly. For example I execute this command:
dvc gc -r myremote
What exactly happens if I execute this command? I have 2 possible answers:
dvc checks which files should be deleted, then moves these files to "myremote" and then deletes all these files in local cache but not in remote.
dvc checks which files should be deleted and deletes these files both in local cache and "myremote"
Which one of them is correct?
one of DVC maintainers here.
Short answer: 2. is correct.
A bit of additional information:
Please be careful when using dvc gc. It will clear your cache from all dependencies that are not mentioned in the current HEAD of your git repository.
We are working on making dvc gc preserving whole history by default.
So if you don't want to delete files from your history commits, it would be better to wait for completion of this task.
[EDIT]
Please see comment below.

Is it possible to put a readme file for R code on github, that displays output?

Recently I participated in the #100daysofmlcode challenge on Linkedin, started by Siraj Raval. I do all of my coding in R. But when I push an RMarkdown file or readme file for my rcode, on Github, it doesn't show the output generated from the code. This makes it really difficult for viewers to catch up with the explanation. Is there a way we could display the code and output, so that it becomes easier for readers to understand? I know they can pull the changes I make from github and see them on their local machines. But considering the time limitations that everyone has, I would still like to know if there is a way we can display both Rcode and output in a readme file on github.
Thank you
GitHub is just a server, it can't process your Rmarkdown file. Two strategies are:
Call your file README.Rmd, and run render() on it to generate a README.md file that contains the output and push both to GitHub.
Setup a continuous integration service like Travis-CI and instruct it to render your README and push the result back to GitHub.
The first option is easiest from a technical setup perspective - you just have to render().
The second option is more convenient but requires some setup in your repo, configuring Travis to build (but not build on its own commits), and setting up credentials on Travis to do the push back to GitHub. To do this you'll need a .travis.yml file that looks something like:
language: r
script:
- bash renderreadme.sh
And a bash script file in your repo called renderreadme.sh:
#!/bin/bash
set -o errexit -o nounset
renderreadme(){
## Set up Repo parameters
git init
git config user.name "your_github_username"
git config user.email "your_email#example.com"
git config --global push.default simple
## Get drat repo
git remote add upstream "https://$GH_TOKEN#github.com/$TRAVIS_REPO_SLUG.git"
git fetch upstream
git checkout master
Rscript -e 'rmarkdown::render("README.Rmd")'
git add README.md
git commit -m "knit README [skip ci]"
git push
}
renderreadme
And you'll need to use the travis client (or something equivalent) to store the secure GitHub credentials needed for the git push operation in that script to succeed. The general guidance in "Building an R Project" for Travis will be useful for these general configuration aspects.
In markdown, you use ` ` for inline code and ``` ``` for blocks (multiple lines) of code

Update deployed meteor app while running with minimum downtime - best practice

I run my meteor app on EC2 like this: node main.js (in tmux session)
Here are the steps I use to update my meteor app:
1) meteor bundle app.tgz
2) scp app.tgz EC2-server:/path
3) ssh EC2-server and attach to tmux
4) kill the current meteor-node process by C-c
5) extract app.tgz
6) run "node main.js" of the extracted app.tgz
Is this the standard practice?
I realize forever can be used too but still do you have to kill the old node process and start a new one every time I update my app? Can the upgrade be more seamless without killing the Node process?
You can't do this without killing the node process, but I haven't found that really matters. What's actually more annoying is the browser refresh on the client, but there isn't much you can do about that.
First, let's assume the application is already running. We start our app via forever with a script like the one in my answer here. I'd show you my whole upgrade script but it contains all kinds of Edthena-specific stuff, so I'll outline the steps we take below:
Build a new bundle. We do this on the server itself, which avoids any missing fibers issues. The bundle file is written to /home/ubuntu/apps/edthena/edthena.tar.gz.
We cd into the /home/ubuntu/apps/edthena directory and rm -rf bundle. That will blow away the files used by the current running process. Because the server is still running in memory it will keep executing. However, this step is problematic if your app regularly does uncached disk operatons like reading from the private directory after startup. We don't, and all of the static assets are served by nginx, so I feel safe in doing this. Alternatively, you can move the old bundle directory to something like bundle.old and it should work.
tar xzf edthena.tar.gz
cd bundle/programs/server && npm install
forever restart /home/ubuntu/apps/edthena/bundle/main.js
There really isn't any downtime with this approach - it just restarts the app in the same way it would if the server threw an exception. Forever also keeps the environment from your original script, so you don't need to specify your environment variables again.
Finally, you can have a look at the log files in your ~/.forever directory. The exact path can be found via forever list.
David's method is better than this once, because there's less downtime when using forever restart compared to forever stop; ...; forever start.
Here's the deploy script spelled out, using the latter technique. In ~/MyApp, I run this bash script:
echo "Meteor bundling..."
meteor bundle myapp.tgz
mkdir ~/myapp.prod 2> /dev/null
cd ~/myapp.prod
forever stop myapp.js
rm -rf bundle
echo "Unpacking bundle"
tar xzf ~/MyApp/myapp.tgz
mv bundle/main.js bundle/myapp.js
# `pwd` is there because ./myapp.log would create the log in ~/.forever/myapp.log actually
PORT=3030 ROOT_URL=http://myapp.example.com MONGO_URL=mongodb://localhost:27017/myapp forever -a -l `pwd`/myapp.log start myapp.js
You're asking about best practices.
I'd recommend mup and cluster
They allow for horizontal scaling, and a bunch of other nice features, while using simple commands and configuration.

Capifony deploy runs some commands against previous release

I'm running a Capifony deployment. However, I notice that Capifony's in-built commands are running against the previous release, whereas my custom commands are correctly targeting the current release.
For example, if I run cap -d staging deploy, I see some commands output like this (linebreaks added):
--> Updating Composer.......................................
Preparing to execute command: "sh -c 'cd /home/myproj/releases/20130924144349 &&
php composer.phar self-update'"
Execute ([Yes], No, Abort) ? |y|
You'll see that this is referring to my previous release - from 2013.
I also see commands referring to this new release's folder (from 2014):
--> Running migrations......................................
Preparing to execute command: "/home/myproj/releases/20140219150009/
app/console doctrine:migrations:migrate --no-interaction"
Execute ([Yes], No, Abort) ? |y|
In my commands, I use the #{release_path} variable, whereas looking at Capifony's code, it's using #{latest_release}. But obviously I can't change Capifony's code.
This issue against Capistrano talks about something similar, but I don't think it really helps, as again I can't change Capifony's code.
If I delete my releases folder on the server, I have a similar problem - #{latest_release} doesn't have any value, so it attempts to do things like create a folder /app/cache (since the code is something like mkdir -p #{latest_release}/app/cache).
(Assuming I don't delete the current symlink and the release folder, the specific error I see is when it fails to copy vendors: cp: cannot copy a directory, /home/myproj/current/vendor, into itself. However, this is just the symptom of the bigger problem - if it thinks the new release is actually the previous one, that explains why current also points there!)
Any ideas? I'm happy to provide extracts from my deploy.rb or staging.rb (I'm using the multistage extension) but didn't just want to dump in the whole thing, so let me know what you're interested in! Thanks
I finally got to the bottom of this one!
I had a step set to run before deployment:
before "deploy", "maintenance:enable"
This maintenance step (correctly) sets up maintenance mode on the existing site (in the example above, my 2013 one).
However, the maintenance task was referring to the previous release by using the latest_release variable. Since the step was running before deployment, latest_release did indeed refer to the 2013 release. However, once latest_release has been used, its value is set for the rest of the deployment run - so it remained set to the 2013 release!
I therefore resolved this by changing the maintenance code so that it didn't use the latest_release variable. I used current_release instead (which doesn't seem to have this side-effect). However, another approach would be to define your own variable which gets its value in the same way as latest_release would:
set :prev_release, exists?(:deploy_timestamped) ? release_path : current_release
I worked out how latest_release was being set by looking in the Capistrano code. In my environment, I could find this by doing bundle show capistrano (since it was installed with bundler), but the approach will differ for other setups.
Although the reason for my problem was quite specific, my approach may help others: I created an entirely vanilla deployment following the Capifony instructions and gradually added in features from my old deployment until it broke!

Resources