How to use different remotes for different folders? - dvc

I want my data and models stored in separate Google Cloud buckets. The idea is that I want to be able to share the data with others without sharing the models.
One idea I can think of is using separate git submodules for data and models. But that feels cumbersome and imposes some additional requirements from the end user (e.g. having to do git submodule update).
So can I do this without using git submodules?

You can first add the different DVC remotes you want to establish (let's say you call them data and models, each one pointing to a different GC bucket). But don't set any remote as the project's default; This way, dvc push won't work without the -r (or --remote) option.
You would then need to push each directory or file individually to the appropriate remote, like dvc push data/ -r data and dvc push model.dat -r models.
Note that a feature request to configure this exists on the DVC repo too. See Specify file types that can be pushed to remote.

Yes, you can use multiple remotes without Git-submodules.
There is a separate command for using data artifacts from external repositories: dvc import http://your-repo datadir The command brings data to your repo and keeps the connection to the original repo (to avoid data duplication in different remotes).
In your case, one repository can be used for a dataset with its own data remote. A second repo might be used for the code and models which imports the dataset project while all it's models and outputs go to another data remote.
With import, no dvc push -r myremote are needed. A default dvc push synchronize data in a proper remote.
EDITED: Simply use one Git repo for dataset with its data-remote/S3-folder, and import it from another repo with code, model and another data-remote/S3-folder.

Related

Is there an alternative to DVC pipelines to create a DAG which is also aware of inputs/outputs to nodes to cache results?

I recently started to use DVC pipelines to create DAG in my application. I work on Machine Learning projects, and I need to experiment a lot with different nodes of my system. For example:
Data preprocessing -> feature extraction -> model training -> model evaluation
Each node produces an output, and the output of each node is used in another node. What DVC allows me to do is to create a pipeline in which I can specify dependencies between nodes. I also use .yaml files to configure parameters of my application, and you can also specify these parameters as dependencies for different nodes. So, whenever a dependency changes between nodes (it can be either configuration parameters or inputs/outputs specified), DVC is able to detect this, and run the necessary parts of the pipeline. If a dependency hasn't changed for a particular node, DVC can use its cache to skip that step. This is really useful for me, since some nodes take really long time to execute, and they don't always need to be ran (if their dependencies hasn't changed).
I also started to use hydra to manage my config files, and to be honest, DVC doesn't work well with hydra. It expects a static config to specify parameter dependencies, and with hydra it is a bit tricky to do, and complicate things.
My question is: is there any alternative to DVC Pipelines which also goes well with hydra?

Copy latest artifact from one path to another

I'm trying to copy the latest artifact from one path to another using Artifactory API.
POST /api/copy/{srcRepoKey}/{srcFilePath}?to=/{targetRepoKey}/{targetFilePath}[&dry=1][&suppressLayouts=0/1(default)][&failFast=0/1]
Let's say I have a few RPMs named: artifact-1.0-1.rpm, artifact-1.0-2.rpm and artifact-1.0-3.rpm.
How to automatically copy the third artifact ?
With the next release of Jfrog's CLI, planned in a couple of weeks, you'll be able to use SORT and LIMIT in the COPY command.
This will allow you to fetch only the latest item\artifact by SORTing by date and LIMITing to the result set to 1.
For now, you can use 2 sequential CURL commands to try and accomplish what you're after:
First use an AQL SEARCH with you're SORT and LIMIT to retrieve the relevant item's path, and then use your COPY command with that path.
Note: the CLI's SORT and LIMIT feature has already been checked in to the CLI's dev branch, so if you wish to use a snapshot you can "download and build" the dev branch from github, and then test if the solution suites you.
I doubt that you can automatically copy all these artifacts in one statement. You can copy the folder but no regex or pattern can defined in copy command.

R Studio - Cloning local repository

I want to create a master repository on our server, from which I can clone a local version onto my computer.
I am using R Studio v0.98.994.
So far, this is what I have tried doing:
Create a folder for the master repository to live in. I do this using 'new project' in R studio, and tell it to make a git repository.
I can then open up another new project, located on my C drive, and use R studio to clone, by telling it to open an existing project and setting the URL as the location of the master project.
However, then when I make changes and commit to my local repository (which works fine) I cannot push to the master repository, I get an error exactly as described in this question: git push fails: `refusing to update checked out branch: refs/heads/master`
So it appears that R Studio creates non-bare repositories?
Now I thought, well okay, I will use git bash to initialise the repository and then connect to that within R studio.
I do so, but cannot then find a way to use that repository in R Studio.
I am very new to Git, so it is entirely probable that this is one of those 'read the instructions' questions, in which case I am very sorry - and could someone possibly point me towards some guidance for this situation? I have spent the better half of a day googling around this error and haven't yet managed to pull together the pieces :( I also apologise; this doesn't feel like a very reproducible question.
It sounds like you are using Windows Git, with a setup on a local Windows machine (C: drive) and a server of some kind, mounted as the S: drive. There's a few things you should be aware of when doing this.
Shared Repositories
If you are intending for multiple people to share the same repository, you want to initiate a shared repository. See the --shared option in git-init for more details. Note that I'm not sure how having your repository on a Windows machine affects the sharing options. If you are just trying to keep your repository in two places, that makes things a lot easier.
Bare Repositories
Separate from the discussion of sharing is the discussion of bare repositories. If you don't intend to ever work with files in the server (i.e. it's just going to be a place to push changes so they are safely stored), you could initialize a bare repository. A bare repository contains the database structure of Git, but does not have the actual files in the directory.
A standard Git repository is a directory with a hidden folder in it named .git. This .git folder contains all the various data structures that Git uses to track changes. A bare repository is essentially a folder containing only the contents of .git.
The good thing about a bare repository is that no one can work in the repository itself (since there is no working directory, just the database). This means that no one could log into S: and edit the repository themselves. Instead, they would have to clone the repository, then push their changes back to the origin. The GitGuys have a good article about why this is ideal.
Note that shared repos and bare repos are not dependent or mutually exclusive. As a general practice, if you are having a "server repo" from which you pull and to which you push, you should have it be bare, regardless of whether the project is shared.
A Non-Shared Workflow
Since it's not clear if you are sharing or not sharing and you're on a Windows environment, which I don't know about from a sharing standpoint, I'm going to give you a simple example. Using git-bash, you should be able to change directories to wherever on S: you have your repositories. Then, use git init with the bare options as described by the link above to initialize a bare repository. Navigate to where you want your repository to live on C:, and then do git clone to get a working copy.
Add a README file or something else so you can do your initial commit, and then commit and do git push origin master to push your changes to the S: repository. Once all that is done, THEN initialize the RStudio Git project. RStudio should defer to your existing configuration, and things should hopefully work.

Use Fossil for system files?

As a new user of Fossil, I'm curious if there are any negative implications with using Fossil to store things like /etc/, /usr/local/etc files from Unix like systems like FreeBSD & OpenBSD. If I'm doing this for multiple systems, I think I'd create a branch with each hostname to track those files.
Q1: Have you done this? Do you prefer a different VCS to handle the system files?
Q2: Lots of changes have happened in Fossil over the years and I'm curious if it's possible to restrict who can merge branches with trunk. From reading earlier threads it wasn't possible but there are two workarounds:
a) tell people not to merge to trunk
b) have people clone and trunk maintainer pick up changes from their repo
System configuration files stored in /etc, /var or /usr/local/etc can generally only edited by the root user. But since root has complete access to the whole system, a mistaken command there can have dire consequences.
For that reason I generally use another location to keep edited configuration files, a directory in my home-directory that I call setup, which is under control of git. Since I have multiple machines running FreeBSD, each machine gets its own subdirectory. There is a special subdirectory of setup called shared for those configuration files that are used on multiple machines. Maintaining multiple copies of identical files in separate repositories or even branches can be a lot of extra work.
My workflow is the following;
Edit a configuration file in my repository.
Copy it to its proper location.
Test the changes. If problems occur, go back to step 1.
Commit the changes to the revision control system. Copy the
committed files to their proper location.
Initially I had a shell script (basically a list of install commands) to install the files for me. But I also wanted to see the differences between the working tree and the installed files.
So for my convenience, I wrote a script called deploy to help me with this. It can tell me which files in the repo are different from the installed files and can show me the differences. It can also install files to their proper locations.

How to manage multiple alfresco repositories?

Problem description:
I have multiple alfresco installations (development, testing, production) of one project.
I need to copy files under Data Dictionary folder (Scripts, Templates, Web Scripts) from one to another in one direction (development -> testing -> production).
Current solution:
I copy files manually via webdav, which is annoying and unreliable (I can forget to copy some.).
Desired solution:
I'd like to have I tool, which will copy changed files at my command, what they are ready for the next step. I had an idea, that it could internally use a Git repository with branches for each installation, being able to fetch the files from devel and push the files to testing and production. This way (with Git) it could also support reverting changes.
It looks like a quite common problem, but I wasn't able to google something about it, so I'm asking here. Does such a tool exist or is there a better way of managing multiple repositories?
If you have a brand new installation of your development/testing/production Alfresco instances, you could simply migrate alf_data dir content, that contains by default db, indexes, content-store, backup files. If you need, you could migrate the "shared" folder too, or at least some files from the shared folder as could be some Alfresco customization (custom scripts or similar). Here is the link that helps with migration steps:
http://wiki.alfresco.com/wiki/System_Migration
Otherwise, if you need only to move a folder from Data Dictionary, or a set of documents, you could use ACP in order to achieve that. Here is the wiki for doing this: http://wiki.alfresco.com/wiki/Export_and_Import
You could do this via FTP. When your want to deploy new changes, you can go with manual client like FileZila to download changes from Dev, then upload them to test.
But you can also automate FTP, so that it can run a scheduled check if there are new things on, say, dev and push them to test.
If you use Git for source control, you could also do this via git-ftp. Hold a copy of Data Dictionary in your source folder, then add some sort of pre-commit check, which will see if you changed any of those files. If you did, on commit it will push the change to dev and test.
I think Relication service AF is suitable for you.
http://wiki.alfresco.com/wiki/Alfresco_Community_3.4.a#Replication

Resources