DVC dependencies for derived data without imports - dvc

I am new to DVC, and so far I like what I see. But possibly my question is fairly easy to answer.
My question: how do we correctly track the dependencies to files in an original hugedatarepo (lets assume that this can also change) in a derivedData project, but WITHOUT the huge files being imported generally when the derived data is checked out? I don't think I can use dvc import to achieve this.
Details: We have a repository with a large amount of quite big data files (scans) and use this data to design and train various algorithms. Often we want to use only specific files and even only small chunks from within the files for training, annotation and so on. That is, we derive data for specific tasks, that we want to put in new repositories.
Currently my Idea is to dvc get the relevant data, put it in a untracked temporary folder and then again manage the derived data with dvc. But still to put in the dependency to the original data.
hugeFileRepo
+metaData.csv
+dataFolder
+-- hugeFile_1
...
+-- hugeFile_n
in the derivedData repository I do
dvc import hugeFileRepo.git metaData.csv
dvc run -f derivedData.dvc \
-d metaData.csv \
-d deriveData.py \
-o derivedDataFolder \
python deriveData.py
My deriveData.py does something along the line (pseudocode)
metaData = read(metaData.csv)
#Hack because I don't know how to it right:
gitRevision = getGitRevision(metaData.csv.dvc)
...
for metaDataForFile, file in metaData:
if(iWantFile(metaDataForFile) ):
#download specific file
!dvc get --rev {gitRevision} -o tempFolder/{file} hugeFileRepo.git {file}
#do processing of huge file and store result in derivedDataFolder
processAndWrite(tempFolder/file)
So I use the metaData file as a proxy for the actual data. The hugeFileRepo data will not change frequently and the metaData file will be kept up to date. And I am absolutely fine with having a dependency to the data in general and not to the actual files I used. So I believe this solution would work for me, but I am sure there is a better way.

This is not a very specific answer because I'm not sure I understand the details and setup completely, but in general here are some ideas:
We have a repository with a large amount of quite big data files (scans)... Often we want to use only specific files and even only small chunks
DVC commands that accept a target data file shoould support granularity (see https://github.com/iterative/dvc/issues/2458), meaning you can import only specific files from a tracked directory. As for chunks, there's no way for DVC to only import certain parts of files from CLI, this would require sematically understanding all possible data formats.
we derive data for specific tasks
Looking at this step as a proper DVC stage (derivedData.dvc) seems like the best approach here, and this stage depends on the full original data files (again, no way for DVC to know in advance what parts of the data the source code will actually use).
Since you're using Python though, there is an API to open and stream data from online DVC repos directly into your program at runtime, so deriveData.py could do that, without need to import or download anything previously. See https://dvc.org/doc/api-reference
Sorry, I don't think I understand the intention or relationship with the main question of the last code sample where git revisions are being used.

Related

Data version control (DVC) edit files in place results in cyclic dependency

we have a larger dataset and have several preprocessing scripts.
These scripts alter data in place.
It seems when I try to register it with dvc run it complains about cyclic dependencies (input is the same as output).
I would assume this is a very common use case.
What is the best practice here ?
Tried to google around but i did not see any solution to this (besides creating another folder for the output).
Usually, we split input and output into separate files rather than modify everything in place, not only for the separation of concerns principles but also to make it fit with tools like DVC.
Hope you can try this way instead.

Unix touch command usage

I know you can use touch to create a new empty file.
I just learned that touch can be used to update the access and modification time of a file. I don't quite know in what situations and why do you need to update the access and modification time of a file , i.e. the usefulness of this particular function?
Thanks!
Some utility depends on timestamp of the file.
For example, make uses timestamp to check whether it is required to do something (usually build) based on the timestamp of the source code, and output (executable, object files, ...)
By touching followed by make, the source file, you can force rebuild.
In addition, touch has a -d option that can fake the modification time.
If one "knows what he's doing" she can avoid long build time, due to unnecessary re-compilations.
For example, when adding a declaration to a common header file,
that does not change any old API, one can fake the header real modification time,
and bypass Makefile's dependencies.

Convenient diff of patch files in Gerrit

My team using Gerrit for reviewing of changes and sometimes we have to push .patch for some files. Sometimes these .patches could reach over ~1000 lines (which obviously not good for reviewing). It is very inconvenient to review it as diff between patches itself. It would be better(in some cases) to review it as diff of origin file and origin file with applied patch(not diff between .patches). Even if original file isn't under version control, committer could attach it with patch set, right?
Unfortunately, after lasting googling i didn't find anything...
Is there any way to show diff between two patch files (not patch sets) in such approach or similar?
I think the best you can do is to find a three-way merge tool to compare
the original source;
the source with the old patch; and
the source with the new patch.
To use such a tool, I think you'd need to use the command lines the Gerrit interface provides to fetch the changes locally, then apply the patches, then use the merge tool for the review.
I don't think you'll find a way to get a "three-way merge" view of the code in the Gerrit UI itself.

u2 or uniVerse code to iterate through folders,subfolders and files to check their permissions

I am new in uniVerse and I have to write a uniVerse program which will check permissions of folders,subfolders and files. for example we have a folder called A and subfolder A1 and files in A1. now have to check if their permissions are set correctly.
lets say files on subfolder A1 are supposed to be rwxrwxr-x (775) but they are rwxrwxrwx(777). then based on this I need to report to say files on folder A1 are not set correctly.
well so far a little push/ideas/references/code snapshots etc would really help.
Thanks a mill in advance for help.
I'm more of a UniData person but in UniVerse Basic it looks like you could leverage the STATUS command which returns a dynamic array that contains the UNIX permissions in numeric form (e.g. 777).
More information available in the UniVerse Basic reference manual here: http://www.rocketsoftware.com/u2/products/universe/resources/technical-manuals/universe-11.1.11-documentation/basicref-v11r1.pdf
One of the best resources for UniVerse and RetrieVe questions is this site: http://www.mannyneira.com/universe/
If your system allows it, you should look into trying to write a script that executes via Shell. You can write UniVerse scripts that duck in and out of Shell. According to the UniVerse and Linux page on that site, you should be able to access it via the SH command.
When writing a Shell program to interact with UniVerse, you're typically going to want to use uvsh to output the data and then pipe it to something else (such as col) to manipulate it. If you pass a string to the uvsh command, it will execute it - so you can pass it commands to read file data (such as from voc pointers).
Keep in mind that every time you run the SH or uvsh command, you're nesting another shell within your current one, not switching between them.
However, it sounds like the file permission information you're interested in could be handled purely on the Shell side of things...

Maintaining same piece of code in multiple files

I have an unusual environment in a project where I have many files that are each independent standalone scripts. All of the code required by the script must be in the one file and I can't reference outside files with includes etc.
There is a common function in all of these files that does authorization that is the last function in each file. If this function changes at all (as it does now and then) it has to changed in all the files and there are plenty of them.
Initially I was thinking of keeping the authorization function in a separate file and running a batch process that produced the final files by combining the auth file with each of the others. However, this is extremely cumbersome when debugging because the auth function needs to be in the main file for this purpose. So I'd always be testing and debugging in the folder with the combined file and then have to copy changes back to the uncombined files.
Can anyone think of a way to solve this problem? i.e. maintain an identical fragment of code in multiple files.
I'm not sure what you mean by "the auth function needs to be in the main file for this purpose", but a typical Unix solution might be to use make(1) and cpp(1) here.
Not sure what environment/editor your using but one thing you can do is to use prebuild events. create a start-tag/end-tag which defines the import region, and then in the prebuild event copy the common code between the tags and then compile...
//$start-tag-common-auth
..... code here .....
//$end-tag-common-auth
In your prebuild event just find those tags, and replace them with the import code and then finish compiling.
VS supports pre-post build events which can call external processes, but do not directly interact with the environment (like batch files or scripts).
Instead of keeping the authentication code in a separate file, designate one of your existing scripts as the primary or master script. Use this one to edit/debug/work on the authentication code. Then add a build/batch process like you are talking about that copies the authentication code from the master script into all of the other scripts.
That way you can still debug and work with the master script at any time, you don't have to worry about one more file, and your build/deploy process keeps everything in sync.
You can use a technique like #Priyank Bolia suggested to make it easy to find/replace the required bit of code.
I ugly way I can think of:
have the original code in all the files, and surround it with markers like:
///To be replaced automatically by the build process to the latest code
String str = "my code copy that can be old";
///Marker end.
This code block can be replaced automatically by the build process, from one common code file.

Resources