How to estimate the size of a git version? - jgit

I would like to estimate the size of a git commit for one file? We are using jgit to store versions for some internal files and i would like to estimate how much the a new commit for one file increases the size of the git repository.
Any suggestions?
UPDATE: The files we store in git are yaml text files.
When I commit a changed version of a yaml file I do not see any significant change of the .git

(Answer ignores pack files.)
Git would have to store one new blob and some insignificant metadata (like the commit message). If you zlib the file and inspect its size you will get a good estimate of how much the repo will grow.

Related

Delete cached data from DVC

I would like to be able to delete individual files or folders from the DVC cache, after they have been pulled with dvc pull, so they don't occupy space in local disk.
Let me make things more concrete and summarize the solutions I found so far. Imagine you have downloaded a data folder using something like:
dvc pull <my_data_folder.dvc>
This will place the downloaded data into .dvc/cache, and it will create a set of soft links in my_data_folder (if you have configured DVC to use soft links)
ls -l my_data_folder
You will see something like:
my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
...
Imagine you don't need this data for a while, and you need to free its space from local disk. I know of two manual approaches for doing that, although I am not sure about the second one:
Preliminary step (optional)
Not needed if you have symlinks (which I believe is true, at least in unix-like OS):
dvc unprotect my_data_folder
Approach 1 (verified):
Delete all the cached data. From the repo's root folder:
rm -r my_data_folder
rm -rf .dvc/cache
This seems to work properly, and will completely free the disk space previously used by the downloaded data. Once we need the data again, we can pull it by doing dvc pull as previously. The drawback is that we are removing all the data downloaded with dvc so far, not only the data corresponding to my_data_folder, so we would need to do dvc pull for all the data again.
Approach 2 (NOT verified):
Delete only specific files (to be thoroughly tested that this does not corrupt DVC in any way):
First, take note of the path indicated in the soft link:
ls -l my_data_folder
You will see something like:
my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
If you want to delete my_data_file_1.pk, from the repo's root folder run:
rm .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
Note on dvc gc
For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.
I would appreciate if someone can suggest a better way, or also comment whether the second approach is actually appropriate. Also, if I want to delete the whole folder and not go file by file, is there any way to do that automatically?
Thank you!
It's not possible at the moment to granularly specify a directory / file to be removed from the cache. Here are the tickets to vote and ask to prioritize this:
dvc gc remove
Reconsider gc implementation
For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.
This is a bit concerning. If you run it with the -w option it keeps only files / dirs that are referenced in the current versions of the .dvc and dvc.lock files. And it should remove everything else.
So, let's say you are building a model:
my_model_file.pk
You created it once and its hash is 4f7bc7702897bec7e0fae679e968d792 and it's written in the dvc.lock or in the my_model_file.dvc.
Then you do another iteration and now hash is different 5a8cc7702897bec7e0faf679e968d363. It should be now written in the .dvc or lock. It means that a model that corresponds to the previous 4f7bc7702897bec7e0fae679e968d792 is not referenced anymore. In this case dvc gc -w should definitely collect it. If that is not happening please create a ticket and we'll try to reproduce and take a look.

Revert a dvc remove -p command

I have just removed a DVC tracking file by mistake using the command dvc remove training_data.dvc -p, which led to all my training dataset gone completely. I know in Git, we can easily revert a deleted branch based on its hash. Does anyone know how to revert all my lost data in DVC?
You should be safe (at least data is not gone) most likely. From the dvc remove docs:
Note that it does not remove files from the DVC cache or remote storage (see dvc gc). However, remember to run dvc push to save the files you actually want to use or share in the future.
So, if you created training_data.dvc as with dvc add and/or dvc run and dvc remove -p didn't ask/warn you about anything, means that data is cached similar to Git in the .dvc/cache.
There are ways to retrieve it, but I would need to know a little bit more details - how exactly did you add your dataset? Did you commit training_data.dvc or it's completely gone? Was it the only data you have added so far? (happy to help you in comments).
Recovering a directory
First of all, here is the document that describes briefly how DVC stores directories in the cache.
What we can do is to find all .dir files in the .dvc/cache:
find .dvc/cache -type f -name "*.dir"
outputs something like:
.dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir
.dvc/cache/00/db872eebe1c914dd13617616bb8586.dir
.dvc/cache/2d/1764cb0fc973f68f31f5ff90ee0883.dir
(if the local cache is lost and we are restoring data from the remote storage, the same logic applies, commands (e.g. to find files on S3 with .dir extension) look different)
Each .dir file is a JSON with a content of one version of a directory (file names, hashes, etc). It has all the information needed to restore it. The next thing we need to do is to understand which one do we need. There is no one single rule for that, what I would recommend to check (and pick depending on your use case):
Check the date modified (if you remember when this data was added).
Check the content of those files - if you remember a specific file name that was present only in the directory you are looking for - just grep it.
Try to restore them one by one and check the directory content.
Okay, now let's imagine we decided that we want to restore .dvc/cache/20/b786b6e6f80e2b3fcf17827ad18597.dir, (e.g. because content of it looks like:
[
{"md5": "6f597d341ceb7d8fbbe88859a892ef81", "relpath": "test.tsv"}, {"md5": "32b715ef0d71ff4c9e61f55b09c15e75", "relpath": "train.tsv"}
]
and we want to get a directory with train.tsv).
The only thing we need to do is to create a .dvc file that references this directory:
outs:
- md5: 20b786b6e6f80e2b3fcf17827ad18597.dir
path: my-directory
(note, that path /20/b786b6e6f80e2b3fcf17827ad18597.dir became a hash value: 20b786b6e6f80e2b3fcf17827ad18597.dir)
And run dvc pull on this file.
That should be it.

How to build a rpm that installs host dependent files

I have to build one rpm that copies the contents of file A to /path/to/tartetfile if the hostname is A. In all other cases the contents of B should be copied to /path/to/targetfile. I'm aware that this may be a misusage of rpm, but I still have to do it like this. Do you have any ideas how to get this done in an elegant way?
My solution at the moment would be to create an empty /path/to/targetfile in my BUILD directory as well as a /tmp/contents.tar.gz that contains the files A and B. In the postinstall routine i then would extract the relevant parts of /tmp/contents.tar.gz to /path/to/targetfile and delete the tarball afterwards. In the pre-uninstall routine I'd then touch the /tmp/contents.tar.gz to supress rpm reporting errors for an already deleted file.
To me this seems to be a very dirty way to get this done. Do you have better ones?
If you plan on abusing rpm for things it was not desinged for, you'll have to do dirty tricks.
I don't see another workaround for you. I fail to see the use of removing the tar.gz etc, unless that (little?) extra space is really a problem for you. I would propose:
package all files (A and B) into some specific directory (/usr/lib/your-package or whatever), not in compressed format.
in the %post section create just symlinks so that /path/to/targetfile points to /usr/lib/your-package/A or /usr/lib/your-package/B (symlinks take up almost no space). This has the additional value that ls -l /path/to/targetfile will show you which which file it points to, giving you the information whether this is file A or B.
in your %files section declare %ghost /path/to/targetfile for a nice cleanup upon removal.

How to add files in the RPM package of an Sailfish OS project?

I am trying to build a Sailfish OS app, and I need to use *.wav files, which are to be distributed through the *.rpm package. In my case, these files are to be put in /usr/share/[application_name]/sounds/*. How do I set up the *.pro and *.yamp files accordingly?
This isn't a RPM question per se: you seem to be asking how to configure
your application through *.pro and *.yamp if you deliver content in
*.rpm packages.
The packaging answer is: Patch the configuration files exactly the same
as if you were installing the *.wav files manually (i.e. not through *.rpm).
You will need to copy the *.wav content into the %buildroot tree that
is used to stage the files to be included in the package, as well as the
modified *.pro and *.yamp content. All the files to be included in the
*.rpm package will need to be mentioned in the %files manifest exactly
as they are to be installed (i.e. w/o the %buildroot prefix used for
staging files while building).
I finally found an answer!
I want to thank to the owner of that project:
https://github.com/krig/metronom-sailfish
From the .pro and the .yaml files of this project i found out how to deploy the files. First, we declare that constant:DEPLOYMENT_PATH = /usr/share/$${TARGET} which seems to hold the path to /usr/share/[appname]. Next, we define some kind of a variable (TODO: find a more detailed explanation of that). The definition of that first sets the path to the files, for example, data.files = data (the second data is the folder). Next, we set data.path to $${DEPLOYMENT_PATH}. We list all the files in OTHER_FILES and add the setting, in our case, data, to INSTALLS. Now, that we are finished with the .pro file, we move to the .yaml file for the .rpm and we add to the necessary line to the Files: section, in our case, - '%{_datadir}/%{name}/data', the last being the folder we need to add. TODO: to whoever is more experienced, please provide a more detailed answer.
Did you check https://sailfishos.org/develop-packaging-apps.html carefully? May helps.

Going remote from local repo: Git and forgetting large files

I'm a relative git newbie, as you're about to see. So please forgive my poor use of git terminology, I'm still learning.
Concise summary of problem: I want to put my local repo on GitHub, but I have some previously-tracked files that are too big.
Background:
This morning I had a local repository where all sorts of files were being tracked: R scripts, .RData files, .csv's, etc. I decided I wanted to make my repository publicly available by pushing it to GitHub.
When I tried to push (using git remote add origin https://github.com/me/repo.git followed by git push -u origin master), I realized that some of my large data files were too large for GitHub. I've decided that it would be OK if the .RData files didn't get pushed to GitHub, and weren't tracked by git (although I don't want to delete the files locally). But I can't figure out how to make this happen.
Things I've tried thus far:
First I added .RData files to the .gitignore file. I quickly realized that
this does nothing for files that are already being tracked.
I used git rm -r --cached . followed by git commit -am "Remove ignored
files", thinking this would help git forget about all of those huge
files I just ignored.
Further following the git help page, I tried git commit --ammend
-CHEAD, but I still couldn't push.
I attempted to use the BFG, but I didn't get very far with it
b/c it apparently didn't find any files larger than 100M. Clearly I
was going something wrong, but decided not to pursue further.
Following some tips I found HERE, I then tried git
filter-branch --tree-filter 'git rm -r -f --ignore-unmatch *.RData'
HEAD. This definitely did something, but I still couldn't push.
However, instead of the huge list of too-big files, I am now down to
2 files that are too big (even though other .RData files in the same
directory are no longer listed).
After my last git push -u origin master --force, this is the print out in terminal:
Counting objects: 1163, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (1134/1134), done.
Writing objects: 100% (1163/1163), 473.07 MiB | 6.80 MiB/s, done.
Total 1163 (delta 522), reused 0 (delta 0)
remote: error: GH001: Large files detected.
remote: error: Trace: 4ce4aa642e458a7a715654ac91c56af4
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File Results/bigFile1.RData is 166.51 MB; this exceeds GitHub's file size limit of 100 M
remote: error: File Results/bigFile2.RData is 166.32 MB; this exceeds GitHub's file size limit of 100 MB
To https://github.com/me/repo.git
! [remote rejected] master -> master (pre-receive hook declined)
error: failed to push some refs to 'https://github.com/me/repo.git'
If you haven't guessed, I don't really know what I'm doing ... I'm essentially trying any code snippet I can find, and seeing if it allows me to push. All of my data and files are backed up, so I'm experimenting rather brazenly.
Given that I'm willing to not track the huge .RData files, how do I get my local repo to the point where I can push it to GitHub?
Any help would be very greatly appreciated. Thanks!
I am pretty sure you will just need to remove them from your .git repo history. Not just remove them from the most current version, they need to be excised from ever having existed in your repo.
The technique is covered elsewhere, see this stackoverflow post or the BFG tool.

Resources