Going remote from local repo: Git and forgetting large files - r

I'm a relative git newbie, as you're about to see. So please forgive my poor use of git terminology, I'm still learning.
Concise summary of problem: I want to put my local repo on GitHub, but I have some previously-tracked files that are too big.
Background:
This morning I had a local repository where all sorts of files were being tracked: R scripts, .RData files, .csv's, etc. I decided I wanted to make my repository publicly available by pushing it to GitHub.
When I tried to push (using git remote add origin https://github.com/me/repo.git followed by git push -u origin master), I realized that some of my large data files were too large for GitHub. I've decided that it would be OK if the .RData files didn't get pushed to GitHub, and weren't tracked by git (although I don't want to delete the files locally). But I can't figure out how to make this happen.
Things I've tried thus far:
First I added .RData files to the .gitignore file. I quickly realized that
this does nothing for files that are already being tracked.
I used git rm -r --cached . followed by git commit -am "Remove ignored
files", thinking this would help git forget about all of those huge
files I just ignored.
Further following the git help page, I tried git commit --ammend
-CHEAD, but I still couldn't push.
I attempted to use the BFG, but I didn't get very far with it
b/c it apparently didn't find any files larger than 100M. Clearly I
was going something wrong, but decided not to pursue further.
Following some tips I found HERE, I then tried git
filter-branch --tree-filter 'git rm -r -f --ignore-unmatch *.RData'
HEAD. This definitely did something, but I still couldn't push.
However, instead of the huge list of too-big files, I am now down to
2 files that are too big (even though other .RData files in the same
directory are no longer listed).
After my last git push -u origin master --force, this is the print out in terminal:
Counting objects: 1163, done.
Delta compression using up to 8 threads.
Compressing objects: 100% (1134/1134), done.
Writing objects: 100% (1163/1163), 473.07 MiB | 6.80 MiB/s, done.
Total 1163 (delta 522), reused 0 (delta 0)
remote: error: GH001: Large files detected.
remote: error: Trace: 4ce4aa642e458a7a715654ac91c56af4
remote: error: See http://git.io/iEPt8g for more information.
remote: error: File Results/bigFile1.RData is 166.51 MB; this exceeds GitHub's file size limit of 100 M
remote: error: File Results/bigFile2.RData is 166.32 MB; this exceeds GitHub's file size limit of 100 MB
To https://github.com/me/repo.git
! [remote rejected] master -> master (pre-receive hook declined)
error: failed to push some refs to 'https://github.com/me/repo.git'
If you haven't guessed, I don't really know what I'm doing ... I'm essentially trying any code snippet I can find, and seeing if it allows me to push. All of my data and files are backed up, so I'm experimenting rather brazenly.
Given that I'm willing to not track the huge .RData files, how do I get my local repo to the point where I can push it to GitHub?
Any help would be very greatly appreciated. Thanks!

I am pretty sure you will just need to remove them from your .git repo history. Not just remove them from the most current version, they need to be excised from ever having existed in your repo.
The technique is covered elsewhere, see this stackoverflow post or the BFG tool.

Related

Delete cached data from DVC

I would like to be able to delete individual files or folders from the DVC cache, after they have been pulled with dvc pull, so they don't occupy space in local disk.
Let me make things more concrete and summarize the solutions I found so far. Imagine you have downloaded a data folder using something like:
dvc pull <my_data_folder.dvc>
This will place the downloaded data into .dvc/cache, and it will create a set of soft links in my_data_folder (if you have configured DVC to use soft links)
ls -l my_data_folder
You will see something like:
my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
...
Imagine you don't need this data for a while, and you need to free its space from local disk. I know of two manual approaches for doing that, although I am not sure about the second one:
Preliminary step (optional)
Not needed if you have symlinks (which I believe is true, at least in unix-like OS):
dvc unprotect my_data_folder
Approach 1 (verified):
Delete all the cached data. From the repo's root folder:
rm -r my_data_folder
rm -rf .dvc/cache
This seems to work properly, and will completely free the disk space previously used by the downloaded data. Once we need the data again, we can pull it by doing dvc pull as previously. The drawback is that we are removing all the data downloaded with dvc so far, not only the data corresponding to my_data_folder, so we would need to do dvc pull for all the data again.
Approach 2 (NOT verified):
Delete only specific files (to be thoroughly tested that this does not corrupt DVC in any way):
First, take note of the path indicated in the soft link:
ls -l my_data_folder
You will see something like:
my_data_file_1.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
my_data_file_2.pk --> .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
If you want to delete my_data_file_1.pk, from the repo's root folder run:
rm .dvc/cache/4f/7bc7702897bec7e0fae679e968d792
Note on dvc gc
For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.
I would appreciate if someone can suggest a better way, or also comment whether the second approach is actually appropriate. Also, if I want to delete the whole folder and not go file by file, is there any way to do that automatically?
Thank you!
It's not possible at the moment to granularly specify a directory / file to be removed from the cache. Here are the tickets to vote and ask to prioritize this:
dvc gc remove
Reconsider gc implementation
For some reason, running dvc gc does not seem to delete the files from the cache, at least in my case.
This is a bit concerning. If you run it with the -w option it keeps only files / dirs that are referenced in the current versions of the .dvc and dvc.lock files. And it should remove everything else.
So, let's say you are building a model:
my_model_file.pk
You created it once and its hash is 4f7bc7702897bec7e0fae679e968d792 and it's written in the dvc.lock or in the my_model_file.dvc.
Then you do another iteration and now hash is different 5a8cc7702897bec7e0faf679e968d363. It should be now written in the .dvc or lock. It means that a model that corresponds to the previous 4f7bc7702897bec7e0fae679e968d792 is not referenced anymore. In this case dvc gc -w should definitely collect it. If that is not happening please create a ticket and we'll try to reproduce and take a look.

Can a pre-commit Git hook zip a directory and add it to the repository?

I'm doing development on a Wordpress plugin. My development directory contains a lot of development-specific stuff (e.g. Grunt files, Sass files, the git repository itself, etc.).
Obviously, I don't want to distribute this folder containing all of those development files; people don't want a few MB of Grunt files when they download my Wordpress plugin.
Up until now, though, my "release" process has been cumbersome:
Commit the Git changes
Zip the entire folder
Open the zip file and delete the .git folder, grunt files, and all the other development-specific files
Release the new zip
I don't know the best way to accomplish this, but I'm very vaguely familiar with Git hooks, and I had this thought: could I set up a Git hook that would zip ONLY the needed production files into a ZIP file and store it with the repo? That way, every time I commit it would automatically create a new release ZIP.
Is that possible? If so, could someone point me in the right direction?
Oh also, I'm on Windows (・_・;). So I'm hoping that there's a way to do it on Windows.
I can't speak for Windows, but:
It's technically possible to do that sort of thing in a pre-commit hook.
Don't.
A pre-commit hook that modifies "what you will commit" is annoying (if nothing else, it violates the "rule of least astonishment", where your version control system simply stores the versions you tell it to store). Apart from that, storing large pre-compressed binaries interferes with git's attempt to save space in pack files, and will cause rapid repository bloat, poor performance, running out of memory, and so on. A ZIP-archive is a pre-compressed binary and hence will behave badly.
In general, a more reasonable "hook-y" way to handle releases is to set up a "release server" to which you push new releases, and have the push trigger the archive-generation. (There are ways to do this without a separate server / repository, and you can do it in a more pull-style fashion, but the push-style is easy to illustrate.)
[Edit: I had originally considered git archive but did not realize you could get it to exclude files conveniently, so wrote up the below instead. So, jthill's answer is better and should be one's first resort. I'll leave this in place as an alternative for some case where for some reason, git archive might not do.]
For instance, here's a server-side post-receive hook code fragment that checks whether a branch whose name matches release* has been pushed-to, and if so, invokes a shell function with the name of the branch (once for each such branch):
#! /bin/sh
NULL_SHA1=0000000000000000000000000000000000000000
scan()
{
local oldsha newsha fullref shortref
local optype
while read oldsha newsha fullref; do
case $oldsha,$newsha in
$NULL_SHA1,*) optype=create;;
*,$NULL_SHA1) optype=delete;;
*) optype=update;;
esac
case $fullref in
refs/heads/*)
reftype=branch
shortref=${fullref#refs/heads/}
;;
*)
reftype=other
shortref=fullref
;;
esac
case $optype,$reftype,$shortref in
create,branch,release*|update,branch,release*)
do_release $shortref;;
esac
done
}
scan
(much of the above is boilerplate, which I have stripped down to essentials). You would have to write the do_release function, which might resemble (totally untested):
do_release()
{
local tmpdir=/tmp/build.$$ # or use mktemp -d
# $tmpdir/index is git's index; $tmpdir/t is the work tree
trap "rm -rf $tmpdir; exit 1" 1 2 3 15
rm -rf $tmpdir
mkdir $tmpdir/t
GIT_INDEX_FILE=$tmpdir/index GIT_WORK_TREE=$tmpdir/t git checkout $1
# now clean out grunt files and make zip archive
(cd $workdir/t; rm -rf grunt; zip ../t.zip .)
# put completed zip archive in export location, name it
# based on the branch name
mv $workdir/t.zip /place/where/zip/files/live/$1.zip
# clean up temp dir now, and no longer need to clean up
# on signal related abort
rm -rf $tmpdir
trap - 1 2 3 15
}
There's actually a command for this, git archive.
git archive master -o wizzo-v1.13.0.zip
See the EXAMPLES section, you can select paths, add prefixes to them, define custom postprocessing by output extension, and some more minor tweaks.
Also see the ATTRIBUTES section: you can give files -- arbitrary patterns, really -- an export-ignore attribute to exclude them from archives.
It's got a bunch more handy-dandies, you can get archives from remote repos, expand arbitrary git log --pretty=format: placeholders, the git manpages are definitely worth whatever time you can invest in them.

Git post-receive hooks for two repos on the same server

I'm developing a site in Wordpress that I’ve built a template and a plugin for. For my git workflow, I'd like to be able to track the template and the plugin separately with two different repos on the server. Right now I push template commits into a bare repo on the server, which then executes this post-receive hook:
#!/bin/sh
export GIT_WORK_TREE=/home/user/public_html/wp-content/themes/custom_theme
export GIT_DIR=/home/user/public_html/custom_theme.git
git checkout -f master
And the files show up in the theme folder as intended when I push. But when I tried to set up the exact same thing for the plugin, I get no result. Here is the post-receive hook for the plugin bare repo:
#!/bin/sh
export GIT_WORK_TREE=/home/user/public_html/wp-content/plugins/custom_plugin
export GIT_DIR=/home/user/public_html/custom_plugin.git
git checkout -f master
When I push to the plugin bare repo, the live plugin directory remains empty and I don't even see any error logs. Does this have to do with using two repos on the same server? I tried adding "unset GIT_DIR" at the start of both hooks to see if that made any difference (it didn't). I've also already checked the file permissions and both hooks are executable for all users. Is there at least a way I can run the plugin hook manually and see what the shell response is?
EDIT:
I added the "echo working 1>&2" to the hook to test if it triggers (per torek's suggestion). After committing a test change and pushing to the server bare repo, here is what I got:
stdin: is not a tty
Counting objects: 5, done.
Delta compression using up to 12 threads.
Compressing objects: 100% (3/3), done.
Writing objects: 100% (3/3), 376 bytes | 0 bytes/s, done.
Total 3 (delta 1), reused 0 (delta 0)
remote: Already on 'master'
remote: working
To user#dev.mysite.org:public_html/custom_plugin.git
f1cc0bb..9a71b5b master -> master
So it seems like the hook is triggering right? I'm getting the "remote: working" line of output above. I'm not exactly sure what all of the output means though. Am I missing something there?
Seems to be working just fine now after staring at it for a day.

Git Archive but first put all the files inside a folder then start archiving

As title suggest, I want to know if there is a single git command that put all my project in one folder first (not including .gitignored files) and then proceed archiving the folder— leaving ignored files not included when archiving which is nice.
This can be beneficial for me as I am working on WordPress plugin with multiple release. Some references.
I want all the files (minus the .gitignored files) move to a folder first then proceed archiving that folder
It is possible in one command provided you define an alias but this isn't git-related:
you can:
clone your repo elsewhere (that way you don't get any ignored or private file)
move your files as you see fit in that local clone
archive (tar cpvf yourArchive.tar yourFolder)
But git archive alone won't help you move those files, which is why I would recommend a script with custom bash commands (not git commands).
You don't really need to copy / clone the repo anywhere.
Make sure you committed all your changes.
Process the files any way you want.
Run tar -cvjf dist/archive-name.tbz2 --transform='s,^,archive-name/,' $(git ls-tree --full-tree -r --name-only --full-name HEAD)
run git reset --hard to restore without any of the changes you made in step #2.
Hints:
The --transform='s,^,archive-name/,' is so your files will be extracted toarchive-name/....`, you can remove it if you don't need that.

Fossil: recognise changes while repo closed

I know, I should only change files in a project, when the repository is opened. But I now tried to see what happens when I change a file when the repo is closed, because I will often do that, because I'm going to forget to open repos. It's inconvenient ...
Now I see what happens: changes are not recognised. Doing a commit, I get the message "nothing has changed" ... which is not true.
What can I do to make fossil recognise missed changes?
Why did you close the repository? When you do fossil open, fossil will try to deploy the latest version. Maybe it has overwritten your files…
You should use open .... --keep if you don't want to harm your working directory.
As a comparison with git (seems that it's your background):
in git, each working directory has its own .git folder. Multiple working directories for the same repository are typically hardlinked.
in fossil, each working directory contains a file named _FOSSIL_ or maybe .fossil depending on your version. It contains both a pointer to the repository (the object database) plus workingdir-specific data (what you'd call HEAD, stash, uncommitted additions/deletions/renames). close will delete that file. So, in git terms, it's like if you did git clone --bare . some_other_folder.git and then recursive rmdir .git. You still have the project history somewhere, but all information about your working tree is lost.

Resources