Commit a large number of files in RStudio using GIT panel - r

In RStudio, if you are dealing with a directory that contains a large number of files, and you want to commit and push the recent changes (that you made on all of them) to your repository, the GUI Git component gets super slow and practically doesn't work. Any idea?

Of course you can ignore the GUI and stick to the command-line Git forever, but if you don't want, a quick jump to the command-line git would solve this problem for now.
The temporary solution that I found is as follows:
Click on the blue-gear icon on the GIT panel, inside RStudio.
Select Shell (a terminal window will pup up!)
Write the add and commit command in the terminal:
{ATTENTION: The following command will commit changes on ALL files! You may want to use what is appropriate for your situation!}
git add -A && git commit -m 'staging all files'
Now you can go back to the GUI Git, and click on push button. All files that you staged in the terminal window, will be pushed up to your repository.

My workaround today was to...
create a man1/ directory in my RStudio project (after force-closing and
relaunching RStudio a few times after I had done something that caused it to hang again),
include man1/ in .gitignore,
move just about everything in man/ to man1/,
delete the .git/index.lock file in the repository,
futz around with RStudio, until it was responsive enough to make the (small) commit of files from man/,
pull and push, so that the remote main was once-again fully synched
copy some files from man1/ to man/, commit these
repeat and rinse of steps 6 and 7, until there's nothing left in man1/
Delete man1/ and its entry in .gitignore
My recipe above isn't one-size-fits-all... for example, you may have run into a "diff is too large" difficulty with RStudio because of a single oversized file, rather than (as I had) with "too many" small files. If you're trying to commit a monstrously-big set of diffs from a single file... you should be including this file in your .gitignore, rather than expecting it to be version-controlled without any difficulty by git. Also, if you're locking up RStudio's Git-interface because of "too many" files being committed simultaneously, your first port of call should I think to commit directories one at a time (but do be sure to push & pull after each commit).
And... I'm not going to complain about this defect in how RStudio interacts with git!
Instead I'll close with some kudos. After just a few days of futzing around with RStudio, I'm finding it to be so much easier than what I remember about hacking on S via emacs in the early 1990s. RStudio handles just about everything (especially the ROxygen documentation workflows) than the emacs/eSS setup that I had been struggling to get fully operational earlier this month.
I'm also impressed with how R has developed since the last time I looked at it -- about 20 years ago! The semantics lurking in corner cases are still very "surprising", to put it mildly ;-) But I do appreciate how it has maintained compatibility with the truly-bizarre and primitive semantics of S while allowing its power-users (which will surely never include me!) to write expressively, elegantly, and with an appropriate balance between concision and the write-only alphabet-soup of APL and its ilk (https://en.wikipedia.org/wiki/Write-only_language)

Related

GitHub branch syncing with the master for no obvious reason

I'm coding in Rstudio and my workflow is along these lines:
make a new branch using Rstudio's UI
add some code or fix a bug
commit code when I'm satisfied and push to GitHub
merge the new code into the master on GitHub
pull the latest master code from GitHub into Rstudio using it's UI
delete any local/remote branches via command line (because Rstudio doesn't have the functionality and doesn't sync with GitHub when it comes to remote branch deletion)
This might not be the most efficient way of doing things (I'm new to git) but it works well enough except for the problem I'm having. Twice now, seemingly at random, I've created a new branch, worked on it and when I've gone back to check something in the master they are identical. The code changes I've made in the branch have already synced with the master.
This is what the last two lines of the History look like:
git history
independant_erp_norm_regressions is the last branch I merged into the master while preprocess_select_global_pars is the current branch which is syncing to the master unduly.
I'm at a loss as to what is going on because I'm doing the same thing as I usually do and haven't been able to find any similar questions on stackoverflow.
Any help would be greatly appreciated (as well as any ways in which I can streamline my workflow).
Ok thanks for the responses guys, as per Tim's reply I decided to commit the changes made to the new branch via Rstudio's UI and check in command line to see what happened behind the scenes. (After that I thought I would do an entire branch/merge via command line to see if the problem persisted or if it was a Rstudio bug). So just before committing the changes Rstudio's git interface showed that Master and my branch were still in sync up to and including having staged files selected together. After committing I used "git show-branch" in command line and it showed that only the correct branch had a new commit, this was mirrored in Rstudio's Git history interface and after merging via GitHub all is well. So it seems like it was just an Rstudio-git bug of sorts.

R Server - Resuming R Session - message hanging or taking 15+min

I frequently work on a R-server environment. However, whenever come back to my work following the last working day, the system often gets stuck with 'resuming r session'. This might take upwards of 5-15min. I try to terminate R or restart R but often this doesn't really do anything.
I'm looking for a work-around as it is very frustrating to go to the R-server URL and to have to wait forever to get started again. IDEALLY, I'd be able to pick up right where I left off. However, if this can't be done, I guess that is ok….
I was looking around at the folder structure and I noticed that there is a folder called "Suspended-R-Session".
Within this folder are a few files such as:
'options',
'lib paths',
'history',
'environment_vars',
'environment',
and 'settings'.
Should I be deleting these files in order to speed up load time???
As described in this link https://support.rstudio.com/hc/en-us/community/posts/200638878-resuming-session-hangup, in my case for R version 3.5:
cd ~/.rstudio/sessions/active/session-45204d30
rm -rf suspended-session-data

Automatically log changes to system files and allow revert

I'm trying to learn about the guts of Unix right now, mostly through experimentation. When I was first starting, I found myself looking through forum posts, copying and pasting bash code. When I broke something, I often had to do a fresh install because I couldn't remember what exactly I had changed where. Now, the simple solution is to record a log of all the system files I've changed and keep original copies of all the default files so I can revert if necessary. It would be great if there was a cl tool which did this for me automatically. It would be even greater if I could step back through changes. Basically, I'm looking to version control my entire OS.
Does anything like this exist? I would also accept alternative strategies for spelunking through Unix without causing permanent damage if you think I'm going about this wrong.
Using debian if it matters.

How to load packages automatically when opening a project in RStudio

Every time I restart RStudio-it requires me to reload all of the packages that were loaded in the workspace previously. I can't seem to figure out what the problem is, RStudio is saving the projects when it closes them.
How can I make sure that RStudio reloads the necessary packages when I open the project?
I presume you want to say that you have to reload all of the packages that were loaded in the workspace previously. That's not an error, that's by design.
If you want to load some packages at startup in a project, you can do so by creating a file called .Rprofile in the project directory, and specify whatever code you want RStudio to run when loading the project.
For example:
cat("Welcome to this project.\n")
require(ggplot2)
require(zoo)
would print a welcome message in the console, and load ggplot2 and zoo every time you open the project.
See also http://www.rstudio.com/ide/docs/using/projects
In general there's nothing different to default package loading in RStudio than in R (How to load packages in R automatically?). Upon startup R checks for an .Rprofile file in either your local, or fail, that, home or install directory (on Mac/Linux: ./.Rprofile or else ~/.Rprofile) and executes it, and hence any options(defaultPackages...)) or other package-load-related commands it contains.
The only small difference is that RStudio "helpfully" changes your default path before startup see "RStudio: Working with Projects", so you might load a different or missing .Rprofile or the wrong .Rprofile, depending on whether you've opened an RStudio Project or just plain files, and what your RStudio default working directory is set to. It's not always clear what directory you're in, so sometimes this causes real grief.
I tend to use RStudio without defining my code as an RStudio Project, simply because it's heavy-handed and creates more files and directories without adding anything (to my use case, anyway).
So the solution I found to maintaining .Rprofile and making sure the right one gets loaded is a trusty old Unix link from the project directory to my ~
ln -s ~/.Rprofile ./.Rprofile
(If you're on Windows it's more painful.)
You don't need to have one global .Rprofile, you could keep task-specific ones for different types of projects, or trees, or (say) a .Rprofile.nlp, .Rprofile.financial, .Rprofile.bio and so on. As well as options(default.packages, you can gather all your thematically-related settings: scipen, width, data.table/dplyr-specific options, searchpath...
Power tips:
obviously keep backups or SCM of your valuable .Rprofile(s)). Definitely make sure git is tracking it, so don't put it in .gitignore
if you have multiple .Rprofiles, put a cat("Loading .Rprofile.foo") line in each one so you can see from console that the right .Rprofile.xyz got loaded
after every project, revise, trim, tweak your .Rprofile; add new use case stuff, comment out irrelevant stuff, commit the changes to git

What artifacts to save for a nightly build?

Assume that I set up an automatic nightly build. What artifacts of the build should I save?
For example:
Input source code
output binaries
Also, how long should I save them, and where?
Do your answers change if I do Continuous Integration?
You shouldn't save anything for the sake of saving it. you should save it because you need it (i.e., QA uses nightly builds to test). At which point, "how long to save it" becomes however long QA wants them.
i wouldn't "save" source code so much as tag/label it. I don't know what source control you're using, but tagging is trivial (performance & disk space) for any quality source control system. Once your build is tagged, unless you need binaries, there really isn't any benefit to just having them around because you can simply re-compile when necessary from source.
Most CI tools let you tag on each successful build. This can become problematic for some systems as you can easily have 100+ tags a day. For such cases I recommend still running a nightly build and only tagging that.
Here are some artifacts/information that I'm used to keep at each build:
The tag name of the snapshot you are building (tag and do a clean checkout before you build)
The build scripts themselfs or their version number (if you treat them as a separate project with its own version control)
The output of the build script: logs and final product
A snapshot of your environment:
compiler version
build tool version
libraries and dll/libs versions
database version (client & server)
ide version
script interpreter version
OS version
source control version (client and server)
versions of other tools used in the process and everything else that might influence the content of your build products. I usually do this with a script that queries all this information and logs it to a text file that should be stored with the other build artifacts.
Ask yourself this question: "if something destroys entirely my build/development environment what information would I need to create a new one so I can redo my build #6547 and end up with the exact same result I got the first time?"
Your answer is what you should keep at each build and it will be a subset or superset of the things I already mentioned.
You can store everything in your SCM (I'd recommend a separate repository), but in this case your question on how long you should keep the items looses sense. Or you should store it to zipped folders or burn a cd/dvd with the build result and artifacts. Whatever you choose, have a backup copy.
You should store them as long as you might need them. How long, will depend on your development team pace and your release cycle.
And no, I don't think it changes if you do continous integration.
This isn't a direct answer to your question, but don't forget to version control the nightly build setup itself. When the project structure changes, you may have to change the build process, which will break older builds from that point on.
In addition to the binaries as everyone else has mentioned I would recomend setting up a symbol server and a source server and making sure you get the correct information out and into those. It will aid in debugging tremendously.
We save the binaries, stripped and unstripped (so we have the exactly same binary, once with and once without debug symbols). Further we build everything twice, once with debug output enabled and once without (again, stripped and unstripped, so every build result in 4 binaries). The build is stored to a directory according to SVN revision number. That way we can always retain the source from the SVN repository by simply checking out this very revision (that way the source is archived as well).
A surprising one I learned about recently: If you're in an environment that might be audited you'll want to save all the output of your build, the script output, the compiler output, etc.
That's the only way you can verify your compiler settings, build steps, etc.
Also, how long to save them for, and where to save them?
Save them until you know that build won't be going to production, iow as long as you have the compiled bits around.
One logical place to save them is your SCM system. Another option is to use a tool that will automatically save them for you, like AnthillPro and its ilk.
We're doing something close to "embedded" development here, and I can tell you what we save:
the SVN revision number and timestamp, as well as the machine it was built on and by whom (also burned into the build binaries)
a full build log, showing whether it was a full/incremental build, any interesting (STDERR) output the data baking tools produced, a list of files compiled and any compiler warnings (this compresses very well, being text)
the actual binaries (for anywhere from 1-8 build configurations)
files produced as a side effect of linking: a linker command file, address map, and a sort of "manifest" file indicating what was burned into the final binaries (CRC and size for each), as well as the debugging database (.pdb equivalent)
We also mail out the result of running some tools over the "side-effect" files to interested users. We don't actually archive these since we can reproduce them later, but these reports include:
total and delta of filesystem size, broken down by file type and/or directory
total and delta of code section sizes (.text, .data, .rodata, .bss, .sinit, etc)
When we have unit tests or functional tests (e.g. smoke tests) running, those results show up in the build log.
We've not thrown out anything yet -- given, our target builds usually end up at ~16 or 32 MiB per configuration, and they're fairly compressible.
We do keep uncompressed copies of the binaries around for 1 week for ease of access; after that we keep only the lightly compressed version. About once a month we have a script that extracts each .zip that the build process produces and 7-zips a whole month of build outputs together (which takes advantage of only having small differences per build).
An average day might have a dozen or two builds per project... The buildserver wakes up about every 5 minutes to check for relevant differences and builds. A full .7z on a large very active project for one month might be 7-10GiB, but it's certainly affordable.
For the most part, we've been able to diagnose everything this way. Occasionally there's a hiccup on the buildsystem and a file isn't actually a the revision it's supposed to be when a build happens, but there's usually enough evidence of this in the logs. Sometimes we have to dig out a tool that understands the debugging database format and feed it a few addresses to diagnose a crash (we have automatic stackdumps built into the product). But usually all the information needed is there.
We haven't had to crack the .7z archives yet, to mention. But we have the info there, and I have some interesting ideas on how to mine bits of useful data from it.
Save what can't be reproduced easily. I work on FPGAs where only the FPGA team have the tools and some cores (libraries) of the design are licensed to compile on only one machine. So we save the output bitstreams. But try to check them over one another rather than with a date/time/version stamp.
Save as in check in to source code control or just on disk? Save nothing to source code control. All derived files should be visible in the file system and available to developers. Don't checkin binaries, code generated from XML files, message digests etc. A separate packaging step will make these end products available. As you have the change number you can always reproduce the build if necessary assuming of course everything you need to do a build is completely in the tree and is available to all builds by syncing.
I would save your built binaries for exactly as long as they have a chance to go into production or be used by some other team (like a QA group). Once something has left production, what you do with it can vary a lot. For a lot of teams, they'll keep just their most recent prior build around (for rollback) and otherwise discard their builds.
Others have regulatory requirements to keep anything that went into production around for as long as seven years (banks). If you are a product company, I'd keep around any binary a customer might have installed in case a tech support guy wants to install the same version.

Resources