After reading this question I attempted to clean out my workspace and found that each time I opened R all the original items I had recently removed were restored. I then checked .RData and found that it had not been modified in a few weeks even though I repeatedly saved the workspace image. How often is .RData updated and how can I change when .RData is updated so that it reflects more recent changes?
It gets modified if and when you
use save.image()
use q() and answer yes
Otherwise it does not get changed.
My personal preference is to explicitly load and save data I want to cache across sessions or for further analysis.
Related
When closing R Studio at the end of a R session, I am asked via a dialog box: "Save workspace image to [working directory] ?"
What does that mean? If I choose to save the workspace image, where is it saved? I always choose not to save the workspace image, are there any disadvantages to save it?
I looked at stackoverflow but did not find posts explaining what does the question mean? I only find a question about how to disable the prompt (with no simple answers...): How to disable "Save workspace image?" prompt in R?
What does that mean?
It means that R saves a list of objects in your global environment (i.e. where your normal work happens) into a file. When R next loads, this list is by default restored (at least partially — there are cases where it won’t work).
A consequence is that restarting R does not give you a clean slate. Instead, your workspace is cluttered with existing stuff, which is generally not what you want. People then resort to all kinds of hacks to try to clean their workspace. But none of these hacks are reliable, and none are necessary if you simply don’t save/restore your workspace.
If I choose to save the workspace image, where is it saved?
R creates a (hidden) file called .RData in your current working directory.
I always choose not to save the workspace image, are there any disadvantages to save it?
The advantage is that, under some circumstances, you avoid recomputing results when you continue your work later. However, there are other, better ways of achieving this. On the flip side, starting R without a clean slate has many disadvantages: Any new analysis you now start won’t be in a clean room, and it won’t be reproducible when executed again.
So you are doing the right thing by not saving the workspace! It’s one of the rules of creating reproducible R code. For more information, I recommend Jenny Bryan’s article on using R with a Project-oriented workflow
But having to manually reject saving the workspace every time is annoying and error-prone. You can disable the dialog box in the RStudio options.
The workspace will include any of your saved objects e.g. dataframes, matrices, functions etc.
Saving it into your working directory will allow you to load this back in next time you open up RStudio so you can continue exactly where you left off. No real disadvantage if you can recreate everything from your script next time and if your script doesn't take a long time to run.
The only thing I have to add here is that you should consider seriously that some people may be working on ongoing projects, i.e. things that aren't accomplished in one day and thus must save their workspace image so as to not start from the beginning again.
I think, best practice is: its ok to save your workspace, but your code only really works if you can clear your entire workspace and then rerun it completely with no errors!
I noticed that even after clearing the environment, clearing the workspace and uninstalling R, I still can't get rid of old variables that still show up.
Here is how I launch my database:
#rm(list=ls(all=TRUE))
stim<-read.table(file.choose(),header=T)
attach(stim)
names(stim)
summary(stim)
str(stim$emotionT2)
names(stim)
I tried removing the "attach(stim)" line, but then none of the newly imported dataset works.
How can I completely clear all data to make sure that I am really testing the newly imported one?
Delete any .RData files in your working directory (your home directory, if you aren't sure). If you want to be careful, just move them/rename them rather than deleting.
Do I have to
1) load a data.frame from the physical RData to the memory,
2) make changes,
3) save it back to the physical RData,
4) remove it from the memory to avoid conflicts?
Is there anyway I can skip the load/save steps and make permanent changes to the physical RData directly? Is there a way to work with data.frame like the way working with a SQLite/MySQL database? Or should I just use SQLite/MySQL (instead of data.frame) as the data storage?
More thoughts: I think the major difference is that to work with SQLite/MySQL you establish a connection to the database, but to work with data.frame from RData you make a copy in the memory. The later approach can create conflicts in complex programs. To avoid potential conflicts you have to save the data.frame and immediately remove it from the memory every time you change it.
Thanks!
Instead of using load you may want to consider using attach. This can attach the saved data object to the search path without loading all the objects in it into the global environment. The data frame would then be available to use.
If you want to change the data frame then you would need to copy it to the global environment (will happen automatically for most editing) and then you would need to save it again (there would not be a simple way to save it into a .Rdata file that contains other objects).
When you are done you can use detach (but if you have made a copy in the global environment then you will still need to delete that copy).
If you don't like typing the load/save commands (or attach/detach) each time then you could write your own function that goes through all the steps for you (and if the copy is only in the environment of the function then you don't need to worry about deleting it).
You may also want to consider different ways of storing your data. The typical .Rdata file works well for an all or nothing approach. The saveRDS and readRDS functions will save and read a single object (and do not force you to use the same name when reading it back in). The interfacing with a database approach is probably the best if you are making frequent changes to tables and want them stored outside of R.
I've written a small function that download a file from my S3 data repository only if the size of the local version of the file is different, to save bandwhich and time.
I would like to improve it to download if and only if the last update datetime is different. I can make the check using HEAD (from httr package) to get the datetime for the remote file and file.info for the local one.
But (as espected) when I download a fresh copy of the file it's going to have the Sysdate as creation/last update time. I need a way to update the datetime of the fresh new local copy with the one from the server including the potential issue due to different time-zones.
file.info doesn't seem to be able to write file properties.
any idea on how can i do that?
I don't think you can and even if you could, that approach seems a bit unreliable to me (you mentioned time zones for example). Instead, I would suggest you rely on a file's md5sum (a unique representation of its contents) to tell when it has changed:
library(tools)
if (md5sum(remote) != md5sum(local)) file.copy(remote, local)
Issue solved, see answers for details.
I would like to run some code (with knitr) on a more powerful server and then maybe have the possibility of making small changes on my own laptop. Even copying across the entire folder, it seems that the cache is rebuilt when re-compiling locally, is there a way to avoid that and actually use the results in the cache?
Update: the problem arose from different versions of knitr on different machines.
In theory, yes -- if you do not change anything, the cache will be kept. In practice, you have to check carefully what the "small changes" are. The documentation page for cache has explained when the cache will be rebuilt, and you need to check if all three conditions are met.
I wonder if in addition to #Yihui's answer if the process of copying from one machine to another changes the datetimes on the files so that they look out of date even when nothing has changed.
Look at the dates on the files involved after copying. If you can figure out which files need to be newer than others then touching them may prevent the rebuilding.
Another option would be to just paste in the chached pieces directly so that they are not rerun (though that means you have to rerun and repaste manually if you change anything in those parts).