How to directly work with a data.frame in physical RData? - r

Do I have to
1) load a data.frame from the physical RData to the memory,
2) make changes,
3) save it back to the physical RData,
4) remove it from the memory to avoid conflicts?
Is there anyway I can skip the load/save steps and make permanent changes to the physical RData directly? Is there a way to work with data.frame like the way working with a SQLite/MySQL database? Or should I just use SQLite/MySQL (instead of data.frame) as the data storage?
More thoughts: I think the major difference is that to work with SQLite/MySQL you establish a connection to the database, but to work with data.frame from RData you make a copy in the memory. The later approach can create conflicts in complex programs. To avoid potential conflicts you have to save the data.frame and immediately remove it from the memory every time you change it.
Thanks!

Instead of using load you may want to consider using attach. This can attach the saved data object to the search path without loading all the objects in it into the global environment. The data frame would then be available to use.
If you want to change the data frame then you would need to copy it to the global environment (will happen automatically for most editing) and then you would need to save it again (there would not be a simple way to save it into a .Rdata file that contains other objects).
When you are done you can use detach (but if you have made a copy in the global environment then you will still need to delete that copy).
If you don't like typing the load/save commands (or attach/detach) each time then you could write your own function that goes through all the steps for you (and if the copy is only in the environment of the function then you don't need to worry about deleting it).
You may also want to consider different ways of storing your data. The typical .Rdata file works well for an all or nothing approach. The saveRDS and readRDS functions will save and read a single object (and do not force you to use the same name when reading it back in). The interfacing with a database approach is probably the best if you are making frequent changes to tables and want them stored outside of R.

Related

How to make a package set up protected variables in R?

I am trying to create a R package mypckg with a function createShinyApp. The latter function should create a directory structure ready to use as shiny app at some location. In this newly created shiny app, I have some variables, which should be accessed from within the shiny app, but not by the user directly (to prevent a user from accidentally overwriting them). The reason for these variables to exist (I know one should not use global variables) is that the shiny app is treating a text corpus and I want to avoid passing (and hence copying) it between the many functions because this could lead to exhaustion of the memory. Somebody using mypckg should be able to set these variables and later use createShinyApp.
My ideas so far are:
I make mypckg::createShinyApp save the protected variables in a protectedVariables.rds file and get the shinyApp to load the variables from this file into a new environment. I am not very experienced with environments so I could not get this to work properly yet because the creation of a new environment is not working upon running a shiny app so far.
I make mypckg::createShinyApp save the protected variables in a protectedVariables.rds file and get the shinyApp to load the variables from this file into the options. Thereafter I would access the variables and set the variables with options() and getOption.
What are the advantages and disadvantages of these ideas and are there yet simpler and more elegant ways of achieving my goal?
It's a little bit difficult to understand the situation without seeing a concrete example of the kind of variable and context you're using it in, but I'll try to answer.
First of all: In R, it's very very difficult to achieve 100% protection of a variable. Even in shiny, the authors of shiny tried putting up a lot of mechanisms to disallow some variables from getting overwritten by users (like the input variable for example), and while it does make it much harder, you should know that it's impossible, or at least extremely difficult, to actually block all ways of changing a variable.
With that disclaimer out of the way, I assume that you'd be happy with something that prevents people from accidentally overwriting the variable, but if they go purposely out of their way to do it, then so be it. In that case, you can certainly read the variables from an RDS file like you suggest (with the caveat that the user can of course overwrite that file). You can also use a global package-level variable -- usually talking about global variables is bad, but in the context of a package it's a very common thing to do.
For example, you can define in a globals.R file in your package:
.mypkgenv <- new.env(parent = emptyenv())
.mypkgenv$var1 <- "some variable"
.mypkgenv$var2 <- "some other variable"
And the shiny app can access these using
mypckg:::.mypkgenv$var1
This is just one way, there are other ways too

Is there a "correct" way to use exported data internally in an R package?

I know that exported data (access to users) belongs in the data/ folder and that internal data (data used internally by package functions) belongs in R/sysdata.rda. However, what about data I wish to both export to the user AND be available internally for use by package functions?
Currently, presumably due to the order in which objects/data are added to the NAMESPACE, my exported data is not available during devtools::check() and I am getting a NOTE: no visible binding for global variable 'data_x'.
There are probably a half dozen ways to get around this issue, many of which appear to me as rather hacky, so I was wondering if there was a "correct" way to have BOTH external and internal data (and avoid the NOTE from R CMD check).
So far I see these options:
write an internal function that calls the data and use that everywhere internally
Use the ':::' to access the data; which seems odd and invokes a different warning
Have a copy of data_x in BOTH data/ and R/sysdata.rda (super hacky)
Get over it and ignore the NOTE
Any suggestions greatly appreciated,
Thx.

Saving workspace image, in R

When closing R Studio at the end of a R session, I am asked via a dialog box: "Save workspace image to [working directory] ?"
What does that mean? If I choose to save the workspace image, where is it saved? I always choose not to save the workspace image, are there any disadvantages to save it?
I looked at stackoverflow but did not find posts explaining what does the question mean? I only find a question about how to disable the prompt (with no simple answers...): How to disable "Save workspace image?" prompt in R?
What does that mean?
It means that R saves a list of objects in your global environment (i.e. where your normal work happens) into a file. When R next loads, this list is by default restored (at least partially — there are cases where it won’t work).
A consequence is that restarting R does not give you a clean slate. Instead, your workspace is cluttered with existing stuff, which is generally not what you want. People then resort to all kinds of hacks to try to clean their workspace. But none of these hacks are reliable, and none are necessary if you simply don’t save/restore your workspace.
If I choose to save the workspace image, where is it saved?
R creates a (hidden) file called .RData in your current working directory.
I always choose not to save the workspace image, are there any disadvantages to save it?
The advantage is that, under some circumstances, you avoid recomputing results when you continue your work later. However, there are other, better ways of achieving this. On the flip side, starting R without a clean slate has many disadvantages: Any new analysis you now start won’t be in a clean room, and it won’t be reproducible when executed again.
So you are doing the right thing by not saving the workspace! It’s one of the rules of creating reproducible R code. For more information, I recommend Jenny Bryan’s article on using R with a Project-oriented workflow
But having to manually reject saving the workspace every time is annoying and error-prone. You can disable the dialog box in the RStudio options.
The workspace will include any of your saved objects e.g. dataframes, matrices, functions etc.
Saving it into your working directory will allow you to load this back in next time you open up RStudio so you can continue exactly where you left off. No real disadvantage if you can recreate everything from your script next time and if your script doesn't take a long time to run.
The only thing I have to add here is that you should consider seriously that some people may be working on ongoing projects, i.e. things that aren't accomplished in one day and thus must save their workspace image so as to not start from the beginning again.
I think, best practice is: its ok to save your workspace, but your code only really works if you can clear your entire workspace and then rerun it completely with no errors!

Store map key/values in a persistent file

I will be creating a structure more or less of the form:
type FileState struct {
LastModified int64
Hash string
Path string
}
I want to write these values to a file and read them in on subsequent calls. My initial plan is to read them into a map and lookup values (Hash and LastModified) using the key (Path). Is there a slick way of doing this in Go?
If not, what file format can you recommend? I have read about and experimented with with some key/value file stores in previous projects, but not using Go. Right now, my requirements are probably fairly simple so a big database server system would be overkill. I just want something I can write to and read from quickly, easily, and portably (Windows, Mac, Linux). Because I have to deploy on multiple platforms I am trying to keep my non-go dependencies to a minimum.
I've considered XML, CSV, JSON. I've briefly looked at the gob package in Go and noticed a BSON package on the Go package dashboard, but I'm not sure if those apply.
My primary goal here is to get up and running quickly, which means the least amount of code I need to write along with ease of deployment.
As long as your entiere data fits in memory, you should't have a problem. Using an in-memory map and writing snapshots to disk regularly (e.g. by using the gob package) is a good idea. The Practical Go Programming talk by Andrew Gerrand uses this technique.
If you need to access those files with different programs, using a popular encoding like json or csv is probably a good idea. If you just have to access those file from within Go, I would use the excellent gob package, which has a lot of nice features.
As soon as your data becomes bigger, it's not a good idea to always write the whole database to disk on every change. Also, your data might not fit into the RAM anymore. In that case, you might want to take a look at the leveldb key-value database package by Nigel Tao, another Go developer. It's currently under active development (but not yet usable), but it will also offer some advanced features like transactions and automatic compression. Also, the read/write throughput should be quite good because of the leveldb design.
There's an ordered, key-value persistence library for the go that I wrote called gkvlite -
https://github.com/steveyen/gkvlite
JSON is very simple but makes bigger files because of the repeated variable names. XML has no advantage. You should go with CSV, which is really simple too. Your program will make less than one page.
But it depends, in fact, upon your modifications. If you make a lot of modifications and must have them stored synchronously on disk, you may need something a little more complex that a single file. If your map is mainly read-only or if you can afford to dump it on file rarely (not every second) a single csv file along an in-memory map will keep things simple and efficient.
BTW, use the csv package of go to do this.

Access cacheSweave objects from cache

I use Sweave and cacheSweave with pleasure.
Occasionally I have a document in which some section takes a really long time (like 20 minutes or something) and after processing I'd like to open up the objects it created and play around with them in an interactive session.
Does anyone know a method for doing that, presumably by fetching directly against the stashR database or something?
I would have preferred to put this as a comment, but i dont have an account here, so anyway.
The easiest way to do this is probably to put save.image() statements at strategic points in the .Rnw file, so that all objects created at that point are saved. Then, one can open up a new instance of R and interact with the objects without altering the sweave file. HTH.

Resources