I want to load different package data but assign them in separate object. There are some packages which has data with same name. I want to load them but as a separate object. For example;
data("milk", package = "EMSC")
data("milk", package = "baseline")
But the later will replace the previous. So, I want to assign them on object Eg. milk.emsc and milk.baseline.
Is there any efficient and simple solution for this?
Since I came to this question after a long time, I will write my answer that I came up with in case someone has the same problem.
local({
data("milk", package="baseline", envir=environment())
assign(x="milk_baseline", envir=.GlobalEnv, value=milk)
})
local({
data("milk", package="EMSC", envir=environment())
assign(x="milk_emsc", envir=.GlobalEnv, value=milk)
})
This way the global environment will be clean and will only have two data sets with same name from two different packages.
Related
I am currently writing an R package containing project-specific data cleaning functions for my collaborators, using devtools and roxygen2 and following the RStudio suggested formats. Many of these functions essentially fix typos/common data entry errors using reference files (dataframes) that are currently stored in the package's sysdata.rda file under /R. Prior to the issue I present below, referencing the files and using them within functions was working fine. Lazy load is set to true.
I would like to make a function that allows users to add a row to these reference files if they come across a novel typo/error. From my research and from reading the very helpful information at https://r-pkgs.org/data.html, it seems the best way to do this is to list the reference files to a new environment, and then allow the user to edit those files within the session-specific environment. Ideally, these changes would persist across sessions but I cannot figure out how to make that work so am continuing down this path.
For brevity, I've only included one of these files, called column_standardized, that contains standard names for the columns as well as the potential alternatives we regularly come across. A function called "standardize.columns" coerces column names of input data frames to the standards and reorders them also to our agreed standard.
Here is a short reproducible of the column_standardized:
column_standardized <- data.frame(standard_name = c("date", "date", "time", "ID", "ID", "ID", "location", "location"), other_names = c("DATE", "day", "TIME", "id", "individual", "name", "LOCATION", "locale"))
To do this, I created a file in /R that contains the following code, based heavily on the example in https://r-pkgs.org/data.html (section 8.5, Internal State). The file is titled "aaaaa.R" so it comes before other functions in /R:
(The reason I set key, correct_col, and alt_col are so that the same function can act upon different reference files with the same code. It works fine and I am not necessarily seeking feedback on the data manipulation aspects of this function.)
the <- new.env(parent = emptyenv())
the$column_standardized <- column_standardized
#' Add alternative names to reference
#'
#' #param correct A character string of the standard/correct name
#' #param alt A character string of the alternative name
#' #param data_type A character string indicating which reference file to edit. Options are 'column' (and others in reality).
#'
#' #return NA; edits included reference files for current R session.
#' #export
add.alt <- function(correct, alt, data_type){
if(data_type == "column"){
key <- the$column_standardized
correct_col <- "standard_name"
alt_col <- "other_names"
}else{print("\nData type not found. Acceptable options are 'column', etc.")}
new_key <- data.frame(matrix(ncol = length(colnames(key)), nrow = 1))
colnames(new_key) <- colnames(key)
new_key[1,correct_col] <- correct
new_key[1, alt_col] <- alt
key <- rbind(key, new_key)
if(data_type == "column"){
the$column_standardized <- key
invisible(key)
}}
No errors/issues are flagged when I run document() or load_all(), but when I check the package it is unable to install because column_standardized does not exist. I assume this is because sysdata.rda is loaded after .R files in /R.
I have also tried to put column_standardized in the /data folder and call it using system.file but run into the same error.
The actual file is over 300 rows long, and there are multiple reference files, and so I don't think it makes sense to just recreate the data frame in the environment from scratch, although I've considered it.
Finally, to my specific questions:
Is there a way to load the system data first so that .R files in /R can reference the data included?
I am not wed to storing them internally although that would be ideal for privacy reasons, and could move the dataframes to /data or another location. This seemed initially simplest but I could be wrong.
Is there a modification that would allow each user to "permanently" modify these reference files? It wouldn't be too much of a headache for them to run add.alt() each session since the files already contain most common errors, and once user data is edited/standardized one usually does not need to restandardize in another session. If someone knows a solution, however, it would probably be the ideal.
I am potentially completely off-base here as this is my first time developing a package, so any tips are appreciated! Many thanks in advance, and happy to provide more documentation information if I've forgotten anything crucial.
I would like to remove some data from the workspace. I know the "Clear All" button will remove all data. However, I would like to remove just certain data.
For example, I have these data frames in the data section:
data
data_1
data_2
data_3
I would like to remove data_1, data_2 and data_3, while keeping data.
I tried data_1 <- data_2 <- data_3 <- NULL, which does remove the data (I think), but still keeps it in the workspace area, so it is not fully what I would like to do.
You'll find the answer by typing ?rm
rm(data_1, data_2, data_3)
A useful way to remove a whole set of named-alike objects:
rm(list = ls()[grep("^tmp", ls())])
thereby removing all objects whose name begins with the string "tmp".
Edit: Following Gsee's comment, making use of the pattern argument:
rm(list = ls(pattern = "^tmp"))
Edit: Answering Rafael comment, one way to retain only a subset of objects is to name the data you want to retain with a specific pattern. For example if you wanted to remove all objects whose name do not start with paper you would issue the following command:
rm(list = grep("^paper", ls(), value = TRUE, invert = TRUE))
Following command will do
rm(list=ls(all=TRUE))
In RStudio, ensure the Environment tab is in Grid (not List) mode.
Tick the object(s) you want to remove from the environment.
Click the broom icon.
You can use the apropos function which is used to find the objects using partial name.
rm(list = apropos("data_"))
Use the following command
remove(list=c("data_1", "data_2", "data_3"))
If you just want to remove one of a group of variables, then you can create a list and keep just the variable you need. The rm function can be used to remove all the variables apart from "data". Here is the script:
0->data
1->data_1
2->data_2
3->data_3
#check variables in workspace
ls()
rm(list=setdiff(ls(), "data"))
#check remaining variables in workspace after deletion
ls()
#note: if you just use rm(list) then R will attempt to remove the "list" variable.
list=setdiff(ls(), "data")
rm(list)
ls()
paste0("data_",seq(1,3,1))
# makes multiple data.frame names with sequential number
rm(list=paste0("data_",seq(1,3,1))
# above code removes data_1~data_3
If you're using RStudio, please consider never using the rm(list = ls()) approach!* Instead, you should build your workflow around frequently employing the Ctrl+Shift+F10 shortcut to restart your R session. This is the fastest way to both nuke the current set of user-defined variables AND to clear loaded packages, devices, etc. The reproducibility of your work will increase markedly by adopting this habit.
See this excellent thread on Rstudio community for (h/t #kierisi) for a more thorough discussion (the main gist is captured by what I've stated already).
I must admit my own first few years of R coding featured script after script starting with the rm "trick" -- I'm writing this answer as advice to anyone else who may be starting out their R careers.
*of course there are legitimate uses for this -- much like attach -- but beginning users will be much better served (IMO) crossing that bridge at a later date.
To clear all data:
click on Misc>Remove all objects.
Your good to go.
To clear the console:
click on edit>Clear console.
No need for any code.
Adding one more way, using ls() and remove()
ls() return a vector of character strings giving the names of the objects in the specified environment.
Create a list of objects you want to remove from the environment using ls() and then use remove() to remove it.
remove(list = ls()[ls() != "data"])
You can also use tidyverse
# to remove specific objects(s)
rm(list = ls() %>% str_subset("xxx"))
# or to keep specific object(s)
rm(list = setdiff(ls(), ls() %>% str_subset("xxx")))
Maybe this can help as well
remove(list = c(ls()[!ls() %in% c("what", "to", "keep", "here")] ) )
This is a tricky one as I can't provide a reproducible example, but I'm hoping that others may have had experience dealing with this.
Essentially I have a function that pulls a large quantity of data from a DB, cleans and reduces the size and loops through some parameters to produce a series of lm model objects, parameter values and other reference values. This is compiled into a complex list structure that totals about 10mb.
It's then supposed to saved as an RDS file on AWS s3 where it's retrieved in a production environment to build predictions.
e.g.
db.connection <- db.connection.object
build_model_list <- function(db.connection) {
clean_and_build_models <- function(db.connection, other.parameters) {
get_db_data <- function(db.connection, some.parameters) {# Retrieve db data} ## Externally defined
db.data <- get_db_data()
build_models <- function(db.data, some.parameters) ## Externally defined
clean_data <- function(db.data, some.parameters) {# Cleans and filters data based on parameters} ## Externally defined
clean.data <- clean_data()
lm_model <- function(clean.data) {# Builds lm model based on clean.data} ## Externally defined
lm.model <- lm_model()
return(list(lm.model, other.parameters))} ## Externally defined
looped.model.object <- llply(some.parameters, clean_and_build_models)
return(looped.model.object)}
model.list <- build_model_list()
saveRDS(model.list, "~/a_place/model_list.RDS")
The issue I'm getting is that 'model.list' object which is only 10MB in memory will inflate to many GBs when I save locally as RDS or try to upload to AWS s3.
I should note that though the function processes very large quantities of data (~ 5 million rows), the data used in the outputs is no larger than a few hundred rows.
Reading the limited info on this on Stack Exchange, I've found that moving some of the externally defined functions (as part of a package) inside the main function (e.g. clean_data and lm_model) helps reduce the RDS save size.
This however has some big disadvantages.
Firstly it's trial and error and follows no clear logical order, with frequent crashes and a couple of hours taken to build the list object, it's a very long debugging cycle.
Secondly, it'll mean my main function will be many hundreds of lines long which will make future alterations and debugging much more tricky.
My question to you is:
Has anyone encountered this issue before?
Any hypotheses as to what's causing it?
Has anyone found a logical non-trial-and-error solution to this?
Thanks for your help.
It took a bit of digging but I did actually find a solution in the end.
It turns out it was the lm model objects that were the guilty party. Based on this very helpful article:
https://blogs.oracle.com/R/entry/is_the_size_of_your
It turns out that the lm.object$terms component includes a an environment component that references to the objects present in the global environment when the model was built. Under certain circumstances, when you saveRDS R will try and draw in the environmental objects into the save object.
As I had ~0.5GB sitting in the global environment and an list array of ~200 lm model objects, this caused the RDS object to inflate dramatically as it was actually trying to compress ~100GB of data.
To test if this is what's causing the problem. Execute the following code:
as.matrix(lapply(lm.object, function(x) length(serialize(x,NULL))))
This will tell you if the $terms component is inflating.
The following code will remove the environmental references from the $terms component:
rm(list=ls(envir = attr(lm.object$terms, ".Environment")), envir = attr(lm.object$terms, ".Environment"))
Be warned though it'll also remove all the global environmental objects it references.
For model objects you could also simply delete the reference to the environment.
As for example like this
ctl <- c(4.17,5.58,5.18,6.11,4.50,4.61,5.17,4.53,5.33,5.14)
trt <- c(4.81,4.17,4.41,3.59,5.87,3.83,6.03,4.89,4.32,4.69)
group <- gl(2, 10, 20, labels = c("Ctl","Trt"))
weight <- c(ctl, trt)
lm.D9 <- lm(weight ~ group)
attr(lm.D9$terms, ".Environment") <- NULL
saveRDS(lm.D9, file = "path_to_save.RDS")
This unfortunatly breaks the model - but you can add an environment manualy after loading again.
readRDS("path_to_save.RDS")
attr(lm.D9$terms, ".Environment") <- globalenv()
This helped me in my specific use case and looks a bit saver to me...
Neither of these two solutions worked for me.
Instead I have used:
downloaded_object <- storage_download(connection, "path")
read_RDS <- readRDS(downloaded_object)
The answer by mhwh mostly solved my problem, but with the additional step of creating an empty list and copying into it from the model object what was relevant. This might be due to additional (undocumented) environment references associated with using the model class I used.
mm <- felm(formula=formula, data=data, keepX=TRUE, ...)
# Make an empty list and copy into it what we need:
mm_cp <- list()
mm_cp$coefficients <- mm$coefficients
# mm_cp$ <- something else from mm you might need ...
mm_cp$terms <- terms(ans)
attr(mm_cp$terms, ".Environment") <- NULL
saveRDS(mm_cp, file = "path_to_save.RDS")
Then when we need to use it:
mm_cp <- saveRDS("path_to_save.RDS")
attr(mm_cp$terms, ".Environment") <- globalenv()
In my case the file went from 5.5G to 13K. Additionally, when reading in the file it used to allocate >32G of memory, more than 6 times the file-size. This also reduced execution time significantly (no need to recreate various environments?).
Environmental references sounds like an excellent contender for a new chapter in the R Inferno book.
Column-wise storage in the inst/extdata directory of a package, as suggested by Jan, is now implemented in the dfunbind package.
I'm using the data-raw idiom to make entire analyses from the raw data to the results reproducible. For this, datasets are first wrapped in R packages which can then be loaded with library().
One of the datasets I'm using is largish, around 8 million observations with about 80 attributes. For my current analysis I only need a small fraction of the attributes, but I'd like to package the entire dataset anyway.
Now, if it is simply packaged as a data frame (e.g., with devtools::use_data()), it will be loaded in its entirety when first accessing it. What would be the best approach to package this kind of data so that I can lazy-load at the column level? (Only those columns which I'm actually accessing are loaded, the others happily stay on disk and don't occupy RAM.) Would the ff package help? Can anyone point me to a working example?
I think, I would store the data in inst/extdata. Then create a couple of functions in your package that can read and return parts of that data. In your functions you can get the path to your data using: system.file("extdata", "yourfile", package = "yourpackage"). (As on the page you linked to).
The question then is in what format you store your data and how do you obtain selections from it without reading the data in memory. For that, there are a large number of options. To name some:
sqlite: Store your data in a sqlite database. You can then perform queries on this data using the rsqlite package.
ff: store your data in ff objects (e.g. save using the save.ffdf function from ffbase; use load.ffdf to load again). ff doesn't handle character fields well (they are always converted to factors). And in theory the files are not cross platform although as long as you stay on intel platforms you should be ok.
CSV: store your data in a plain old csv file. You can then make selections from this file using the LaF package. The performance will probably be less than with ff but might be good enough.
RDS: store each of your columns in a seperate RDS file (using saveRDS) and load them using readRDS the advantage is that you do not depend on any R-packages. This is fast. The disadvantage is that you cannot do row selections (but that does not seem to be the case).
If you only want to select columns, I would go with RDS.
A rough example using RDS
The following code creates an example package containing the iris data set:
load_data <- function(dataset, columns) {
result <- vector("list", length(columns));
for (i in seq_along(columns)) {
col <- columns[i]
fn <- system.file("extdata", dataset, paste0(col, ".RDS"), package = "lazydata")
result[[i]] <- readRDS(fn)
}
names(result) <- columns
as.data.frame(result)
}
store_data <- function(package, name, data) {
dir <- file.path(package, "inst", "exdata", name)
dir.create(dir, recursive = TRUE)
for (col in names(data)) {
saveRDS(data[[col]], file.path(dir, paste0(col, ".RDS")))
}
}
packagename <- "lazyload"
package.skeleton(packagename, "load_data")
store_data(packagename, "iris", iris)
After building and installing the package (you'll need to fix the documentation, e.g. delete it) you can do:
library(lazyload)
data <- load_data("iris", "Sepal.Width")
To load the Sepal.Width column of the iris data set.
Of course this is a very simple implementation of load_data: no error handling, it assumes all column exist, it does not know which columns exist, it does not know which data sets exist.
I need to simulate some data, and I would like to have a function with everything built in, so I just need to run
simulate(scenario="xxx")
This function stores all the simulated datasets for the designated scenario in a list called simdat. Within the function, I want to rename that list "simdat.xxx" and save it out as "simdat_xxx.RData", so later on I can just load this file and have access to the list simdat.xxx. I need the list to have a name that refers specifically to which batch it is, because I am dealing with a lot of batches and I may want to load several at the same time.
Is there a way to, within a function, make a name and use it to name an object? I searched over and over again and could not find a way to do this. In desperation, I am resorting to doing this: within the function,
(a) write a temporary script using paste, which looks like this
temp.fn <- function(simdat){
simdat.xxx <- simdat
save(simdat.xxx,file="simdat_xxx.RData")
}
(b) use writeLines to write it out to a .R file
(c) source the file
(d) run it
This seriously seems like overkill to me. Is there a better way to do this?
Thanks much for your help!
Trang
Try this,
simulate <- function(scenario="xxx"){
simdat <- replicate(4, rnorm(10), simplify=FALSE)
data_name <- paste("simdat", scenario, sep=".")
assign(data_name, simdat)
save(list = data_name, file = paste0("simdat_", scenario, ".Rdata"))
}