How can I automatically load all data sets from my own package - r

Assume I have two packages
package1
data sets d1, d2, d3
package2 which should use data sets from package1
package1 contains data sets, which I want to use for testing.
I can access each of these via e.g. package1::d1. But how can I load all of them in an automated way?
Something like
ds <- data(package = "package1") # you can try e.g. "carData"
ds$results[1, 3] # gives the first entry
mydataset <- load(ds$results[1, 3]) # this does not work
Other will use both packages, so it should work for others and on different platforms (Windows, Mac)
Any ideas?
ds$results[1, 3]
# Item
# "d1"
looks promising, but
data(ds$results[1, 3])
# Warning message:
# In data(ds$results[1, 3]) : data set ‘ds$results[1, 3]’ not found

As indicated by #dcarlson, you could extract the names of all datasets in your package and give them back as list to the data() function as argument list. However this solution only returns a promise for each dataset and not the actual dataset.
my_package <- "datasets"
name_of_all_datasets <- data.frame(data(package = my_package)$results)$Item
data(list = name_of_all_datasets, package = my_package)

I just realized, there are two options:
You want to load external files from inst/extdata
pathExtData <- system.file("extdata", package = "myPackage")
allFilenames <- list.files(pathExtData, full.names = TRUE)
# e.g. in case of Excel files
datalist <- list()
for (i in 1:length(allFilenames)) {
datalist[[i]] <- readxl::read_xlsx(path = allFilenames[i], sheet = "mySheet")
}
You want to load RData from a package you can use
ds <- data(package = "myPackage")
datalist <- list()
for (i in 1:length(ds$results[, 3])) {
eval(parse(text = paste0("datalist[[", i, "]] <- myPackage::", ds$results[i, 3])))
}

Related

How to update data.table variable within a function?

Sorry if this is a duplicate. I am very new to data.table. Basically, I am able to get my code to work outside of functions, but when I pack the operations inside of a function, they breakdown. Ultimately, I had hoped to make the functions age.inds and m.inds internal functions in a package.
# required functions ------------------------------------------------------
# create object
create.obj <- function(n = 100){
obj = list()
obj$inds <- data.table(age = rep(0.1, n), m = NA)
obj$m$model <- function(age, a){return(age^a)}
obj$m$params <- list(a = 2)
return(obj)
}
# calculate new 'age' of inds
age.inds <- function(obj){
obj$inds[, age := age + 1]
return(obj)
}
# calculate new 'm' of inds
m.inds <- function(obj){
ARGS <- list()
args.incl <- which(names(obj$m$params) %in% names(formals(obj$m$model)))
ARGS <- c(ARGS, obj$m$params[args.incl])
args.incl <- names(obj$inds)[names(obj$inds) %in% names(formals(obj$m$model))]
ARGS <- c(ARGS, obj$inds[, ..args.incl]) # double dot '..' version
# ARGS <- c(ARGS, inds[, args.incl, with = FALSE]) # 'with' version
obj$inds[, m := do.call(obj$m$model, ARGS)]
return(obj)
}
# advance object
adv.obj <- function(obj, times = 1){
for(i in seq(times)){
obj <- age.inds(obj)
obj <- m.inds(obj)
}
return(obj)
}
# Example ----------------------------------------------------------------
# this doesn't work
obj <- create.obj(n = 10)
obj # so far so good
obj <- age.inds(obj)
obj # 'inds' gone
# would ultimately like to call adv.obj
obj <- adv.obj(obj, times = 5)
Also (as a side note), most of what I would like to do in my code would be vectorized calculations (i.e. updating variables in obj$inds), so I don't even know if going to data.tables makes too much sense for me (i.e. no by grouping operations as of yet). I am dealing with large objects and wondered if switching from data.frame objects would speed things up (I can get my code to work using data.frames).
Thanks
Update
OK, the issue with the printing has been solved thanks to #eddi. I am however unable to use these "inds" functions when they are located internally within a package (i.e not exported). I made a small package (DTtester), that has this example in the help file for adv.obj:
obj <- create.obj(n=10)
obj <- adv.obj(obj, times = 5)
# Error in `:=`(age, new.age) :
# Check that is.data.table(DT) == TRUE. Otherwise, := and `:=`(...) are
# defined for use in j, once only and in particular ways. See help(":=").
Any idea why the functions would fail in this way?

Aggregating all SpatialPolygonsDataFrame objects from list into one SpatialPolygonsDataFrame

Not looking to editing topology, merely aggregating all polygons into one sp object of type SpatialPolygonsDataFrame (spdf). There is only one polygon per spdf.
Data (dropbox link to data) (filesize 1.1KB) ( dput() not appropriate in this instance):
list_of_spdf <- unlist(readRDS("data.Rds"))
I get the desired result with:
one_spdf <- rbind(list_of_spdf[1][[1]], list_of_spdf[2][[1]], list_of_spdf[3][[1]], makeUniqueIDs = TRUE)
# when plotting can see two polygons (third object is a repeat for sake of testing)
plot(one_spdf)
Having hundreds of objects (though only one polygon per spdf), I need to do the rbind programatically. So I tried lapply
list_of_spdf <- lapply(list_of_spdf, rbind, makeUniqueIDs = TRUE)
Obviously, this returns a list and therefore not what I'm looking for.
So I wrote a function:
rbindSPDF <- function(lst) {
# Create empty spdf objects
pol <-
SpatialPolygonsDataFrame(SpatialPolygons(list()), data = data.frame())
pols <-
SpatialPolygonsDataFrame(SpatialPolygons(list()), data = data.frame())
# loop for rbind
for (i in 1:length(lst)) {
pol[i] <- lst[i][[1]]
if (length(pols) == 0) {
pols <- pol[i]
} else {
pols <- rbind(pols, pol[i], makeUniqueIDs = TRUE)
}
}
return(pols)
}
However, when using rbindSPDF:
single_spdf <- rbindSPDF(list_of_spdf)
I get:
Error in as.vector(data) :
no method for coercing this S4 class to a vector
Not sure what I'm doing wrong here.
Plus, I'm guessing I probably don't even need to use my own function.
Note: On top of many other packages, I'm using spand rgdal for spatial data and would rather avoid using yet another one due to attaching/detaching time and masking.
To have a programmatical version of
one_spdf <- rbind(list_of_spdf[1][[1]],
list_of_spdf[2][[1]],
list_of_spdf[3][[1]],
...
makeUniqueIDs = TRUE)
for a very long list in list_of_spdf, would something like the following work?
# generate list containing list_of_spdf[i][[1]]
list.df <- lapply(seq_along(list_of_spdf),
function(i){list_of_spdf[i][[1]]})
# apply rbind to the list
one_spdf2 <- do.call("rbind",
c(args = list.df, makeUniqueIDs = TRUE))
> all.equal(one_spdf, one_spdf2)
[1] TRUE
The results seem equivalent on my machine.

How to loop through input variables for a function in R?

I have a function, biovars(prec,tmin,tmax) with three variables as input. prec,tmin,tmax are multi bandGeoTiffs. However there are several Tiffs stored in different directories, that needed to be stacked before the can be passed to the function:
stack("/path1/to/file/prec1.tif")
stack("/path2/to/file/tmin1.tif")
stack("/path3/to/file/tmax1.tif")
How can I loop through each directory, stack the first file and use the result of the loop as input for the function:
biovars(prec1,tmin1,tmax1)
biovars(prec2,tmin2,tmax2)
...
biovars(precn,tminn,tmaxn)
Is there an easy way for doing this? I don't think it is possible with three for loops...
From your description, I think you can do something like this:
library(dismo)
# setwd(root of data files")
fprec <- list.files(".", pattern="prec", recursive=TRUE, full.names=TRUE)
ftmin <- gsub("prec", "tmin", fprec)
ftmax <- gsub("prec", "tmax", fprec)
fbio <- gsub("prec", "bio", fprec)
for (i in 1:length(fprec)) {
prec <- brick(fprec[i])
tmin <- brick(ftmin[i])
tmax <- brick(ftmax[i])
b <- biovars(prec, tmin, tmax, filename=fbio[i])
}
I always prefer to use *apply functions and avoid get. It also leads to easily read code (IMO).
prec_files = c("/path/prec1.tif", "/path/prec2.tif", ..., "/path/precn.tif")
tmin_files = c("/path/tmin1.tif", "/path/tmin2.tif", ..., "/path/tminn.tif")
tmax_files = c("/path/tmax1.tif", "/path/tmax2.tif", ..., "/path/tmaxn.tif")
prec_stacks <- lapply(prec_files, stack)
tmin_stacks <- lapply(tmin_files, stack)
tmax_stacks <- lapply(tmax_files, stack)
results <- mapply(FUN = biovar, prec_stacks, tmin_stacks, tmax_stacks)
If biovars returns a useful object, you can store the results of the mapply call. If you hit memory limitations by (I assume) loading all the rasters at once, you can wrap the loading and biovar call into one function:
stack_and_biovar <- function(prec_path, tmin_path, tmax_path) {
prec <- stack(prec_path)
tmin <- stack(tmin_path)
tmax <- stack(tmax_path)
biovar(prec, tmin, tmax)
}
mapply(FUN = stack_and_biovar, prec_files, tmin_files, tmax_files)
If I understand you correctly then there is a simple way to do it using either pmap or pwalk from purrr package.
Basically, you list all the files with given patterns then use pmap/pwalk to loop through that list and pass corresponding file names to raster::stack and biovars functions.
library(tidyverse)
pathGIS <- "/path1/to/file/"
filePattern <- list("^prec*", "^tmax*", "^tmin*")
# List files and sort them numerically using mixedsort from gtools
fileList <- map(filePattern, ~ gtools::mixedsort(list.files(path = pathGIS, recursive = TRUE,
pattern = .x, full.names = TRUE)))
head(fileList)
# ..1, ..2, etc. correspond to the sublist order within fileList
pwalk(fileList, ~ biovars(raster::stack(..1), raster::stack(..2), raster::stack(..3)))
pmap(fileList, ~ biovars(raster::stack(..1), raster::stack(..2), raster::stack(..3)))
Simple example
fileList <- list(c("./GIS/prec1.tif", "./GIS/prec2.tif"),
c("./GIS/tmax1.tif", "./GIS/tmax2.tif"),
c("./GIS/tmin1.tif", "./GIS/tmin2.tif"))
dummy_function <- function(precFile, tmaxFile, tminFile) {
cat("\n-----\n")
print(precFile)
print(tmaxFile)
print(tminFile)
cat("-----\n")
}
pwalk(fileList, ~ dummy_function(..1, ..2, ..3))
-----
[1] "./GIS/prec1.tif"
[1] "./GIS/tmax1.tif"
[1] "./GIS/tmin1.tif"
-----
-----
[1] "./GIS/prec2.tif"
[1] "./GIS/tmax2.tif"
[1] "./GIS/tmin2.tif"
-----

Running 'xlsx' processes in parallel, using the 'parallel' R package

I have a project where I need to process some data from an Excel file with R. I must use the 'xlsx' package because of some specific functions.
First, I wrote a script, which works as expected without errors.
options(java.parameters = "-Xmx4096m") #for extra memory
library(xlsx)
wb <- loadWorkbook(file = "my_excel.xlsx")
sheet1 <- getSheets(wb)[[1]]
rows <- getRows(sheet1)
make_df <- function (x) {
cells <- getCells(rows[x])
styles <- sapply(cells, getCellStyle)
cellColor <- function(style) {
fg <- style$getFillForegroundXSSFColor()
rgb <- tryCatch(fg$getRgb(), error = function(e) NULL)
rgb <- paste(rgb, collapse = "")
return(rgb)
}
colors <- sapply(styles, cellColor)
if (!any(colors == "ff0000")) {
df[nrow(df) + 1, ] <- sapply(cells, getCellValue) #I define this 'df' somewhere in the code; this part could be improved
}
}
df <- sapply(1 : length(rows), make_df)
In short, I am looking for the rows in Excel where there are no red-colored cells, like described here. The problem is that the Excel file is very big, and it takes a lot of time to process.
What I'd like to do is to run the row checking in parallel, to be more efficient, so I added:
cl = makeCluster(detectCores() - 1)
clusterEvalQ(cl=cl, c(library(xlsx))) #sharing the package with the workers
clusterExport(cl = cl, c('rows')) #sharing the 'row' variable with the workers
df <- parSapply(cl, 1 : length(rows), make_df)
And after running this, I get the following error:
Error in checkForRemoteErrors(val) :
7 nodes produced errors; first error: RcallMethod: attempt to call a method of a NULL object.
I tried the parallelization with another example, without using 'xlsx' functions, and it worked.
After some digging, I found this post which offered somewhat of an answer (more like a workaround), but I can't seem to be able to implement it.
Is there a clean way to do what I'm trying to do here?
If not, then what would be the best solution in this case?

Checking the class of an object of a package in R

I want to check for which all objects are data frames in package called "datasets" and if found for them I want to check for the conditions like if dimensions of those data frames is 248 observations and 8 variables then print those object
This is the code I have tried, but unable to get the output.
library(datasets)
lsf.str("package:datasets")
listname <- as.list((ls("package:datasets")))
lst=c()
for(i in listname){
lst <- is.data.frame(as.name(listname[[i]]))
}
Any help would be much appreciated.
What you bascially need is the get function and the Filter function.
datasets <- as.environment("package:datasets")
dataset_names <- ls(envir = datasets)
filter_func <- function(x){
y <- get(x, envir = datasets)
is.data.frame(y) && all(dim(y) == c(248, 8))
}
Filter(filter_func, dataset_names)

Resources