probe global variables to call inside function - r

I want to pass variables within the .Globalenv when inside a function. Basically concatenate x number of data frames into a matrix.
Here is some dummy code;
Alpha <- data.frame(lon=124.9167,lat=1.53333)
Alpha_2 <- data.frame(lon=3.13333, lat=42.48333)
Alpha_3 <- data.frame(lon=-91.50667, lat=27.78333)
myfunc <- function(x){
vars <- ls(.GlobalEnv, pattern=x)
mat <- as.matrix(rbind(vars[1], vars[2], vars[3]))
return(mat)
}
When calling myfunc('Alpha') I would like the same thing to be returned as when you run;
as.matrix(rbind(Alpha, Alpha_2, Alpha_3)
lon lat
1 124.91670 1.53333
2 3.13333 42.48333
3 -91.50667 27.78333
Any pointers would be appreciated, thanks!

You can use get to retrieve variables by name. We do this here in a loop with lapply, and then use rbind to bind them together.
myfunc <- function(x){
vars <- ls(.GlobalEnv, pattern=x)
df <- do.call(rbind, mget(vars, .GlobalEnv)) # courtesy #Roland
return(df)
}
myfunc("Alpha")
# lon lat
# 1 124.91670 1.53333
# 2 3.13333 42.48333
# 3 -91.50667 27.78333
Note, in practice, you probably want to check that the variables that match the pattern actually are what you think they are, but this gives you the rough tools you want.
Old version (2nd line of func):
df <- do.call(rbind, lapply(vars, get, envir=.GlobalEnv))

Related

lapply function with arguments I want to pick from a dataframe with a loop

I'm still very new to R and haven't found any answer so far. Sorry to finally ask.
Edition with a quick example:
I want to compute a multidimensional development index based on South Africa Data.
My list is composed of individual information for each year, so basically df1 is about year 1 and df2 about year2.
df1<-data.frame(var1=c(1, 1,1), var2=c(0,0,1), var3=c(1,1,0))
df2<-data.frame(var1=c(1, 0,1), var2=c(1,0,1), var3=c(0,1,0))
mylist <-list (df1,df2)
You can find here a very simplified working index function:
myindex <- function(x, dimX, dimY){
econ_i<- ( x[dimX]+ x[dimY] )
return ( (1/length(econ_i))*sum(econ_i) )
}
myindex(df1, "var2", "var3")
Then I have my dataframe of variables I want to use for my index
mydf <- data.frame(set1=c("var1", "var2"), set2=c("var2", "var3"))
I'm using a function to get arguments from database such as:
pick_values <-function(x){
vect <-c()
for(i in x){
vect <- c(vect, i)
}
return(vect)
}
I'd like to set up a lapply loop such that I apply my function for my list, for all sets of arguments in my dataframe. In other words, I'd like to compute my index for both years, with all sets of variables I can use. //end Edit
I've tried many unsuccessful things so far. For instance:
lapply(mylist, myindex, lapply(mydf,pick_values))
Thanks a lot for your help!
Okay, I don't like your mydf name nor that it has factors, so I rename it args because it has function arguments and I set stringsAsFactors = F:
args <- data.frame(set1=c("var1", "var2"), set2=c("var2", "var3"), stringsAsFactors = F)
We'll also write a wrapper for myindex that accepts a vector of arguments instead of dimX and dimY:
myindex2 = function(x, d) {
myindex(x, d[1], d[2])
}
Then we can nest lapply like this:
lapply(mylist, function(m) lapply(args, myindex2, x = m))
# $df1
# $df1$set1
# [1] 4
#
# $df1$set2
# [1] 3
#
#
# $df2
# $df2$set1
# [1] 4
#
# $df2$set2
# [1] 3

R: using apply over two data.frames

I want to use apply instead of a for-loop. The problem is, my for-loop uses two data.frames as an input. For example:
x <- data.frame(col1=c(1,NA,3,NA), col2=c(9,NA,11,12))
y <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
output <- rep(NA,2)
for(i in 1:2)
{
output[i] <- sum(is.na(x[,i]))+sum(y[,i])
}
The result here is, correctly c(12,27).
But if I try function and apply:
test <- function(vector1,vector2) sum(is.na(vector1))+sum(vector2)
apply(x,y,MARGIN=2,FUN=test)
With apply the result is c(38,37).
How can I fix this?
You can use mapply instead of apply:
x <- data.frame(col1=c(1,NA,3,NA), col2=c(9,NA,11,12))
y <- data.frame(col1=c(1,2,3,4), col2=c(5,6,7,8))
test <- function(vector1,vector2) sum(is.na(vector1))+sum(vector2)
mapply(test, x, y)
# col1 col2
# 12 27
?mapply

Create empty object in R

I'm trying to create empty numeric object like this
corr <- cor()
to use it later on in a loop.
but, it keep returning this error
Error in is.data.frame(x) : argument "x" is missing, with no default.
Here is my full script:
EVI <- "D:\\Modis_EVI\\Original\\EVI_Stack_single5000.tif"
y.EVI <- brick(EVI)
m.EVI.cropped <- as.matrix(y.EVI)
time <- 1:nlayers(y.EVI)
corr <- cor()
inf2NA <- function(x) { x[is.infinite(x)] <- NA; x }
for (i in 1:nrow(m.EVI.cropped)){
EVI.m <- m.EVI.cropped[i,]
time <- 1:nlayers(y.EVI)
Corr[i] <- cor(EVI.m, time, method="pearson", use="pairwise.complete.obs")
}
Any advice please?
Since you are asking for advice:
It is very likely that you don't need to do this since you can probably use (i) a vectorized function or (ii) a lapply loop that pre-allocates the return object for you. If you insist on using a for loop, set it up properly. This means you should pre-allocate which you can, e.g., do by using corr <- numeric(n), where n is the number of iterations. Appending to a vector is extremely slooooooow.
We can create empty objects with numeric(0), logical(0), character(0) etc.
For example
num_vec <- numeric(0)
creates an empty numeric vector that can be filled up later on:
num_vec[1] <- 2
num_vec
# [1] 2
num_vec[2] <- 1
num_vec
# [1] 2 1

For loops referring to differently numbered objects

in R I have a list of 100 phlyo objects called called Newick1, Newick2, Newick3, etc. I want to do pairwise comparisons between the trees (e.g. all.equal.phylo(Newick1, Newick2)) but am having difficulty figuring out how to do this efficiently since each file has a different name.
I think something like the for loop below will work, but how do I designate a different file for each iteration of the loop? For obvious reasons the [i] and [j] I put in the code below don't work, but I don't know what to replace them with.
Thank you very much!
for (i in 1:99) {
for (j in i+1:100) {
all.equal.phylo(Newick[i], Newick[j]) -> output[i,j]
} }
try mget() to reference multiple objects by name
> x1 <- x2 <- x3 <-1
> mget(paste0("x",1:3))
$x1
[1] 1
$x2
[1] 1
$x3
[1] 1
You can try a variation on the following:
# make a two column dataframe
# and filter the identical values
df <- expand.grid(1:100,1:100)
names(df) <- c('i','j')
df <- df[!df$i == df$j,]
# example function that takes two parameters
addtwo <- function(i,j){i + j}
# apply that function across rows of the dataframe
results <- mapply(addtwo, df$i, df$j)
# using the same logic,
# your function would look something like this
getdistance <- function(i,j, newicks=NEWICKS) {
all.equal.phylo(newicks[i], newicks[j])
}
# and apply it like this
results <- mapply(getdistance, df$i, df$j)
Key concepts:
expand.grid()
mapply()

flatten record based list/object into dataframe

Edit: this question is outdated. The jsonlite package flattens automatically.
I am dealing with online datastreams that have record-based encoding, usually in JSON. The structure of the object (i.e. the names in the JSON) are known from the API documentation, however, values are mostly optional and not present in every record. Lists can contain new lists, and the structure is sometimes quite deep. Here is a quite simple example of some GPS data: http://pastebin.com/raw.php?i=yz6z9t25. Note that in the lower rows, the "l" object is missing due to no GPS signal.
I am looking for an elegant way to flatten these objects into a dataframe. I am currently using something like this:
library(RJSONIO)
library(plyr)
obj <- fromJSON("http://pastebin.com/raw.php?i=yz6z9t25", simplifyWithNames=FALSE, simplify=FALSE)
flatdata <- lapply(obj$data, as.data.frame);
mydf <- rbind.fill(flatdata)
This does the job, however it is slow and a bit error prone. A problem with this approach is that I am not using my knowledge about the structure (object names) in the data; instead it is inferred from the data. This leads to problems when a certain property happens to be absent in every record. In this case, it will not appear in the dataframe at all, instead of a column with NA values. This can lead to issues downstream. For example, I need to process the location timestamp:
mydf$l.t <- structure(mydf$l.t/1000, class="POSIXct")
However, this will result in an error in case of a dataset in which the l$t object isn't there. Furthermore both the as.data.frame and rbind.fill make things quite slow. The example dataset is a relatively small one. Any suggestions for better implementation? A robust solution would always yield a dataframe with the same columns in the same order, and where only the number of rows varies.
Edit: below a dataset with more meta data. It is larger in size and nested more deeply:
obj <- fromJSON("http://www.stat.ucla.edu/~jeroen/files/output.json", simplifyWithNames=FALSE, simplify=FALSE)
Here's a solution that lets you take advantage of your prior knowledge of data field names and classes. Also, by avoiding repeated calls to as.data.frame and the single call to plyr's rbind.fill() (both time-intensive) it runs about 60 times faster on your example data.
cols <- c("id", "ls", "ts", "l.lo","l.tz", "l.t", "l.ac", "l.la", "l.pr", "m")
numcols <- c("l.lo", "l.t", "l.ac", "l.la")
## Flatten each top-level list element, converting it to a character vector.
x <- lapply(obj$data, unlist)
## Extract fields that might be present in each record (returning NA if absent).
y <- sapply(x, function(X) X[cols])
## Convert to a data.frame with columns of desired classes.
z <- as.data.frame(t(y), stringsAsFactors=FALSE)
z[numcols] <- lapply(numcols, function(X) as.numeric(as.character(z[[X]])))
Edit: To confirm that my approach gives results identical to those in the original question, I ran the following test. (Notice that in both cases I set stringsAsFactors=FALSE to avoid meaningless differences in orderings of the factor levels.)
flatdata <- lapply(obj$data, as.data.frame, stringsAsFactors=FALSE)
mydf <- rbind.fill(flatdata)
identical(z, mydf)
# [1] TRUE
Further Edit:
Just for the record, here's an alternate version of the above that in addition automatically:
finds names of all data fields
determines their class/type
coerces the columns of the final data.frame to the correct class
.
dat <- obj$data
## Find the names and classes of all fields
fields <- unlist(lapply(xx, function(X) rapply(X, class, how="unlist")))
fields <- fields[unique(names(fields))]
cols <- names(fields)
## Flatten each top-level list element, converting it to a character vector.
x <- lapply(dat, unlist)
## Extract fields that might be present in each record (returning NA if absent).
y <- sapply(x, function(X) X[cols])
## Convert to a data.frame with columns of desired classes.
z <- as.data.frame(t(y), stringsAsFactors=FALSE)
## Coerce columns of z (all currently character) back to their original type
z[] <- lapply(seq_along(fields), function(i) as(z[[cols[i]]], fields[i]))
Here's an attempt that tries to make no assumptions about the types of the data. It's a bit slower than #JoshOBrien's, but faster than the OP's original solution.
Joshua <- function(x) {
un <- lapply(x, unlist, recursive=FALSE)
ns <- unique(unlist(lapply(un, names)))
un <- lapply(un, function(x) {
y <- as.list(x)[ns]
names(y) <- ns
lapply(y, function(z) if(is.null(z)) NA else z)})
s <- lapply(ns, function(x) sapply(un, "[[", x))
names(s) <- ns
data.frame(s, stringsAsFactors=FALSE)
}
Josh <- function(x) {
cols <- c("id", "ls", "ts", "l.lo","l.tz", "l.t", "l.ac", "l.la", "l.pr", "m")
numcols <- c("l.lo", "l.t", "l.ac", "l.la")
## Flatten each top-level list element, converting it to a character vector.
x <- lapply(obj$data, unlist)
## Extract fields that might be present in each record (returning NA if absent).
y <- sapply(x, function(X) X[cols])
## Convert to a data.frame with columns of desired classes.
z <- as.data.frame(t(y))
z[numcols] <- lapply(numcols, function(X) as.numeric(as.character(z[[X]])))
z
}
Jeroen <- function(x) {
flatdata <- lapply(x, as.data.frame)
rbind.fill(flatdata)
}
library(rbenchmark)
benchmark(Josh=Josh(obj$data), Joshua=Joshua(obj$data),
Jeroen=Jeroen(obj$data), replications=5, order="relative")
# test replications elapsed relative user.self sys.self user.child sys.child
# 1 Josh 5 0.24 1.000000 0.24 0 NA NA
# 2 Joshua 5 0.31 1.291667 0.32 0 NA NA
# 3 Jeroen 5 12.97 54.041667 12.87 0 NA NA
Just for clarity, I am adding a combination of Josh and Joshua's solution which is the best I have come up with so far.
flatlist <- function(mylist){
lapply(rapply(mylist, enquote, how="unlist"), eval)
}
records2df <- function(recordlist, columns) {
if(length(recordlist)==0 && !missing(columns)){
return(as.data.frame(matrix(ncol=length(columns), nrow=0, dimnames=list(NULL,columns))))
}
un <- lapply(recordlist, flatlist)
if(!missing(columns)){
ns <- columns;
} else {
ns <- unique(unlist(lapply(un, names)))
}
un <- lapply(un, function(x) {
y <- as.list(x)[ns]
names(y) <- ns
lapply(y, function(z) if(is.null(z)) NA else z)})
s <- lapply(ns, function(x) sapply(un, "[[", x))
names(s) <- ns
data.frame(s, stringsAsFactors=FALSE)
}
The function is reasonably fast. I still think it should be able to speed this up though:
obj <- fromJSON("http://www.stat.ucla.edu/~jeroen/files/output.json", simplifyWithNames=FALSE, simplify=FALSE)
flatdata <- records2df(obj$data)
It also allows you to 'force' certain columns, although it doesn't result in too much of a speedup:
flatdata <- records2df(obj$data, columns=c("m", "doesnotexist"))

Resources