I'm calculating summary statistics for numerous data frames across multiple slices vs a single response variable. I currently do this by passing a list of DFs to a function. But my function has to specify the columns (ie-slices) individually. This speeds up my process dramatically; but, I think there has to be an even more efficient way to do this via an apply() family function. I'm hoping someone here can help me out.
Here's my code:
table1 <- function(x) {
dl2 <- list()
for (i in 1:length(x)) {
z <- x[[i]]
t.sliceA <- addmargins(table(list(z$sliceA, z$Growing)))
t.sliceB <- addmargins(table(list(z$sliceB, z$Growing)))
t.sliceC <- addmargins(table(list(z$sliceC, z$Growing)))
t.sliceD <- addmargins(table(list(z$sliceD, z$Growing)))
...
t.sliceAA <- addmargins(table(list(z$sliceAA, z$Growing)))
table.list <- list(t.sliceA, t.sliceB, t.sliceC, ... , t.sliceAA)
names(table.list) <- c("t.sliceA", "t.sliceB", ... , "t.sliceAA")
dl2[[i]] <- table.list
}
assign("dl",dl2, envir=.GlobalEnv)
}
# run the function
dl <- c(DF1, DF2, ..., DF.n)
table1(dl)
I assume there must be a more efficient way to do this via lapply() where I only have to specify the columns needed. Something where I would replace the lines
t.sliceA <- [blah]
...
t.sliceAA <- [blah]
with something like:
apply(z[,c(1:4,10:12,15)],2, function(x) addmargins(table(list(x,z$Growing))))
Any help that you can provide would be very helpful. Thanks!
Update: Reproducible example
#Chase
My apologies if the this was done poorly. It's my first time using github.
https://gist.github.com/3719220
and here's the code:
# load the example datasets
a.small <- dget("df1.txt")
l.small <- dget(df2.txt)
# working function that I'd like to simplify
table1 <- function(x) {
dl2 <- list()
for (i in 1:length(x)) {
z <- x[[i]]
t.tenure <- addmargins(table(list(z$Tenure.Group, z$Growing)))
t.optfile <- addmargins(table(list(z$opt.file, z$Growing)))
t.checking <- addmargins(table(list(z$checking, z$Growing)))
t.full <- addmargins(table(list(z$add.full, z$Growing)))
t.optdm <- addmargins(table(list(z$opt.dm, z$Growing)))
t.up <- addmargins(table(list(z$add.up, z$Growing)))
t.off <- addmargins(table(list(z$offmode, z$Growing)))
table.list <- list(t.tenure, t.optfile, t.checking, t.full, t.optdm, t.up, t.off)
names(table.list) <- c("t.tenure", "t.optfile", "t.checking", "t.full", "t.optdm", "t.up", "t.off")
dl2[[i]] <- table.list
}
assign("dl",dl2, envir=.GlobalEnv)
}
# create a DF list to send to the function
dl <- list(a.small, l.small)
table1(dl) # run the function
As far as I can see this will be easily done with a couple of lapply statements
If we define our function to create a table with margins as
tabulate_df <- function(DF, .what, .with) {
table.add.margins <- function(...) addmargins(table(...))
lapply(DF[.what], table.add.margins, DF[[.with]])
}
Then
# the columns we want to cross tabulate with `Growing`
table_names <- setdiff(names(df1), 'Growing')
df_list <- setNames(list(df1,df2), c('df1','df2'))
lapply(df_list, tabulate_df, .what = table_names, .with = 'Growing')
Related
I have to automate this sequence of functions:
for (i in c(15,17,20,24,25,26,27,28,29,45,50,52,55,60,62)) {
WBES_sf_angola_i <- subset(WBES_sf_angola, isic == i)
WBES_angola_i <- as_Spatial(WBES_sf_angola_i)
FDI_angola_i <- FDI_angola[FDI_angola$isic==i,]
dist_ao_i <- distm(WBES_angola_i,FDI_angola_i, fun = distGeo)/1000
rm(WBES_sf_angola_i,WBES_angola_i,FDI_angola_i)
}
As a result, I want a "dist_ao" for each i. The indexed values are to be found in the isic columns of the WBES_sf_angola and the FDI_angola datasets.
How can I embed the index in the various items' names?
EDIT:
I tried with following modification:
for (i in c(15,17,20,24,25,26,27,28,29,45,50,52,55,60,62)) {
WBES_sf_angola_i <- subset(WBES_sf_angola, isic == i)
WBES_angola_i <- as_Spatial(WBES_sf_angola_i)
FDI_angola_i <- FDI_angola[FDI_angola$isic==i,]
result_list <- list()
result_list[[paste0("dist_ao_", i)]] <- distm(WBES_angola_i,FDI_angola_i, fun = distGeo)/1000
rm(WBES_sf_angola_i,WBES_angola_i,FDI_angola_i)
}
and the output is just a list of 1 that contains dist_ao_62. Where do I avoid overwriting?
Untested (due to missing MRE) but should work:
result_list <- list()
for (i in c(15,17,20,24,25,26,27,28,29,45,50,52,55,60,62)) {
result_list[[paste0("dist_ao_", i)]] <- distm(as_Spatial(subset(WBES_sf_angola, isic == i)) , FDI_angola[FDI_angola$isic==i,], fun = distGeo)/1000
}
You could approach it this way. All resulting dataframes will be included in the list, which you can convert to a dataframe from the last line of the the code here. NOTE: since not reproducible, I have mostly taken the code from your question inside the loop.
WBES_sf_angola_result <- list() # renamed this, as it seems you are using a dataset with the name WBES_sf_angola
WBES_angola <- list()
FDI_angola <- list()
dist_ao <- list()
for (i in c(15,17,20,24,25,26,27,28,29,45,50,52,55,60,62)) {
WBES_sf_angola[[paste0("i_", i)]] <- subset(WBES_sf_angola, isic == i)
WBES_angola[[paste0("i_", i)] <- as_Spatial(WBES_sf_angola_i)
FDI_angola[[paste0("i_", i)] <- FDI_angola[FDI_angola$isic==i,]
dist_ao[[paste0("i_", i)] <- distm(WBES_angola_i,FDI_angola_i, fun = distGeo)/1000
rm(WBES_sf_angola_i,WBES_angola_i,FDI_angola_i)
}
WBES_sf_angola_result <- do.call(rbind, WBES_sf_angola_result) # to get a dataframe
Your subset data can also be accessed through list index. eg.
WBES_sf_angola_result[[i_15]] # for the first item.
This code chunk creates a 10 objects based of length of alpha.
alpha <- seq(.1,1,by=.1)
for (i in 1:length(alpha)){
assign(paste0("list_ts_ses_tune", i),NULL)
}
How do I put each function into the new list_ts_ses_tune1 ... null objects I've created? Each function puts in a list, and works if I set list_ts_ses_tune1 <- lapply ...
for (i in 1:length(alpha))
{
list_ts_ses_tune[i] <- lapply(list_ts, function(x)
forecast::forecast(ses(x,h=24,alpha=alpha[i])))
list_ts_ses_tune[i] <- lapply(list_ts_ses_tune[i], "[", c("mean"))
}
Maybe this is a better way to do this? I need each individual output in a list of values.
Edit:
for (i in 1:length(alpha))
{
list_ts_ses_tune[[i]] <- lapply(list_ts[1:(length(list_ts)/2)],
function(x)
forecast::forecast(ses(x,h=24,alpha=alpha[i])))
list_ts_ses_tune[[i]] <- lapply(list_ts_ses_tune[[i]], "[", c("mean"))
}
We can use mget to return all the objects into a list
mget(ls(pattern = '^list_ts_ses_tune\\d+'))
Also, the NULL list can be created more easily instead of 10 objects in the global environment
list_ts_ses_tune <- vector('list', length(alpha))
Now, we can just use the OP's code
for (i in 1:length(alpha))
{
list_ts_ses_tune[[i]] <- lapply(list_ts, function(x)
forecast::forecast(ses(x,h=24,alpha=alpha[i])))
}
If we want to create a single data.frame
for(i in seq_along(alpha)) {
list_ts_ses_tune[[i]] <- data.frame(Mean = do.call(rbind, lapply(list_ts, function(x)
forecast::forecast(ses(x,h=24,alpha=alpha[i]))$mean)))
}
You could simply accomplish everything by doing:
library(forecast)
list_ts_ses_tune <- Map(function(x)
lapply(alpha, function(y)forecast(ses(x,h=24,alpha=y))['mean']), list_ts)
I'm trying to go through all the rows from a table to apply some functions. Something like:
for(i in 1:nrow(df)){
df[i,2] <- somefunction1(df[i,1])
df[i,3] <- somefunction2(df[i,1])
df[i,3] <- somefunction3(df[i,1])
df[i,4] <- somefunction4(df[i,1])
}
This works but it takes too long so I was thinking in parallelization with the library doParallel. But when I try:
foreach(i = 1:nrow(df) ) %dopar% {
df[i,2] <- somefunction1(df[i,1])
df[i,3] <- somefunction2(df[i,1])
df[i,3] <- somefunction3(df[i,1])
df[i,4] <- somefunction4(df[i,1])}
It doesn't change the table, but it returns a list with the last function results.
I guess maybe this is not the best approach for what I want to achieve so I am open to receive new ways to make this kind of code more efficient. This is something that an excel table makes automatically to all the cells at the same time without spending so much time, so I think R should be able to do this somehow.
If parallelization is the way to go, I would like to receive some orientation about how to store the results in the table directly inside the loop, without executing each function separatedly and store it after that (it makes the code slow and less reliable with the association of the results to the variables).
Thank you in advance.
That's a really inefficient way to perform a function on every row in the data frame. Do you have to use a for loop at all?
Here is some code that runs some simple functions on row in the data frame, in parallel:
a <- sample(1:1000)
df <- as.data.frame(cbind(a))
somefunction1 <- function(x) {
x/1
}
somefunction2 <- function(x) {
x/2
}
somefunction3 <- function(x) {
x/3
}
somefunction4 <- function(x) {
x/4
}
for(i in 1:nrow(df)){
df[i,1] <- somefunction1(df[i,1])
df[i,2] <- somefunction2(df[i,1])
df[i,3] <- somefunction3(df[i,1])
df[i,4] <- somefunction4(df[i,1])
}
library(foreach)
library(doMC)
library(abind)
registerDoMC(detectCores()-1)
acomb <- function(...) abind(..., along=1)
par_df <- foreach(i=icount(nrow(df)), .combine='acomb', .multicombine=TRUE)
%dopar%
{
df[i,1] <- somefunction1(df[i,1])
df[i,2] <- somefunction2(df[i,1])
df[i,3] <- somefunction3(df[i,1])
df[i,4] <- somefunction4(df[i,1])
df[i,]
}
par_df_2 <- data.frame(par_df, stringsAsFactors=FALSE)
This is not an issue with parallelization.
Your code between { } behaves like a function would. See the following example
myfun <- function() {
1
2
3
}
myfun()
# 3
There is an implicit return of the last evaluated value, and the other evaluated values are lost. The same is occurring with your "function"
foreach(i = 1:nrow(df) ) %dopar% {
df[i,2] <- somefunction1(df[i,1])
df[i,3] <- somefunction2(df[i,1])
df[i,3] <- somefunction3(df[i,1])
df[i,4] <- somefunction4(df[i,1])}
You can modify it as
foreach(i = 1:nrow(df) ) %dopar% {
c(somefunction1(df[i,1]), somefunction2(df[i,1]), somefunction3(df[i,1]), somefunction4(df[i,1])) }
to return a vector of the evaluated values
Second, and arguably more important, you should try to change your functions to accept vectors and return vectors. So instead of
df[i,2] <- somefunction1(df[i,1]) # single element in vector
Try
df[,2] <- somefunction1(df[,1]) # entire vector
I am trying to append the "matrix" class and in turn overwrite the default behaviour of "[". Code examples below:
annMatrix <- function(mat=NULL, rowAnn=NULL, colAnn=NULL) {
if(is.null(mat)) mat <- matrix(nrow=0, ncol=0)
mat <- as.matrix(mat)
if(is.null(rowAnn)) rowAnn <- data.frame(row.names=seq_len(nrow(mat)))
if(is.null(colAnn)) colAnn <- data.frame(row.names=seq_len(ncol(mat)))
rowAnn <- data.frame(rowAnn, stringsAsFactors=FALSE)
colAnn <- data.frame(colAnn, stringsAsFactors=FALSE)
stopifnot(nrow(mat)==nrow(rowAnn) & ncol(mat)==nrow(colAnn))
attr(mat, "colAnn") <- colAnn
attr(mat, "rowAnn") <- rowAnn
class(mat) <- append(class(mat), "annMatrix")
mat
}
`[.annMatrix` <- function(annMat, rowExpr=NULL, colExpr=NULL) {
stopifnot(is.valid.annMatrix(annMat))
rowExpr <- eval(substitute(list(rowExpr)), attr(annMat, "rowAnn"), parent.frame())
colExpr <- eval(substitute(list(colExpr)), attr(annMat, "colAnn"), parent.frame())
indsR <- unlist(rowExpr)
indsC <- unlist(colExpr)
if(is.null(indsR)) indsR <- seq_len(nrow(annMat))
if(is.null(indsC)) indsC <- seq_len(ncol(annMat))
attr(annMat, "rowAnn") <- attr(annMat, "rowAnn")[indsR,,drop=FALSE]
attr(annMat, "colAnn") <- attr(annMat, "colAnn")[indsC,,drop=FALSE]
annMat <- unclass(annMat)
annMat <- annMat[indsR,indsC,drop=FALSE]
class(annMat) <- append(class(annMat), "annMatrix")
annMat
}
The basic idea is to make matrix preserve it's specific attributes after subsetting.
However I am running into a problem:
How to write "[" function in such a way that it behaves differently when called with and without a comma:
annMat[i]
annMat[i,]
as the default "[" for matrices seems to do.
I was thinking to set second argument to some value by default, but the value will not change because of an added comma.
I have the following function :
ExampleFunction <- function(ListNumber1, ListNumber2)
{
OutputData <- list()
OutputData[[1]] <- rbind.fill(ListNumber1[[1]], ListNumber2[[1]])
OutputData[[2]] <- rbind.fill(ListNumber1[[2]], ListNumber2[[2]])
return(OutputData)
}
I want to improve this function introducing the possibility to use a variable number of arguments (i.e. lists in my example). Here is an attempt to do this but I don't see how to fill the arguments of rbind.fill().
ExampleFunctionUpgrade <- function(...)
{
Arguments <- list(...)
OutputData <- list()
VarNames <- paste0("List", seq_along(Arguments))
for (i in 1:length(Arguments))
{
assign(VarNames[i], Arguments[[i]])
}
OutputData <- rbind.fill(???)
return(OutputData)
}
I would try to iterate over the columns within an lapply call that is to be bound together.
ExampleFunctionUpgrade <- function(...)
{
Arguments <- list(...)
OutputData <- list()
for(i in 1:length(Arguments[[1]])) {
OutputData[[i]] <- rbind.fill(lapply(Arguments, '[[', i))
}
return(OutputData)
}
If you don't like 'for loops' you can use two lapply calls.