lapply on dataframe list using different column index - r

I'm try to do a feature selection on a dataframe list using the caret package. I have different dataframes but the last 6 columns are the same. When I am trying to apply the model on a single df it works fine
# For a single dataframe
mx.chem # the name of my single dataframe
#define the control
mx.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10)
# run the rfe
mx.results <- rfe(mx.chem[,1:22], mx.chem[,23], sizes = c(1:22), rfeControl = mx.control)
print(mex.results)
but My problem is when I try to use lapply on a list of df. The code I have until now is
require(mlbench)
require(caret)
mylist # is a df list containing 3 df
for (i in 1:3) {
my.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10) # define the control
longdata <- length(i)-6
idxindustry <- longdata +1
my.results <- lapply(mylist, function(x) rfe ( x[,1:longdata], x[,idxindustry], data = x, sizes =c(1:longdata), rfeControl = my.control))
}
I'm not sure if I'm using column index properly. Does anyone have an idea how to fix to make my code work. Thanks

Here are two possible ways:
#Using lapply
mx.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10)
rfe.lst <- lapply(mylist,
function(x) {
longdata <- ncol(x)-6
rfe ( x[,1:longdata], x[,longdata + 1],
sizes =c(1:longdata),
rfeControl = mx.control)
})
#For loop
mx.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10)
rfe.lst <- vector("list", 3)
for(i in 1:3) {
longdata <- ncol(mylist[[i]])-6
rfe.lst[[i]] <- rfe(mylist[[i]][,1:longdata], x[,longdata + 1],
sizes=c(1:longdata),
rfeControl=mx.control)
}

Your code doesn't do what you think. length(i) will always be 1, because i is your loop index and takes the values 1 to 3. You mean to do:
length(mylist[[i]])
Note the double brackets. That's how you select the element from the list, in this case the data frame. If you use single brackets, you get back a list with the elements you want.
But that's still not what you aim to achieve. If you would change that line in your code, you have 2 loops:
an outer loop that creates longdata and idxindustry based on a single data frame each time.
an inner lapply loop that uses the values for longdata and idxindustry on all three dataframes.
Remember that lapply takes each element in the list and passes it as the first argument to the function you specify. So you can do this in a single lapply like this:
my.control <- rfeControl(functions=rfFuncs, method = "cv", number = 10)
my.results <- lapply(mylist, function(x){
# x becomes one of the data frames in the list mylist here, so you can
# treat it like a data frame in the code below
longdata <- length(x) - 6
idxindustry <- longdata +1
rfe( x[,1:longdata], x[,idxindustry], data = x,
sizes =c(1:longdata), rfeControl = my.control)
})
And then you run rfe with longdata and idxindustry based on the data frame at hand. Note I put the call to rfeControl outside the lapply loop for performance.

Related

Optimize a large number of variable operations and variable ordering

I would like some suggestions on speeding up the code below. The flow of the code is fairly straight forward:
create a vector of unique combinations (m=3, 4, or 5) from df variable names
transform the vector of combinations into a list of formulas
break up the list of formulas to process into chunks to get around memory limitations
iterate through each chunk performing the formula operation and subset the df to the user specified number of rows (topn)
The full reprex is below including the different attempts using purrr::map and base lapply. I also attempted to use:= from data.table following the link below but I was unable to figure out how to transform the list of formulas into formulas that could be fed to qoute(:=(...)):
Apply a list of formulas to R data.table
It appears to me that one of the bottlenecks in my code is in variable operation step. A previous bottleneck was in the ordering step that I've managed to speed up quite a bit using the library kit and the link below but any suggestions that could speed up the entire flow is appreciated. The example I'm posting here uses combn of 4 as that is typically what I use in my workflow but I would also like to be able to go up to combn of 5 if the speed is reasonable.
Fastest way to find second (third...) highest/lowest value in vector or column
library(purrr)
library(stringr)
library(kit)
df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
df$value <- rnorm(80000,200,500)
cols <- names(df)
cols <- cols[!grepl("value", cols)]
combination <- 4
## create unique combinations of column names
ops_vec <- combn(cols, combination, FUN = paste, collapse = "*")
## transform ops vector into list of formulas
ops_vec_l <- purrr::map(ops_vec, .f = function(x) str_split(x, "\\*", simplify = T))
## break up the list of formulas into chunks otherwise memory error
chunks_run <- split(1:length(ops_vec_l), ceiling(seq_along(ops_vec_l)/10000))
## store results of each chunk into one final list
chunks_list <- vector("list", length = length(chunks_run))
ptm <- Sys.time()
chunks_idx <- 1
for (chunks_idx in seq_along(chunks_run))
{
## using purrr::map
# p <- Sys.time()
ele_length <- length(chunks_run[[chunks_idx]])
ops_list_temp <- vector("list", length = ele_length)
ops_list_temp <- purrr::map(
ops_vec_l[ chunks_run[[chunks_idx]] ], .f = function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
)
# (p <- Sys.time()-p) #Time difference of ~ 3.6 secs to complete chunk of 10,000 operations
# ## using base lapply
# p <- Sys.time()
# ele_length <- length( ops_vec_l[ chunks_run[[chunks_idx]] ])
# ops_list_temp <- vector("list", length = ele_length)
# ops_list_temp <- lapply(
# ops_vec_l[ chunks_run[[chunks_idx]] ], function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
# )
# (p <- Sys.time()-p) #Time difference of ~3.7 secs to complete a chunk of 10,000 operations
## number of rows I want to subset from df
topn <- 250
## list to store indices of topn values for each list element
indices_list <- vector("list", length = length(ops_list_temp))
## list to store value of the topn indices for each list element
values_list <- vector("list", length = length(ops_list_temp))
## for each variable combination in "ops_list_temp" list, find the index (indices) of the topn values in decreasing order
## each element in this list should be the length of topn
indices_list <- purrr::map(ops_list_temp, .f = function(x) kit::topn(vec = x, n = topn, decreasing = T, hasna = F))
## after finding the indices of the topn values for a given variable combination, find the value(s) corresponding to index (indices) and store in the list
## each element in this list, should be the length of topn
values_list <- purrr::map(indices_list, .f = function(x) df[x,"value"])
## save completed chunk to final list
chunks_list[[chunks_idx]] <- values_list
}
(ptm <- Sys.time()-ptm) # Time difference of 41.1 mins

T test for two lists of matrices

I have two lists of matrices (list A and list B, each matrix is of dimension 14x14, and list A contains 10 matrices and list B 11)
and I would like to do a t test for each coordinate to compare the means of each coordinate of group A and group B.
As a result I would like to have a matrix of dimension 14x14 which contains the p value associated with each t test.
Thank you in advance for your answers.
Here's a method using a for loop and then applying the lm() function.
First we'll generate some fake data as described in the question.
#generating fake matrices described by OP
listA <- vector(mode = "list", length = 10)
listB <- vector(mode = "list", length = 10)
for (i in 1:10) {
listA[[i]] <- matrix(rnorm(196),14,14,byrow = TRUE)
}
for (i in 1:11) {
listB[[i]] <- matrix(rnorm(196),14,14,byrow = TRUE)
}
Then we'll unwrap each matrix as described by dcarlson in a for loop.
Unwrapped.Mats <- NULL
for (ID in 1:10) {
unwrapped <- as.vector(as.matrix(listA[[ID]])) #Unwrapping each matrix into a vector
withID <- c(ID, "GroupA", unwrapped) #labeling it with ID# and which group it belongs to
UnwrappedCorMats <- rbind(Unwrapped.Mats, withID)
}
for (ID in 1:11) {
unwrapped <- as.vector(as.matrix(listB[[ID]]))
withID <- c(PID, "GroupB", unwrapped)
UnwrappedCorMats <- rbind(UnwrappedCorMats, withID)
}
Then write and apply a function to run lm(). lm() is statistically equivalent to an unpaired t-test in this context but I'm using it to be more easily adapted into a mixed effect model if anyone wants to add mixed effects.
UnwrappedDF <- as.data.frame(UnwrappedCorMats)
lmPixel2Pixel <- function(i) { #defining function to run lm
lmoutput <- summary(lm(i ~ V2, data= UnwrappedDF))
lmoutputJustP <- lmoutput$coefficients[2,4] #Comment out this line to return full lm output rather than just p value
}
Vector_pvals <- sapply(UnwrappedDF[3:length(UnwrappedDF)], lmPixel2Pixel)
Finally we will reform the vector into the same shape as the original matrix for more easily interpreting results
LM_mat.again <- as.data.frame(matrix(Vector_pvals, nrow = nrow(listA[[1]]), ncol = ncol(listA[[1]]), byrow = T))
colnames(LM_mat.again) <- colnames(listA[[1]]) #if your matrix has row or column names then restoring them is helpful for interpretation
rownames(LM_mat.again) <- colnames(listB[[1]])
head(LM_mat.again)
I'm sure there are faster methods but this one is pretty straight forward and I was surprised there weren't answers for this type of operation yet

R Convert loop into function

I would like to clean up my code a bit and start to use more functions for my everyday computations (where I would normally use for loops). I have an example of a for loop that I would like to make into a function. The problem I am having is in how to step through the constraint vectors without a loop. Here's what I mean;
## represents spectral data
set.seed(11)
df <- data.frame(Sample = 1:100, replicate(1000, sample(0:1000, 100, rep = TRUE)))
## feature ranges by column number
frm <- c(438,563,953,963)
to <- c(548,803,1000,993)
nm <- c("WL890", "WL1080", "WL1400", "WL1375")
WL.ps <- list()
for (i in 1:length(frm)){
## finds the minimum value within the range constraints and returns the corresponding column name
WL <- colnames(df[frm[i]:to[i]])[apply(df[frm[i]:to[i]],1,which.min)]
WL.ps[[i]] <- WL
}
new.df <- data.frame(WL.ps)
colnames(new.df) <- nm
The part where I iterate through the 'frm' and 'to' vector values is what I'm having trouble with. How does one go from frm[1] to frm[2].. so-on in a function (apply or otherwise)?
Any advice would be greatly appreciated.
Thank you.
You could write a function which returns column name of minimum value in each row for a particular range of columns. I have used max.col instead of apply(df, 1, which.min) to get minimum value in a row since max.col would be efficient compared to apply.
apply_fun <- function(data, x, y) {
cols <- x:y
names(data[cols])[max.col(-data[cols])]
}
Apply this function using Map :
WL.ps <- Map(apply_fun, frm, to, MoreArgs = list(data = df))

Assigning a variable to pasted name of column in R

I have a few data frames with the names:
Meanplots1,
Meanplots2,
Meanplots3 etc.
I am trying to write a for loop to do a series of equations on each data frame.
I am attempting to use the paste0 function.
What I want to happen is for x to be a column of each data set. So the code should work like this line:
x <- Meanplots1$PAR
However, since I want to put this in a for loop I want to format it like this:
for (i in 1:3){
x <- paste0("Meanplots",i,"$PAR")
Dmodel <- nls(y ~ ((a*x)/(b + x )) - c, data = dat, start = list(a=a,b=b,c=c))
}
What this does is it assigns x to the list "Meanplots1$PAR" not the actual column. Any idea on how to fix this?
We can get all the data.frame in a list with mget
lst1 <- mget(ls(pattern = '^MeanPlots\\d+$'))
then loop over the list with lapply and apply the model
DmodelLst <- lapply(lst1, function(dat) nls(y ~ ((a* PAR)/(b + PAR )) - c,
data = dat, start = list(a=a,b=b,c=c)))
Replace 'x' with the column name 'PAR'.
In the OP's loop, create a NULL list to store the output ('Outlst'), get the value of the object from paste0, then apply the formula with the unquoted column name i.e. 'PAR'
Outlst <- vector("list", 3)
ndat <- data.frame(x = seq(0,2000,100))
for(i in 1:3) {
dat <- get(paste0("MeanPlots", i))
modeltmp <- nls(y ~ ((a*PAR)/(b + PAR )) - c,
data = dat, start = list(a=a,b=b,c=c))
MD <- data.frame(predict(modeltmp, newdata = ndat))
MD[,2] <- ndat$x
names(MD) <- c("Photo","PARi")
Outlst[[i]] <- MD
}
Now, we extract the output of each list element
Outlst[[1]]
Outlst[[2]]
instead of creating multiple objects in the global environment

Combining multiple function arguments inside list2serv(lapply(),)

I'm working with ten training datasets, train1 through train10, and would like to repeat the following statements for 1 through 10 with a single block of code:
train_y_1 <- c(train1$y)
train1$y <-NULL
train_x_1 <- data.matrix(train1)
olsfit_1 <- cv.glmnet(y=train_y_1, x=train_x_1, alpha=1, family="gaussian")
I've read in the forums that lapply() is preferable to for loops. My code:
# Create empty data frames and list (to be populated with values in main program)
list2env(setNames(lapply(1:10, function(i) data.frame()), paste0('train_y_', 1:10)), envir=.GlobalEnv)
list2env(setNames(lapply(1:10, function(i) data.frame()), paste0('train_x_', 1:10)), envir=.GlobalEnv)
list2env(setNames(lapply(1:10, function(i) list()), paste0('lasso_', 1:10)), envir=.GlobalEnv)
# Create y and x input matrices and run ten lasso regressions
list2env(lapply(mget(paste0('train', 1:10)), mget(paste0('train_y_', 1:10)), mget(paste0('train_x_', 1:10)), mget(paste0('lasso_', 1:10)),
function(a,b,c,d)
{
b <- c(a$y);
a$y <- NULL;
c <- data.matrix(a);
d <- cv.glmnet(y=b, x=c, alpha=1, family="gaussian");
}), envir=.GlobalEnv)
which produces the error message:
Error in match.fun(FUN) :
'mget(paste0("train_y_", 1:10))' is not a function, character or symbol
So it looks like R is confused by the four mget() functions which I intended to be reading in values for the a,b,c,d arguments, but I'm not sure how to proceed next.
Any suggestions?
You want to keep all your data in lists whenever possible, avoiding polluting the global environment with a bunch of variables. This isn't tested, and train is missing, but should be a similar list of your train data. Then, you could do something like,
trainy <- setNames(lapply(1:10, function(i) data.frame()), paste0('train_y_', 1:10))
trainx <- setNames(lapply(1:10, function(i) data.frame()), paste0('train_x_', 1:10))
lasso <- setNames(lapply(1:10, function(i) list()), paste0('lasso_', 1:10))
f <- function(a,b,c,d) {
b <- c(a$y);
a$y <- NULL;
c <- data.matrix(a);
d <- cv.glmnet(y=b, x=c, alpha=1, family="gaussian");
}
mapply(f, train, trainy, trainx, lasso, SIMPLIFY=F)
Although, since your lists are just initializing variables, you probably just want to loop (apply) over a list of your training data,
lapply(train, function(x) {
... # the statements you want to repeat
list(...) # return a list of the three data.frames
})
We can achieve this with the following code.
# Load libraries
library(dplyr);library(glmnet)
# Gather all the variables in global into a list
fit = mget(paste0("train", 1:10), envir = .GlobalEnv) %>%
# Pipe each element of the list into `cv.glmnet` function
lapply(function(dat) {cv.glmnet(y = dat$y,
x = data.matrix(dat %>% mutate(y = NULL)),
alpha = 1,
family = "gaussian")})
Your output will be neatly stored in fit, which is a list with 10 elements. You can call each element with fit[[i]]. For example coef(fit[[1]]) pulls out the coefs for train1 and lapply(fit, coef) pulls the coef for all 10 models and stores them in a list.

Resources