I need to apply a list of indices to a list of dataframes with a one on one mapping. First element of the list of indices goes to the first dataframe only and so on. List of indices applies to the rows in the dataframes.
And a list of complementary dataframes needs to created by selecting rows not mentioned in the indices list.
Here is some sample data:
set.seed(1)
A <- data.frame(matrix(rnorm(40,0,1), nrow = 10))
B <- data.frame(matrix(rnorm(40,2,3), nrow = 10))
C <- data.frame(matrix(rnorm(40,3,4), nrow = 10))
dflis <- list(A,B,C)
# Create a sample row index
ix <- lapply(lapply(dflis,nrow), sample, size = 6)
So far I have managed this working but ugly looking code:
dflis.train <- lapply(seq_along(dflis), function(x) dflis[[x]][ix[[x]],])
dflis.test <- lapply(seq_along(dflis), function(x) dflis[[x]][-ix[[x]],])
Can someone suggest something better, more elegant?
Use Map/mapply instead of the univariate lapply, so that you can iterate over both objects and apply a function, like:
Map(function(d,r) d[r,], dflis, ix)
Or if you want to be fancy:
Map(`[`, dflis, ix, TRUE)
Matches your requested answer.
identical(
Map(function(d,r) d[r,], dflis, ix),
lapply(seq_along(dflis), function(x) dflis[[x]][ix[[x]],])
)
#[1] TRUE
Related
I would like some suggestions on speeding up the code below. The flow of the code is fairly straight forward:
create a vector of unique combinations (m=3, 4, or 5) from df variable names
transform the vector of combinations into a list of formulas
break up the list of formulas to process into chunks to get around memory limitations
iterate through each chunk performing the formula operation and subset the df to the user specified number of rows (topn)
The full reprex is below including the different attempts using purrr::map and base lapply. I also attempted to use:= from data.table following the link below but I was unable to figure out how to transform the list of formulas into formulas that could be fed to qoute(:=(...)):
Apply a list of formulas to R data.table
It appears to me that one of the bottlenecks in my code is in variable operation step. A previous bottleneck was in the ordering step that I've managed to speed up quite a bit using the library kit and the link below but any suggestions that could speed up the entire flow is appreciated. The example I'm posting here uses combn of 4 as that is typically what I use in my workflow but I would also like to be able to go up to combn of 5 if the speed is reasonable.
Fastest way to find second (third...) highest/lowest value in vector or column
library(purrr)
library(stringr)
library(kit)
df <- data.frame(matrix(data = rnorm(80000*90,200,500), nrow = 80000, ncol = 90))
df$value <- rnorm(80000,200,500)
cols <- names(df)
cols <- cols[!grepl("value", cols)]
combination <- 4
## create unique combinations of column names
ops_vec <- combn(cols, combination, FUN = paste, collapse = "*")
## transform ops vector into list of formulas
ops_vec_l <- purrr::map(ops_vec, .f = function(x) str_split(x, "\\*", simplify = T))
## break up the list of formulas into chunks otherwise memory error
chunks_run <- split(1:length(ops_vec_l), ceiling(seq_along(ops_vec_l)/10000))
## store results of each chunk into one final list
chunks_list <- vector("list", length = length(chunks_run))
ptm <- Sys.time()
chunks_idx <- 1
for (chunks_idx in seq_along(chunks_run))
{
## using purrr::map
# p <- Sys.time()
ele_length <- length(chunks_run[[chunks_idx]])
ops_list_temp <- vector("list", length = ele_length)
ops_list_temp <- purrr::map(
ops_vec_l[ chunks_run[[chunks_idx]] ], .f = function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
)
# (p <- Sys.time()-p) #Time difference of ~ 3.6 secs to complete chunk of 10,000 operations
# ## using base lapply
# p <- Sys.time()
# ele_length <- length( ops_vec_l[ chunks_run[[chunks_idx]] ])
# ops_list_temp <- vector("list", length = ele_length)
# ops_list_temp <- lapply(
# ops_vec_l[ chunks_run[[chunks_idx]] ], function(x) df[,x[,1]]*df[,x[,2]]*df[,x[,3]]*df[,x[,4]]
# )
# (p <- Sys.time()-p) #Time difference of ~3.7 secs to complete a chunk of 10,000 operations
## number of rows I want to subset from df
topn <- 250
## list to store indices of topn values for each list element
indices_list <- vector("list", length = length(ops_list_temp))
## list to store value of the topn indices for each list element
values_list <- vector("list", length = length(ops_list_temp))
## for each variable combination in "ops_list_temp" list, find the index (indices) of the topn values in decreasing order
## each element in this list should be the length of topn
indices_list <- purrr::map(ops_list_temp, .f = function(x) kit::topn(vec = x, n = topn, decreasing = T, hasna = F))
## after finding the indices of the topn values for a given variable combination, find the value(s) corresponding to index (indices) and store in the list
## each element in this list, should be the length of topn
values_list <- purrr::map(indices_list, .f = function(x) df[x,"value"])
## save completed chunk to final list
chunks_list[[chunks_idx]] <- values_list
}
(ptm <- Sys.time()-ptm) # Time difference of 41.1 mins
Below, I'm wondering how to use BASE R function quantile() separately across elements in L that are named EFL and ESL?
Note: this is a toy example, L could contain any number of similarly named elements.
foo <- function(X) {
X <- as.matrix(X)
tab <- table(row(X), factor(X, levels = sort(unique(as.vector(X)))))
w <- diag(ncol(tab))
rosum <- rowSums(tab)
obs_oc <- tab * (t(w %*% t(tab)) - 1)
obs_c <- colSums(obs_oc)
max_oc <- tab * (rosum - 1)
max_c <- colSums(max_oc)
SA <- obs_c / max_c
h <- names(SA)
h[is.na(h)] <- "NA"
setNames(SA, h)
}
DAT <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/X.csv", row.names = 1)
L <- replicate(50, foo(DAT[sample(1:nrow(DAT), replace = TRUE),]), simplify = FALSE)
# How to use `quantile()` separately across all similarly named elements (e.g., EFL, ESL) in `L[[i]]` i = 1,... 5
# quantile(all EFL elements across `L`)
# quantile(all ESL elements across `L`)
The previous solution I used do.call to rbind each list into a matrix and array and then calculate the quantile over each data.frame row.
sapply(as.data.frame(do.call(rbind, L)), quantile)
However, when there is a missing row, it does not take that into account. To accurately get the rows you need to fill the missing rows. I used data.table's rbindlist (you could also use plyr::rbind.fill) with fill=TRUE to fill the missing values. It requires each to be a data.frame/table/list, so I converted each to a data.frame, but before doing so you need to transpose (t()) the data so that the rows line up to each element. It could be written in a single line, but it's easier read what is happening in multiple lines.
L2 = lapply(L, function(x){as.data.frame(t(x))})
df = data.table::rbindlist(L2, fill=TRUE) # or plyr::rbind.fill(L2)
sapply(df, quantile, na.rm = TRUE)
You can also use purrr::transpose:
Lt <- purrr::tranpose(L)
quantile(unlist(Lt$EFL),.8)
quantile(unlist(Lt$ESL),.8)
I have a list of dataframes that I would like to multiply for each element of vector.
The first dataframe in the list would be multiplied by the first observation of the vector, and so on, producing another list of dataframes already multiplied.
I tried to do this with a loop, but was unsuccessful. I also tried to imagine something using map or lapply, but I couldn't.
for(i in vec){
for(j in listdf){
listdf2 <- i*listdf[[j]]
}
}
Error in listdf[[j]] : invalid subscript type 'list'
Any idea how to solve this?
*Vector and the List of Dataframes have the same length.
Use Map :
listdf2 <- Map(`*`, listdf, vec)
in purrr this can be done using map2 :
listdf2 <- purrr::map2(listdf, vec, `*`)
If you are interested in for loop solution you just need one loop :
listdf2 <- vector('list', length(listdf))
for (i in seq_along(vec)) {
listdf2[[i]] <- listdf[[i]] * vec[i]
}
data
vec <- c(4, 3, 5)
df <- data.frame(a = 1:5, b = 3:7)
listdf <- list(df, df, df)
I have a large list (z) containing 3 lists of 10 data frames. I would like to collapse this object into a list of 3 data frames where each data frame is the sum of the 10 prior data frames (think matrix addition). Here is what I am working with, keep in mind that these are fake numbers, as the real data are read in from hundreds of *.csv files
x = rep(1,100)
x = matrix(x,10,10)
x = as.data.frame(x)
y = list(x,x,x,x,x,x,x,x,x,x)
z = list(y,y,y)
The desired end product would look like this:
x1 = rep(10,100)
x1 = matrix(x,10,10)
y1 = list(x1,x1,x1)
I keep trying stuff along the lines of:
z1 = c()
for (i in 1:3){
for (j in 1:10){
z1[[i]] = sum(z[[i]][[j]])
}
}
However, this does not yield the desired output. I have also messed around with some of the the apply functions, but to no avail
Thanks in advance for your help!
We can use Reduce to sum the corresponding i, j elements in the list and collapse it to a single dataset
lapply(z, function(x) Reduce(`+`, x))
If we want to remove the last column which is not numeric
lapply(z, function(x) Reduce(`+`, lapply(x, function(y) y[-ncol(y)])))
Or it can be looped over the sequence of list
lapply(seq_along(z), function(i) Reduce(`+`, lapply(seq_along(z[[i]]),
function(j) z[[i]][[j]][-ncol(z[[i]][[j]])])))
If we want to use sum, the data.frames inside the list can be converted to an array, loop over the array with apply, specify the MARGIN and do the sum. In this option, there is also possiblity to take care of NA elements with na.rm = TRUE in sum
lapply(z, function(x) apply(array(unlist(x), c(10, 10, 10)),
1:2, sum, na.rm = TRUE))
Or make it more efficient by looping only on one dimension and use colSums
lapply(z, function(x) apply(array(unlist(x), c(10, 10, 10)), 1, colSums, na.rm = TRUE))
Or using a for loop
z1 <- replicate(length(z), matrix(0, 10, 10), simplify = FALSE)
for(i in seq_along(z)) for(j in seq_along(z[[1]])) z1[[i]] <- z1[[i]] + z[[i]][[j]]
Assume a matrix that contains all bit strings of length r and is in order.
library(gtools)
mat<-permutations(n = 2, r = 5, v = c(0,1), repeats.allowed = TRUE)
mat<-cbind(mat, round(runif(nrow(mat)), digits = 2))
and several vectors each with r elements:
r=5
vec<-t(replicate(100,sample(c(0,1),5,replace=T)))
For each vector (i.e, row in vec) I would like to identify the corresponding row in mat
Note: I would like to list the result for each row, not just the unique elements.
Is there an efficient way to do this without using a for loop?
Try
indx1 <- do.call(`paste0`,as.data.frame(mat[,-6]))
indx2 <- do.call(`paste0`, as.data.frame(vec))
sapply(indx2, function(x) mat[indx1 %in% x,6])