Intersection of pairs of sets (any possible combination) - r

I have more than theree sets, but here I wrote the following example.
S1<-c("Frizzy","Jack","Amy")
S2<-c("Alice","Samy","Anna","Jack")
S3<-c("Frizzy","Anna","Fred","Jack")
I would like to obtain the following result
length(intersect(S1,S2))+length(intersect(S1,S3))+length(intersect(S2,S3))
without write manually all the possible combinations.

We can use combn to get the pairwise intersect between the elements, get the lengths of the list elements and find the sum
sum(lengths(combn(list(S1, S2, S3), 2,
FUN = function(x) Reduce(intersect, x), simplify = FALSE)))
#[1] 5
If there are many objects of the same pattern 'S' followed by some digits, use mget to get those all into a list instead of writing them manually
lst1 <- mget(ls(pattern = '^S\\d+$'))
sum(lengths(combn(lst1, 2,
FUN = function(x) Reduce(intersect, x), simplify = FALSE)))
#[1] 5

Related

Submit every similarly named elements of a list of vectors to a function in R

Below, I'm wondering how to use BASE R function quantile() separately across elements in L that are named EFL and ESL?
Note: this is a toy example, L could contain any number of similarly named elements.
foo <- function(X) {
X <- as.matrix(X)
tab <- table(row(X), factor(X, levels = sort(unique(as.vector(X)))))
w <- diag(ncol(tab))
rosum <- rowSums(tab)
obs_oc <- tab * (t(w %*% t(tab)) - 1)
obs_c <- colSums(obs_oc)
max_oc <- tab * (rosum - 1)
max_c <- colSums(max_oc)
SA <- obs_c / max_c
h <- names(SA)
h[is.na(h)] <- "NA"
setNames(SA, h)
}
DAT <- read.csv("https://raw.githubusercontent.com/rnorouzian/m/master/X.csv", row.names = 1)
L <- replicate(50, foo(DAT[sample(1:nrow(DAT), replace = TRUE),]), simplify = FALSE)
# How to use `quantile()` separately across all similarly named elements (e.g., EFL, ESL) in `L[[i]]` i = 1,... 5
# quantile(all EFL elements across `L`)
# quantile(all ESL elements across `L`)
The previous solution I used do.call to rbind each list into a matrix and array and then calculate the quantile over each data.frame row.
sapply(as.data.frame(do.call(rbind, L)), quantile)
However, when there is a missing row, it does not take that into account. To accurately get the rows you need to fill the missing rows. I used data.table's rbindlist (you could also use plyr::rbind.fill) with fill=TRUE to fill the missing values. It requires each to be a data.frame/table/list, so I converted each to a data.frame, but before doing so you need to transpose (t()) the data so that the rows line up to each element. It could be written in a single line, but it's easier read what is happening in multiple lines.
L2 = lapply(L, function(x){as.data.frame(t(x))})
df = data.table::rbindlist(L2, fill=TRUE) # or plyr::rbind.fill(L2)
sapply(df, quantile, na.rm = TRUE)
You can also use purrr::transpose:
Lt <- purrr::tranpose(L)
quantile(unlist(Lt$EFL),.8)
quantile(unlist(Lt$ESL),.8)

Is there a way to sum together lists of data frames within a larger list?

I have a large list (z) containing 3 lists of 10 data frames. I would like to collapse this object into a list of 3 data frames where each data frame is the sum of the 10 prior data frames (think matrix addition). Here is what I am working with, keep in mind that these are fake numbers, as the real data are read in from hundreds of *.csv files
x = rep(1,100)
x = matrix(x,10,10)
x = as.data.frame(x)
y = list(x,x,x,x,x,x,x,x,x,x)
z = list(y,y,y)
The desired end product would look like this:
x1 = rep(10,100)
x1 = matrix(x,10,10)
y1 = list(x1,x1,x1)
I keep trying stuff along the lines of:
z1 = c()
for (i in 1:3){
for (j in 1:10){
z1[[i]] = sum(z[[i]][[j]])
}
}
However, this does not yield the desired output. I have also messed around with some of the the apply functions, but to no avail
Thanks in advance for your help!
We can use Reduce to sum the corresponding i, j elements in the list and collapse it to a single dataset
lapply(z, function(x) Reduce(`+`, x))
If we want to remove the last column which is not numeric
lapply(z, function(x) Reduce(`+`, lapply(x, function(y) y[-ncol(y)])))
Or it can be looped over the sequence of list
lapply(seq_along(z), function(i) Reduce(`+`, lapply(seq_along(z[[i]]),
function(j) z[[i]][[j]][-ncol(z[[i]][[j]])])))
If we want to use sum, the data.frames inside the list can be converted to an array, loop over the array with apply, specify the MARGIN and do the sum. In this option, there is also possiblity to take care of NA elements with na.rm = TRUE in sum
lapply(z, function(x) apply(array(unlist(x), c(10, 10, 10)),
1:2, sum, na.rm = TRUE))
Or make it more efficient by looping only on one dimension and use colSums
lapply(z, function(x) apply(array(unlist(x), c(10, 10, 10)), 1, colSums, na.rm = TRUE))
Or using a for loop
z1 <- replicate(length(z), matrix(0, 10, 10), simplify = FALSE)
for(i in seq_along(z)) for(j in seq_along(z[[1]])) z1[[i]] <- z1[[i]] + z[[i]][[j]]

apply list of indices to list of dataframes

I need to apply a list of indices to a list of dataframes with a one on one mapping. First element of the list of indices goes to the first dataframe only and so on. List of indices applies to the rows in the dataframes.
And a list of complementary dataframes needs to created by selecting rows not mentioned in the indices list.
Here is some sample data:
set.seed(1)
A <- data.frame(matrix(rnorm(40,0,1), nrow = 10))
B <- data.frame(matrix(rnorm(40,2,3), nrow = 10))
C <- data.frame(matrix(rnorm(40,3,4), nrow = 10))
dflis <- list(A,B,C)
# Create a sample row index
ix <- lapply(lapply(dflis,nrow), sample, size = 6)
So far I have managed this working but ugly looking code:
dflis.train <- lapply(seq_along(dflis), function(x) dflis[[x]][ix[[x]],])
dflis.test <- lapply(seq_along(dflis), function(x) dflis[[x]][-ix[[x]],])
Can someone suggest something better, more elegant?
Use Map/mapply instead of the univariate lapply, so that you can iterate over both objects and apply a function, like:
Map(function(d,r) d[r,], dflis, ix)
Or if you want to be fancy:
Map(`[`, dflis, ix, TRUE)
Matches your requested answer.
identical(
Map(function(d,r) d[r,], dflis, ix),
lapply(seq_along(dflis), function(x) dflis[[x]][ix[[x]],])
)
#[1] TRUE

All combinations of two-way tables

How can I generate all two way tables from a data frame in R?
some_data <- data.frame(replicate(100, base::sample(1:4, size = 50, replace = TRUE)))
combos <- combn(names(some_data), 2)
The following does not work, was planning to wrap a for loop around it and store results from each iteration somewhere
i=1
table(some_data[combos[, i][1]], some_data[combos[, i][2]])
Why does this not work? individual arguments evaluate as expected:
some_data[combos[, i][1]]
some_data[combos[, i][2]]
Calling it with the variable names directly yields the desired result, but how to loop through all combos in this structure?
table(some_data$X1, some_data$X2)
With combn, there is the FUN argument, so we can use that to extract the 'some_data' and then get the table output in an array
out <- combn(names(some_data), 2, FUN = function(i) table(some_data[i]))
Regarding the issue in the OP's post
table(some_data[combos[, i][1]], some_data[combos[, i][2]])
Both of them are data.frames, we can extract as a vector and it should work
table(some_data[, combos[, i][1]], some_data[, combos[, i][2]])
^^ ^^
or more compactly
table(some_data[combos[, i]])
Update
combn by default have simplify = TRUE, that is it would convert the output to an array. Suppose, if we have combinations that are not symmetric, then this will result in different dimensions of the table output unless we convert it to factor with levels specified. An array can hold only a fixed dimensions. If some of the elements changes in dimension, it result in error as it is an array. One way is to use simplify = FALSE to return a list and list doesn't have that restriction.
Here is an example where the previous code fails
set.seed(24)
some_data2 <- data.frame(replicate(5, base::sample(1:10, size = 50,
replace = TRUE)))
some_data <- data.frame(some_data, some_data2)
out1 <- combn(names(some_data), 2, FUN = function(i)
table(some_data[i]), simplify = FALSE)
is.list(out1)
#[1] TRUE
length(out1)
#[1] 5460

Error using colSds; error while loop across lists

I have the following list:
d1<-data.frame(y1=c(34,56,89,45),y2=c(42,54,68,25),y3=c(253,547,586,258),y4=c(233,537,554,258))
d2<-data.frame(y1=c(37,26,14,67),y2=c(65,54,43,23),y3=c(243,577,516,125),y4=c(267,527,567,368))
d3<-data.frame(y1=c(35,24,14,58),y2=c(65,51,43,21),y3=c(267,527,567,368),y4=c(243,577,516,125))
d4<-data.frame(y1=c(34,23,13,36),y2=c(65,55,44,24),y3=c(233,537,554,258),y4=c(253,547,586,258))
lst <- list(d1,d2,d3,d4)
My intention is to obtain different data frames with the means and sd of certain columns for each of the elements of the list. The first problem came when trying to use colSds to obtain the sd.
W.mean<-list()
W.sd<-list()
for (i in ids){
W.mean<-lapply(lst, function(i) colMeans(i[,c(1,2,4)],na.rm=TRUE))
W.sd<-lapply(lst, function(i) colSds(i[,c(1,2,4)],na.rm = TRUE))
}
As soon as I run this script I obtain the folowing error:
Error in colVars(x, rows = rows, cols = cols, ...) :
Argument 'x' must be a matrix or a vector.
The mean function still working so I have a new list with all the means (W.mean)
Now I want to create separete data.frame with just the means (would also include de sd, but I need to make it work)
for (i in c("d1","d2","d3","d4")){
df<-get(i)
df<-data.frame(t(W.mean[[i]]))
assign(paste0(i,"mean"), df)
}
However I get a new error: Error in t.default(W.mean[[i]]) : argument is not a matrix
can someone help me to fix the errors? TAHNKS
The reason is because colSds work on matrix and not on data.frame. According to the Description from ?colSds
Description - Standard deviation estimates for each row (column) in a
matrix.
lapply(lst, function(i) colSds(i[, c(1,2, 4)], na.rm = TRUE))
Error in colVars(x, rows = rows, cols = cols, ...) : Argument 'x'
must be a matrix or a vector.
Therefore, convert the 'data.frame' to 'matrix' and it should work fine
lapply(lst, function(i) colSds(as.matrix(i[, c(1,2, 4)]), na.rm = TRUE))
#[[1]]
#[1] 23.76272 18.24600 173.64427
#[[2]]
#[1] 22.70095 17.91415 139.72682
#[[3]]
#[1] 18.89224 18.40290 216.20110
#[[4]]
#[1] 10.66146 17.56891 180.27202
Also, the for loop in
for(i in ids) {
}
seems to be unnecessary if the intention is just to loop over the list of data.frame and get the colMeans and colSds. Also, we can do this in a single lapply call instead of multiple lapply
res1 <- lapply(lst, function(i) t(cbind(Mean = colMeans(i[, c(1,2, 4)],
na.rm = TRUE), Sds = colSds(as.matrix(i[, c(1,2, 4)]), na.rm = TRUE))))
and it can be converted to a single dataset by rbinding the contents
do.call(rbind, res1)

Resources