All vs all set intersection for list values - r

I have a list L containing dataframes L=(A,B,C,D). Each dataframe has a column z. I would like to perform a set intersection of values in column z and count the numbers for each pairwise comparison of the dataframes in the list. (i.e. values that are shared) Such that I get a final matrix
A B C D
A
B
C
D
Where the values of the matrix contain the sum of the number of shared values. I am not sure which is the most idiomatic way to implement this using R. I could do a for loop where I start with the first member of the list, extract the values of column z perform a set intersection and populate an empty matrix. But there could be better more efficient approach.
Any ideas and implementations?
Example:
df1 <- data.frame(z=c(1,2,3),s=c(4,5,6))
df2 <- data.frame(z=c(3,2,4),s=c(6,5,4))
my.list <- list(df1, df2)
expected output
df1 df2
df1 3 2
df2 2 3

You can possibly try the outer function:
outer(my.list, my.list, function(x, y) Map(function(i, j) length(intersect(i$z, j$z)), x, y))
df1 df2
df1 3 2
df2 2 3

Related

Finding longest length out of 3 different vectors in R

I do not know if there is a function for this but I have 3 dataframes with different lengths. I was wondering if there is a way to find which one is the largest length and load that into a variable. For example:
x <- c(1:10)
y <- c(1:20)
z <- c(1:40)
I would want to use z as my variable because it has the longest length. Is there a function that can search through these 3 variables (x,y,z) and give me back the one with the longest length?
Thanks
We can place it in a list, use lengths to create an index of maximum length and extract those element from the list
lst[which.max(lengths(lst))]
data
lst <- list(x, y, z)
if you have dataframe and not vectors:
lst[which.max(sapply(lst,nrow))]
data
lst <- list(df1, df2, df3)

for every row in one data frame find the most similar row in another

I have two data frames with Boolean values and numeric values. If needed the numeric values could be put into categories.
var1 <- c(400,234,199,45,77,19)
var2 <- c(0,0,1,1,0,1)
var3 <- c(1,0,1,0,0,1)
df1 <- data.frame(var1,var2,var3)
var.1 <- c(78,147,670,200,75,17)
var.2 <- c(0,0,0,1,1,1)
var.3 <- c(0,1,1,0,1,1)
df2 <- data.frame(var.1,var.2,var.3)
I want to find in df1 the most similar row in df2.
I am aware of cluster analysis, which I could do for one data frame by itself, but once I have clusters for one data frame, how would I extract and apply the same clustering algorithm to the other data frame, so that both data frames are clustered in the same way? I also need as many "clusters" as rows in the data frame, which makes me think cluster analysis is not for this task.
Additionally, every row in df1 must be matched with only one row in df2 so that at the end of the process every row in df1 matches to a different row in df2. This is tricky, because if taking each row in isolation in df1, the same row in df2 might get matched multiple times, which is not desired.
You don'T have to do clustering, just search for the smallest distance. Take the first row of df1 and cbind it with df2. This is easiest, if column names are identical
var1 <- c(400,234,199,45,77,19)
var2 <- c(0,0,1,1,0,1)
var3 <- c(1,0,1,0,0,1)
df1 <- data.frame(var.1,var.2,var.3)
var.1 <- c(78,147,670,200,75,17)
var.2 <- c(0,0,0,1,1,1)
var.3 <- c(0,1,1,0,1,1)
df2 <- data.frame(var.1,var.2,var.3)
rbind(df1[1,], df2)
the result of this can be examinated with dist. We are only interested in the first column of the result, i. e. the first nrow(df2) results.
dist(rbind(df1[1,], df2))[1:nrow(df2)]
evaluates to
> dist(rbind(df1[1,], df2))[1:nrow(df2)]
[1] 0.000000 69.007246 592.000845 122.004098 3.316625
[6] 61.016391
and which.min tells us, which of the rows has the smallest distance:
> which.min(dist(rbind(df1[1,], df2))[1:nrow(df2)])
[1] 1
So the first line in df2 has the smallest distance to the first line of df1. You can put that into an apply or a for loop to do the calculation for each row in df1.
You have to answer the question though, how the distance of a mixture of Boolean and numeric values should be computed. There is no universal answer for that.

Forcing Rbind with uneven columns in R

I am trying to force some list objects (e.g. 4 tables of frequency count) into a matrix by doing rbind. However, they have uneven columns (i.e. some range from 2 to 5, while others range from 1:5). I want is to display such that if a table does not begin with a column of 1, then it displays NA in that row in the subsequent rbind matrix. I tried the approach below but the values repeat itself in the row rather than displaying NAs if is does not exist.
I considered rbind.fill but it requires for the table to be a data frame. I could create some loops but in the spirit of R, I wonder if there is another approach I could use?
# Example
a <- sample(0:5,100, replace=TRUE)
b <- sample(2:5,100, replace=TRUE)
c <- sample(1:4,100, replace=TRUE)
d <- sample(1:3,100, replace=TRUE)
list <- list(a,b,c,d)
table(list[4])
count(list[1])
matrix <- matrix(ncol=5)
lapply(list,(table))
do.call("rbind",(lapply(list,table)))
When I have a similar problem, I include all the values I want in the vector and then subtract one from the result
table(c(1:5, a)) - 1
This could be made into a function
table2 <- function(x, values, ...){
table(c(x, values), ...) - 1
}
Of course, this will give zeros rather than NA

Finding nearest number between two lists

I have a list of dataframes (df1) and another list of dataframes (df2) which hold values required to find the 'nearest value' in the first list.
df1<-list(d1=data.frame(y=1:10), d2=data.frame(y=3:20))
df2<-list(d3=data.frame(y=2),d4=data.frame(y=4))
Say I have this function:
df1[[1]]$y[which(abs(df1[[1]]$y-df2[[1]])== min(abs(df1[[1]]$y-df2[[1]])))]
This function works perfectly in finding the closest value of df2 value 1 in df1. What I can't achieve is getting to work with lapply as in something like:
lapply(df1, function(x){
f<-x$y[which(abs(x$y-df2) == min(abs(x$y - df2)))]
})
I would like to return a dataframe with all f values which show the nearest number for each item in df1.
Thanks,
M
I assume you're trying to compare the first data.frames in df1 and df2 to each other, and the second data.frames in df1 and df2 to each other. It would also be useful to use the which.min function (check out help(which.min)).
edit
In response to your comment, you could use mapply instead:
> mapply(function(x,z) x$y[which.min(abs(x$y - z$y))], df1, df2)
d1 d2
2 4
The OP's real problem is unclear, but I would probably do...
library(data.table)
DT1 = rbindlist(unname(df1), idcol=TRUE)
DT2 = rbindlist(unname(df2), idcol=TRUE)
DT1[DT2, on=c(".id","y"), roll="nearest"]
# .id y
# 1: 1 2
# 2: 2 4

How to use a for loop to extract specific cells from a matrix?

Searched a few different topics but am not finding the exact same question. I have a square correlation matrix where the row/column names are genes. Slice of the matrix shown below.
Xelaev15073085m Xelaev15073088m Xelaev15073090m Xelaev15073095m
Xelaev15000002m 0.1250128 -0.6368677 0.3119062 0.3980826
Xelaev15000006m 0.4127414 -0.8805597 0.6435158 0.9629489
Xelaev15000007m 0.4012530 -0.8854113 0.6425895 0.9614517
I have a data frame which has pairs of genes I want to extract from this large matrix.
V1 V2
1 Xelaev15011657m Xelaev15017932m
2 Xelaev15011587m Xelaev15046612m
3 Xelaev15011594m Xelaev15046616m
4 Xelaev15011597m Xelaev15046617m
5 Xelaev15011603m Xelaev15046624m
6 Xelaev15011654m Xelaev15017928m
I am trying to loop through the data frame and output the matrix cell of the pair matrix["gene1","gene2"] (for example the value 0.1250128 when comparing Xelaev15073085m and Xelaev15000002m). Doing this on a single gene basis is easy, however my attempt at a for loop to do this for the thousands of pairs in this list is failing. In the below example headedlist is a sample of the data frame above, and FullcorSM is the full correlation matrix.
for(i in headedlist$V1){
data.frame(i, headedlist[i,2], FullcorSM[i,headedlist[i,2]])
}
The above line was my first attempt and returns null. My 2nd attempt is shown below.
for(i in 1:nrow(stagelist)){
write.table(data.frame(stagelist$V1, stagelist$V2, FullcorSM["stagelist$V1","stagelist$V2"]),
file="sampleout",
sep="\t",quote=F)
}
Which returns an out of bounds error. To do the 2nd example without the quotes in the FullcorSM["stagelist$V1", "stagelist$V2"] section returns all values of the 2nd column for each of the first column, closer to what I want but still am missing some knowledge of how R is interpreting my matrix/data frame syntax, but it is not clear to me what the fix is. Any insight on how to proceed?
The functionality you're trying to create is actually built into R. You can extract values from a matrix using another two-column matrix, where the first column represents the rownames and the second represents the column names. For example:
m = as.matrix(read.table(text=" Xelaev15073085m Xelaev15073088m Xelaev15073090m Xelaev15073095m
Xelaev15000002m 0.1250128 -0.6368677 0.3119062 0.3980826
Xelaev15000006m 0.4127414 -0.8805597 0.6435158 0.9629489
Xelaev15000007m 0.4012530 -0.8854113 0.6425895 0.9614517"))
# note that your subscript matrix has to be a matrix too, not a data frame
n = as.matrix(read.table(text="Xelaev15000002m Xelaev15073088m
Xelaev15000006m Xelaev15073090m"))
# then it's quite simple
print(m[n])
# [1] -0.6368677 0.6435158
Far from as clean as #David Robinson's very nice solution. Anyway, here it doesn't matter which genes that are in rows and which are in columns in the correlation matrix, and if the subscript matrix contains combinations not in the correlation matrix. Same matrix names as in #David's solution:
# combinations of row and column names for original and transposed correlation matrix
m_comb <- c(outer(rownames(m), colnames(m), paste),
outer(rownames(t(m)), colnames(t(m)), paste))
# 'dim names' in subscript matrix
n_comb <- paste(n[, "V1"], n[, "V2"])
# subset
m[n[n_comb %in% m_comb, ]]
# [1] -0.6368677 0.6435158
Update
Another possibility, slightly more convoluted but perhaps a more useful output. First read the correlation matrix to a data frame df, and the subscript matrix to a data frame df2.
# add row names as a column in correlation matrix
df$rows <- rownames(df)
# melt the correlation matrix
library(reshape2)
df3 <- melt(df)
# merge subscript data and correlation data
df4 <- merge(x = df2, y = df3, by.x = c("V1", "V2"), by.y = c("rows", "variable"))
df4
# V1 V2 value
# 1 Xelaev15000002m Xelaev15073088m -0.6368677
# 2 Xelaev15000006m Xelaev15073090m 0.6435158

Resources