Table of all intersections in two data frames - r

I have two data frames. Each row of the dataframes has a different number of elements (actually gene names) -- I used read.csv("file.csv",fill=TRUE) to read them in, so there some na padding in some of the rows.
Each of the data frames have the same elements, only they've been clustered differently, so they are in different groups. I want to output a table of the intersections from the two dataframes.
So if
df1<-data.frame(c("a","b","NA","NA"),c("c","d","e","f"),c("g","h","i","NA" ),c("j","NA","NA","NA"))
df2<-data.frame(c("c","e","i","NA"),c("f","g","h","NA"),c("a","b","d","j" ))
then I want to get to something like this:
df1[1,] df1[2,] df1[3,] df1[4,]
df2[1,] 0 2 1 0
df2[2,] 0 1 2 0
df2[3,] 2 1 0 1
It seems like it should be something I should be able to do with intersect() and an apply function of some sort. I can't get my head around it though. Using my google-fu the nearest I can find is this :Finding an efficient way to count the number of overlaps between interval sets in two tables?, but that deals with data tables and is looking at numerical overlaps in line segments as best I can tell, not lists of names.
Does anyone have any idea how to do this?

You could do this by looping through the rows of each data frame and then calculating the length of the intersection of the rows, omitting missing values:
apply(df1, 1, function(i) apply(df2, 1, function(j) length(na.omit(intersect(i, j)))))
# [,1] [,2] [,3] [,4]
# [1,] 0 2 1 0
# [2,] 0 1 2 0
# [3,] 2 1 0 1
Sample data:
(df1<-rbind(c("a","b", NA, NA),c("c","d","e","f"),c("g","h","i", NA),c("j", NA, NA, NA)))
# [,1] [,2] [,3] [,4]
# [1,] "a" "b" NA NA
# [2,] "c" "d" "e" "f"
# [3,] "g" "h" "i" NA
# [4,] "j" NA NA NA
(df2<-rbind(c("c","e","i", NA),c("f","g","h", NA),c("a","b","d","j")))
# [,1] [,2] [,3] [,4]
# [1,] "c" "e" "i" NA
# [2,] "f" "g" "h" NA
# [3,] "a" "b" "d" "j"

Related

Compare two matrices, keeping values in one matrix that are TRUE in the other

This seems to be an easy task, which I am not finding a solution on R after looking up here and elsewhere. I have two matrices, one with string values and another with logical values.
a <- matrix(c(
"A", "B", "C"
))
b <- matrix(c(
T, F, T
))
> b
[,1]
[1,] TRUE
[2,] FALSE
[3,] TRUE
> a
[,1]
[1,] "A"
[2,] "B"
[3,] "C"
I need to create a third matrix that keeps values in the first that are TRUE in the second, and leaving NA on the remainder, like so:
> C
[,1]
[1,] "A"
[2,] NA
[3,] "C"
How do I achieve the above result?
C <- matrix(a[ifelse(b, T, NA)], ncol = ncol(a))
Here is an alternative by just assigning the NA to FALSE:
a[b==FALSE] <- NA
[,1]
[1,] "A"
[2,] NA
[3,] "C"
using which:
c<-a
c[which(b==FALSE)]<-NA
a <- a[b] . This might also work, Depending on how you want the result.

R - filtering Matrix based off True/False vector

I have a data structure that can contain both vectors and matrices. I want to filter it based off of of a true false column. I can't figure out how to filter both of them successfully.
result <- structure(list(aba = c(1, 2, 3, 4), beta = c("a", "b", "c", "d"),
chi = structure(c(0.438148361863568, 0.889733991585672, 0.0910745360888541,
0.0512442977633327, 0.812013201415539, 0.717306115897372, 0.995319503592327,
0.758843480376527, 0.366544214077294, 0.706843026448041, 0.108310810523108,
0.225777650484815, 0.831163870869204, 0.274351604515687, 0.323493955424055,
0.351171918679029), .Dim = c(4L, 4L))), .Names = c("aba", "beta", "chi"))
> result
$aba
[1] 1 2 3 4
$beta
[1] "a" "b" "c" "d"
$chi
[,1] [,2] [,3] [,4]
[1,] 0.43814836 0.8120132 0.3665442 0.8311639
[2,] 0.88973399 0.7173061 0.7068430 0.2743516
[3,] 0.09107454 0.9953195 0.1083108 0.3234940
[4,] 0.05124430 0.7588435 0.2257777 0.3511719
tf <- c(T,F,T,T)
What I would like to do is something like
> lapply(result,function(x) {ifelse(tf,x,NA)})
$aba
[1] 1 NA 3 4
$beta
[1] "a" NA "c" "d"
$chi
[1] 0.43814836 NA 0.09107454 0.05124430
but the $chi matrix structure is lost.
The result I'd expect is
ifelse(matrix(tf,ncol=4,nrow=4),result$chi,NA)
[,1] [,2] [,3] [,4]
[1,] 0.43814836 0.8120132 0.3665442 0.8311639
[2,] NA NA NA NA
[3,] 0.09107454 0.9953195 0.1083108 0.3234940
[4,] 0.05124430 0.7588435 0.2257777 0.3511719
The challenge I'm having a problem solving is how to match the tf vector to the data. It feels like I need to set it using a conditional based on data type, which I'd like to avoid. Thoughts and answers are appreciated.
I don't see how you can avoid either checking the data type or the "dimensions" of the data. As such, I would propose something like:
lapply(result, function(x) {
if (is.null(dim(x))) x[!tf] <- NA else x[!tf, ] <- NA
x
})
# $aba
# [1] 1 NA 3 4
#
# $beta
# [1] "a" NA "c" "d"
#
# $chi
# [,1] [,2] [,3] [,4]
# [1,] 0.43814836 0.8120132 0.3665442 0.8311639
# [2,] NA NA NA NA
# [3,] 0.09107454 0.9953195 0.1083108 0.3234940
# [4,] 0.05124430 0.7588435 0.2257777 0.3511719
This seems fairly simple:
is.na(tf) <- !tf # convert FALSE to NA
result$chi[ tf, ] # and use the default behavior of "[" with NA arg
[,1] [,2] [,3] [,4]
[1,] 0.43814836 0.8120132 0.3665442 0.8311639
[2,] NA NA NA NA
[3,] 0.09107454 0.9953195 0.1083108 0.3234940
[4,] 0.05124430 0.7588435 0.2257777 0.3511719
But now I see that you wanted NAs at the corresponging postions of the atomic vectors. Unfortunately "[" with the additional NULL argument would error-out on that type of object.

Test arrays of paired words to find out which letters the pairs share

What I want to do is use apply instead of a loop to compare two arrays of character string by each row e.g. row one of x.str with row one of y.str.
x.str
[,1] [,2] [,3] [,4]
[1,] "c" "o" "m" "e"
[2,] "g" "o" "n" "e"
[3,] "b" "o" "o" "d"
[4,] "f" "i" "n" "e"
y.str
[,1] [,2] [,3] [,4]
[1,] "t" "o" "o" "t"
[2,] "j" "a" "m" "m"
[3,] "b" "e" "e" "n"
[4,] "l" "e" "t" "s"
If I was going to write it as a loop:
A = array(0,dim=dim(x.str1))
for(i in 1:length(x.str[,1])){
A[i,] = ifelse(x.str[i,] %in% y.str[i,],1,0)
}
With the out put:
[,1] [,2] [,3] [,4]
[1,] 0 1 1 0
[2,] 0 0 1 0
[3,] 1 0 0 0
[4,] 0 1 0 0
However, the dimension of the real arrays will be approx.
array(0,dim=c(10000,12)
Thus I wanted to use apply instead as much quicker than a loop. I've look all over this site and other and tried many different ways but cant work out how to select the current row being processed within apply to use in the function. Similar post have suggested using:
nrow()
rownames()
I've used them like:
stringCom = function(x){
i = nrow(x)
ifelse(x.str[i,] %in% y.str1[i,],0,1)
}
apply(x.str,1,stringCom)
but all I keep getting is errors. I have tried:
test = function(x){
r = nrow(x)
r
}
apply(x.str,1,test)
Which just gives NULL as its output. Similar thing happens with rownames, NROW, names etc. I'm sure there is probably a very simple answer but can not seem to find it.
Any suggestions/help would be greatly appreciated.
If you want to iterate two things at once, it's better to use some form of mapply. Here's some sample input data
x.str<-matrix(strsplit("cgbfoooimnoneede","")[[1]], ncol=4)
y.str<-matrix(strsplit("tjbloaeeomettmns","")[[1]], ncol=4)
then you could do something like
t(mapply(function(a,b) a%in%b,
split(x.str, row(x.str)), split(y.str, row(y.str)))+0)
# [,1] [,2] [,3] [,4]
# 1 0 1 0 0
# 2 0 0 0 0
# 3 1 0 0 0
# 4 0 0 0 1
which returns the same thing as the code you wrote
A = array(0,dim=dim(x.str))
for(i in 1:length(x.str[,1])){
A[i,] = ifelse(x.str[i,] %in% y.str[i,],1,0)
}
A
# [,1] [,2] [,3] [,4]
# [1,] 0 1 0 0
# [2,] 0 0 0 0
# [3,] 1 0 0 0
# [4,] 0 0 0 1
But it's not always true that for loops are slow. You should be sure to benchmark different strategies to see what works best for any given application. The biggest bottleneck is usual memory management. As long as your predefine the needed memory for the result, the for loop can often be faster.

R: duplicates elimination in a matrix, keeping track of multiplicities

I have a basic problem with R.
I have produced the matrix
M
[,1] [,2]
[1,] "a" "1"
[2,] "b" "2"
[3,] "a" "3"
[4,] "c" "1"
I would like to obtain the 3X2 matrix
[,1] [,2] [,3]
[1,] "a" "1" "3"
[2,] "b" "2" NA
[3,] "c" "1" NA
obtained by eliminating duplicates in M[,1] and writing in N[i,2], N[i,3] the values in M[,2] corresponding to the same element in M[,1], for all i's. The "NA"'s in N[,3] correspond to the singletons in M[,1].
I know how to eliminate duplicates from a vector in R: my problem is to keep track of the elements in M[,2] and write them in the resulting matrix N. I tried with for cycles but they do not work so well in my "real world" case, where the matrices are much bigger.
Any suggestions?
I thank you very much.
You can use dcast in the reshape2 package after turning your matrix to a data.frame. To reverse the process you can use melt.
df = data.frame(c("a","b","a","c"),c(1:3,1))
colnames(df) = c("factor","obs")
require(reshape2)
df2=dcast(df, factor ~ obs)
now df2 is:
factor 1 2 3
1 a 1 NA 3
2 b NA 2 NA
3 c 1 NA NA
To me it makes more sense to keep it like this. But if you need it in your format:
res = t(apply(df2,1,function(x) { newLine = as.vector(x[which(!is.na(x))],mode="any"); newLine=c(newLine,rep(NA, ncol(df2)-length(newLine) )) }))
res = res[,-ncol(res)]
[,1] [,2] [,3]
[1,] "a" " 1" " 3"
[2,] "b" " 2" NA
[3,] "c" " 1" NA

R/Igraph Display edge weights in an edge list?

Is there any way to display edge weights when viewing the graph object as an edge list?
I want to do something in the spirit of:
get.edgelist(graph, attr='weight')
so as to view the edge pairings with the weights listed alongside the nodes, but that seems not to be allowed. Only way I know how to view the weights is to view the network data as an adjacency matrix. Hoping that's not the only way.
Using the example in the help page for function get.edgelist in pkg:igraph:
> cbind( get.edgelist(g) , round( E(g)$weight, 3 ))
[,1] [,2] [,3]
[1,] "a" "b" "0.342"
[2,] "b" "d" "0.181"
[3,] "b" "e" "0.403"
[4,] "b" "f" "0.841"
[5,] "d" "f" "0.997"
[6,] "e" "g" "0.029"
[7,] "a" "h" "0.17"
[8,] "b" "j" "0.69"
[9,] "g" "j" "0.422"
Another option is to use get.data.frame() from the igraph package
# create a random graph with weighted edges
g <- erdos.renyi.game(5, 5/10, directed = TRUE)
E(g)$weight <- runif(length(E(g)), 1, 5)
# pull nodes and edge weights
get.data.frame(g)
from to weight
1 1 5 4.716679
2 2 1 4.119414
3 1 2 4.535791
4 2 5 2.486553
5 3 2 4.932118
6 5 2 3.353693
7 1 3 3.003062
8 2 3 3.350118
9 1 4 2.929069
10 2 4 4.929474
11 5 4 4.333134

Resources