Related
I am looking for an efficient way to combine selected columns in a logical matrix by "ANDing" them together and ending up with a new matrix. An example of what I am looking for:
matrixData <- rep(c(TRUE, TRUE, FALSE), 8)
exampleMatrix <- matrix(matrixData, nrow=6, ncol=4, byrow=TRUE)
exampleMatrix
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE FALSE TRUE
[2,] TRUE FALSE TRUE TRUE
[3,] FALSE TRUE TRUE FALSE
[4,] TRUE TRUE FALSE TRUE
[5,] TRUE FALSE TRUE TRUE
[6,] FALSE TRUE TRUE FALSE
The columns to be ANDed to each other are specified in a numeric vector of length ncol(exampleMatrix), where the columns to be grouped together ANDed have the same value (a value from 1 to n, where n <= ncol(exampleMatrix) and every value in 1:n is used at least once). The resulting matrix should have the columns in order from 1:n. For example, if the vector that specifies the column groups is
colGroups <- c(3, 2, 2, 1)
Then the resulting matrix would be
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] TRUE FALSE TRUE
[3,] FALSE TRUE FALSE
[4,] TRUE FALSE TRUE
[5,] TRUE FALSE TRUE
[6,] FALSE TRUE FALSE
Where in the resulting matrix
[,1] = exampleMatrix[,4]
[,2] = exampleMatrix[,2] & exampleMatrix[,3]
[,3] = exampleMatrix[,1]
My current way of doing this looks basically like this:
finalMatrix <- matrix(TRUE, nrow=nrow(exampleMatrix), ncol=3)
for (i in 1:3){
selectedColumns <- exampleMatrix[,colGroups==i, drop=FALSE]
finalMatrix[,i] <- rowSums(selectedColumns)==ncol(selectedColumns)
}
Where rowSums(selectedColumns)==ncol(selectedColumns) is an efficient way to AND all of the columns of a matrix together.
My problem is that I am doing this on very big matrices (millions of rows) and I am looking for any way to make this quicker. My first instinct would be to use apply in some way but I can't see any way to use that to improve efficiency as I am not performing the operation in the for loop many times but instead it is the operation in the loop that is slow.
In addition, any tips to reduce memory allocation would be very useful, as I currently have to run gc() within the loop frequently to avoid running out of memory completely, and it is a very expensive operation that significantly slows everything down as well. Thanks!
For a more representative example, this is a much larger exampleMatrix:
matrixData <- rep(c(TRUE, TRUE, FALSE), 8e7)
exampleMatrix <- matrix(matrixData, nrow=6e7, ncol=4, byrow=TRUE)
From your example, I understand that there are very few columns and very many rows. In this case, it'll be efficient to just do a simple loop over colGroups (30% improvement over your suggestion):
for (jj in seq_along(colGroups))
finalMatrix[ , colGroups[jj]] =
finalMatrix[ , colGroups[jj]] & exampleMatrix[ , jj]
I think it will be hard to beat this without parallelizing. This loop is parallelizable if there are more columns (though the parallelization will have to be done a bit carefully (in batches)).
As far as I can tell, this is an aggregation across columns using the all function. So if you transpose to rows, then use colGroups as the grouping factor to apply all, then transpose back to columns, you should get the intended result:
t(aggregate(t(exampleMatrix), list(colGroups), FUN=all)[-1])
# [,1] [,2] [,3]
#V1 TRUE FALSE TRUE
#V2 TRUE FALSE TRUE
#V3 FALSE TRUE FALSE
#V4 TRUE FALSE TRUE
#V5 TRUE FALSE TRUE
#V6 FALSE TRUE FALSE
The [-1] just drops the group-identifier variable which you don't require in the final output.
If you're working with stupid big data, the by-group aggregation could be done in data.table as well:
library(data.table)
t(as.data.table(t(exampleMatrix))[, lapply(.SD,all), by=colGroups][,-1])
I am trying to perform the following outer operation:
x <- c(1, 11)
choices <- list(1:10, 10:20)
outer(x, choices, FUN=`%in%`)
I expect the following matrix:
[,1] [,2]
[1,] TRUE FALSE
[2,] FALSE TRUE
which would correspond to the following operations:
outer(x, choices, FUN=paste, sep=" %in% ")
[,1] [,2]
[1,] "1 %in% 1:10" "1 %in% 10:20"
[2,] "11 %in% 1:10" "11 %in% 10:20"
But for some reason I am getting:
[,1] [,2]
[1,] FALSE FALSE
[2,] FALSE FALSE
What is happening?
As expressed in the comments, the table argument of match (the function called by %in%) isn't intended to be a list (if it is, it gets coerced to a character). You should use vapply:
vapply(choices,function(y) x %in% y,logical(length(x)))
# [,1] [,2]
#[1,] TRUE FALSE
#[2,] FALSE TRUE
Another way that is close to your train of thought, would be to use expand.grid() to create the combinations, and then Map the two columns via %in% function, i.e.
d1 <- expand.grid(x, choices)
matrix(mapply(`%in%`, d1$Var1, d1$Var2), nrow = length(x))
#or you can use Map(`%in%`, ...) in order to keep results in a list
OR
As #nicola suggests, in order to make things better,
d1 <- expand.grid(list(x), choices)
mapply(%in%, d1$Var1, d1$Var2)
both giving,
[,1] [,2]
[1,] TRUE FALSE
[2,] FALSE TRUE
I have a matrix, named "mat", and a smaller matrix, named "center".
temp = c(1.8421,5.6586,6.3526,2.904,3.232,4.6076,4.8,3.2909,4.6122,4.9399)
mat = matrix(temp, ncol=2)
[,1] [,2]
[1,] 1.8421 4.6076
[2,] 5.6586 4.8000
[3,] 6.3526 3.2909
[4,] 2.9040 4.6122
[5,] 3.2320 4.9399
center = matrix(c(3, 6, 3, 2), ncol=2)
[,1] [,2]
[1,] 3 3
[2,] 6 2
I need to compute the distance between each row of mat with every row of center. For example, the distance of mat[1,] and center[1,] can be computed as
diff = mat[1,]-center[1,]
t(diff)%*%diff
[,1]
[1,] 3.92511
Similarly, I can find the distance of mat[1,] and center[2,]
diff = mat[1,]-center[2,]
t(diff)%*%diff
[,1]
[1,] 24.08771
Repeat this process for each row of mat, I will end up with
[,1] [,2]
[1,] 3.925110 24.087710
[2,] 10.308154 7.956554
[3,] 11.324550 1.790750
[4,] 2.608405 16.408805
[5,] 3.817036 16.304836
I know how to implement it with for-loops. I was really hoping someone could tell me how to do it with some kind of an apply() function, maybe mapply() I guess.
Thanks
apply(center, 1, function(x) colSums((x - t(mat)) ^ 2))
# [,1] [,2]
# [1,] 3.925110 24.087710
# [2,] 10.308154 7.956554
# [3,] 11.324550 1.790750
# [4,] 2.608405 16.408805
# [5,] 3.817036 16.304836
If you want the apply for expressiveness of code that's one thing but it's still looping, just different syntax. This can be done without any loops, or with a very small one across center instead of mat. I'd just transpose first because it's wise to get into the habit of getting as much as possible out of the apply statement. (The BrodieG answer is pretty much identical in function.) These are working because R will automatically recycle the smaller vector along the matrix and do it much faster than apply or for.
tm <- t(mat)
apply(center, 1, function(m){
colSums((tm - m)^2) })
Use dist and then extract the relevant submatrix:
ix <- 1:nrow(mat)
as.matrix( dist( rbind(mat, center) )^2 )[ix, -ix]
6 7
# 1 3.925110 24.087710
# 2 10.308154 7.956554
# 3 11.324550 1.790750
# 4 2.608405 16.408805
# 5 3.817036 16.304836
REVISION: simplified slightly.
You could use outer as well
d <- function(i, j) sum((mat[i, ] - center[j, ])^2)
outer(1:nrow(mat), 1:nrow(center), Vectorize(d))
This will solve it
t(apply(mat,1,function(row){
d1<-sum((row-center[1,])^2)
d2<-sum((row-center[2,])^2)
return(c(d1,d2))
}))
Result:
[,1] [,2]
[1,] 3.925110 24.087710
[2,] 10.308154 7.956554
[3,] 11.324550 1.790750
[4,] 2.608405 16.408805
[5,] 3.817036 16.304836
I am trying to compare 1st row of a matrix with all rows of the same matrix. But the vectorized comparison is not returning correct results. Any reason why this may be happening?
m <- matrix(c(1,2,3,1,2,4), nrow=2, ncol=3, byrow=TRUE)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
> # Why does the first row not have 3 TRUE values?
> m[1,] == m
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE FALSE
> m[1,] == m[1,]
[1] TRUE TRUE TRUE
> m[1,] == m[2,]
[1] TRUE TRUE FALSE
Follow-up. In my actual data I have large number of rows then (atleast 10million) then both time and memory adds up. Additional suggestions on the below as suggested below by others?
m <- matrix(rep(c(1,2,3), 1000000), ncol=3, byrow=TRUE)
> #by #alexis_laz
> m1 <- matrix(m[1,], nrow = nrow(m), ncol = ncol(m), byrow = T)
> system.time(m == m1)
user system elapsed
0.21 0.03 0.31
> object.size(m1)
24000112 bytes
> #by #PaulHiemstra
> system.time( t(apply(m, 1, function(x) x == m[1,])) )
user system elapsed
35.18 0.08 36.04
Follow-up 2. #alexis_laz you are correct. I want to compare every row with each other and have posted a followup question on that ( How to vectorize comparing each row of matrix with all other rows)
In the comparison m[1,] == m, the first term m[1,] is recycled (once) to equal the length of m. The comparison is then done column-wise.
You're comparing c(1,2,3) with c(1,1,2,2,3,4), thus c(1,2,3,1,2,3) with c(1,1,2,2,3,3,4) so you have one TRUE followed by five FALSE (and packaged as a matrix to match the dimensions of m).
As #MatthewLundberg pointed out, the recycling rules of R do not behave as you expected. In my opinion it is always better to explicitely state what to compare and not rely on R's assumptions. One way to make the correct comparison:
t(apply(m, 1, function(x) x == m[1,]))
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
or:
m == rbind(m[1,], m[1,])
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
or by making R's recyling working in your favor (thanks to #Arun):
t(t(m) == m[1,])
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
Basically I'm looking to write a function that will take a vector of strings and a search term as input, and output a boolean vector. After this, I'd also like to take a list of strings and run it through this same function to output multiple results vectors, one for each string.
So the initial data looks like:
> searchVector <- cbind(c("aaa1","aaa2","","bbb1,aaa1,ccc1", "ddd1,ccc1,aaa1"))
> searchVector
[,1]
[1,] "aaa1"
[2,] "aaa2"
[3,] ""
[4,] "bbb1,aaa1,ccc1"
[5,] "ddd1,ccc1,aaa1"
and this is what we'd hope to see:
>findTrigger(c("aaa","bbb"),searchVector)
[aaa] [bbb]
[1,] 1 0
[2,] 1 0
[3,] 0 0
[4,] 1 1
[5,] 1 0
I've made the following attempt:
searchfunction <- function (searchTerms, searchVector) {
output = matrix( nrow = length(searchVector),
ncol = length(searchTerms),
dimnames = searchTerms)
for (j in seq(1,length(searchTerms)))
{
for (i in seq(1,length(searchVector)))
{
output[i,j]=is.numeric(pmatch(searchTerms[j], searchVector[i]))
}
}
return(as.numeric(output))
}
But I just get a matrix of all 1's. I'm fairly new to R and I've looked around online, but haven't had any luck. Any help would be greatly appreciated, Thanks!
The key is to use the function grepl. This should get you started:
searchVector <- c("aaa1","aaa2","","bbb1,aaa1,ccc1", "ddd1,ccc1,aaa1")
res <- lapply(c('aaa','bbb'),function(pattern,x) as.numeric(grepl(pattern = pattern,x = x)),x = searchVector)
do.call(cbind,res)
To explore this a bit, start with just grepl:
> grepl('aaa',searchVector)
[1] TRUE TRUE FALSE TRUE TRUE
> as.numeric(grepl('aaa',searchVector))
[1] 1 1 0 1 1
Then I'm just wrapping that up in lapply, to loop over the vector c('aaa','bbb'). This will return a list of vectors, which we then combine into the matrix you indicated using do.call and cbind.
mapply and grep or grepl (thanks joran) are your friend:
searchTerms <- c("aaa", "bbb")
searchVector <- cbind(c("aaa1","aaa2","","bbb1,aaa1,ccc1", "ddd1,ccc1,aaa1"))
M <- mapply(grepl, searchTerms, MoreArgs=list(x=searchVector))
M
aaa bbb
[1,] TRUE FALSE
[2,] TRUE FALSE
[3,] FALSE FALSE
[4,] TRUE TRUE
[5,] TRUE FALSE
If you want it as 1,0: apply(M,2,as.numeric)