i want to identify whether two matrices have NA's in the same spot.
Setup:
We have three matrices. I want to run a function that tells me mat1 and mat2 have NA's in identical spots, and that tells me that mat1(and mat3) vs mat2 have NA's in different spots
mat1 <- matrix(nrow=2, ncol =2, data =c(NA,0,0,NA))
mat2 <- matrix(nrow=2, ncol =2, data=c(NA,0,0,NA))
mat3 <- matrix(nrow=2, ncol=2, data = c(NA,0,0,0))
Compare the NA status of all elements:
> all(is.na(mat1) == is.na(mat2))
[1] TRUE
> all(is.na(mat1) == is.na(mat3))
[1] FALSE
In a function I'd do this:
> nanana = function(m1, m2){!any(is.na(m1) != is.na(m2))}
I've inverted the logic so that any can stop checking if it finds any difference. If you use all it has to go over every element. I'm not sure if this kind of short-circuiting is in R but it might save you a millisecond or two.
> nanana(mat1, mat2)
[1] TRUE
> nanana(mat1, mat3)
[1] FALSE
We can write a function which compares the position of NA elements in two matrix
identical_NA_matrix <- function(m1, m2) {
identical(which(is.na(m1), arr.ind = TRUE), which(is.na(m2), arr.ind = TRUE))
}
identical_NA_matrix(mat1,mat3)
#[1] FALSE
identical_NA_matrix(mat1,mat2)
#[1] TRUE
Related
Suppose I have a matrix,
mat <- matrix((1:9)^2, 3, 3)
I can slice the matrix like so
> mat[2:3, 2]
[1] 25 36
How does one store the subscript as a variable? That is, what should my_sub be, such that
> mat[my_sub]
[1] 25 36
A list gets "invalid subscript type" error. A vector will lose the multidimensionality. Seems like such a basic operation to not have a primitive type that fits this usage.
I know I can access the matrix via vector addressing, which means converting from [2:3, 2] to c(5, 6), but that mapping presumes knowledge of matrix shape. What if I simply want [2:3, 2] for any matrix shape (assuming it is at least those dimensions)?
Here are some alternatives. They both generalize to higher dimenional arrays.
1) matrix subscripting If the indexes are all scalar except possibly one, as in the question, then:
mi <- cbind(2:3, 2)
mat[mi]
# test
identical(mat[mi], mat[2:3, 2])
## [1] TRUE
In higher dimensions:
a <- array(1:24, 2:4)
mi <- cbind(2, 2:3, 3)
a[mi]
# test
identical(a[mi], a[2, 2:3, 3])
## [1] TRUE
It would be possible to extend this to eliminate the scalar restriction using:
L <- list(2:3, 2:3)
array(mat[as.matrix(do.call(expand.grid, L))], lengths(L))
however, in light of (2) which also uses do.call but avoids the need for expand.grid it seems unnecessarily complex.
2) do.call This approach does not have the scalar limitation. mat and a are from above:
L2 <- list(2:3, 1:2)
do.call("[", c(list(mat), L2))
# test
identical(do.call("[", c(list(mat), L2)), mat[2:3, 1:2])
## [1] TRUE
L3 <- list(2, 2:3, 3:4)
do.call("[", c(list(a), L3))
# test
identical(do.call("[", c(list(a), L3)), a[2, 2:3, 3:4])
## [1] TRUE
This could be made prettier by defining:
`%[%` <- function(x, indexList) do.call("[", c(list(x), indexList))
mat %[% list(2:3, 1:2)
a %[% list(2, 2:3, 3:4)
Use which argument arr.ind = TRUE.
x <- c(25, 36)
inx <- which(mat == x, arr.ind = TRUE)
Warning message:
In mat == x :
longer object length is not a multiple of shorter object length
mat[inx]
#[1] 25 36
This is an interesting question. The subset function can actually help. You cannot subset directly your matrix using a vector or a list, but you can store the indexes in a list and use subset to do the trick.
mat <- matrix(1:12, nrow=4)
mat[2:3, 1:2]
# example using subset
subset(mat, subset = 1:nrow(mat) %in% 2:3, select = 1:2)
# double check
identical(mat[2:3, 1:2],
subset(mat, subset = 1:nrow(mat) %in% 2:3, select = 1:2))
# TRUE
Actually, we can write a custom function if we want to store the row- and column- indexes in the same list.
cust.subset <- function(mat, dim.list){
subset(mat, subset = 1:nrow(mat) %in% dim.list[[1]], select = dim.list[[2]])
}
# initialize a list that includes your sub-setting indexes
sbdim <- list(2:3, 1:2)
sbdim
# [[1]]
# [1] 2 3
# [[2]]
# [1] 1 2
# subset using your custom f(x) and your list
cust.subset(mat, sbdim)
# [,1] [,2]
# [1,] 2 6
# [2,] 3 7
I want to get the column means for the last list element, which is a sparse matrix multiplied times a regular matrix. Whenever I use colMeans, however, I get an error. For example:
# Use the igraph package to create a sparse matrix
library(igraph)
my.lattice <- get.adjacency(graph.lattice(length = 5, dim = 2))
# Create a conformable matrix of TRUE and FALSE values
start <- matrix(sample(c(TRUE, FALSE), 50, replace = T), ncol = 2)
# Multiply the matrix times the vector, and save the results to a list
out <- list()
out[[1]] <- my.lattice %*% start
out[[2]] <- my.lattice %*% out[[1]]
# Try to get column means of the last element
colMeans(tail(out, 1)[[1]]) # Selecting first element because tail creates a list
# Error in colMeans(tail(out, 1)[[1]]) :
# 'x' must be an array of at least two dimensions
# But tail(out, 1)[[1]] seems to have two dimensions
dim(tail(out, 1)[[1]])
# [1] 25 2
Any idea what's causing this error, or what I can do about it?
It looks like explicitly calling the colMeans function from the Matrix package works:
> Matrix::colMeans(tail(out, 1)[[1]])
# [1] 4.48 5.48
Thanks to user20650 for this suggestion.
I have a setup that looks like below
for(V in (seq(1, 250, by = 5))){
for(n in (seq(1, 250, by = 5))){
# 1) Working Algorithm creating a probability
ie. vector in range [0:1]
# 2) Take the natural log of this probability
a <- log(lag(Probability), base = exp(1))
# 3) calculate price differences
b <- abs(diff(Price) -1)
# 4) Then compute correlation between a and b
cor(a, b)
# 5) Here I'd like to save this in the corresponding index of matrix
}
}
So that I get a [V, n] sized matrix as output, that collects from each loop.
I have a few problems with this.
The first problem is that my correlation is not computable, as the Probability is often 0, creating a ln(0) = -Inf input in the ln(Probability) vector. Is there a way to compute the std.dev or cor of a Ln vector with -Inf inputs?
My second question is how I save this correlation output into a matrix generated for each loop?
Thanks for your help. I hope this is clear enough.
For your second question (My second question is how I save this correlation output into a matrix generated for each loop?), you could initialise a matrix before the loop and store each computed correlation in the corresponding index like:
sz <- seq(1, 250, by = 5)
out_mat <- matrix(0, nrow=length(sz), ncol=length(sz))
# then continue with your for-loop
for (V in 1:length(sz)) {
for(n in length(sz)) {
# here instead of accessing V and n in computing probability
# use sz[V] and sz[n]
...
...
# after computing the correlation, here use V and n (not sz[V] or sz[n])
out_mat[V, n] <- c # c holds the value of cor(a,b)
}
}
What you can do with -Inf is replace that by NA, for example:
x = runif(10)
x[3] = 1/0
> is.infinite(x)
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
x[is.infinite(x)] <- NA
> x
[1] 0.09936348 0.66624531 NA 0.90689357 0.71578917 0.14655174
[7] 0.59561047 0.41944552 0.67203026 0.03263173
And use the na.rm argument for sd:
> sd(x, na.rm = TRUE)
[1] 0.3126829
I have two data frames with different number of rows, but the same number of columns. In the example below data frame 1 is 4 x 2, data frame 2 is 3 x 2. I need a 4 x 3 logical matrix where TRUE indicates that the all the rows in the data frames match. This example works, but takes a very long time to run with larger data frames (I'm trying two data frames with about 5,000 rows, but still just two columns). Is there a more efficient way of doing this?
> df1 <- data.frame(row.names=1:4, var1=c(TRUE, TRUE, FALSE, FALSE), var2=c(1,2,3,4))
> df2 <- data.frame(row.names=5:7, var1=c(FALSE, TRUE, FALSE), var2=c(5,2,3))
>
> m1 <- t(as.matrix(df1))
> m2 <- as.matrix(df2)
>
> apply(m2, 1, FUN=function(x) { apply(m1, 2, FUN=function(y) { all(x==y) } ) })
5 6 7
1 FALSE FALSE FALSE
2 FALSE TRUE FALSE
3 FALSE FALSE TRUE
4 FALSE FALSE FALSE
Thanks in advance for any help.
I was drawn here by your post on R-bloggers: http://jason.bryer.org/posts/2013-01-24/Comparing_Two_Data_Frames.html
If like you say, your data has no numeric vectors, then I think I can suggest a faster approach. It consists in:
turn your two data.frames into two matrices of integers
compute the Euclidean distance between rows of your two datas
Quick example using your data:
mat1 <- as.matrix(sapply(df1, as.integer))
mat2 <- as.matrix(sapply(df2, as.integer))
library(fields)
rdist(mat1, mat2) < 1e-9
# [,1] [,2] [,3]
# [1,] FALSE FALSE FALSE
# [2,] FALSE TRUE FALSE
# [3,] FALSE FALSE TRUE
# [4,] FALSE FALSE FALSE
A few comments:
if your data contained vectors of characters, you would have to convert them into factors and make sure that they share the same factor levels.
I used the fields package to compute the Euclidean distance. It uses a Fortran implementation and is as far as I know the fastest R package around for the task (and I have tested many, trust me.)
I'm honestly not sure if this will be faster, but you might try:
foo <- Vectorize(function(x,y) {all(df1[x,] == df2[y,])})
> outer(1:4,1:3,FUN = foo)
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE FALSE
[3,] FALSE FALSE TRUE
[4,] FALSE FALSE FALSE
I feel compelled to at least mention the danger in using == for comparisons as opposed to all.equal or identical. I'm presuming that you're comfortable enough with the data types involves that this won't be a problem.
I suspect that the optimal solution depends on how many unique rows and how many total rows you have.
For the example on your blog, where there are 1000-1500 rows but only 20 unique values (for the seed you set there), I think it's faster to do this:
assign ids to each unique row and then
run outer on the vector of ids seen in each data.frame.
Here's the performance I got. #flodel's approach does about the same on my computer; it's the third one below. Disclaimer: I don't know much about running these kinds of tests.
> set.seed(2112)
> df1 <- data.frame(row.names=1:1000,
+ var1=sample(c(TRUE,FALSE), 1000, replace=TRUE),
+ var2=sample(1:10, 1000, replace=TRUE) )
> df2 <- data.frame(row.names=1001:2500,
+ var1=sample(c(TRUE,FALSE), 1500, replace=TRUE),
+ var2=sample(1:10, 1500, replace=TRUE))
>
> # candidate method on blog
> system.time({
+ df1$var3 <- apply(df1, 1, paste, collapse='.')
+ df2$var3 <- apply(df2, 1, paste, collapse='.')
+ df6 <- sapply(df2$var3, FUN=function(x) { x == df1$var3 })
+ dimnames(df6) <- list(row.names(df1), row.names(df2))
+ })
user system elapsed
1.13 0.00 1.14
>
> rownames(df1) <- NULL # in case something weird happens to rownames on merge
> rownames(df2) <- NULL
> # id method
> system.time({
+ df12 <- unique(rbind(df1,df2))
+ df12$id <- rownames(df12)
+
+ id1 <- merge(df12,df1)$id
+ id2 <- merge(df12,df2)$id
+
+ x <- outer(id1,id2,`==`)
+ })
user system elapsed
0.11 0.02 0.13
>
> library(fields)
> # rdlist from fields method
> system.time({
+ mat1 <- as.matrix(sapply(df1, as.integer))
+ mat2 <- as.matrix(sapply(df2, as.integer))
+ rdist(mat1, mat2) < 1e-9
+ })
user system elapsed
0.15 0.00 0.16
I guess the rbind and the merges would make this solution relatively more costly with different data.
I can't believe this is taking me this long to figure out, and I still can't figure it out.
I need to keep a collection of vectors, and later check that a certain vector is in that collection. I tried lists combined with %in% but that doesn't appear to work properly.
My next idea was to create a matrix and rbind vectors to it, but now I don't know how to check if a vector is contained in a matrix. %in appears to compare sets and not exact rows. Same appears to apply to intersect.
Help much appreciated!
Do you mean like this:
wantVec <- c(3,1,2)
myList <- list(A = c(1:3), B = c(3,1,2), C = c(2,3,1))
sapply(myList, function(x, want) isTRUE(all.equal(x, want)), wantVec)
## or, is the vector in the set?
any(sapply(myList, function(x, want) isTRUE(all.equal(x, want)), wantVec))
We can do a similar thing with a matrix:
myMat <- matrix(unlist(myList), ncol = 3, byrow = TRUE)
## As the vectors are now in the rows, we use apply over the rows
apply(myMat, 1, function(x, want) isTRUE(all.equal(x, want)), wantVec)
## or
any(apply(myMat, 1, function(x, want) isTRUE(all.equal(x, want)), wantVec))
Or by columns:
myMat2 <- matrix(unlist(myList), ncol = 3)
## As the vectors are now in the cols, we use apply over the cols
apply(myMat, 2, function(x, want) isTRUE(all.equal(x, want)), wantVec)
## or
any(apply(myMat, 2, function(x, want) isTRUE(all.equal(x, want)), wantVec))
If you need to do this a lot, write your own function
vecMatch <- function(x, want) {
isTRUE(all.equal(x, want))
}
And then use it, e.g. on the list myList:
> sapply(myList, vecMatch, wantVec)
A B C
FALSE TRUE FALSE
> any(sapply(myList, vecMatch, wantVec))
[1] TRUE
Or even wrap the whole thing:
vecMatch <- function(x, want) {
out <- sapply(x, function(x, want) isTRUE(all.equal(x, want)), want)
any(out)
}
> vecMatch(myList, wantVec)
[1] TRUE
> vecMatch(myList, 5:3)
[1] FALSE
EDIT: Quick comment on why I used isTRUE() wrapped around the all.equal() calls. This is due to the fact that where the two arguments are not equal, all.equal() doesn't return a logical value (FALSE):
> all.equal(1:3, c(3,2,1))
[1] "Mean relative difference: 1"
isTRUE() is useful here because it returns TRUE iff it's argument is TRUE, whilst it returns FALSE if it is anything else.
> M
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
v <- c(2, 5, 8)
check each column:
c1 <- which(M[, 1] == v[1])
c2 <- which(M[, 2] == v[2])
c3 <- which(M[, 3] == v[3])
Here is a way to still use intersect() on more than 2 elements
> intersect(intersect(c1, c2), c3)
[1] 2