I have a matrix filled with TRUE/FALSE values and I am trying to find the index position of the first TRUE value on each row (or return NA if there is no TRUE value in the row). The following code gets the job done, but it uses an apply() call, which I believe is just a wrapper around a for loop. I'm working with some large datasets and performance is suffering. Is there a faster way?
> x <- matrix(rep(c(F,T,T),10), nrow=10)
> x
[,1] [,2] [,3]
[1,] FALSE TRUE TRUE
[2,] TRUE TRUE FALSE
[3,] TRUE FALSE TRUE
[4,] FALSE TRUE TRUE
[5,] TRUE TRUE FALSE
[6,] TRUE FALSE TRUE
[7,] FALSE TRUE TRUE
[8,] TRUE TRUE FALSE
[9,] TRUE FALSE TRUE
[10,] FALSE TRUE TRUE
> apply(x,1,function(y) which(y)[1])
[1] 2 1 1 2 1 1 2 1 1 2
Not sure this is any better, but this is one solution:
> x2 <- t(t(matrix(as.numeric(x), nrow=10)) * 1:3)
> x2[x2 == 0] <- Inf
> rowMins(x2)
[1] 2 1 1 2 1 1 2 1 1 2
Edit: Here's a better solution using base R:
> x2 <- (x2 <- which(x, arr=TRUE))[order(x2[,1]),]
> x2[as.logical(c(1,diff(x2[,1]) != 0)),2]
[1] 2 1 1 2 1 1 2 1 1 2
A couple of years later, I want to add two alternative approaches.
1) With max.col:
> max.col(x, "first")
[1] 2 1 1 2 1 1 2 1 1 2
2) With aggregate:
> aggregate(col ~ row, data = which(x, arr.ind = TRUE), FUN = min)$col
[1] 2 1 1 2 1 1 2 1 1 2
As performance is an issue, let's test the different methods on a larger dataset. First create a function for each method:
abiel <- function(n){apply(n, 1, function(y) which(y)[1])}
maxcol <- function(n){max.col(n, "first")}
aggr.min <- function(n){aggregate(col ~ row, data = which(n, arr.ind = TRUE), FUN = min)$col}
shane.bR <- function(n){x2 <- (x2 <- which(n, arr=TRUE))[order(x2[,1]),]; x2[as.logical(c(1,diff(x2[,1]) != 0)),2]}
joris <- function(n){z <- which(t(n))-1;((z%%ncol(n))+1)[match(1:nrow(n), (z%/%ncol(n))+1)]}
Second, create a larger dataset:
xl <- matrix(sample(c(F,T),9e5,replace=TRUE), nrow=1e5)
Third, run the benchmark:
library(microbenchmark)
microbenchmark(abiel(xl), maxcol(xl), aggr.min(xl), shane.bR(xl), joris(xl),
unit = 'relative')
which results in:
Unit: relative
expr min lq mean median uq max neval cld
abiel(xl) 55.102815 33.458994 15.781460 33.243576 33.196486 2.911675 100 d
maxcol(xl) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a
aggr.min(xl) 439.863935 262.595535 118.436328 263.387427 256.815607 16.709754 100 e
shane.bR(xl) 12.477856 8.522470 7.389083 13.549351 24.626431 1.748501 100 c
joris(xl) 7.922274 5.449662 4.418423 5.964554 9.855588 1.491417 100 b
You can gain a lot of speed by using %% and %/%:
x <- matrix(rep(c(F,T,T),10), nrow=10)
z <- which(t(x))-1
((z%%ncol(x))+1)[match(1:nrow(x), (z%/%ncol(x))+1)]
This can be adapted as needed: if you want to do this for columns, you don't have to transpose the matrix.
Tried out on a 1,000,000 X 5 matrix :
x <- matrix(sample(c(F,T),5000000,replace=T), ncol=5)
system.time(apply(x,1,function(y) which(y)[1]))
#> user system elapsed
#> 12.61 0.07 12.70
system.time({
z <- which(t(x))-1
(z%%ncol(x)+1)[match(1:nrow(x), (z%/%ncol(x))+1)]}
)
#> user system elapsed
#> 1.11 0.00 1.11
You could gain quite a lot this way.
Related
What is the fastest approach to saving unique values that go into matrix multiplication (without 0)?
For example, if I have a data.table object
library(data.table)
A = data.table(j3=c(3,0,3),j5=c(0,5,5),j7=c(0,7,0),j8=c(8,0,8))
I would like to see which unique values go into A*transpose(A) (or as.matrix(A) %*% as.matrix(t(A))). Right now, I can do it using for loops as:
B=t(A)
L = list()
models = c('A1','A2','A3')
for(i in 1:nrow(A)){
for(j in 1:ncol(B)){
u = union(unlist(A[i,]),B[,j])
u = u[u!=0] # remove 0
L[[paste(models[i],models[j])]]= u
}
}
However, is there a faster and more RAM-efficient way? The output doesn't have to be a list object, as in my case, it can be a data.table (data.frame) as well. Also, the order of values is not important. For example, 3 5 8 is as good as 5 3 8, 8 5 3 etc.
Any help is appreciated.
EDIT: So as.matrix(A) %*% as.matrix(t(A)) is:
[,1] [,2] [,3]
[1,] 73 0 73
[2,] 0 74 25
[3,] 73 25 98
The first element is calculated as 3*3+0*0+0*0+8*8 = 73, the second element is 3*0+0*5+0*7+8*0 = 0, etc. I need unique numbers that go to this calculation but without 0.
Therefore outputs (saved in the list L) are:
> L
$`A1 A1`
[1] 3 8
$`A1 A2`
[1] 3 8 5 7
$`A1 A3`
[1] 3 8 5
$`A2 A1`
[1] 5 7 3 8
$`A2 A2`
[1] 5 7
$`A2 A3`
[1] 5 7 3 8
$`A3 A1`
[1] 3 5 8
$`A3 A2`
[1] 3 5 8 7
$`A3 A3`
[1] 3 5 8
Once again, the output doesn't have to be a list object. I would prefer data.table if it is doable. Is it possible to rewrite my approach as Rcpp function?
Potential optimizations
Following up on #user2554330's answer, note that if A is an m-by-n matrix, then AAT = A %*% t(A) (equivalently tcrossprod(A)) is an m-by-m symmetric matrix. AAT[i, j] and AAT[j, i] are computed using the same entries of A, so you only need to inspect m*(m+1)/2 pairs of rows of A, not m*m.
You can do even better by finding and caching the unique elements of each row before pairing them. Preprocessing in this way avoids redundant computation and should noticeably improve performance when m << n.
Limitations
Another aspect of the problem is how unique works under the hood. unique has an argument nmax that you can use to specify an expected maximum number of unique elements. From ?duplicated:
Except for factors, logical and raw vectors the default nmax = NA is equivalent to nmax = length(x). Since a hash table of size 8*nmax bytes is allocated, setting nmax suitably can save large amounts of memory. For factors it is automatically set to the smaller of length(x) and the number of levels plus one (for NA). If nmax is set too small there is liable to be an error: nmax = 1 is silently ignored.
Long vectors are supported for the default method of duplicated, but may only be usable if nmax is supplied.
These comments apply to unique as well. Since you have a 300-by-4e+07 matrix, you would be evaluating (with preprocessing):
unique(<4e+07-length vector>), 300 times,
unique(<up to 8e+07-length vector>), 299*300/2 times.
That can consume a lot of memory if you don't know anything about your matrix that might allow you to set nmax. And it can take a long time if you don't have access to many CPUs.
So I agree with comments asking you to consider why you need to do this at all and whether your underlying problem has a nicer solution.
Two answers
FWIW, here are two approaches to your general problem that actually take advantage of symmetry. f and g are without and with preprocessing. [[.utri allows you to extract elements from the return value, an m*(m+1)/2-length list, as if it were an m-by-m matrix. as.matrix.utri constructs the full, symmetric m-by-m list matrix.
f <- function(A, nmax = NA) {
a <- seq_len(nrow(A))
J <- cbind(sequence(a), rep.int(a, a))
FUN <- function(i) {
if (i[1L] == i[2L]) {
x <- A[i[1L], ]
} else {
x <- c(A[i[1L], ], A[i[2L], ])
}
unique.default(x[x != 0], nmax = nmax)
}
res <- apply(J, 1L, FUN, simplify = FALSE)
class(res) <- "utri"
res
}
g <- function(A, nmax = NA) {
l <- lapply(asplit(A, 1L), function(x) unique.default(x[x != 0], nmax = nmax))
a <- seq_along(l)
J <- cbind(sequence(a), rep.int(a, a))
FUN <- function(i) {
if (i[1L] == i[2L]) {
l[[i[1L]]]
} else {
unique.default(c(l[[i[1L]]], l[[i[2L]]]))
}
}
res <- apply(J, 1L, FUN, simplify = FALSE)
class(res) <- "utri"
res
}
`[[.utri` <- function(x, i, j) {
stopifnot(length(i) == 1L, length(j) == 1L)
class(x) <- NULL
if (i <= j) {
x[[i + (j * (j - 1L)) %/% 2L]]
} else {
x[[j + (i * (i - 1L)) %/% 2L]]
}
}
as.matrix.utri <- function(x) {
p <- length(x)
n <- as.integer(round(0.5 * (-1 + sqrt(1 + 8 * p))))
i <- rep.int(seq_len(n), n)
j <- rep.int(seq_len(n), rep.int(n, n))
r <- i > j
ir <- i[r]
i[r] <- j[r]
j[r] <- ir
res <- x[i + (j * (j - 1L)) %/% 2L]
dim(res) <- c(n, n)
res
}
Here is a simple test on a 4-by-4 integer matrix:
mkA <- function(m, n) {
A <- sample(0:(n - 1L), size = as.double(m) * n, replace = TRUE,
prob = rep.int(c(n - 1, 1), c(1L, n - 1L)))
dim(A) <- c(m, n)
A
}
set.seed(1L)
A <- mkA(4L, 4L)
A
## [,1] [,2] [,3] [,4]
## [1,] 0 0 2 3
## [2,] 0 1 0 0
## [3,] 2 1 0 3
## [4,] 1 2 0 0
identical(f(A), gA <- g(A))
## [1] TRUE
gA[[1L, 1L]] # used for 'tcrossprod(A)[1L, 1L]'
## [1] 2 3
gA[[1L, 2L]] # used for 'tcrossprod(A)[1L, 2L]'
## [1] 2 3 1
gA[[2L, 1L]] # used for 'tcrossprod(A)[2L, 1L]'
## [1] 2 3 1
gA # under the hood, an 'm*(m+1)/2'-length list
## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 2 3 1
##
## [[3]]
## [1] 1
##
## [[4]]
## [1] 2 3 1
##
## [[5]]
## [1] 1 2 3
##
## [[6]]
## [1] 2 1 3
##
## [[7]]
## [1] 2 3 1
##
## [[8]]
## [1] 1 2
##
## [[9]]
## [1] 2 1 3
##
## [[10]]
## [1] 1 2
##
## attr(,"class")
## [1] "utri"
mgA <- as.matrix(gA) # the full, symmetric, 'm'-by-'m' list matrix
mgA
## [,1] [,2] [,3] [,4]
## [1,] integer,2 integer,3 integer,3 integer,3
## [2,] integer,3 1 integer,3 integer,2
## [3,] integer,3 integer,3 integer,3 integer,3
## [4,] integer,3 integer,2 integer,3 integer,2
mgA[1L, ] # used for first row of 'tcrossprod(A)'
## [[1]]
## [1] 2 3
##
## [[2]]
## [1] 2 3 1
##
## [[3]]
## [1] 2 3 1
##
## [[4]]
## [1] 2 3 1
## If you need names
dimnames(mgA) <- rep.int(list(sprintf("A%d", seq_len(nrow(mgA)))), 2L)
mgA["A1", ]
## $A1
## [1] 2 3
##
## $A2
## [1] 2 3 1
##
## $A3
## [1] 2 3 1
##
## $A4
## [1] 2 3 1
## If you need an 'm'-by-'m' 'data.table' result
DT <- data.table::as.data.table(mgA)
DT
## A1 A2 A3 A4
## 1: 2,3 2,3,1 2,3,1 2,3,1
## 2: 2,3,1 1 1,2,3 1,2
## 3: 2,3,1 1,2,3 2,1,3 2,1,3
## 4: 2,3,1 1,2 2,1,3 1,2
And here are two benchmarks on two large integer matrices, showing that preprocessing can help quite a bit:
set.seed(1L)
A <- mkA(100L, 1e+04L)
microbenchmark::microbenchmark(f(A), g(A), times = 10L, setup = gc(FALSE))
## Unit: milliseconds
## expr min lq mean median uq max neval
## f(A) 2352.0572 2383.3100 2435.7954 2403.8968 2431.6214 2619.553 10
## g(A) 843.0206 852.5757 858.7262 858.2746 863.8239 881.450 10
A <- mkA(100L, 1e+06L)
microbenchmark::microbenchmark(f(A), g(A), times = 10L, setup = gc(FALSE))
## Unit: seconds
## expr min lq mean median uq max neval
## f(A) 290.93327 295.54319 302.57001 301.17810 307.50226 318.14203 10
## g(A) 72.85608 73.83614 76.67941 76.57313 77.78056 83.73388 10
Perhaps we can try this
f <- function(A, models) {
AA <- replace(A, A == 0, NA)
setNames(
c(t(outer(
1:nrow(A),
1:nrow(A),
Vectorize(function(x, y) unique(na.omit(c(t(AA[c(x, y)])))))
))),
t(outer(models, models, paste))
)
}
which gives
$`A1 A1`
[1] 3 8
$`A1 A2`
[1] 3 8 5 7
$`A1 A3`
[1] 3 8 5
$`A2 A1`
[1] 5 7 3 8
$`A2 A2`
[1] 5 7
$`A2 A3`
[1] 5 7 3 8
$`A3 A1`
[1] 3 5 8
$`A3 A2`
[1] 3 5 8 7
$`A3 A3`
[1] 3 5 8
If you care about the speed, you can try
lst <- asplit(replace(A, A == 0, NA), 1)
mat <- matrix(list(), nrow = nrow(A), ncol = nrow(A))
mat[lower.tri(mat)] <- combn(lst, 2, function(...) unique(na.omit(unlist(...))), simplify = FALSE)
mat[upper.tri(mat)] <- t(mat)[upper.tri(mat)]
diag(mat) <- Map(function(x) unname(x)[!is.na(x)], lst)
L <- c(t(mat))
Thanks for posting the additional information in your edits. From what you posted, it appears that for all pairs of rows of a matrix or data table A, you want the unique non-zero values in those two rows.
To do that efficiently I'd suggest ensuring that A is a matrix. Row indexing in dataframes or data tables is a lot slower than doing so in matrices. (Column indexing can be faster, but I doubt if it's worth transposing the table to get that.)
Once you have a matrix, A[i, ] is a vector containing the values in row i, and that's a pretty fast calculation. You want the unique non-zero values in c(A[i, ], A[j, ]). The unique function will produce this, but won't leave out the zeros. I'd suggest experimenting. Depending on the contents of each row, it is conceivable that leaving the zeros out of the rows first before computing the unique entries could be either faster or slower than calculating all the unique values and deleting 0 afterwards.
You say you want to do this for a few hundred rows, but each row is very long. I'd guess you won't be able to improve much on nested loops: the time will be spent on each entry, not on the loops. However, you could experiment with vectorization using the apply() function, e.g.
result <- vector("list", nrows)
for (i in 1:nrows)
result[[i]] <- apply(A, 1, function(row) setdiff(unique(c(row, A[i,])), 0))
This will give a list of lists; if you want to examine entry i, j, you can use result[[c(i,j)]].
Given a list of the locations of 1s in each row, I'm trying to find an efficient way to construct a binary matrix. Here's a small example, although I’m trying to find something that scales well -
Given a binary matrix:
> M <- matrix(rbinom(25,1,0.5),5,5)
> M
[,1] [,2] [,3] [,4] [,5]
[1,] 0 1 1 1 0
[2,] 0 1 1 1 1
[3,] 1 1 0 1 1
[4,] 1 0 0 1 0
[5,] 0 1 1 0 0
I can transform M into an adjacency list using:
> Mlist <- apply(M==1, 1, which, simplify = FALSE)
> Mlist
[[1]]
[1] 2 3 4
[[2]]
[1] 2 3 4 5
[[3]]
[1] 1 2 4 5
[[4]]
[1] 1 4
[[5]]
[1] 2 3
I'd like to transform Mlist back into M. One possibility is:
M.new <- matrix(0,5,5)
for (row in 1:5){M.new[row,Mlist[[row]]] <- 1}
But, it seems like there should be a more efficient way.
Thanks!
1) Using M and Mlist defined in the Note at the end, sapply over its components replacing a vector of zeros with ones at the needed locations. Transpose at the end.
M2 <- t(sapply(Mlist, replace, x = integer(length(Mlist)), 1L))
identical(M, M2) # check that M2 equals M
## [1] TRUE
2) A variation with slightly more keystrokes, but faster, would be
M3 <- do.call("rbind", lapply(Mlist, replace, x = integer(length(Mlist)), 1L))
identical(M, M3)
## [1] TRUE
Benchmark
Here ex1 and ex2 are (1) and (2) above and ex0 is the for loop in the question except we used integer instead of double. Note that (2) is about 100x faster then the loop in the question.
library(microbenchmark)
microbenchmark(
ex0 = { M.new <- matrix(0L,5,5); for (row in 1:5){M.new[row,Mlist[[row]]] <- 1L} },
ex1 = t(sapply(Mlist, replace, x = integer(length(Mlist)), 1L)),
ex2 = do.call("rbind", lapply(Mlist, replace, x = integer(length(Mlist)), 1L))
)
giving:
Unit: microseconds
expr min lq mean median uq max neval cld
ex0 4454.4 4504.15 4639.111 4564.1 4670.10 8450.2 100 b
ex1 73.1 84.75 98.220 94.3 111.75 130.8 100 a
ex2 32.0 36.20 43.866 42.7 51.85 82.5 100 a
Note
set.seed(123)
M <- matrix(rbinom(25,1,0.5),5,5)
Mlist <- apply(M==1, 1, which, simplify = FALSE)
Using the vectorized row/column indexing - replicate the sequence of 'Mlist' by the lengths of the 'Mlist', and cbind with the unlisted 'Mlist' to create a matrix which can be used to assign the subset of elements of 'M.new' to 1
ind <- cbind(rep(seq_along(Mlist), lengths(Mlist)), unlist(Mlist))
M.new[ind] <- 1
-checking
> all.equal(M, M.new)
[1] TRUE
Or another option is sparseMatrix
library(Matrix)
as.matrix(sparseMatrix(i = rep(seq_along(Mlist), lengths(Mlist)),
j = unlist(Mlist), x = 1))
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 1 1 1
[2,] 0 1 0 1 0
[3,] 1 0 0 1 0
[4,] 0 1 0 1 0
[5,] 1 0 1 1 1
I would like to create 3 matrices from a bigger matrix.
The new matrices should contain:
new matrix 1: the 1st, 4th, 7th.... element of the old matrix
new matrix 2: the 2nd, 5th, 8th.... element of the old matrix
new matrix 3: the 3rd, 6th, 9th.... element of the old matrix
So if my matrix looks like this:
m<-matrix(c(1:3),nrow=12, ncol=2)
[,1] [,2]
[1,] 1 1
[2,] 2 2
[3,] 3 3
[4,] 1 1
[5,] 2 2
[6,] 3 3
[7,] 1 1
[8,] 2 2
[9,] 3 3
[10,] 1 1
[11,] 2 2
[12,] 3 3
I tried it with a for loop like this
for(i in 1:4){
m1<-m[i+3,]
m2<-m[i+4,]
m3<-m[i+5,]
}
But this not only would not be able to give me the 1st/2nd/3rd rows, but also doesn't give me all rows.
There has to be a more elegant way to do it.
Take advantage of the cycling rule of indexing in R:
m[c(T, F, F),]
# [,1] [,2]
# [1,] 1 1
# [2,] 1 1
# [3,] 1 1
# [4,] 1 1
m[c(F, T, F),]
# [,1] [,2]
# [1,] 2 2
# [2,] 2 2
# [3,] 2 2
# [4,] 2 2
m[c(F, F, T),]
# [,1] [,2]
# [1,] 3 3
# [2,] 3 3
# [3,] 3 3
# [4,] 3 3
When we are indexing the matrix with vectors which have different length from the number of rows of the matrix, the vector here which has a smaller length will get cycled until their lengths match, so for instance, the first case, the actual indexing vector is extended to c(T, F, F, T, F, F, T, F, F) which will pick up the first, fourth and seventh row as expected. The same goes for case two and three.
We can use seq to do this. This will be faster for big datasets.
m[seq(1, nrow(m), by =3),]
Or we could do:
m[seq(nrow(m))%%3==1,] # 1th, 3th, 7th, ...
m[seq(nrow(m))%%3==2,] # 2th, 5th, 8th, ...
m[seq(nrow(m))%%3==0,] # 3th, 6th, 9th, ...
BENCHMARKING
library(microbenchmark)
m <- matrix(c(1:3),nrow=12, ncol=2)
func_Psidom <- function(m){m[c(T, F, F),]}
func_akrun <- function(m){ m[seq(1, nrow(m), by =3),]}
func_42 <- function(m){ m[c(TRUE,FALSE,FALSE), ]}
func_m0h3n <- function(m){m[seq(nrow(m))%%3==1,]}
r <- func_Psidom(m)
all(func_akrun(m)==r)
# [1] TRUE
all(func_42(m)==r)
# [1] TRUE
all(func_m0h3n(m)==r)
# [1] TRUE
microbenchmark(func_Psidom(m), func_akrun(m), func_42(m), func_m0h3n(m))
# Unit: microseconds
# expr min lq mean median uq max neval
# func_Psidom(m) 2.566 3.850 4.49990 4.2780 4.7050 14.543 100
# func_akrun(m) 38.923 39.779 43.58536 40.2065 41.0615 252.359 100
# func_42(m) 2.994 3.422 4.13628 4.2770 4.7050 13.688 100
# func_m0h3n(m) 18.820 20.103 22.37447 20.7445 21.3860 104.365 100
# ============================================================
m <- matrix(c(1:3),nrow=1200, ncol=2)
r <- func_Psidom(m)
all(func_akrun(m)==r)
# [1] TRUE
all(func_42(m)==r)
# [1] TRUE
all(func_m0h3n(m)==r)
# [1] TRUE
microbenchmark(func_Psidom(m), func_akrun(m), func_42(m), func_m0h3n(m))
# Unit: microseconds
# expr min lq mean median uq max neval
# func_Psidom(m) 12.832 13.6875 14.41458 14.542 14.543 22.242 100
# func_akrun(m) 56.033 57.3150 65.17700 57.743 58.599 289.998 100
# func_42(m) 12.832 13.4735 14.76962 14.115 14.543 56.032 100
# func_m0h3n(m) 76.990 78.2730 97.82522 78.702 79.557 1873.437 100
# ============================================================
m <- matrix(c(1:3),nrow=120000, ncol=2)
r <- func_Psidom(m)
all(func_akrun(m)==r)
# [1] TRUE
all(func_42(m)==r)
# [1] TRUE
all(func_m0h3n(m)==r)
# [1] TRUE
microbenchmark(func_Psidom(m), func_akrun(m), func_42(m), func_m0h3n(m))
# Unit: microseconds
# expr min lq mean median uq max neval
# func_Psidom(m) 963.665 978.6355 1168.161 1026.113 1076.798 3648.498 100
# func_akrun(m) 1674.117 1787.6785 2808.231 1890.760 2145.043 58450.377 100
# func_42(m) 960.672 976.2835 1244.467 1033.812 1115.507 3114.268 100
# func_m0h3n(m) 5817.920 6127.8070 7697.345 7455.895 8055.565 62414.963 100
Logical vectors get recycled to the length of the number of rows or columns when matrix indexing:
m[c(TRUE,FALSE,FALSE), ]
[,1] [,2]
[1,] 1 1
[2,] 1 1
[3,] 1 1
[4,] 1 1
m[c(TRUE,FALSE,FALSE)[c(2,1,3)], ] # the numeric vector permutes the logical values
[,1] [,2]
[1,] 2 2
[2,] 2 2
[3,] 2 2
[4,] 2 2
m[c(TRUE,FALSE,FALSE)[c(2,3,1)], ]
[,1] [,2]
[1,] 3 3
[2,] 3 3
[3,] 3 3
[4,] 3 3
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
Processing the list of data.frames with “apply” family of functions
I have a dataframe with six numeric variables V1, V2, V3 and V1.lag, V2.lag, V3.lag.
NOTE: My real dataset has much more variables but I use 3 for ilustration only!
I would like to be able to automatically (without hardcoding anything) run through all V variables (not lag variables) and create V1.over.V1.lag variables by dividing each V variable with coresponding lag variable.
df<-data.frame(matrix(rnorm(216),72,6));
colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag");
Thanks in advance
**EDIT: I figured how to identify "raw" columns and "lag" columns **
raws <- sapply( names(df), function(x){ unlist(strsplit(x, "[.]"))[2] == "raw" } ); ## which are raw factors
lags <- sapply( names(df), function(x){ unlist(strsplit(x, "[.]"))[2] == "lag" } ); ## which are lagged factors
but I still can't figure how to divide all raw factors with their lag counterparts
which(raws);
will give me indices, but how do I combine them with lags into new factor?
df[which(raws)] / df[which(lags)]
doesn't work
Assuming you have only v.raw and v.lag columns in you data.frame, this should work
mm <- colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag")
df[,gregexpr('.raw',mm) > 0] /df[,gregexpr('.*lag',mm) > 0]
Edit some explanations to the solution :
gregexpr('.raw',mm) > 0
[1] TRUE TRUE TRUE FALSE FALSE FALSE
head(df[,gregexpr('.raw',mm) > 0],1)
v1.raw v2.raw v3.raw
1 0.7719037 -0.2078197 -1.223753
regexpr('.lag',mm) > 0
[1] FALSE FALSE FALSE TRUE TRUE TRUE
head(df[,gregexpr('.lag',mm) > 0],1)
v1.lag v2.lag v3.lag
1 0.7719037 -0.2078197 -1.223753
Than we use the vectorize / to do division, in one operation.
Here an example :
df <- matrix(rep(c(1,2,3,4,5,6),each = 5),ncol=6)
colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag")
v1.raw v2.raw v3.raw v1.lag v2.lag v3.lag
[1,] 1 2 3 4 5 6
[2,] 1 2 3 4 5 6
[3,] 1 2 3 4 5 6
[4,] 1 2 3 4 5 6
[5,] 1 2 3 4 5 6
mm <- colnames(df)
df[,which(gregexpr('.raw',mm) > 0)] /df[,which(gregexpr('.lag',mm) > 0)]
v1.raw v2.raw v3.raw #as expected 1/4 2/5 3/6
[1,] 0.25 0.4 0.5
[2,] 0.25 0.4 0.5
[3,] 0.25 0.4 0.5
[4,] 0.25 0.4 0.5
[5,] 0.25 0.4 0.5
Edit2 prevent Nan with zero
df <- matrix(rep(c(1,2,3,4,5,6),each = 5),ncol=6)
colnames(df) <- c("v1.raw", "v2.raw", "v3.raw", "v1.lag", "v2.lag", "v3.lag")
df[1,4] <- 0 ## I introduce a 0 here
mm <- colnames(df)
## I use ifelse , because it is vectorize also !
## If you find a 0 , don't compute , and retuen me the original value
## You can do other things here
ifelse(df[,which(gregexpr('.lag',mm) > 0)] != 0 ,
df[,which(gregexpr('.raw',mm) > 0)] /df[,which(gregexpr('.lag',mm) > 0)],
df[,which(gregexpr('.raw',mm) > 0)])
v1.lag v2.lag v3.lag ## for some reasons ifelse choose other columns names!(lag not raw)
[1,] 1.00 0.4 0.5
[2,] 0.25 0.4 0.5
[3,] 0.25 0.4 0.5
[4,] 0.25 0.4 0.5
[5,] 0.25 0.4 0.5
Given the following:
a <- c(1,2,3)
b <- c(1,2,3)
c <- c(4,5,6)
A <- cbind(a,b,c)
I want to find which columns in A are equal to for example my vector a.
My first attempt would be:
> which(a==A)
[1] 1 2 3 4 5 6
Which did not do that. (Too be honest I don't even understand what that did)
Second attempt was:
a==A
a b c
[1,] TRUE TRUE FALSE
[2,] TRUE TRUE FALSE
[3,] TRUE TRUE FALSE
which definitely is a step in the right direction but it seems extended into a matrix. What I would have preferred is something like just one of the rows. How do I compare a vector to columns and how do I find columns in a matrix that are equal to a vector?
Use identical. That is R's "scalar" comparison operator; it returns a single logical value, not a vector.
apply(A, 2, identical, a)
# a b c
# TRUE TRUE FALSE
If A is a data frame in your real case, you're better off using sapply or vapply because apply coerces it's input to a matrix.
d <- c("a", "b", "c")
B <- data.frame(a, b, c, d)
apply(B, 2, identical, a) # incorrect!
# a b c d
# FALSE FALSE FALSE FALSE
sapply(B, identical, a) # correct
# a b c d
# TRUE TRUE FALSE FALSE
But note that data.frame coerces character inputs to factors unless you ask otherwise:
sapply(B, identical, d) # incorrect
# a b c d
# FALSE FALSE FALSE FALSE
C <- data.frame(a, b, c, d, stringsAsFactors = FALSE)
sapply(C, identical, d) # correct
# a b c d
# FALSE FALSE FALSE TRUE
Identical is also considerably faster than using all + ==:
library(microbenchmark)
a <- 1:1000
b <- c(1:999, 1001)
microbenchmark(
all(a == b),
identical(a, b))
# Unit: microseconds
# expr min lq median uq max
# 1 all(a == b) 8.053 8.149 8.2195 8.3295 17.355
# 2 identical(a, b) 1.082 1.182 1.2675 1.3435 3.635
If you add an extra row:
> A
a b c
[1,] 1 1 4 4
[2,] 2 2 5 2
[3,] 3 3 6 1
Then you can see that this function is correct:
> hasCol=function(A,a){colSums(a==A)==nrow(A)}
> A[,hasCol(A,a)]
a b
[1,] 1 1
[2,] 2 2
[3,] 3 3
But the earlier version accepted doesn't:
> oopsCol=function(A,a){colSums(a==A)>0}
> A[,oopsCol(A,a)]
a b
[1,] 1 1 4
[2,] 2 2 2
[3,] 3 3 1
It returns the 4,2,1 column because the 2 matches the 2 in 1,2,3.
Surely there's a better solution but the following works:
> a <- c(1,2,3)
> b <- c(1,2,3)
> c <- c(4,5,6)
> A <- cbind(a,b,c)
> sapply(1:ncol(A), function(i) all(a==A[,i]))
[1] TRUE TRUE FALSE
And to get the indices:
> which(sapply(1:ncol(A), function(i) all(a==A[,i])))
[1] 1 2
colSums(a==A)==nrow(A)
Recycling of == makes a effectively a matrix which has all columns equal to a and dimensions equal to those of A. colSums sums each column; while TRUE behaves like 1 and FALSE like 0, columns equal to a will have sum equal to the number of rows. We use this observation to finally reduce the answer to a logical vector.
EDIT:
library(microbenchmark)
A<-rep(1:14,1000);c(7,2000)->dim(A)
1:7->a
microbenchmark(
apply(A,2,function(b) identical(a,b)),
apply(A,2,function(b) all(a==b)),
colSums(A==a)==nrow(A))
# Unit: microseconds
# expr min lq median
# 1 apply(A, 2, function(b) all(a == b)) 9446.210 9825.6465 10278.335
# 2 apply(A, 2, function(b) identical(a, b)) 9324.203 9915.7935 10314.833
# 3 colSums(A == a) == nrow(A) 120.252 121.5885 140.185
# uq max
# 1 10648.7820 30588.765
# 2 10868.5970 13905.095
# 3 141.7035 162.858