Yesterday I asked a very simple vectorization question and got some great answers. Today the question is a bit more complex and I'm wondering if R has a function to speed up the runtime of this loop through vectorization.
The loop is
for(j in 1:N) {
A[j,1] = B[max(which(C[j]>=D))];
}
I tried
A[,1] = B[max(which(C>=D))];
and this dropped the runtime considerably ... but the answer was wrong. Is there a "correct" way to do this in R?
EDIT1:
Thanks for the questions regarding data. I will give sizes of the arrays here:
We are looping over 1:N
A is N x 1
B is length M
C is length N
D is length M
If it matters in terms of speed, in this example, N = 844, M = 2500.
Edit2:
And here are some values for a smaller simulated dataset:
B <- c(1.0000000, 1.0000000, 1.0000000, 0.9565217, 0.9565217, 0.9565217, 0.9565217,
0.9565217, 0.9565217, 0.9565217, 0.8967391, 0.8369565, 0.7771739, 0.7173913,
0.7173913, 0.7173913, 0.7173913, 0.7173913, 0.6277174, 0.6277174, 0.5230978,
0.5230978, 0.3923234, 0.3923234, 0.3923234)
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074, 2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587,
0.58848, 0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164,
1.22956, 1.23760, 1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
The result should be:
> A
[,1]
[1,] 1.0000000
[2,] 0.9565217
[3,] 0.9565217
[4,] 0.9565217
[5,] 0.7173913
[6,] 0.7173913
[7,] 0.7173913
[8,] 0.6277174
[9,] 0.5230978
[10,] 0.3923234
If you are eager to get the answer immediately, jump to Conclusion. I offer you a single line R code, with maximum efficiency. For details/ideas, read through the following.
Code re-shaping and problem re-definition
When OP asks a vectorization of the following loop:
for(j in 1:N) A[j, 1] <- B[max(which(C[j] >= D))]
The first thing I do is to transform it into a nice version:
## stage 1: index computation (need vectorization)
id <- integer(N); for(j in 1:N) id[j] <- max(which(D <= C[j]))
## stage 2: shuffling (readily vectorized)
A[, 1] <- B[id]
Now we see that only stage 1 needs be vectorized. This stage essentially does the following:
D[1] D[2] D[3] ... D[M]
C[1]
C[2]
C[3]
.
.
C[N]
For each row j, find the cut off location k(j) in D, such that D[k(j) + 1], D[k(j) + 2], ..., D[M] > C[j].
Efficient algorithm based on sorting
There is actually an efficient algorithm to do this:
sort C in ascending order, into CC (record ordering index iC, such that C[iC] == CC)
sort D in ascending order, into DD (record ordering index iD, such that D[iD] == DD)
By sorting, we substantially reduce the work complexity.
If data are unsorted, then we have to explicitly scan all elements: D[1], D[2], ..., D[M] in order to decide on k(j). So there is O(M) costs for each row, thus O(MN) costs in total.
However, If data are sorted, then we only need to do the following:
j = 1: search `D[1], D[2], ..., D[k(1)]`, till `D[k(1) + 1] > C[1]`;
j = 2: search `D[k(1) + 1], D[k(1)+2], ..., D[k(2)]`, till `D[k(2) + 1] > C[2]`;
...
For each row, only partial searching is applied, and the overall complexity is only O(M), i.e., D vector is only touched once, rather than N times as in the trivial implementation. As a result, after sorting, the algorithm is N times faster!! For large M and N, this is a huge difference! As you said in other comment, this code will be called millions of times, then we definitely want O(M) algorithm instead of O(MN) algorithm.
Also note, that the memory costs for this approach is O(M + N), i.e., we only concatenate two vectors together, rather than expanding it into an M-by-N matrix. So such storage saving is also noticeable.
In fact, we can take one step further, by converting this comparison problem into a matching problem, which is easier to vectorize in R.
## version 1:
CCDD <- c(CC, DD) ## combine CC and DD
CCDD <- sort(CCDD, decreasing = TRUE) ## sort into descending order
id0 <- M + N - match(CC, CCDD) + 1
id <- id0 - 1:N
To understand why this work, consider an alternative representation:
## version 2:
CCDD <- c(CC, DD) ## combine CC and DD
CCDD <- sort(CCDD) ## sort into ascending order
id0 <- match(CC, CCDD)
id <- id0 - 1:N
Now the following diagram illustrates what CCDD vector looks like:
CCDD: D[1] D[2] C[1] D[3] C[2] C[3] D[4] D[5] D[6] C[4] .....
id0: 3 5 6 10 .....
id : 2 3 3 6 .....
So, CCDD[id] gives: D[2], D[3], D[3], D[6], ...., exactly the last element no greater than C[1], C[2]. C[3], C[4], ...., Therefore, id is just the index we want!
Then people may wonder why I suggest doing "version 1" rather than "version 2". Because when there are tied values in CCDD, "version 2" will give wrong result, because match() will take the first element that matches, ignoring later matches. So instead of matching from left to right (in ascending index), we have to match from right to left (in descending index).
Using OP's data
With this in mind, I start looking at OP's data. Now amazingly, OP's data are already sorted:
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074, 2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587, 0.58848,
0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164, 1.22956, 1.23760,
1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
M <- length(D); N <- length(C)
is.unsorted(C)
# FALSE
is.unsorted(D)
#FALSE
Furthermore, OP has already combined C and D:
all(C %in% D)
# TRUE
It seems that OP and I have the same idea on efficiency in mind. Presumably OP once had a shorter D vector, while the D vector he supplied is really the CCDD vector I mentioned above!
Now, in this situation, things are all the way simple: we just do a single line:
id <- M - match(C, rev(D)) + 1
Note I put rev() because OP has sorted D in ascending order so I need to reverse it. This single line may look very much different from the "version 1" code, but nothing is wrong here. Remember, The D used here is really the CCDD in "version 1" code, and the M here is really the M + N there. Also, there is no need to subtract 1:N from id, due to our different definition of D.
Checking result
Now, the trivial R-loop gives:
id <- integer(N); for(j in 1:N) id[j] <- max(which(D <= C[j]))
id
# [1] 1 4 7 9 14 15 16 19 21 23
Well, our single line, vectorized code gives:
id <- M - match(C, rev(D)) + 1
id
# [1] 1 4 7 9 14 15 16 19 21 23
Perfect match, hence we are doing the right thing.
Conclusion
So, Laurbert, this is the answer you want:
A[, 1] <- B[M - match(C, rev(D)) + 1]
You can use outer for this.
Your code:
A1 <- matrix(NA_real_, ncol = 1, nrow = length(C))
for(j in seq_along(C)) {
A1[j,1] = B[max(which(C[j]>=D))];
}
Test if the elements of C are larger/equal the elements of D with outer:
test <- outer(C, D, FUN = ">=")
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24] [,25]
# [1,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [2,] TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [3,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [4,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [5,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [6,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [7,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
# [8,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
# [9,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE
#[10,] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE FALSE
Note that this can use a lot of memory for large vectors.
Then find the last TRUE value in each row:
ind <- max.col(test, ties.method = "last") * (rowSums(test) > 0)
rowSums(test) > 0 tests if there are any TRUE values and makes the corresponding element of ind 0 otherwise. It's undefined what you'd want to happen in this case. (A 0 index is ignored during subsetting. Possibly, you'd want NA instead in your final result?)
Now subset:
A2 <- as.matrix(B[ind], ncol = 1)
# [,1]
# [1,] 1.0000000
# [2,] 0.9565217
# [3,] 0.9565217
# [4,] 0.9565217
# [5,] 0.7173913
# [6,] 0.7173913
# [7,] 0.7173913
# [8,] 0.6277174
# [9,] 0.5230978
#[10,] 0.3923234
Are the results identical?
identical(A2, A1)
#[1] TRUE
The data (please use dput next time to provide example data):
B <- c(1.0000000, 1.0000000, 1.0000000, 0.9565217, 0.9565217, 0.9565217, 0.9565217,
0.9565217, 0.9565217, 0.9565217, 0.8967391, 0.8369565, 0.7771739, 0.7173913,
0.7173913, 0.7173913, 0.7173913, 0.7173913, 0.6277174, 0.6277174, 0.5230978,
0.5230978, 0.3923234, 0.3923234, 0.3923234)
C <- c(0.10607, 0.14705, 0.43607, 0.56587, 0.76203, 0.95657, 1.03524, 1.22956, 1.39074,
2.36452)
D <- c(0.10607, 0.13980, 0.14571, 0.14705, 0.29412, 0.33693, 0.43607, 0.53968, 0.56587,
0.58848, 0.64189, 0.65475, 0.75518, 0.76203, 0.95657, 1.03524, 1.05454, 1.18164,
1.22956, 1.23760, 1.39074, 1.87604, 2.36452, 2.89497, 4.42393)
After seeing #Roland's answer, I think I understand better what you are asking. To double check: you want to compare each value of C (individually) against all values of D, and get the largest index of D (let's call it k) that holds a value smaller than C[j]. You then want to use it to assign the corresponding value of B to A, thus A[j]=B[k]. Is this correct?
I don't have an answer regarding how to vectorize what you want to do, but do have some suggestions on how to speed it up. Before that, let me ask whether it's actually worth going through the effort. For the larger example you mentioned (N~1000, M~2500), your loop still runs in well under a second on my laptop. Unless this calculation is done many times over inside another loop, it seems like unnecessary optimization...
Also, like #Roland pointed out, it's not clear what should happen if there is a value in C that's smaller than all values in D. These functions (including your original loop) will not work if that happens and would need some slight tweaking.
Anyway, these are my suggestions:
First, let me wrap your loop into a function for convenience.
f_loop <- function(B, C, D){
N <- length(C)
A <- matrix(0, ncol=1, nrow=N)
for(j in 1:N) {
A[j,1] = B[max(which(C[j]>=D))]
}
return(A)
}
If you want it to look a bit more "R-like" you can replace the loop with one of the *apply functions. In this case, it also runs slightly faster than the loop.
vapply(C, function(x) B[max(which(x>=D))], 0)
## Wrapped into a function for easier reference
f_vapply <- function(B, C, D){
vapply(C, function(x) B[max(which(x>=D))], 0)
}
My other suggestion is uglier (and not really "R-like"), but can help speed things up a lot (if that's the end goal here). I used the inline package to create a compiled version of your loop (note that depending on your OS and R setup, you may need to download additional tools or packages to be able to compile code).
## Translate loop into Fortran
loopcode <-
" integer i, j, k
do i = 1, n
k = 0
do j = 1, m
if (C(i) >= D(j)) k = j
end do
A(i) = B(k)
end do
"
## Compile into function
library(inline)
loopfun <- cfunction(sig = signature(A="numeric", B="numeric", C="numeric", D="numeric", n="integer", m="integer"), dim=c("(n)", "(m)", "(n)", "(m)", "", ""), loopcode, language="F95")
## Wrap into function for easier reference
f_compiled <- function(B, C, D){
A <- C
n <- length(A)
m <- length(B)
out <- loopfun(A, B, C, D, n, m)
return(as.matrix(out$A, ncol=1))
}
Let's check that the results all match:
cbind(A, f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## [,1] [,2] [,3] [,4]
## [1,] 1.0000000 1.0000000 1.0000000 1.0000000
## [2,] 0.9565217 0.9565217 0.9565217 0.9565217
## [3,] 0.9565217 0.9565217 0.9565217 0.9565217
## [4,] 0.9565217 0.9565217 0.9565217 0.9565217
## [5,] 0.7173913 0.7173913 0.7173913 0.7173913
## [6,] 0.7173913 0.7173913 0.7173913 0.7173913
## [7,] 0.7173913 0.7173913 0.7173913 0.7173913
## [8,] 0.6277174 0.6277174 0.6277174 0.6277174
## [9,] 0.5230978 0.5230978 0.5230978 0.5230978
## [10,] 0.3923234 0.3923234 0.3923234 0.3923234
And check the speed:
microbenchmark(f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## Unit: microseconds
## expr min lq mean median uq max neval cld
## f_loop(B, C, D) 52.804 54.8075 57.34588 56.5420 58.4615 83.843 100 c
## f_vapply(B, C, D) 38.677 41.5055 43.21231 42.8825 44.1525 65.355 100 b
## f_compiled(B, C, D) 17.095 18.2775 20.55372 20.1770 21.4710 66.407 100 a
We can also try it with vectors of similar size to the larger ones you mentioned (note the change in units for the results):
## Make the vector larger for benchmark
B <- rep(B, 100) # M = 2500
C <- rep(C, 100) # N = 1000
D <- rep(D, 100) # M = 2500
microbenchmark(f_loop(B, C, D), f_vapply(B, C, D), f_compiled(B, C, D))
## Unit: milliseconds
## expr min lq mean median uq max neval cld
## f_loop(B, C, D) 24.380069 24.85061 25.99855 25.839282 25.952433 62.75721 100 b
## f_vapply(B, C, D) 23.543749 24.18427 25.34881 25.015859 25.179924 62.60746 100 b
## f_compiled(B, C, D) 1.976611 2.01403 2.06750 2.032864 2.057594 3.13658 100 a
EDIT:
I realized that if you always want the largest index of D for which C[j]>=D holds, of course it makes much more sense to loop through D starting from the end of the array, and exiting as soon as the first instance is found (instead of looping through the full array).
This is a small tweak to the Fortran code I wrote above that takes advantage of that.
loopcode <-
" integer i, j, k
do j = 1, n
k = 0
do i = m, 1, -1
if (C(j) >= D(i)) then
k = i
exit
end if
end do
A(j) = B(k)
end do
"
I won't include it in the benchmarks, because it'll be much more dependent on the actual data points. But it is obvious that worst case behavior is the same as the previous loop (e.g. if the index of interest occurs at the beginning, D is looped through in full) and the best case behavior almost completely eliminates looping through D (e.g. if the condition holds at the end of the array).
Related
I am looking for an efficient way to combine selected columns in a logical matrix by "ANDing" them together and ending up with a new matrix. An example of what I am looking for:
matrixData <- rep(c(TRUE, TRUE, FALSE), 8)
exampleMatrix <- matrix(matrixData, nrow=6, ncol=4, byrow=TRUE)
exampleMatrix
[,1] [,2] [,3] [,4]
[1,] TRUE TRUE FALSE TRUE
[2,] TRUE FALSE TRUE TRUE
[3,] FALSE TRUE TRUE FALSE
[4,] TRUE TRUE FALSE TRUE
[5,] TRUE FALSE TRUE TRUE
[6,] FALSE TRUE TRUE FALSE
The columns to be ANDed to each other are specified in a numeric vector of length ncol(exampleMatrix), where the columns to be grouped together ANDed have the same value (a value from 1 to n, where n <= ncol(exampleMatrix) and every value in 1:n is used at least once). The resulting matrix should have the columns in order from 1:n. For example, if the vector that specifies the column groups is
colGroups <- c(3, 2, 2, 1)
Then the resulting matrix would be
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] TRUE FALSE TRUE
[3,] FALSE TRUE FALSE
[4,] TRUE FALSE TRUE
[5,] TRUE FALSE TRUE
[6,] FALSE TRUE FALSE
Where in the resulting matrix
[,1] = exampleMatrix[,4]
[,2] = exampleMatrix[,2] & exampleMatrix[,3]
[,3] = exampleMatrix[,1]
My current way of doing this looks basically like this:
finalMatrix <- matrix(TRUE, nrow=nrow(exampleMatrix), ncol=3)
for (i in 1:3){
selectedColumns <- exampleMatrix[,colGroups==i, drop=FALSE]
finalMatrix[,i] <- rowSums(selectedColumns)==ncol(selectedColumns)
}
Where rowSums(selectedColumns)==ncol(selectedColumns) is an efficient way to AND all of the columns of a matrix together.
My problem is that I am doing this on very big matrices (millions of rows) and I am looking for any way to make this quicker. My first instinct would be to use apply in some way but I can't see any way to use that to improve efficiency as I am not performing the operation in the for loop many times but instead it is the operation in the loop that is slow.
In addition, any tips to reduce memory allocation would be very useful, as I currently have to run gc() within the loop frequently to avoid running out of memory completely, and it is a very expensive operation that significantly slows everything down as well. Thanks!
For a more representative example, this is a much larger exampleMatrix:
matrixData <- rep(c(TRUE, TRUE, FALSE), 8e7)
exampleMatrix <- matrix(matrixData, nrow=6e7, ncol=4, byrow=TRUE)
From your example, I understand that there are very few columns and very many rows. In this case, it'll be efficient to just do a simple loop over colGroups (30% improvement over your suggestion):
for (jj in seq_along(colGroups))
finalMatrix[ , colGroups[jj]] =
finalMatrix[ , colGroups[jj]] & exampleMatrix[ , jj]
I think it will be hard to beat this without parallelizing. This loop is parallelizable if there are more columns (though the parallelization will have to be done a bit carefully (in batches)).
As far as I can tell, this is an aggregation across columns using the all function. So if you transpose to rows, then use colGroups as the grouping factor to apply all, then transpose back to columns, you should get the intended result:
t(aggregate(t(exampleMatrix), list(colGroups), FUN=all)[-1])
# [,1] [,2] [,3]
#V1 TRUE FALSE TRUE
#V2 TRUE FALSE TRUE
#V3 FALSE TRUE FALSE
#V4 TRUE FALSE TRUE
#V5 TRUE FALSE TRUE
#V6 FALSE TRUE FALSE
The [-1] just drops the group-identifier variable which you don't require in the final output.
If you're working with stupid big data, the by-group aggregation could be done in data.table as well:
library(data.table)
t(as.data.table(t(exampleMatrix))[, lapply(.SD,all), by=colGroups][,-1])
Greetings good people of Stackland!
Recently I was given this task
Generate the Fibonacci sequence in any language
Evaluate whether each value is odd or even
Sum the even numbers such that their total is not >500,000
I chose to do this R, as I am learning the language and thought it would be a good exercise in doing so.
I have managed to complete step 2 of the task but haven't been able to proceed any further. Please see code and comments below.
len <- 50
fibvals <- numeric(len)
fibvals[1] <- 1
fibvals[2] <- 1
for(i in 3:len) { fibvals[i] <- fibvals[i-1]+fibvals[i-2]}
fibvals
[1] 1 1 2 3 5
[6] 8 13 21 34 55
[11] 89 144 233 377 610
[16] 987 1597 2584 4181 6765
[21] 10946 17711 28657 46368 75025
[26] 121393 196418 317811 514229 832040
[31] 1346269 2178309 3524578 5702887 9227465
[36] 14930352 24157817 39088169 63245986 102334155
[41] 165580141 267914296 433494437 701408733 1134903170
[46] 1836311903 2971215073 4807526976 7778742049 12586269025
# Creates a variable called len in which the value 50 is stored
# Creates a var called fibvals, which is a numeric datatype, which should have len (50) vals
# Sets the value of the first entry in fibvals to 1
# Sets the value of the second entry in fibvals to 1
# Loop - "for (i in 3:len)" dictates that the loop should be executed between step 3 and step 50 (denoted by "len")
# Loop - Defines a loop step "i" as being the result of the (current i - the before it) + (current i - i two before it)
# Loop - Example 5 = (5-3) + (5-2) OR 2 + 3 = 5 | Example 21 = (21-13) + (21-8) OR 8 + 13 = 21
is.even <- function(x){ x %% 2 == 0 }
# Creates a UDF to check if values are odd or even by using modulo.
If the remainder is 0 when any value is divided by 2, it is an even number
is.even(fibvals)
[1] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[11] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
[21] TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE
[31] FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE
[41] FALSE TRUE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE
# Evaluates all Fibonacci values on odd or even property
What I need is a bit of guidance as to where I should go from here.
Should I create a data.table and query that using the SQL package, or is there a much more elegant and less cumbersome way?
Thanks in advance!
For sorting out the even number from first 50 fibonacci numbers you can use this
even_numbers <- fibvals[fibvals%%2==0]
Then by computing cumulative sum of those even numbers and imposing the condition of the maximum value of the sum, you can select those even numbers by this
cumsum(even_numbers)<500000
Therefore your desired fibonacci numbers are
even_numbers[cumsum(even_numbers)<500000]
and their sum is
sum(even_numbers[cumsum(even_numbers)<500000])
This would do it
fsum <- 0
for (i in 1:len) { if (is.even(fibvals[i]) && (fsum + fibvals[i])<=500000) {fsum = fsum + fibvals[i]}}
The sum would then be stored in fsum.
Here's a way to do it with a recursive function:
getEvenWithFibber <- function(y = c(1,1),
s = 0,
threshold = 500000) {
if(s + y[1] + y[2] < threshold)
getEvenWithFibber(y = c(y[1] + y[2],y), s = s + ifelse(y[1]%%2==0,y[1],0))
else list(sum = s, seq = y, iseven = y%%2 == 0)
}
getEvenWithFibber()
I am trying to compare 1st row of a matrix with all rows of the same matrix. But the vectorized comparison is not returning correct results. Any reason why this may be happening?
m <- matrix(c(1,2,3,1,2,4), nrow=2, ncol=3, byrow=TRUE)
> m
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 2 4
> # Why does the first row not have 3 TRUE values?
> m[1,] == m
[,1] [,2] [,3]
[1,] TRUE FALSE FALSE
[2,] FALSE FALSE FALSE
> m[1,] == m[1,]
[1] TRUE TRUE TRUE
> m[1,] == m[2,]
[1] TRUE TRUE FALSE
Follow-up. In my actual data I have large number of rows then (atleast 10million) then both time and memory adds up. Additional suggestions on the below as suggested below by others?
m <- matrix(rep(c(1,2,3), 1000000), ncol=3, byrow=TRUE)
> #by #alexis_laz
> m1 <- matrix(m[1,], nrow = nrow(m), ncol = ncol(m), byrow = T)
> system.time(m == m1)
user system elapsed
0.21 0.03 0.31
> object.size(m1)
24000112 bytes
> #by #PaulHiemstra
> system.time( t(apply(m, 1, function(x) x == m[1,])) )
user system elapsed
35.18 0.08 36.04
Follow-up 2. #alexis_laz you are correct. I want to compare every row with each other and have posted a followup question on that ( How to vectorize comparing each row of matrix with all other rows)
In the comparison m[1,] == m, the first term m[1,] is recycled (once) to equal the length of m. The comparison is then done column-wise.
You're comparing c(1,2,3) with c(1,1,2,2,3,4), thus c(1,2,3,1,2,3) with c(1,1,2,2,3,3,4) so you have one TRUE followed by five FALSE (and packaged as a matrix to match the dimensions of m).
As #MatthewLundberg pointed out, the recycling rules of R do not behave as you expected. In my opinion it is always better to explicitely state what to compare and not rely on R's assumptions. One way to make the correct comparison:
t(apply(m, 1, function(x) x == m[1,]))
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
or:
m == rbind(m[1,], m[1,])
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
or by making R's recyling working in your favor (thanks to #Arun):
t(t(m) == m[1,])
[,1] [,2] [,3]
[1,] TRUE TRUE TRUE
[2,] TRUE TRUE FALSE
I have a matrix A,
A = as.matrix(data.frame(col1 = c(1,1,2,3,1,2), col2 = c(-1,-1,-2,-3,-1,-2), col3 = c(2,6,1,3,2,4)))
And I have a vector v,
v = c(-1, 2)
How can I get a vector of TRUE/FALSE that compares the last two columns of the matrix and returns TRUE if the last two columns match the vector, or false if they don't?
I.e., If I try,
A[,c(2:3)] == v
I obtain,
col2 col3
[1,] TRUE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] TRUE FALSE
[6,] FALSE FALSE
Which is not what I want, I want both columns to be the same as vector v, more like,
result = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE)
Since the first, and 5th rows match the vector v entirely.
Here's a simple alternative
> apply(A[, 2:3], 1, function(x) all(x==v))
[1] TRUE FALSE FALSE FALSE TRUE FALSE
Ooops by looking into R mailing list I found an answer: https://stat.ethz.ch/pipermail/r-help/2010-September/254096.html,
check.equal <- function(x, y)
{
isTRUE(all.equal(y, x, check.attributes=FALSE))
}
result = apply(A[,c(2:3)], 1, check.equal, y=v)
Not sure I need to define a function and do all that, maybe there are easier ways to do it.
Here's another straightforward option:
which(duplicated(rbind(A[, 2:3], v), fromLast=TRUE))
# [1] 1 5
results <- rep(FALSE, nrow(A))
results[which(duplicated(rbind(A[, 2:3], v), fromLast=TRUE))] <- TRUE
results
# [1] TRUE FALSE FALSE FALSE TRUE FALSE
Alternatively, as one line:
duplicated(rbind(A[, 2:3], v), fromLast=TRUE)[-(nrow(A)+1)]
# [1] TRUE FALSE FALSE FALSE TRUE FALSE
A dirty one:
result <- c()
for(n in 1:nrow(A)){result[n] <-(sum(A[n,-1]==v)==2)}
> result
[1] TRUE FALSE FALSE FALSE TRUE FALSE
I have a logical vector, for which I wish to insert new elements at particular indexes. I've come up with a clumsy solution below, but is there a neater way?
probes <- rep(TRUE, 15)
ind <- c(5, 10)
probes.2 <- logical(length(probes)+length(ind))
probes.ind <- ind + 1:length(ind)
probes.original <- (1:length(probes.2))[-probes.ind]
probes.2[probes.ind] <- FALSE
probes.2[probes.original] <- probes
print(probes)
gives
[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
and
print(probes.2)
gives
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE TRUE TRUE
So it works but is ugly looking - any suggestions?
These are all very creative approaches. I think working with indexes is definitely the way to go (Marek's solution is very nice).
I would just mention that there is a function to do roughly that: append().
probes <- rep(TRUE, 15)
probes <- append(probes, FALSE, after=5)
probes <- append(probes, FALSE, after=11)
Or you could do this recursively with your indexes (you need to grow the "after" value on each iteration):
probes <- rep(TRUE, 15)
ind <- c(5, 10)
for(i in 0:(length(ind)-1))
probes <- append(probes, FALSE, after=(ind[i+1]+i))
Incidentally, this question was also previously asked on R-Help. As Barry says:
"Actually I'd say there were no ways of doing this, since I dont think you can actually insert into a vector - you have to create a new vector that produces the illusion of insertion!"
You can do some magic with indexes:
First create vector with output values:
probs <- rep(TRUE, 15)
ind <- c(5, 10)
val <- c( probs, rep(FALSE,length(ind)) )
# > val
# [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
# [13] TRUE TRUE TRUE FALSE FALSE
Now trick. Each old element gets rank, each new element gets half-rank
id <- c( seq_along(probs), ind+0.5 )
# > id
# [1] 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 11.0 12.0 13.0 14.0 15.0
# [16] 5.5 10.5
Then use order to sort in proper order:
val[order(id)]
# [1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
# [13] TRUE TRUE TRUE TRUE TRUE
probes <- rep(TRUE, 1000000)
ind <- c(50:100)
val <- rep(FALSE,length(ind))
new.probes <- vector(mode="logical",length(probes)+length(val))
new.probes[-ind] <- probes
new.probes[ind] <- val
Some timings:
My method
user system elapsed
0.03 0.00 0.03
Marek method
user system elapsed
0.18 0.00 0.18
R append with for loop
user system elapsed
1.61 0.48 2.10
How about this:
> probes <- rep(TRUE, 15)
> ind <- c(5, 10)
> probes.ind <- rep(NA, length(probes))
> probes.ind[ind] <- FALSE
> new.probes <- as.vector(rbind(probes, probes.ind))
> new.probes <- new.probes[!is.na(new.probes)]
> new.probes
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE TRUE TRUE
That is sorta tricky. Here's one way. It iterates over the list, inserting each time, so it's not too efficient.
probes <- rep(TRUE, 15)
probes.ind <- ind + 0:(length(ind)-1)
for (i in probes.ind) {
probes <- c(probes[1:i], FALSE, probes[(i+1):length(probes)])
}
> probes
[1] TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE TRUE TRUE TRUE FALSE
[13] TRUE TRUE TRUE TRUE TRUE
This should even work if ind has repeated elements, although ind does need to be sorted for the probes.ind construction to work.
Or you can do it using the insertRow function from the miscTools package.
probes <- rep(TRUE, 15)
ind <- c(5,10)
for (i in ind){
probes <- as.vector(insertRow(as.matrix(probes), i, FALSE))
}
I came up with a good answer that's easy to understand and fairly fast to run, building off Wojciech's answer above. I'll adapt the method for the example here, but it can be easily generalized to pretty much any data type for an arbitrary pattern of missing points (shown below).
probes <- rep(TRUE, 15)
ind <- c(5,10)
probes.final <- rep(FALSE, length(probes)+length(ind))
probes.final[-ind] <- probes
The data I needed this for is sampled at a regular interval, but many samples are thrown out, and the resulting data file only includes the timestamps and measurements for those retained. I needed to produce a vector containing all the timestamps and a data vector with NAs inserted for timestamps that were tossed. I used the "not in" function stolen from here to make it a bit simpler.
`%notin%` <- Negate(`%in%`)
dat <- rnorm(50000) # Data given
times <- seq(from=554.3, by=0.1, length.out=70000] # "Original" time stamps
times <- times[-sample(2:69999, 20000)] # "Given" times with arbitrary points missing from interior
times.final <- seq(from=times[1], to=times[length(times)], by=0.1)
na.ind <- which(times.final %notin% times)
dat.final <- rep(NA, length(times.final))
dat.final[-na.ind] <- dat
Um, hi, I had the same doubt, but I couldn't understand what people had answered, because I'm still learning the language. So I tried make my own and I suppose it works! I created a vector and I wanted to insert the value 100 after the 3rd, 5th and 6th indexes. This is what I wrote.
vector <- c(0:9)
indexes <- c(6, 3, 5)
indexes <- indexes[order(indexes)]
i <- 1
j <- 0
while(i <= length(indexes)){
vector <- append(vector, 100, after = indexes[i] + j)
i <-i + 1
j <- j + 1
}
vector
The vector "indexes" must be in ascending order for this to work. This is why I put them in order at the third line.
The variable "j" is necessary because at each iteration, the length of the new vector increases and the original values are moved.
In the case you wish to insert the new value next to each other, simply repeat the number of the index. For instance, by assigning indexes <- c(3, 5, 5, 5, 6), you should get vector == 0 1 2 100 3 4 100 100 100 5 100 6 7 8 9