Combining two vectors element-by-element - r

I have 2 vectors, such as these:
A <- c(1,2,NA,NA,NA,NA,7)
B <- c(NA,NA,3,4,NA,NA,7)
I would like to combine them so that the resulting vector is
1,2,3,4,NA,NA,-1
That is
when only 1 value (say X) in either vector at position i exists (the other being NA) the new vector should take the value X at position i.
when both values are NA at position i, the new vector should take the value NA at position i
when both vectors have a value at position i, the new vector should take the value -1 at position i.
I can easily do this with a loop, but it is very slow on a large dataset so can anyone provide a fast way to do this ?

These commands create the vector:
X <- A
X[is.na(A)] <- B[is.na(A)]
X[is.na(B)] <- A[is.na(B)]
X[!is.na(A & B)] <- -1
#[1] 1 2 3 4 NA NA -1

A <- c(1,2,NA,NA,NA,NA,7)
B <- c(NA,NA,3,4,NA,NA,7)
C <- rowMeans(cbind(A,B),na.rm=TRUE)
C[which(!is.na(A*B))]<- -1
#[1] 1 2 3 4 NaN NaN -1
Benchmarks:
Unit: microseconds
expr min lq median uq max
1 Roland(A, B) 17.863 19.095 19.710 20.019 68.985
2 Sven(A, B) 11.703 13.243 14.167 14.783 100.398

A bit late to the party, but here is another option defining a function that works by applying rules to the two vectors cbind-ed together.
# get the data
A <- c(1,2,NA,NA,NA,NA,7)
B <- c(NA,NA,3,4,NA,NA,7)
# define the function
process <- function(A,B) {
x <- cbind(A,B)
apply(x,1,function(x) {
if(sum(is.na(x))==1) {na.omit(x)} else
if(all(is.na(x))) {NA} else
if(!any(is.na(x))) {-1}
})
}
# call the function
process(A,B)
#[1] 1 2 3 4 NA NA -1
The main benefit of using a function is that it is easier to update the rules or the inputs to apply the code to new data.

Related

Check if list contains another list in R

I want to check if a list (or a vector, equivalently) is contained into another one, not if it is a subset of thereof. Let us assume we have
r <- c(1,1)
s <- c(5,2)
t <- c(1,2,5)
The function should behave as follows:
is.contained(r,t)
[1] FALSE
# as (1,1) is not contained in (1,2,5) since the former
# contains two 1 whereas the latter only one.
is.contained(s,t)
[1] TRUE
The operator %in% checks for subsets, hence it would return TRUE in both cases, likewise all or any. I am sure there is a one-liner but I just do not see it.
How about using a loop. I iterate over the first vector and check if it is present in the second vector. If it is there i remove it from second vector. And the process continues.
is.contained=function(vec1,vec2){
x=vector(length = length(vec1))
for (i in 1:length(vec1)) {
x[i] = vec1[i] %in% vec2
if(length(which(vec1[i] %in% vec2)) == 0) vec2 else
vec2=vec2[-match(vec1[i], vec2)]
}
y=all(x==T)
return(y)
}
The sets functions (e.g. intersect, union, etc.) from base R give results consistent with set theory. Sets technically don't have repeating elements, thus the vector c(1,1,2) and c(1,2) are considered the same when it comes to sets (see Set (Mathematics)). This is the main problem this question faces and thus why some of the solutions posted here fail (including my previous attempts). The solution to the OP's question is found somewhere between understanding sets and sequences. Although sequences allow repetition, order matters, and here we don't care (order doesn't matter in sets).
Below, I have provided a vector intersect function (VectorIntersect) that returns all of the common elements between two vectors regardless of order or presence of duplicates. Also provided is a containment function called is.contained, which calls VectorIntersect, that will determine if all of the elements in one vector are present in another vector.
VectorIntersect <- function(v,z) {
unlist(lapply(unique(v[v%in%z]), function(x) rep(x,min(sum(v==x),sum(z==x)))))
}
is.contained <- function(v,z) {length(VectorIntersect(v,z))==length(v)}
Let's look at a simple example:
r <- c(1, 1)
s <- c(rep(1, 5), rep("a", 5))
s
[1] "1" "1" "1" "1" "1" "a" "a" "a" "a" "a"
VectorIntersect(r, s)
[1] 1 1
is.contained(r, s) ## r is contained in s
[1] TRUE
is.contained(s, r) ## s is not contained in r
[1] FALSE
is.contained(s, s) ## s is contained in itself.. more on this later
[1] TRUE
Now, let's look at #Gennaro's clever recursive approach which gives correct results (Many apologies and also many Kudos... on earlier tests, I was under the impression that it was checking to see if b was contained in s and not the other way around):
fun.contains(s, r) ## s contains r
[1] TRUE
fun.contains(r, s) ## r does not contain s
[1] FALSE
fun.contains(s, s) ## s contains s
[1] TRUE
We will now step through the other set-based algorithms and see how they handle r and s above. I have added print statements to each function for clarity. First, #Jilber's function.
is.containedJilber <- function(x,y){
temp <- intersect(x,y)
print(temp); print(length(x)); print(length(temp)); print(all.equal(x, temp))
out <- ifelse(length(x)==length(temp), all.equal(x, temp), FALSE)
return(out)
}
is.containedJilber(r, s) ## should return TRUE but does not
[1] "1" ## result of intersect
[1] 2 ## length of r
[1] 1 ## length of temp
## results from all.equal.. gives weird results because lengths are different
[1] "Modes: numeric, character" "Lengths: 2, 1" "target is numeric, current is character"
[1] FALSE ## results from the fact that 2 does not equal 1
is.containedJilber(s, s) ## should return TRUE but does not
[1] "1" "a" ## result of intersect
[1] 10 ## length of s
[1] 2 ## length of temp
## results from all.equal.. again, gives weird results because lengths are different
[1] "Lengths (10, 2) differ (string compare on first 2)" "1 string mismatch"
[1] FALSE ## results from the fact that 10 does not equal 2
Here is #Simon's:
is.containedSimon <- function(x, y) {
print(setdiff(x, y))
z <- x[x %in%setdiff(x, y)]
print(z); print(length(x)); print(length(y)); print(length(z))
length(z) == length(x) - length(y)
}
is.containedSimon(s, r) ## should return TRUE but does not
[1] "a" ## result of setdiff
[1] "a" "a" "a" "a" "a" ## the elements in s that match the result of setdiff
[1] 10 ## length of s
[1] 2 ## length of r
[1] 5 ## length of z
[1] FALSE ## result of 5 not being equal to 10 - 2
Hopefully this illustrates the pitfalls of applying strict set operations in this setting.
Let's test for efficiency and equality. Below, we build many test vectors and check to see if they are contained in either the vector testContainsNum (if it's a number vector) or testContainsAlpha (if it is a character vector):
set.seed(123)
testContainsNum <- sample(20:40, 145, replace=TRUE) ## generate large vector with random numbers
testContainsAlpha <- sample(letters, 175, replace=TRUE) ## generate large vector with random letters
tVec <- lapply(1:1000, function(x) { ## generating test data..
if (x%%2==0) {
sample(20:40, sample(50:100, 1), replace=TRUE) ## even indices will contain numbers
} else {
sample(letters, sample(50:90, 1), replace=TRUE) ## odd indices will contain characters
}
})
tContains <- lapply(1:1000, function(x) if (x%%2==0) {testContainsNum} else {testContainsAlpha})
## First check equality
tJoe <- mapply(is.contained, tVec, tContains)
tGennaro <- mapply(fun.contains, tContains, tVec)
tSimon <- mapply(is.containedSimon, tContains, tVec)
tJilber <- mapply(is.containedJilber, tVec, tContains)
all(tJoe==tGennaro) ## Give same results
[1] TRUE
## Both Jilber's and Simon's solution don't return any TRUE values
any(tJilber)
[1] FALSE
any(tSimon)
[1] FALSE
## There should be 170 TRUEs
sum(tJoe)
[1] 170
Let's take a closer look to determine if is.contained and fun.contains are behaving correctly.
table(tVec[[3]])
a b c e f g h i j k l m n o p q r t u v w x y z
3 4 5 2 2 1 5 3 5 3 2 1 7 3 1 2 4 3 5 5 2 4 3 3
table(tContains[[3]])
a b c d e f g h i j k l m n o p q r s t u v w x y z
4 11 4 3 7 8 13 4 4 9 13 3 10 7 7 4 8 7 8 6 7 5 9 4 4 6
## Note above that tVec[[3]] has 1 more c and h than tContains[[3]],
## thus tVec[[3]] is not contained in tContains[[3]]
c(tJoe[3], tGennaro[3])
[1] FALSE FALSE ## This is correct!!!!
table(tVec[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
6 4 4 7 6 3 4 6 3 5 4 4 6 4 4 2 2 5 3 1 4
table(tContains[[14]])
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40
6 4 10 7 6 4 10 6 8 10 5 5 6 9 8 5 7 5 11 4 9
## Note above that every element in tVec[[14]] is present in
## tContains[[14]] and the number of occurences is less than or
## equal to the occurences in tContains[[14]]. Thus, tVec[[14]]
## is contained in tContains[[14]]
c(tJoe[14], tGennaro[14])
[1] TRUE TRUE ## This is correct!!!!
Here are the benchmarks:
library(microbenchmark)
microbenchmark(Joe = mapply(is.contained, tVec, tContains),
Gennaro = mapply(fun.contains, tContains, tVec))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 165.0310 172.7817 181.3722 178.7014 187.0826 230.2806 100 a
Gennaro 249.8495 265.4022 279.0866 273.5923 288.1159 336.8464 100 b
Side Note about VectorIntersect()
After spending a good bit of time with this problem, it became increasingly clear that separating VectorIntersect from is.contained is tremendously valuable. I know many times in my own work, obtaining the intersection without duplicates being removed surfaced frequently. Oftentimes, the method implemented was messy and probably not reliable (easy to see why after this!). This is why VectorIntersect is a great utility function in additon to is.contained.
Update
Actually #Gennaro's solution can be improved quite a bit by calculating s[duplicated(s)] only one time as opposed to 3 times (similarly for b and length(s), we only calculate them once vs 2 times).
fun.containsFAST <- function(b, s){
dupS <- s[duplicated(s)]; dupB <- b[duplicated(b)]
lenS <- length(dupS)
all(s %in% b) && lenS <= length(dupB) &&
(if(lenS>0) fun.containsFAST(dupB,dupS) else 1)
}
microbenchmark(Joe = mapply(is.contained, tVec, tContains),
GenFAST = mapply(fun.containsFAST, tContains, tVec),
Gennaro = mapply(fun.contains, tContains, tVec))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 163.3529 172.1050 182.3634 177.2324 184.9622 293.8185 100 b
GenFAST 145.3982 157.7183 169.3290 164.7898 173.4063 277.1561 100 a
Gennaro 243.2416 265.8270 281.1472 273.5323 284.8820 403.7249 100 c
Update 2
What about testing containment for really big vectors? The function I provided is not likely to perform well as building the "intersection" (with duplicates etc.) by essentially looping over the true set intersection isn't very efficient. The modified #Gennaro's function won't be fast as well, because for very large vectors with many duplicates, the function calls could get nested pretty deep. With this in mind, I built yet another containment function that is specifically built for dealing with large vectors. It utilizes vectorized base R functions, especially of note pmin.int, which returns the parallel minima of multiple vectors. The inner function myL is taken directly from the guts of the rle function in base R (although slightly modified for this specific use).
is.containedBIG <- function(v, z) { ## v and z must be sorted
myL <- function(x) {LX <- length(x); diff(c(0L, which(x[-1L] != x[-LX]), LX))}
sum(pmin.int(myL(v[v %in% z]), myL(z[z %in% v])))==length(v)
}
Note that on smaller exmaples is.contained and fun.containsFAST are faster (this is mostly due to the time it takes to repeatedly sort.. as you will see, if the data is sorted is.containedBIG is much faster). Observe (for thoroughness we will also show the verification of #Chirayu's function and test's its efficiency):
## we are using tVec and tContains as defined above in the original test
tChirayu <- mapply(is.containedChirayu, tVec, tContains)
tJoeBIG <- sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]])))
all(tChirayu==tJoe) ## #Chirayu's returns correct results
[1] TRUE
all(tJoeBIG==tJoe) ## specialized alogrithm returns correct results
[1] TRUE
microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVec[[x]], tContains[[x]])),
JoeBIG=sapply(1:1000, function(x) is.containedBIG(sort(tVec[[x]]), sort(tContains[[x]]))),
GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContains[[x]], tVec[[x]])),
Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVec[[x]], tContains[[x]])))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 154.6158 165.5861 176.3019 175.4786 180.1299 313.7974 100 a
JoeBIG 269.1460 282.9347 294.1568 289.0174 299.4687 445.5222 100 b ## about 2x as slow as GenFAST
GenFAST 140.8219 150.5530 156.2019 155.8306 162.0420 178.7837 100 a
Chirayu 1213.8962 1238.5666 1305.5392 1256.7044 1294.5307 2619.5370 100 c ## about 8x as slow as GenFAST
Now, with sorted data, the results are quite astonishing. is.containedBIG shows a 3 fold improvement in speed whereas the other functions return almost identical timings.
## with pre-sorted data
tVecSort <- lapply(tVec, sort)
tContainsSort <- lapply(tContains, sort)
microbenchmark(Joe=sapply(1:1000, function(x) is.contained(tVecSort[[x]], tContainsSort[[x]])),
JoeBIG=sapply(1:1000, function(x) is.containedBIG(tVecSort[[x]], tContainsSort[[x]])),
GenFAST=sapply(1:1000, function(x) fun.containsFAST(tContainsSort[[x]], tVecSort[[x]])),
Chirayu=sapply(1:1000, function(x) is.containedChirayu(tVecSort[[x]], tContainsSort[[x]])))
Unit: milliseconds
expr min lq mean median uq max neval cld
Joe 154.74771 166.46926 173.45399 172.92374 177.09029 297.7758 100 c
JoeBIG 83.69259 87.35881 94.48476 92.07183 98.37235 221.6014 100 a ## now it's the fastest
GenFAST 139.19631 151.23654 159.18670 157.05911 162.85636 275.7158 100 b
Chirayu 1194.15362 1241.38823 1280.10058 1260.09439 1297.44847 1454.9805 100 d
For very large vectors, we have the following (only showing GenFAST and JoeBIG as the other functions will take too long):
set.seed(97)
randS <- sample(10^9, 8.5*10^5)
testBigNum <- sample(randS, 2*10^7, replace = TRUE)
tVecBigNum <- lapply(1:20, function(x) {
sample(randS, sample(1500000:2500000, 1), replace=TRUE)
})
system.time(tJoeBigNum <- sapply(1:20, function(x) is.containedBIG(sort(tVecBigNum[[x]]), sort(testBigNum))))
user system elapsed
74.016 11.351 85.409
system.time(tGennaroBigNum <- sapply(1:20, function(x) fun.containsFAST(testBigNum, tVecBigNum[[x]])))
user system elapsed
662.875 54.238 720.433
sum(tJoeBigNum)
[1] 13
all(tJoeBigNum==tGennaroBigNum)
[1] TRUE
## pre-sorted data
testBigSort <- sort(testBigNum)
tVecBigSort <- lapply(tVecBigNum, sort)
system.time(tJoeBigSort <- sapply(1:20, function(x) is.containedBIG(tVecBigSort[[x]], testBigSort)))
user system elapsed
33.910 10.302 44.289
system.time(tGennaroBigSort <- sapply(1:20, function(x) fun.containsFAST(testBigSort, tVecBigSort[[x]])))
user system elapsed
196.546 54.923 258.079
identical(tJoeBigSort, tGennaroBigSort, tJoeBigNum)
[1] TRUE
Regardless if your data is sorted or not, the point of this last test is to show that is.containedBIG is much faster on larger data. An interesting take away from this last test was the fact that fun.containsFAST showed a very large improvement in time when the data was sorted. I was under the impression that duplicated (which is the workhorse of fun.containsFAST), did not depend on whether a vector was sorted or not. Earlier test confirmed this sentiment (the unsorted test timings were practically identical to the sorted test timings (see above)). More research is needed.
How about a recursive method checking for length of the duplicates for each list?
fun.contains <- function(b, s){
all(s %in% b) && length(s[duplicated(s)]) <= length(b[duplicated(b)]) &&
(if(length(s[duplicated(s)])>0) fun.contains(b[duplicated(b)],s[duplicated(s)]) else 1 )
}
The idea is that a list is contained into another one if and only if so is the list of the respective duplicates, unless there are no duplicates (in that case the recursion defaults to TRUE).
Another custom-function version, checking whether the number of elements (length()) of the non-equal elements (setdiff) is equal to the difference in the vectors' length:
# Does vector x contain vector y?
is.contained <- function(x, y) {
z <- x[x %in%setdiff(x, y)]
length(z) == length(x) - length(y)
}
r <- c(1,1)
s <- c(1,1,5)
t <- c(1,2,5)
is.contained(r, t)
#> [1] FALSE
is.contained(s, r)
#> [1] TRUE
is.contained(r, s)
#> [1] FALSE

Efficiently save indices of nonzero matrix elements to a file

I need to save the indices of a matrix's nonzero elements to a file. This works very well for small-sized matrices, storing the row numbers of the non-zero indices in a and the column numbers of the non-zero indices in b:
X <- matrix(c(1,0,3,4,0,5), byrow=TRUE, nrow=2);
a <- row(X)[which(!X == 0)]
b <- col(X)[which(!X == 0)]
But size of the matrix is huge, and I need to find an efficient way to save the indices to a txt file, so that I have a[1] b[1] (new line) a[2] b[2] and so on. Any suggestions?
The package Matrix has a great solution for extremely large matrices. The sparseMatrix object can be summarized into a data.frame where i and j are your indices and x is the value:
X <- matrix(c(1,0,3,4,0,5), byrow=TRUE, nrow=2);
a <- row(X)[which(!X == 0)]
b <- col(X)[which(!X == 0)]
library(Matrix)
Y <- Matrix(X, sparse = TRUE)
(res <- summary(Y))
2 x 3 sparse Matrix of class "dgCMatrix", with 4 entries
i j x
1 1 1 1
2 2 1 4
3 1 3 3
4 2 3 5
class(res)
[1] "sparseSummary" "data.frame"
You can then subset to get just i and j:
res[, c("i", "j")]
i j
1 1 1
2 2 1
3 1 3
4 2 3
You can grab the rows and columns of all non-zero locations using which with parameter arr.ind=TRUE, writing the result to a file with write.table:
write.table(which(X != 0, arr.ind=TRUE), "file.txt", row.names=F, col.names=F)
This yields space-separated output of the pairs of elements in the specified file:
1 1
2 1
1 3
2 3
Using which with arr.ind=TRUE saves a few scans through your input matrix compared to the code posted in your question, so it should be a bit quicker at calculating the data to output. You can see this with a benchmark for a larger matrix (1000 x 1000, with 1% density):
set.seed(144)
bigX <- matrix(sample(c(rep(0, 99), 1), 1000000, replace=T), nrow=1000)
OP <- function(X) cbind(row(X)[which(!X == 0)], col(X)[which(!X == 0)])
josilber <- function(X) which(X != 0, arr.ind=TRUE)
library(microbenchmark)
microbenchmark(OP(bigX), josilber(bigX))
# Unit: milliseconds
# expr min lq mean median uq max neval
# OP(bigX) 20.513535 23.014517 36.463423 25.354250 59.130520 65.50304 100
# josilber(bigX) 3.873165 4.281624 6.741824 5.250777 6.998415 45.02542 100
In this case we see about a 5x speedup in computing the non-zero rows and columns. Depending on the density and size of your matrix the output operation (write.table) might instead be the bottleneck, in which case there may not be too much benefit to this approach.

Simple way of creating dummy variable in R

I want to know how simply a dummy variables can be created. I found many similar questions on the dummy but either they are based on some external packages or technical.
I have data like this :
df <- data.frame(X=rnorm(10,0,1), Y=rnorm(10,0,1))
df$Z <- c(NA, diff(df$X)*diff(df$Y))
Z create a new variable within df ie product of change in X and change in Y.
Now I want to create a dummy variable D in df such that if : Z < 0 then D==1, if Z >0 then D==0.
I tried in this way :
df$D <- NA
for(i in 2:10) {
if(df$Z[i] <0 ) {
D[i] ==1
}
if(df$Z[i] >0 ) {
D[i] ==0
}}
This is not working.
I want to know why above code is not working (with easy way of doing this) and how dummy variables can be creating in R without using any external packages with little bit of explanation.
Try :
df$D<-ifelse(df$Z<0,1,0)
df
X Y Z D
1 -0.1041896 -1.11731404 NA NA
2 -1.4286604 1.42523717 -3.36753491 1
3 0.3931643 -0.05525477 -2.69719691 1
4 -0.2236541 1.64531526 -1.04894297 1
5 1.1725167 0.80063291 -1.17932089 1
6 0.7571427 0.64072381 0.06642209 0
7 0.4929186 1.25125268 -0.16131645 1
8 0.9715885 -0.54755653 -0.86103574 1
9 -0.2962052 -1.37459521 1.04851438 0
10 -1.4838675 -0.85788632 -0.61367565 1
The ifelsefunction takes 3 arguments : the condition to evaluate df$Z<0, the value if the condition is TRUE : 1 and the value if the condition is FALSE : 0. The function is vectorized so it works well in this case.
We can create a logical vector by df$Z < 0 and then coerce it to binary by wrapping with +.
df$D <- +(df$Z <0)
Or as #BenBolker mentioned, the canonical options would be
as.numeric(df$Z < 0)
or
as.integer(df$Z < 0)
Benchmarks
set.seed(42)
Z <- rnorm(1e7)
library(microbenchmark)
microbenchmark(akrun= +(Z < 0), etienne = ifelse(Z < 0, 1, 0),
times= 20L, unit='relative')
# Unit: relative
# expr min lq mean median uq max neval
# akrun 1.00000 1.00000 1.000000 1.00000 1.00000 1.000000 20
# etienne 12.20975 10.36044 9.926074 10.66976 9.32328 7.830117 20
You can try
df$D[df$Z<0]<-1
df$D[df$Z>0]<-0
But you should consider the possibility that Z can be 0.

Memorize the last "correct" value of a sequence (for removing outliers)

I have a little problem in a function.
The aim of it is to remove outliers I've detected in my data.frame. They are detected when there's a too big difference with the previous correct value (e.g c(1,2,3,20,30,4,5,6): "20" and "30" are the outliers). But my data is much more complex than this.
My idea is to consider the first two numeric values of my column as "correct". Then, I want to test each next value:
if the difference between the tested value and the previous one is <20, then it's a new correct one, and the test must start again from this new correct value (and not from the previous correct one)
if the same difference is >20, then it's a wrong one. An index must be put next to the wrong value, and the test must still continue from this same correct value, until a new correct value is detected
Here's an example with my function and a fake DF:
myts <- data.frame(x=c(12,12,35,39,46,45,33,5,26,28,29,34,15,15),z=NA)
test <- function(x){
st1 = NULL
temp <- st1[1] <- x[1]
st1 <- numeric(length(x))
for (i in 2:(length(x))){
if((!is.na(x[i])) & (!is.na(x[i-1]))& (abs((x[i])-(temp)) > 20)){
st1[i] <- 1
} }
return(st1)
}
myts[,2] <- apply(as.data.frame(myts[,1]),2,test)
myts[,2] <- as.numeric(myts[,2])
It does nearly the job, but the problem is that the last correct value is not memorized. It still does the test from the first correct value.
Due to this, rows 9 to 11 in my example are not detected. I let you imagine the problem on a 500,000 rows data.frame.
How can I solve this little problem? The rest of the function may be OK.
You just need to update temp for any indices that aren't outliers:
test <- function(x) {
temp <- x[1]
st1 <- numeric(length(x))
for (i in 2:(length(x))){
if(!is.na(x[i]) & !is.na(x[i-1]) & abs(x[i]-temp) > 20) {
st1[i] <- 1
} else {
temp <- x[i]
}
}
return(st1)
}
myts[,2] <- apply(as.data.frame(myts[,1]),2,test)
myts[,2] <- as.numeric(myts[,2])
myts
# x z
# 1 12 0
# 2 12 0
# 3 35 1
# 4 39 1
# 5 46 1
# 6 45 1
# 7 33 1
# 8 5 0
# 9 26 1
# 10 28 1
# 11 29 1
# 12 34 1
# 13 15 0
# 14 15 0
One thing to note is that for loops in R will be quite slow compared to vectorized functions. However, because each element in your vector depends on a complicated way on the previous ones, it's tough to use R's built-in vectorized functions to efficiently compute your vector. You can convert this code nearly verbatim to C++ and use the Rcpp package to regain the efficiency:
library(Rcpp)
test2 <- cppFunction(
"IntegerVector test2(NumericVector x) {
const int n = x.length();
IntegerVector st1(n, 0);
double temp = x[0];
for (int i=1; i < n; ++i) {
if (!R_IsNA(x[i]) && !R_IsNA(x[i]) && fabs(x[i] - temp) > 20.0) {
st1[i] = 1;
} else {
temp = x[i];
}
}
return st1;
}")
all.equal(test(myts[,1]), test2(myts[,1]))
# [1] TRUE
# Benchmark on large vector with some NA values:
set.seed(144)
large.vec <- c(0, sample(c(1:50, NA), 1000000, replace=T))
all.equal(test(large.vec), test2(large.vec))
# [1] TRUE
library(microbenchmark)
microbenchmark(test(large.vec), test2(large.vec))
# Unit: milliseconds
# expr min lq mean median uq max neval
# test(large.vec) 2343.684164 2468.873079 2667.67970 2604.22954 2747.23919 3753.54901 100
# test2(large.vec) 9.596752 9.864069 10.97127 10.23011 11.68708 16.67855 100
The Rcpp code is about 250x faster on a vector of length 1 million. Depending on your use case this speedup may or may not be important.

check each element of vector against all rows of data frame

I have a vector, for which I want to check each element against each row of a data frame. It involves a grep function, since the elements to be checked are buried in other text.
With help of this forum, I got this code:
mat=data.frame(par=c('long A story','C story', 'blabla D'),val=1:3)
vec=c('Z','D','A')
mat$label <- NA
for (x in vec){
is.match <- lapply(mat$par,function(y) grep(x, y))
mat$label[which(is.match > 0)] <- x
}
The problem is that it takes minutes to execute. Is there a way to vectorize this?
I've assumed you only want the first match in each case:
which.matches <- grep("[ZDA]", mat$par)
what.matches <- regmatches(mat$par, regexpr("[ZDA]", mat$par))
mat$label[which.matches] <- what.matches
mat
par val label
1 long A story 1 A
2 C story 2 <NA>
3 blabla D 3 D
EDIT: Benchmarking
Unit: microseconds
expr min lq median uq max
1 answer(mat) 185.338 194.0925 199.073 209.1850 898.919
2 question(mat) 672.227 693.9610 708.601 725.6555 1457.046
EDIT 2:
As #mrdwab suggested, this can actually be used as a one-liner:
mat$label[grep("[ZDA]", mat$par)] <- regmatches(mat$par, regexpr("[ZDA]", mat$par))

Resources