I want to check whether a vector y contains another vector x
y <- c(0,0,0,NA,NA,0)
x <- c(0,0,0,0)
In this case, it should give me FALSE because there is no sequence of four NULL in y. But if we take a look at vector y2, the result should be TRUE.
y2 <- c(0,0,NA,0,0,0,0)
EDIT:
I tried to use %in% but it seems to only work for elements of vectors, not for whole vectors. The solution doesn't have to be applicable to more general problems. It would be nice if it works for this particular case.
You can use a combination of grepl and paste. Here you need to collapse each vector into one character using the collapse argument in paste.
> grepl(paste(x,collapse=";"),paste(y2,collapse=";"))
[1] TRUE
> grepl(paste(x,collapse=";"),paste(y,collapse=";"))
[1] FALSE
> grepl(paste(c(123),collapse=";"),paste(c(12,3),collapse=";"))
[1] FALSE
Use this:
any(apply(embed(y,length(y)-length(x)+1),2,identical,x))
Just for those who might wonder, a time test of the answers.
findit1<-function(x,y) any(apply(embed(y,length(y)-length(x)+1),2,identical,x))
findit2<-function(x,y) grepl(paste(x,collapse=";"),paste(y,collapse=";"))
x<-c(0,1,1,0,0,0,1,0,1)
y<-sample(c(0,1),1e5,replace=TRUE)
Rgames> microbenchmark(findit1(x,y),findit2(x,y))
Unit: milliseconds
expr min lq median uq max neval
findit1(x, y) 403.79291 449.9028 457.8320 466.4996 603.6573 100
findit2(x, y) 99.09317 100.7774 101.4513 102.1728 119.8970 100
EDIT:
using eddi's rle answer,
Rgames> findit3<-function(x,y) sum(length(x) <= rle(y)$lengths[rle(y)$values %in% 0])
Rgames> x<-c(0,0,0,0,0)
Rgames> microbenchmark(findit1(x,y),findit2(x,y),findit3(x,y))
Unit: milliseconds
expr min lq median uq max neval
findit1(x, y) 340.63570 383.39450 414.6791 456.38786 532.98017 100
findit2(x, y) 99.72606 101.11308 101.9399 103.20869 117.91149 100
findit3(x, y) 23.39226 24.39826 31.8478 35.10592 53.15408 100
But in the general case of any sequence in x I doubt there's a way to massage rle or seqle to do this. I'll have to go play with things for a while. :-)
For this particular case of 0's only in x, just use rle:
sum(length(x) <= rle(y2)$lengths[rle(y2)$values %in% 0]) > 0
#[1] TRUE
sum(length(x) <= rle(y)$lengths[rle(y)$values %in% 0]) > 0
#[1] FALSE
The OP didn't ask for this, but here's a way to find where the instances of x occur.
I used "9" as my tagging character on the assumption that "9" never shows up in y. Clearly one could choose some other character.
> bar<-gsub(paste(x,collapse=""),'9',paste(y,collapse=""))
> rab<-as.numeric(unlist(strsplit(bar,'')))
> rle(rab==9)
Run Length Encoding
lengths: int [1:3123] 49 1 49 1 20 1 6 1 78 1 ...
values : logi [1:3123] FALSE TRUE FALSE TRUE FALSE TRUE ...
I really like this method:
# Second vector contains all the elements of the first
all(1:10 %in% 1:10)
# [1] TRUE
# Second vector does not contain all elements of the first
all(1:11 %in% 1:10)
# [1] FALSE
Yet another option:
length(x) == max(nchar(strsplit(paste(y,collapse=''),"NA")[[1]]))
length(x) == max(nchar(strsplit(paste(y2,collapse=''),"NA")[[1]]))
I also think there should be smarter way, e.g. utilize somehow cumsum (and make it reset to 0 at every occurence of NA and then get the maximum and compare it to the length of x). After some Internet search I have:
length(x) == max(sapply(split(y, replace(cumsum(is.na(y)), is.na(y), -1))[-1],length))
length(x) == max(sapply(split(y2, replace(cumsum(is.na(y2)), is.na(y2), -1))[-1],length))
Or maybe to start with which(is.na(x)) and then somehow calculate the maximum difference between elements in result.
Similar to Ferdinand's (nice use of embed(), BTW), this will return a vector of all matches (empty if none):
which(sapply(1:(length(y)-length(x)+1), function(z) identical(x, y[z:(z+length(x)-1)])))
Related
Given is a vector
x <- c("A","B","C","A","C","B","C","A")
Now, for every element x_i in x, I want to find the index until which the vector beginning at x_i has no duplicates and save those indices in a vector (same length as x).
Example:
Starting with the first element "A" in x, the solution is 3, since beginning in "A", we have "B","C" and finally "A" on index 4.
Here, at index 4, the subvector consisting of "A","B","C","A" has a duplicate "A", hence the answer is 3.
We continue with the second element "B". We then get the sequence "B","C","A","C" and have a duplicate, namely "C". Hence the answer will be 4. (Its not 3, since "B" is already the second element, hence we have to add 1)
Now, the third element. It is "C" and we get the sequence "C","A","C", which gives a first duplicate. Hence the answer is 4. (We have "C" and "A" as unique length 2 vector, but "C" is already the third element, so we have to add 2).
I hope it is clear what to do.
This question is only for performance-boosting. I already have a working function
uniqueTill <- function(x) {
res <- unlist(lapply(1:length(x), function(i) which(duplicated(x[i:length(x)]))[1]+i-2))
res[is.na(res)] <- length(x)
res
}
We can now check the function with:
inds <- uniqueTill(x)
#check if the vectors are in fact unqiue
lapply(1:length(inds), function(k) x[k:inds[k]])
lapply(1:length(inds), function(k) any(duplicated(x[k:inds[k]])))
#check if the next element in the original vector makes the subvectors no more unique
lapply(1:length(inds), function(k) x[k:(inds[k]+1)])
lapply(1:length(inds), function(k) any(duplicated(x[k:(inds[k]+1)]))) #here the last 3 are FALSE since there are no more duplicates, but thats wanted
I already got a small performance increase with the following function:
(This function does not give the desired output, because of the behaviour of which.min on vectors only containing FALSE, which comes into game when we are reaching the end of the vector where no more duplicates occur. Maybe you also have an idea how to fix this little problem)
uniqueTill2 <- function(x) {
res <- unlist(lapply(1:length(x), function(i) which.min(!duplicated(x[i:length(x)]))+i-2))
res[is.na(res)] <- length(x)
res
}
#newer version
uniqueTill2 <- function(x) {
unlist(lapply(1:length(x), function(i) {
dups <- duplicated(x[i:length(x)])
if(!any(dups)) return(length(x))
which.max(dups)+i-2
}))
}
Instead of taking "which" in every iteration and then taking its first index, its enough to get the first min.index via "which.min".
Therefore we need to negate the duplicated vector, such that a duplicated element giving TRUE (=1 in numeric) becomes a FALSE (=0 in numeric) such that the use of "which.min" works.
I got the following benchmarks:
x <- c("A","B","C","A","C","B","C","A")
microbenchmark::microbenchmark(uniqueTill(x),uniqueTill2(x))
Unit: microseconds
expr min lq mean median uq max neval
uniqueTill(x) 63.002 66.5010 74.50998 68.4015 73.3005 200.300 100
uniqueTill2(x) 56.401 58.8505 63.29607 60.8505 63.4510 119.901 100
xx <- rep(x,1e3)
microbenchmark::microbenchmark(uniqueTill(xx),uniqueTill2(xx),times=10)
Unit: milliseconds
expr min lq mean median uq max neval
uniqueTill(xx) 752.8640 761.6679 822.4813 845.1330 850.2489 906.9668 10
uniqueTill2(xx) 719.3167 725.2254 744.7558 727.9224 747.4992 812.6559 10
Do you have any ideas how to speed things up? Some Rcpp would also be very nice.
I have a huge vector which has a couple of NA values, and I'm trying to find the max value in that vector (the vector is all numbers), but I can't do this because of the NA values.
How can I remove the NA values so that I can compute the max?
Trying ?max, you'll see that it actually has a na.rm = argument, set by default to FALSE. (That's the common default for many other R functions, including sum(), mean(), etc.)
Setting na.rm=TRUE does just what you're asking for:
d <- c(1, 100, NA, 10)
max(d, na.rm=TRUE)
If you do want to remove all of the NAs, use this idiom instead:
d <- d[!is.na(d)]
A final note: Other functions (e.g. table(), lm(), and sort()) have NA-related arguments that use different names (and offer different options). So if NA's cause you problems in a function call, it's worth checking for a built-in solution among the function's arguments. I've found there's usually one already there.
The na.omit function is what a lot of the regression routines use internally:
vec <- 1:1000
vec[runif(200, 1, 1000)] <- NA
max(vec)
#[1] NA
max( na.omit(vec) )
#[1] 1000
Use discard from purrr (works with lists and vectors).
discard(v, is.na)
The benefit is that it is easy to use pipes; alternatively use the built-in subsetting function [:
v %>% discard(is.na)
v %>% `[`(!is.na(.))
Note that na.omit does not work on lists:
> x <- list(a=1, b=2, c=NA)
> na.omit(x)
$a
[1] 1
$b
[1] 2
$c
[1] NA
?max shows you that there is an extra parameter na.rm that you can set to TRUE.
Apart from that, if you really want to remove the NAs, just use something like:
myvec[!is.na(myvec)]
Just in case someone new to R wants a simplified answer to the original question
How can I remove NA values from a vector?
Here it is:
Assume you have a vector foo as follows:
foo = c(1:10, NA, 20:30)
running length(foo) gives 22.
nona_foo = foo[!is.na(foo)]
length(nona_foo) is 21, because the NA values have been removed.
Remember is.na(foo) returns a boolean matrix, so indexing foo with the opposite of this value will give you all the elements which are not NA.
You can call max(vector, na.rm = TRUE). More generally, you can use the na.omit() function.
I ran a quick benchmark comparing the two base approaches and it turns out that x[!is.na(x)] is faster than na.omit. User qwr suggested I try purrr::dicard also - this turned out to be massively slower (though I'll happily take comments on my implementation & test!)
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)),
times = 1e6)
Unit: microseconds
expr min lq mean median uq max neval cld
purrr::map(airquality, function(x) { x[!is.na(x)] }) 66.8 75.9 130.5643 86.2 131.80 541125.5 1e+06 a
purrr::map(airquality, na.omit) 95.7 107.4 185.5108 129.3 190.50 534795.5 1e+06 b
purrr::map(airquality, ~purrr::discard(.x, .p = is.na)) 3391.7 3648.6 5615.8965 4079.7 6486.45 1121975.4 1e+06 c
For reference, here's the original test of x[!is.na(x)] vs na.omit:
microbenchmark::microbenchmark(
purrr::map(airquality,function(x) {x[!is.na(x)]}),
purrr::map(airquality,na.omit),
times = 1000000)
Unit: microseconds
expr min lq mean median uq max neval cld
map(airquality, function(x) { x[!is.na(x)] }) 53.0 56.6 86.48231 58.1 64.8 414195.2 1e+06 a
map(airquality, na.omit) 85.3 90.4 134.49964 92.5 104.9 348352.8 1e+06 b
Another option using complete.cases like this:
d <- c(1, 100, NA, 10)
result <- complete.cases(d)
output <- d[result]
output
#> [1] 1 100 10
max(output)
#> [1] 100
Created on 2022-08-26 with reprex v2.0.2
As an example, I have a large list of vectors with various lengths (and some NULL) and would like to find the first list element with two elements. As in this post, I know that with a list you can use a similar approach by using sapply() and subsetting the first result. As the solution in the post linked above using match() doesn't work in this case, I'm curious if there is a more elegant (and more computationally efficient) way to achieve this.
A reproducible example
# some example data
x <- list(NULL, NULL, NA, rep("foo", 6), c("we want", "this one"),
c(letters[1:10]), c("foo", "bar"), NULL)
x
# find the first element of length 2 using sapply and sub-setting to result #1
x[sapply(x, FUN=function(i) {length(i)==2})][[1]]
Or, as in #Josh O'Brien's answer to this post,
# get the index of the first element of length 2
seq_along(x)[sapply(x, FUN=function(i) {length(i)==2})]
Any thoughts or ideas?
Do you want this?
Find(function(i) length(i) == 2, x) # [1] "we want" "this one"
Position(function(i) length(i) == 2, x) # [1] 5
I ran benchmarking on each of the solutions suggested for a single list of 200,000 elements (28.8 Mb) made from rep(x, 25000). This was just the x list from my example repeated many times. Here are the results:
> microbenchmark(Find(function(i) length(i) == 2, x),
x[sapply(x, length) == 2][[1]],
x[sapply(x, FUN=function(i) {length(i)==2})][[1]],
x[[match(2,lapply(x,length))]],
x[match(2, mapply(length, x))],
x[mapply(length, x) == 2][[1]])
Unit: microseconds
expr min lq median uq max neval
Find(function(i) length(i) == 2, x) 89.104 107.531 112.8955 119.6605 466.045 100
x[sapply(x, length) == 2][[1]] 166539.621 185113.274 193224.0270 209923.2405 378499.180 100
x[sapply(x, FUN = function(i) {length(i) == 2 })][[1]] 279596.600 301976.512 310928.3845 322857.7610 484233.342 100
x[[match(2, lapply(x, length))]] 378391.882 388831.223 398639.1430 415137.0565 591727.647 100
x[match(2, mapply(length, x))] 207324.777 225027.221 235982.9895 249744.3525 422451.010 100
x[mapply(length, x) == 2][[1]] 205649.537 223045.252 236039.6710 249529.5245 411916.734 100
Thanks for the quick and informative responses!
mapply seems to be really quick
> x <- rep(x, 25000)
> microbenchmark({ x[match(2, mapply(length, x))] })
# Unit: milliseconds
# min lq median uq max neval
# 243.7502 275.8941 326.2993 337.9221 405.7011 100
also check x[mapply(length, x) == 2][[1]]
Here's a different way with sapply
> x[sapply(x, length) == 2][[1]]
# [1] "we want" "this one"
This next one is interesting.
> x[ grep("2", summary(x)[,1])[1] ]
# [[1]]
# [1] "we want" "this one"
Using match can work.
match(2,lapply(x,length))
#[1] 5
x[[match(2,lapply(x,length))]]
#[1] "we want" "this one"
Suppose we've a vector (or a data.frame for that matter) as follows:
set.seed(1)
x <- sample(10, 1e6, TRUE)
And one wants to get all values of x where x > 4, say:
a1 <- x[x > 4] # (or)
a2 <- x[which(x > 4)]
identical(a1, a2) # TRUE
I think most people would prefer x[x > 4]. But surprisingly (at least to me), subsetting using which is faster!
require(microbenchmark)
microbenchmark(x[x > 4], x[which(x > 4)], times = 100)
Unit: milliseconds
expr min lq median uq max neval
x[x > 4] 56.59467 57.70877 58.54111 59.94623 104.51472 100
x[which(x > 4)] 26.62217 27.64490 28.31413 29.97908 99.68973 100
It's about 2.1 times faster on mine.
One possibility for the difference, I thought, could be due to the fact that which doesn't consider NA but > returns them as well. But then logical operation itself should be the reason for this difference, which is not the case (obviously). That is:
microbenchmark(x > 4, which(x > 4), times = 100)
Unit: milliseconds
expr min lq median uq max neval
x > 4 8.182576 10.06163 12.68847 14.64203 60.83536 100
which(x > 4) 18.579746 19.94923 21.43004 23.75860 64.20152 100
Using which is about 1.7 times slower just before subsetting. But which seems to catch up drastically on/during subsetting.
It seems not possible to use my usual weapon of choice debugonce (thanks to #GavinSimpson) as which calls .Internal(which(x)) whereas == calls .Primitive("==").
My question therefore is why is [ on numeric type resulting from which faster than logical vector resulting from >? Any ideas?
I think I should move out of the comments and add an answer. This is my hunch building up on what the others have answered and discussed. (I'm sure the real answer exists in the C source for subset_dflt.)
Once I have a vector x and a logical vector x > 0, I can subset x on x > 0 in two ways. I can use which or I can use the vector x > 0 directly as the indexing. However, we must note that the two are not identical since x[x > 0] will preserve NAs while x[which(x > 0)] will not.
However, in either method, I will need to examine each element of the vector x > 0. In an explicit which call I will have to examine only the boolean state of the element while in a direct sub-setting operation I will have to examine both missing-ness and the boolean state of each element.
#flodel brings an interesting observation. Since [, is.na, which, and | are all primitives or internal routines, let's assume no extraordinary overhead and do this experiment:
microbenchmark(which(x > 0), x[which(x > 0)], x > 0 | is.na(x), x[x > 0],
unit="us", times=1000)
Unit: microseconds
expr min lq median uq max neval
which(x > 0) 1219.274 1238.693 1261.439 1900.871 23085.57 1000
x[which(x > 0)] 1554.857 1592.543 1974.370 2339.238 23816.99 1000
x > 0 | is.na(x) 3439.191 3459.296 3770.260 4194.474 25234.70 1000
x[x > 0] 3838.455 3876.816 4267.261 4621.544 25734.53 1000
Considering median values, we can see that, assuming x > 0 | is.na(x) is a crude model of what I am saying happens in logical sub-setting, then the actual time taken in 'subset' is ~ 500 us. And the time taken in 'subset' with which is ~ 700 us. Both the numbers are comparable and indicate that it is not the 'subset'ing itself which is costly in one method or another. In stead, it is what is being done to compute the subset wanted that is cheaper in the which method.
Here's my take on it. Subsetting on a numeric allows pulling out exactly those elements that are required. Subsetting on a logical requires examining each element of the index vector to see if it's TRUE, and then building an internal list of the required elements of the target vector. There are two steps involved, so will take longer.
The difference is biggest is the number of elements extracted is small relative to the size of the original vector. For example:
> z <- rnorm(1e8)
> system.time(z[which(z < -5)])
user system elapsed
0.58 0.03 0.60
> system.time(z[z < -5])
user system elapsed
2.56 0.14 2.70
> system.time(z[which(z < 5)])
user system elapsed
1.39 0.30 1.68
> system.time(z[z < 5])
user system elapsed
2.82 0.44 3.26
Here, if you're pulling out only a small proportion of elements (there were 23 elements of z < -5 in my test), using which takes a very small proportion compared to logical indexing. However, if you're extracting a large proportion of elements, the times are closer.
This seems to be because subsetting by logical vector is slower than subsetting by numerical index.
> ii <- x > 4
> ij <- which(x > 4)
>
> head(ii)
[1] FALSE FALSE TRUE TRUE FALSE TRUE
> head(ij)
[1] 3 4 6 7 8 9
>
> microbenchmark(x[ii], x[ij], times = 100)
Unit: milliseconds
expr min lq median uq max neval
x[ii] 25.574977 26.15414 28.299858 31.080903 82.04686 100
x[ij] 3.037134 3.31821 3.670096 7.516761 12.39738 100
Updated:
Probably one reason is that, smaller length of the index numeric can reduce the (internal) loop for subsetting and results in the slower evaluation. You can find ik < ij < il
But there would be another difference, because there is a huge difference between ii and il.
> ii <- x > 4
>
> ij <- which(x > 4)
> ik <- which(x > 9)
> il <- which(x > -1)
>
> microbenchmark(x[ii], x[ij], x[ik], x[il], times = 100)
Unit: microseconds
expr min lq median uq max neval
x[ii] 25645.621 25986.2720 28466.412 30693.158 79582.484 100
x[ij] 3111.974 3281.8280 3477.627 6142.216 55076.121 100
x[ik] 585.723 628.2125 650.184 682.888 7551.084 100
x[il] 5266.032 5773.9015 9073.614 10583.312 15113.791 100
I'd like to show the names of columns in a large dataframe that contain missing values. Basically, I want the equivalent of complete.cases(df) but for columns, not rows. Some of the columns are non-numeric, so something like
names(df[is.na(colMeans(df))])
returns "Error in colMeans(df) : 'x' must be numeric." So, my current solution is to transpose the dataframe and run complete.cases, but I'm guessing there's some variant of apply (or something in plyr) that's much more efficient.
nacols <- function(df) {
names(df[,!complete.cases(t(df))])
}
w <- c("hello","goodbye","stuff")
x <- c(1,2,3)
y <- c(1,NA,0)
z <- c(1,0, NA)
tmp <- data.frame(w,x,y,z)
nacols(tmp)
[1] "y" "z"
Can someone show me a more efficient function to identify columns that have NAs?
This is the fastest way that I know of:
unlist(lapply(df, function(x) any(is.na(x))))
EDIT:
I guess everyone else wrote it out complete so here it is complete:
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) any(is.na(x))))]
}
And if you microbenchmark the 4 solutions on a WIN 7 machine:
Unit: microseconds
expr min lq median uq max
1 ANDRIE 85.380 91.911 106.375 116.639 863.124
2 MANOEL 87.712 93.778 105.908 118.971 8426.886
3 MOIRA 764.215 798.273 817.402 876.188 143039.632
4 TYLER 51.321 57.853 62.518 72.316 1365.136
And here's a visual of that:
Edit At the time I wrote this anyNA did not exist or I was unaware of it. This may speed things up moreso...per the help manual for ?anyNA:
The generic function anyNA implements any(is.na(x)) in a possibly faster way (especially for atomic vectors).
nacols <- function(df) {
colnames(df)[unlist(lapply(df, function(x) anyNA(x)))]
}
Here is one way:
colnames(tmp)[colSums(is.na(tmp)) > 0]
Hope it helps,
Manoel
One way...
nacols <- function(x){
y <- sapply(x, function(xx)any(is.na(xx)))
names(y[y])
}
nacols(tmp)
[1] "y" "z"
Explanation: since the result y is a logical vector, names(y[y]) returns the names of y for only those cases where y is TRUE.