Get indices of two values that bracket zero in R - r

I have a vector x:
x <- c(-1,-0.5,-0.1,-0.001,0.5,0.6,0.9)
I want the index of the closest negative value to zero and the closest positive value to zero. In this case, 4 and 5. x is not necessarily sorted.
I can do this by setting numbers to NA:
# negative numbers only
tmp <- x
tmp[x > 0] <- NA
which.max(tmp)
# positive numbers only
tmp <- x
tmp[x < 0] <- NA
which.min(tmp)
But that seems clunky. Any tips?

good scenario
If you are in the classic case, where
your vector is sorted in increasing order,
it does not include 0,
it has no tied values,
you can simply do the following:
findInterval(0, x, TRUE) + 0:1
If condition 1 does not hold, but condition 2 and 3 still hold, you can do
sig <- order(x)
sig[findInterval(0, x[sig], TRUE) + 0:1]
akrun's answer is fundamentally the same.
bad scenario
Things become tricky once your vector x contains 0 or tied / repeated values, because:
repeated values challenge sorting based method, as sorting method like "quick sort" is not stable (see What is stability in sorting algorithms and why is it important? if you don't know what a stable sort is);
findInterval will locate exactly 0 at presence of 0.
In this situation, you have to adapt Ronak Shah's answer which allows you to exclude 0. But be aware that which may give you multiple indexes if there are repeated values.

Another way could be:
#closest positive value to zero.
which(x == min(x[x > 0]))
#[1] 5
#closest negative value to zero
which(x == max(x[x < 0]))
#[1] 4

We could try
rle(sign(x))$lengths[1] + 0:1
#[1] 4 5
if it is unsorted, then
x1 <- sort(x)
match(x1[rle(sign(x1))$lengths[1] + 0:1], x)

Related

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

How to find if two or more continuously elements of a vector are equal in R

I want to find a way to determine if two or more continuously elements of a vector are equal.
For example, in vector x=c(1,1,1,2,3,1,3), the first, the second and the third element are equal.
With the following command, I can determine if a vector, say y, contains two or more continuously elements that are equal to 2 or 3
all(rle(y)$lengths[which( rle(y)$values==2 | rle(y)$values==3 )]==1)
Is there any other faster way?
EDIT
Let say we have the vector z=c(1,1,2,1,2,2,3,2,3,3).
I want a vector with three elements as output. The first element will refer to value 1, the second to 2 and the third one to 3. The values of the elements of the output vector will be equal to 1 if two or more continuously elements of z are the same for one value of 1,2,3 and 0 otherwise. So, the output for the vector z will be (1,1,1).
For the vector w=c(1,1,2,3,2,3,1) the output will be 1,0,0, since only for the value 1 there are two continuously elements, that is in the first and in the second position of w.
I'm not entirely sure if I'm understanding your question as it could be worded better. The first part just asks how you find if continuous elements in a vector are equal. The answer is to use the diff() function combined with a check for a difference of zero:
z <- c(1,1,2,1,2,2,3,2,3,3)
sort(unique(z[which(diff(z) == 0)]))
# [1] 1 2 3
w <- c(1,1,2,3,2,3,1)
sort(unique(w[which(diff(w) == 0)]))
# [1] 1
But your edit example seems to imply you are looking to see if there are repeated units in a vector, of which will only be the integers 1, 2, or 3. Your output will always be X, Y, Z, where
X is 1 if there is at least one "1" repeated, else 0
Y is 2 if there is at least one "2" repeated, else 0
Z is 3 if there is at least one "3" repeated, else 0
Is this correct?
If so, see the following
continuously <- function(x){
s <- sort(unique(x[which(diff(x) == 0)]))
output <- c(0,0,0)
output[s] <- s
return(output)
}
continuously(z)
# [1] 1 2 3
continuously(w)
# [1] 1 0 0
Assuming your series name is z=c(1,1,2,1,2,2,3,2,3,3) then you can do:
(unique(z[c(FALSE, diff(z) == 0)]) >= 0)+0 which will output to 1, 1, 1,
When you run the above command on your other sequenc:
w=c(1,1,2,3,2,3,1)
then (unique(w[c(FALSE, diff(w) == 0)]) >= 0)+0 will return to 1
You may also try this for an exact output like 1,1,1 or 1,0,0
(unique(z[c(FALSE, diff(z) == 0)]) == unique(z))+0 #1,1,1 for z and 1,0,0 for w
Logic:
diff command will take difference between corresponding second and prior items, since total differences will always 1 less than the number of items, I have added first item as FALSE. Then subsetted with your original sequences and for boolean comparison whether the difference returned is zero or not. Finally we convert them to 1s by asking if they are greater than or equal to 0 (To get series of 1s, you may also check it with some other conditions to get 1s).
Assuming your sequence doesn't have negative numbers.

Are there ways to randomly sample among ties in the R function which.max()?

I currently am using the which.max() function in R within a loop. Sometimes, I have a vector which contains the same elements, like:
vec <- c(0,0,2,0,2)
The function will then always return:
> which.max(vec)
[1] 3
I am wondering if there is a simple solution to break ties randomly so that it doesn't always choose the smallest index among ties. I know that there is a which.is.max function in nnet, but was hoping to see if there was another simple solution without having to resort to installing extra packages. Thanks.
which(vec == max(vec)) will match all ties. You can then pick one at random using sample(which(vec == max(vec)), 1).
As you mentioned in the comments, sample does something annoying when the supplied vector is of length 1. So when there is only one maximum.
You can fix this as follows:
maxima <- which(vec == max(vec))
if(length(maxima) > 1){
maxima <- sample(maxima, 1)
}
Another method is using rank with ties.method = "random" and then we can use which.max on it.
which.max(rank(vec, ties.method = "random"))
which.max(rank(vec, ties.method = "random"))
#[1] 3
which.max(rank(vec, ties.method = "random"))
#[1] 5
rank would basically rank the vector according to their value and with ties.method = "random" it will randomly assign rank in case of a tie.
rank(vec, ties.method = "random")
#[1] 2 1 4 3 5
rank(vec, ties.method = "random")
#[1] 1 3 5 2 4
There is a concept called "pertubation", where you modify each number by a random amount that is significantly smaller than the existing variation. You can then take the maximum amount, which will be one of the original maxima plus some random amount. Which one of the original maxima will be selected is random, as it's determined by which had the largest random amount added. So for instance, if all your numbers are integers, you can convert them to floats, add a random number between 0 and .001, pick the largest one, and then round it back to int. This is probably not the most efficient method, but given that you mentioned the which.is.max in nnet, presumably you are doing work with neural networks, and pertubation is an important concept with NNs.
As alternative:
vec <- c(0,0,2,0,2)
vec %>% unique %>% sapply(function(x) which(x==vec)[sample(x=length(which(x==vec)),1)])

Number of overlapping elements

I've got two vectors:
vec1 <- c(1,0,1,1,1)
vec2 <- c(1,1,0,1,1)
The vectors have the same elements at position 1, 4 and 5.
How can I return the number of elements that overlap in 2 vectors taking the position into account? So, here I would like to return the number 3.
Test for equality, then sum, you might want to exclude NAs:
sum(vec1==vec2, na.rm=TRUE)
EDIT
Exclude 0==0 matches, by adding an exclusion like:
sum(vec1==vec2 & vec1!=0, na.rm=TRUE)
Thanks to #CarlWitthoft
Or, if you have only ones and zeros, then:
sum((vec1+vec2)==2, na.rm=TRUE)
If your entries are only 0 and 1 (or if you are only interested in 0 and anything that is not 0) you can use xor to determine where they differ and then sum its negation, otherwise you would have to test for equality as #zx8754 commented:
sum(!xor(vec1,vec2))
[1] 3

Removing zero lines from dataframe yields dataframe of zero lines

I have a script that has a bunch of quality control checksums and it got caught on a dataset that had no need to remove any samples (rows) due to quality control. However, this script gave me an unexpected result of a dataframe with zero rows. With example data, why does this work:
data(iris)
##get rid of those pesky factors
iris$Species <- NULL
med <- which(iris[, 1] < 4.9)
medtemp <- iris[-med, ]
dim(medtemp)
[1] 134 4
but this returns a dataframe of zero rows:
small <- which(iris[, 1] < 4.0)
smalltemp <- iris[-small, ]
dim(smalltemp)
[1] 0 4
As does this:
x <- 0
zerotemp <- iris[-x, ]
dim(zerotemp)
[1] 0 4
It seems that the smalltemp dataframe should be the same size as iris since there are no rows to remove at all. Why is this?
Copied verbatim from Patrick Burns's R Inferno p. 41 (I hope this constitutes "fair use" -- if someone objects I'll remove it)
negative nothing is something
> x2 <- 1:4
> x2[-which(x2 == 3)]
[1] 1 2 4
The command above returns all of the values in x2 not equal to 3.
> x2[-which(x2 == 5)]
numeric(0)
The hope is that the above command returns all of x2 since no elements are
equal to 5. Reality will dash that hope. Instead it returns a vector of length
zero.
There is a subtle difference between the two following statements:
x[]
x[numeric(0)]
Subtle difference in the input, but no subtlety in the difference in the output.
There are at least three possible solutions for the original problem.
out <- which(x2 == 5)
if(length(out)) x2[-out] else x2
Another solution is to use logical subscripts:
x2[!(x2 %in% 5)]
Or you can, in a sense, work backwards:
x2[ setdiff(seq along(x2), which(x2 == 5)) ]
Could it be that in your second example, small evaluates to 0?
Taking the zeroth element of a vector will always return the empty vector:
> foo <- 1:3
> foo
[1] 1 2 3
> foo[0]
integer(0)
>
Instead of using which to get your indices, I would use a boolean vector and negate it. That way you can do this:
small <- iris[, 1] < 4.0
smalltemp <- iris[!small, ]
dim(smalltemp)
[1] 150 4
EDIT: I don't think a negative index of 0 (as in your case) is allowed since there is no 0th index and thus R can't exclude that index from your selection. Negative indexing can be interpreted as: "give me back all rows except those with these indices".
It is because of the rules of what to do with an index that is zero. Only strictly positive or strictly negative indices are allowed. As [0] returns nothing, and
R> -0 == 0
[1] TRUE
Hence you get nothing where you expected it to drop nothing.
The identical(0) issue is treated as indexing by a NULL and this is documented to work as if indexing by 0 and hence the same behaviour.
This is discussed in the R Language Definition manual

Resources