I'm trying to find "peaks" in a vector, i.e. elements for which the nearest neighboring elements on both sides that do not have the same value have lower values.
So, e.g. in the vector
c(0,1,1,2,3,3,3,2,3,4,5,6,5,7)
there are peaks at positions 5,6,7,12 and 14
Finding local maxima and minima comes close, but doesn't quite fit.
This should work. The call to diff(sign(diff(x)) == -2 finds peaks by, in essence, testing for a negative second derivative at/around each of the unique values picked out by rle.
x <- c(0,1,1,2,3,3,3,2,3,4,5,6,5,7)
r <- rle(x)
which(rep(x = diff(sign(diff(c(-Inf, r$values, -Inf)))) == -2,
times = r$lengths))
# [1] 5 6 7 12 14
(I padded your vector with -Infs so that both elements 1 and 14 have the possibility of being matched, should the nearest different-valued element have a lower value. You can obviously adjust the end-element matching rule by instead setting one or both of these to Inf.)
Related
I have the following vector:
my_vec <- c(2,3,5,3,5,2,6,7,2,4,6,8)
threshold <- 4
Is there a way to sum up the differences of all smaller elements of my_vec compared to the threshold value?
So the expected result on this example should be 8 (2+1+0+1+0+2+0+0+2+0+0+0)
For my purpose, the sum (8) is all I need (I don't need the difference between every element). I tried this by using a loop but unfortunately, there are several vectors of different length so I can't loop from 1:12 (as on the above vector) on a vector which has only 10 elements.
First subset elements below threshold and then sum difference to threshold:
threshold <- 4
sum((threshold - my_vec[my_vec < threshold]))
# [1] 8
You can use pmin between my_vec and threshold to get minimum of them and get sum of it's differences with threshold.
sum(threshold - pmin(my_vec, threshold))
#[1] 8
I want to find a way to determine if two or more continuously elements of a vector are equal.
For example, in vector x=c(1,1,1,2,3,1,3), the first, the second and the third element are equal.
With the following command, I can determine if a vector, say y, contains two or more continuously elements that are equal to 2 or 3
all(rle(y)$lengths[which( rle(y)$values==2 | rle(y)$values==3 )]==1)
Is there any other faster way?
EDIT
Let say we have the vector z=c(1,1,2,1,2,2,3,2,3,3).
I want a vector with three elements as output. The first element will refer to value 1, the second to 2 and the third one to 3. The values of the elements of the output vector will be equal to 1 if two or more continuously elements of z are the same for one value of 1,2,3 and 0 otherwise. So, the output for the vector z will be (1,1,1).
For the vector w=c(1,1,2,3,2,3,1) the output will be 1,0,0, since only for the value 1 there are two continuously elements, that is in the first and in the second position of w.
I'm not entirely sure if I'm understanding your question as it could be worded better. The first part just asks how you find if continuous elements in a vector are equal. The answer is to use the diff() function combined with a check for a difference of zero:
z <- c(1,1,2,1,2,2,3,2,3,3)
sort(unique(z[which(diff(z) == 0)]))
# [1] 1 2 3
w <- c(1,1,2,3,2,3,1)
sort(unique(w[which(diff(w) == 0)]))
# [1] 1
But your edit example seems to imply you are looking to see if there are repeated units in a vector, of which will only be the integers 1, 2, or 3. Your output will always be X, Y, Z, where
X is 1 if there is at least one "1" repeated, else 0
Y is 2 if there is at least one "2" repeated, else 0
Z is 3 if there is at least one "3" repeated, else 0
Is this correct?
If so, see the following
continuously <- function(x){
s <- sort(unique(x[which(diff(x) == 0)]))
output <- c(0,0,0)
output[s] <- s
return(output)
}
continuously(z)
# [1] 1 2 3
continuously(w)
# [1] 1 0 0
Assuming your series name is z=c(1,1,2,1,2,2,3,2,3,3) then you can do:
(unique(z[c(FALSE, diff(z) == 0)]) >= 0)+0 which will output to 1, 1, 1,
When you run the above command on your other sequenc:
w=c(1,1,2,3,2,3,1)
then (unique(w[c(FALSE, diff(w) == 0)]) >= 0)+0 will return to 1
You may also try this for an exact output like 1,1,1 or 1,0,0
(unique(z[c(FALSE, diff(z) == 0)]) == unique(z))+0 #1,1,1 for z and 1,0,0 for w
Logic:
diff command will take difference between corresponding second and prior items, since total differences will always 1 less than the number of items, I have added first item as FALSE. Then subsetted with your original sequences and for boolean comparison whether the difference returned is zero or not. Finally we convert them to 1s by asking if they are greater than or equal to 0 (To get series of 1s, you may also check it with some other conditions to get 1s).
Assuming your sequence doesn't have negative numbers.
Let's say I have a numeric vector X
X <- c(1,42,1,23,5,7)
I would like to create another vector Y with the same number of elements, each of which is a randomly generated whole number from a sequence in which 1 is the lower bound and the element in X is the upper bound e.g for Y[2] the number would be a randomly generated number selected from between 1 and 42 and for Y[4] the number would be randomly selected from between 1 and 23.
I have tried to use the apply function to do this
Y<-apply(C, 1, sample)
but I am having no luck and generating the error message
Error in apply(X, 1, sample) : dim(X) must have a positive length1,
sample
Is there a better way to do this?
You can't use apply for a vector, but for multidimensional objects only (e.g., matrices). You have to use sapply instead. Futhermore, you need the argument size = 1 since you want to sample one value for each entry of X.
sapply(X, sample, size = 1)
[1] 1 7 1 16 3 6
suppose I have a vector of size 915. Name of the vector is base
[1] 1.467352 4.651796 4.949438 5.625817 5.691591 5.839439 5.927564 7.152487 8.195661 8.640770....591.3779 591.9426 592.0126 592.3861 593.2927 593.3991 593.6104 594.1526 594.5325 594.7093
Also I have constructed another vector:
intervals <- c(0,seq(from = 1, by = 6,length.out = 100)) we can interpret this vector as intervals.
Then I want to test in which interval(vector interval) lies each value of vector base. For example first element of base lies in second interval( 1.467352 doesn't lie into interval (0,1], but lies into (1,7]). The same procedure I want to execute for each value in base
From this I want to create third vector, which means the number of interval in which lies i-th element of base
BUT! The maximum size of each interval is, for example, 5(One interval can consist only five elements). It means, that even if seven elements of vector base lies in the second interval, this second interval must include only five.
third_vector = 2,2,2,2,2,3,3....
As we see, only five elements are in the second interval. 6-th and 7-th element due to the lack of space must lie into the third interval.
And the question is: how can I effectively implement this in R?
One option is to bin the data into quantiles, where the number of quantiles is set based on the maximum number of values allowed in a given interval. Below is an example. Let me know if this is what you had in mind:
# Fake data
set.seed(1)
dat = data.frame(x=rnorm(83, 10, 5))
# Cut into intervals containing no more than n values
n = 5
dat$x.bin = cut(dat$x, quantile(dat$x, seq(0,1,length=ceiling(nrow(dat)/n)+1)),
include.lowest=TRUE)
# Check
table(dat$x.bin)
[-1.07,3.62] (3.62,5.87] (5.87,6.7] (6.7,7.29] (7.29,8.2] (8.2,9.32] (9.32,9.72]
5 5 5 5 5 4 5
(9.72,9.97] (9.97,10.8] (10.8,11.7] (11.7,12.1] (12.1,12.9] (12.9,13.5] (13.5,14]
5 5 5 5 4 5 5
(14,15.5] (15.5,17.4] (17.4,22]
5 5 5
To implement #LorenzoBusetto's suggestion, you could do the following. This method ensures that every interval except the last contains n values:
dat = dat[order(dat$x),]
dat$x.bin = 0:(nrow(dat)-1) %/% n
I need to find stretches of values above 0 in a numeric vector where there are at least 10 members within each region. I do not want to check every single position as it would be very time intensive (vector is over 10 million).
Here is what I'm trying to do (very preliminary as I can't figure out how to skip increments in for loop):
1. Check if x[i] (start position) is positive.
a) if positive, check to see if x[i+10] (end position) is positive (since we want at least length 10 of positive integers)
* if positive, check every position in between to see if positive
* if negative, move to x[i+11], skip positions (e.g. new start position is x[i+12]) in between start & end positions since we would not get >10 members if negative end position is included.
x <- rnorm(50, mean=0, sd=4)
for(i in 1:length(x)){
if(x[i]>0){ # IF START POSITION IS POSITIVE
flag=1
print(paste0(i, ": start greater than 1"))
if(x[i+10]>0){ # IF END POSITION POSITIVE, THEN CHECK ALL POSITIONS IN BETWEEN
for(j in i+1:i+9){
if(x[j]>0){ # IF POSITION IS POSITIVE, CHECK NEXT POSITION IF POSITIVE
print(paste0(j, ": for j1"))
}else{ # IF POSITION IS NEGATIVE, THEN SKIP CHECKING & SET NEW START POSITION
print(paste0(j, ": for j2"))
i <- i+11
break;
}
}
}else{ # IF END POSITION IS NOT POSITIVE, START CHECK ONE POSITION AFTER END POSITION
i <- i+11
}
}
}
The issue I have is that even when I manually increment i, the for loop i value masks the new set value. Appreciate any insight.
I dunno if this approach is as efficient as Curt F's, but how about
runs <- rle(x>0)
And then working with the regions defined by runs$lengths>10 & runs$values ==TRUE ?
Here is a solution that finds stretches of ten positive numbers in a vector of length ten million. It does not use the loop approach suggested in the OP.
The idea here is to take the cumulative sum of the logical expression vec>0. The difference between position n and n-10 will be 10 only if all values of the vector at positions between n-10 and n are positive.
filter is an easy and relatively fast way to calculate these differences.
#generate random data
vec <- runif(1e7,-1,1)
#cumulative sum
csvec <- cumsum(vec>0)
#construct a filter that will find the difference between the nth value with the n-10th value of the cumulative sign vector
f11 <- c(1,rep(0,9),-1)
#apply the filter
fv <- filter(csvec, f11, sides = 1)
#find where the difference as computed by the filter is 10
inds <- which(fv == 10)
#check a few results
> vec[(inds[1]-9):(inds[1])]
[1] 0.98457526 0.03659257 0.77507743 0.69223183 0.70776891 0.34305865 0.90249491 0.93019927 0.18686722 0.69973176
> vec[(inds[2]-9):(inds[2])]
[1] 0.0623790 0.8489058 0.3783840 0.8781701 0.6193165 0.6202030 0.3160442 0.3859175 0.8416434 0.8994019
> vec[(inds[200]-9):(inds[200])]
[1] 0.0605163 0.7921233 0.3879834 0.6393018 0.2327136 0.3622615 0.1981222 0.8410318 0.3582605 0.6530633
#check all the results
> prod(sapply(1:length(inds),function(x){prod(sign(vec[(inds[x]-9):(inds[x])]))}))
[1] 1
I played around with system.time() to see how long the various steps took. On my not-very-powerful laptop the longest step was filter(), which took just over half a second for a vector of length ten million.
Vectorised solution using only basic commands:
x <- runif(1e7,-1,1) # generate random vector
y <- which(x<=0) # find boundaries i.e. negatives and zeros
dif <- y[2:length(y)] - y[1:(length(y)-1)] # find distance in boundaries
drange <- which(dif > 10) # find distances more than 10
starts <- y[drange]+1 # starting positions of sequence
ends <- y[drange+1]-1 # last positions of sequence
The first range you want is from x[starts[1]] to x[ends[1]] , etc.