I need to find stretches of values above 0 in a numeric vector where there are at least 10 members within each region. I do not want to check every single position as it would be very time intensive (vector is over 10 million).
Here is what I'm trying to do (very preliminary as I can't figure out how to skip increments in for loop):
1. Check if x[i] (start position) is positive.
a) if positive, check to see if x[i+10] (end position) is positive (since we want at least length 10 of positive integers)
* if positive, check every position in between to see if positive
* if negative, move to x[i+11], skip positions (e.g. new start position is x[i+12]) in between start & end positions since we would not get >10 members if negative end position is included.
x <- rnorm(50, mean=0, sd=4)
for(i in 1:length(x)){
if(x[i]>0){ # IF START POSITION IS POSITIVE
flag=1
print(paste0(i, ": start greater than 1"))
if(x[i+10]>0){ # IF END POSITION POSITIVE, THEN CHECK ALL POSITIONS IN BETWEEN
for(j in i+1:i+9){
if(x[j]>0){ # IF POSITION IS POSITIVE, CHECK NEXT POSITION IF POSITIVE
print(paste0(j, ": for j1"))
}else{ # IF POSITION IS NEGATIVE, THEN SKIP CHECKING & SET NEW START POSITION
print(paste0(j, ": for j2"))
i <- i+11
break;
}
}
}else{ # IF END POSITION IS NOT POSITIVE, START CHECK ONE POSITION AFTER END POSITION
i <- i+11
}
}
}
The issue I have is that even when I manually increment i, the for loop i value masks the new set value. Appreciate any insight.
I dunno if this approach is as efficient as Curt F's, but how about
runs <- rle(x>0)
And then working with the regions defined by runs$lengths>10 & runs$values ==TRUE ?
Here is a solution that finds stretches of ten positive numbers in a vector of length ten million. It does not use the loop approach suggested in the OP.
The idea here is to take the cumulative sum of the logical expression vec>0. The difference between position n and n-10 will be 10 only if all values of the vector at positions between n-10 and n are positive.
filter is an easy and relatively fast way to calculate these differences.
#generate random data
vec <- runif(1e7,-1,1)
#cumulative sum
csvec <- cumsum(vec>0)
#construct a filter that will find the difference between the nth value with the n-10th value of the cumulative sign vector
f11 <- c(1,rep(0,9),-1)
#apply the filter
fv <- filter(csvec, f11, sides = 1)
#find where the difference as computed by the filter is 10
inds <- which(fv == 10)
#check a few results
> vec[(inds[1]-9):(inds[1])]
[1] 0.98457526 0.03659257 0.77507743 0.69223183 0.70776891 0.34305865 0.90249491 0.93019927 0.18686722 0.69973176
> vec[(inds[2]-9):(inds[2])]
[1] 0.0623790 0.8489058 0.3783840 0.8781701 0.6193165 0.6202030 0.3160442 0.3859175 0.8416434 0.8994019
> vec[(inds[200]-9):(inds[200])]
[1] 0.0605163 0.7921233 0.3879834 0.6393018 0.2327136 0.3622615 0.1981222 0.8410318 0.3582605 0.6530633
#check all the results
> prod(sapply(1:length(inds),function(x){prod(sign(vec[(inds[x]-9):(inds[x])]))}))
[1] 1
I played around with system.time() to see how long the various steps took. On my not-very-powerful laptop the longest step was filter(), which took just over half a second for a vector of length ten million.
Vectorised solution using only basic commands:
x <- runif(1e7,-1,1) # generate random vector
y <- which(x<=0) # find boundaries i.e. negatives and zeros
dif <- y[2:length(y)] - y[1:(length(y)-1)] # find distance in boundaries
drange <- which(dif > 10) # find distances more than 10
starts <- y[drange]+1 # starting positions of sequence
ends <- y[drange+1]-1 # last positions of sequence
The first range you want is from x[starts[1]] to x[ends[1]] , etc.
Related
let's say I have a vector that increases and then decreases like the simple example below. I want to identify the position (index) in the vector that is closest to a value - but with the condition that the following value must be lower (I always want to pick up the closest value on the downslope of the data).
In the example below, I want the answer to be 13 (rather than 6).
I can't think of a solution using which.min() or match.closest() which would reliably work for this.
Any help gratefully received!
# example vector which increases then decreases
vector <- c(1,2,3,4,5,6,7,8,9,9,8,7,6,5,4,3,2,1)
# index
index <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
value <- 6.2
Maybe you can use cummax + rev like below
which.min(abs(rev(cummax(rev(vector)))-value))
which gives
[1] 13
Assuming your points always continue to decrease in value after the first decrease, and value is between the point of the first decrease and the last point, you could do this:
closest <- function(value, vec, next_is){
lead_fun <- function(x) c(tail(x, -1), NA)
meets_cond <- get(next_is)(lead_fun(vec), vec)
which.min(abs(vec[meets_cond] - value)) + which.max(meets_cond) - 1
}
closest(6.2, vec = vector, next_is = '<')
# [1] 13
Check which elements in the vector meet your condition, find the index of the closest element in that vector, then add back the number of elements before the first which meets your condition.
Edit: ----------------------------------------
Another version of the function which accepts an arbitrary logical vector which is TRUE for indices meeting a condition:
closest <- function(value, vec, cond_vec){
which.min(abs(vec[cond_vec] - value)) + which.max(cond_vec) - 1
}
Note that this assumes the values matching your condition are all in one contiguous region (not e.g. the first matches, then the third, then the sixth, etc.)
If your condition is that the point comes after the max value:
closest(6.2, vec = vector, cond_vec = seq_along(vector) > which.max(vector))
# [1] 13
I need to find consecutive values in a data.frame of wind speed measurements that are smaller than a certain threshold. I'm looking for 2 consecutive observations beneath the threshold. I want to return the location of the first observation of the series that meets these criteria.
The following should work for what you are asking for:
# create random vector, for example
set.seed(1234)
temp <- rnorm(50)
# get position of all observations that fulfill criterion, here obs is > 0.2
thresholdObs <- which(temp > .2)
Here, which returns the position of all observations fulfilling some criterion. At this point, it is prudent to test that there are any observations that satisfy your critieron. This can be achieved with the intersect function or subsetting together with the %in% operator:
length(intersect(thresholdObs, thresholdObs + 1))
or
length(thresholdObs[thresholdObs %in% (thresholdObs + 1L)])
If length 0 is returned, then no such observation is in your data. If length is 1 or greate, then you can use
# get the answer
min(thresholdObs[thresholdObs %in% (thresholdObs + 1L)] - 1)
or
min(intersect(thresholdObs, thresholdObs + 1))-1
As #Frank notes below, if min is fed a vector of length 0, it returns Inf, which means infinity in R. I increment these positions thresholdObs + 1 and the take the intersection of these two sets. The only positions that are returned are those where the previous position passes the threshold test. I then substract 1 from these positions and take the minimum to get the desired result. Because which will return an ordered result, the following will also work:
intersect(thresholdObs, thresholdObs + 1)[1] - 1
where [1] extracts the first element in the intersection.
Also note that
intersect(thresholdObs, thresholdObs + 1) - 1
or
thresholdObs[thresholdObs %in% (thresholdObs + 1L)]
will return all positions where there are at least two consecutive elements that pass the threshold. However, there will be multiple positions returned for consecutive values passing the threshold that are greater than 2.
I had a custom deck consisting of eight cards of the sequence 2^n, n=0,..,6. I draw cards (without replacement) until the sum is equal or greater than the threshold. How can I implement in R a function that calculates the mean of the difference between the sum and the threshold??
I tried to do it using this How to store values in a vector with nested functions
but it takes ages... I think there is a way to do it with probabilities/simulations but I can figure out.
The threshold could be greater than the value of one single card, ie, threshold=500 or less than the value of a single card, ie, threshold=50
What I have done so far is to find all the subsets that meet the condition of the sum greater or equal to the threshold. Then I will only substract the threshold and calculate the mean.
I am using the following code in R. For a small set I get the answer quite fast. However, I have been running the function for several ours with the set containing the 56 numbers and is still working.
set<-c(rep(1,8),rep(2,8), rep(4,8),rep(8,8),rep(16,8),rep(32,8),rep(64,8))
recursive.subset <-function(x, index, current, threshold, result){
for (i in index:length(x)){
if (current + x[i] >= threshold){
store <<- append(store, sum(c(result,x[i])))
} else {
recursive.subset(x, i + 1, current+x[i], threshold, c(result,x[i]))
}
}
}
store <- vector()
inivector <- vector(mode="numeric", length=0) #initializing empty vector
recursive.subset (set, 1, 0, threshold, inivector)
I don't know if it is possible to get an exact solution, simply because there are so many possible combinations. It is probably better to do simulations, i.e. write a script for 1 full draw and then rerun that script many times. Since the solutions are very similar, the simulation should give a pretty good approximation.
Ok, here goes:
set <- rep(2^(0:6), each = 8)
thr <- 500
fun <- function(set,thr){
x <- cumsum(sample(set))
value <- x[min(which(x >= thr))]
value
}
system.time(a <- replicate(100000, fun(set,thr)))
# user system elapsed
# 1.10 0.00 1.09
mean(a - thr)
# [1] 21.22992
Explanation: Rather than drawing a card one at a time, I draw all cards simultaneously (sample) and then calculate the cumulative sum (cumsum). I then find the point where the cards at up to the threshold or larger, and find the sum of those cards back in x. We run this function many times with replicate, to obtain a vector of the values. We use mean(a-thr) to calculate the mean difference.
Edit: Made a really stupid typo in the code, fixed it now.
Edit2: Shortened the function a little.
I'm trying to find "peaks" in a vector, i.e. elements for which the nearest neighboring elements on both sides that do not have the same value have lower values.
So, e.g. in the vector
c(0,1,1,2,3,3,3,2,3,4,5,6,5,7)
there are peaks at positions 5,6,7,12 and 14
Finding local maxima and minima comes close, but doesn't quite fit.
This should work. The call to diff(sign(diff(x)) == -2 finds peaks by, in essence, testing for a negative second derivative at/around each of the unique values picked out by rle.
x <- c(0,1,1,2,3,3,3,2,3,4,5,6,5,7)
r <- rle(x)
which(rep(x = diff(sign(diff(c(-Inf, r$values, -Inf)))) == -2,
times = r$lengths))
# [1] 5 6 7 12 14
(I padded your vector with -Infs so that both elements 1 and 14 have the possibility of being matched, should the nearest different-valued element have a lower value. You can obviously adjust the end-element matching rule by instead setting one or both of these to Inf.)
The language I'm using is R, but you don't necessarily need to know about R to answer the question.
Question:
I have a sequence that can be considered the ground truth, and another sequence that is a shifted version of the first, with some missing values. I'd like to know how to align the two.
setup
I have a sequence ground.truth that is basically a set of times:
ground.truth <- rep( seq(1,by=4,length.out=10), 5 ) +
rep( seq(0,length.out=5,by=4*10+30), each=10 )
Think of ground.truth as times where I'm doing the following:
{take a sample every 4 seconds for 10 times, then wait 30 seconds} x 5
I have a second sequence observations, which is ground.truth shifted with 20% of the values missing:
nSamples <- length(ground.truth)
idx_to_keep <- sort(sample( 1:nSamples, .8*nSamples ))
theLag <- runif(1)*100
observations <- ground.truth[idx_to_keep] + theLag
nObs <- length(observations)
If I plot these vectors this is what it looks like (remember, think of these as times):
What I've tried. I want to:
calculate the shift (theLag in my example above)
calculate a vector idx such that ground.truth[idx] == observations - theLag
First, assume we know theLag. Note that ground.truth[1] is not necessarily observations[1]-theLag. In fact, we have ground.truth[1] == observations[1+lagI]-theLag for some lagI.
To calculate this, I thought I'd use cross-correlation (ccf function).
However, whenever I do this I get a lag with a max. cross-correlation of 0, meaning ground.truth[1] == observations[1] - theLag. But I've tried this in examples where I've explicitly made sure that observations[1] - theLag is not ground.truth[1] (i.e. modify idx_to_keep to make sure it doesn't have 1 in it).
The shift theLag shouldn't affect the cross-correlation (isn't ccf(x,y) == ccf(x,y-constant)?) so I was going to work it out later.
Perhaps I'm misunderstanding though, because observations doesn't have as many values in it as ground.truth? Even in the simpler case where I set theLag==0, the cross correlation function still fails to identify the correct lag, which leads me to believe I'm thinking about this wrong.
Does anyone have a general methodology for me to go about this, or know of some R functions/packages that could help?
Thanks a lot.
For the lag, you can compute all the differences (distances) between your two sets of points:
diffs <- outer(observations, ground.truth, '-')
Your lag should be the value that appears length(observations) times:
which(table(diffs) == length(observations))
# 55.715382960625
# 86
Double check:
theLag
# [1] 55.71538
The second part of your question is easy once you have found theLag:
idx <- which(ground.truth %in% (observations - theLag))
The following should work if your time series are not too long.
You have two vectors of time-stamps,
the second one being a shifted and incomplete copy of the first,
and you want to find by how much it was shifted.
# Sample data
n <- 10
x <- cumsum(rexp(n,.1))
theLag <- rnorm(1)
y <- theLag + x[sort(sample(1:n, floor(.8*n)))]
We can try all possible lags and, for each one,
compute how bad the alignment is,
by matching each observed timestamp with the closest
"truth" timestamp.
# Loss function
library(sqldf)
f <- function(u) {
# Put all the values in a data.frame
d1 <- data.frame(g="truth", value=x)
d2 <- data.frame(g="observed", value=y+u)
d <- rbind(d1,d2)
# For each observed value, find the next truth value
# (we could take the nearest, on either side,
# but it would be more complicated)
d <- sqldf("
SELECT A.g, A.value,
( SELECT MIN(B.value)
FROM d AS B
WHERE B.g='truth'
AND B.value >= A.value
) AS next
FROM d AS A
WHERE A.g = 'observed'
")
# If u is greater than the lag, there are missing values.
# If u is smaller, the differences decrease
# as we approach the lag.
if(any(is.na(d))) {
return(Inf)
} else {
return( sum(d$`next` - d$value, na.rm=TRUE) )
}
}
We can now search for the best lag.
# Look at the loss function
sapply( seq(-2,2,by=.1), f )
# Minimize the loss function.
# Change the interval if it does not converge,
# i.e., if it seems in contradiction with the values above
# or if the minimum is Inf
(r <- optimize(f, c(-3,3)))
-r$minimum
theLag # Same value, most of the time