R: Find consecutive values beneath threshold - r

I need to find consecutive values in a data.frame of wind speed measurements that are smaller than a certain threshold. I'm looking for 2 consecutive observations beneath the threshold. I want to return the location of the first observation of the series that meets these criteria.

The following should work for what you are asking for:
# create random vector, for example
set.seed(1234)
temp <- rnorm(50)
# get position of all observations that fulfill criterion, here obs is > 0.2
thresholdObs <- which(temp > .2)
Here, which returns the position of all observations fulfilling some criterion. At this point, it is prudent to test that there are any observations that satisfy your critieron. This can be achieved with the intersect function or subsetting together with the %in% operator:
length(intersect(thresholdObs, thresholdObs + 1))
or
length(thresholdObs[thresholdObs %in% (thresholdObs + 1L)])
If length 0 is returned, then no such observation is in your data. If length is 1 or greate, then you can use
# get the answer
min(thresholdObs[thresholdObs %in% (thresholdObs + 1L)] - 1)
or
min(intersect(thresholdObs, thresholdObs + 1))-1
As #Frank notes below, if min is fed a vector of length 0, it returns Inf, which means infinity in R. I increment these positions thresholdObs + 1 and the take the intersection of these two sets. The only positions that are returned are those where the previous position passes the threshold test. I then substract 1 from these positions and take the minimum to get the desired result. Because which will return an ordered result, the following will also work:
intersect(thresholdObs, thresholdObs + 1)[1] - 1
where [1] extracts the first element in the intersection.
Also note that
intersect(thresholdObs, thresholdObs + 1) - 1
or
thresholdObs[thresholdObs %in% (thresholdObs + 1L)]
will return all positions where there are at least two consecutive elements that pass the threshold. However, there will be multiple positions returned for consecutive values passing the threshold that are greater than 2.

Related

Sum up the differences between every element of a vector and a given threshold

I have the following vector:
my_vec <- c(2,3,5,3,5,2,6,7,2,4,6,8)
threshold <- 4
Is there a way to sum up the differences of all smaller elements of my_vec compared to the threshold value?
So the expected result on this example should be 8 (2+1+0+1+0+2+0+0+2+0+0+0)
For my purpose, the sum (8) is all I need (I don't need the difference between every element). I tried this by using a loop but unfortunately, there are several vectors of different length so I can't loop from 1:12 (as on the above vector) on a vector which has only 10 elements.
First subset elements below threshold and then sum difference to threshold:
threshold <- 4
sum((threshold - my_vec[my_vec < threshold]))
# [1] 8
You can use pmin between my_vec and threshold to get minimum of them and get sum of it's differences with threshold.
sum(threshold - pmin(my_vec, threshold))
#[1] 8

Find position of closest value to another value given a condition in R

let's say I have a vector that increases and then decreases like the simple example below. I want to identify the position (index) in the vector that is closest to a value - but with the condition that the following value must be lower (I always want to pick up the closest value on the downslope of the data).
In the example below, I want the answer to be 13 (rather than 6).
I can't think of a solution using which.min() or match.closest() which would reliably work for this.
Any help gratefully received!
# example vector which increases then decreases
vector <- c(1,2,3,4,5,6,7,8,9,9,8,7,6,5,4,3,2,1)
# index
index <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18)
value <- 6.2
Maybe you can use cummax + rev like below
which.min(abs(rev(cummax(rev(vector)))-value))
which gives
[1] 13
Assuming your points always continue to decrease in value after the first decrease, and value is between the point of the first decrease and the last point, you could do this:
closest <- function(value, vec, next_is){
lead_fun <- function(x) c(tail(x, -1), NA)
meets_cond <- get(next_is)(lead_fun(vec), vec)
which.min(abs(vec[meets_cond] - value)) + which.max(meets_cond) - 1
}
closest(6.2, vec = vector, next_is = '<')
# [1] 13
Check which elements in the vector meet your condition, find the index of the closest element in that vector, then add back the number of elements before the first which meets your condition.
Edit: ----------------------------------------
Another version of the function which accepts an arbitrary logical vector which is TRUE for indices meeting a condition:
closest <- function(value, vec, cond_vec){
which.min(abs(vec[cond_vec] - value)) + which.max(cond_vec) - 1
}
Note that this assumes the values matching your condition are all in one contiguous region (not e.g. the first matches, then the third, then the sixth, etc.)
If your condition is that the point comes after the max value:
closest(6.2, vec = vector, cond_vec = seq_along(vector) > which.max(vector))
# [1] 13

Create loop or function that calculates through every observation and performs another calculation based off the result of the first calculation

I have a vector called time:
Time <- c(2.444582, 2.445613, 2.446644, 2.447675, 2.448706, 2.449737, 2.769358, 2.770389, 2.771420, 2.772451, 2.773482, 2.774513, 2.775544, 2.776575, 2.777606, 3.087606, 3.093759, 3.099134, 3.100493, 3.478295, 3.484896, 3.482309)
I want to create a loop or function that subtracts every observation from the previous one, for example: 2.445613-2.444582, 2.446644-2.445613, 2.447675-2.446644 etc. Then if the difference between each observation is greater than 0.2 (i.e 2.769358-2.448706 = 0.320), I want to get the difference between the lesser number and the first number in that sequence (i.e. 2.449737-2.444582, 2.777606-2.769358, 3.100493-3.087606) and the difference from the last number in the vector and the first number in the particular sequence (i.e. 3.482309-3.478295)
My desired output from this Time vector would be: 0.005155, 0.008248, 0.012887, 0.004014
Here is one way without loops :
We first get the positions where the difference between current observation and the previous one is greater than 0.2. We then create index using this position to subset values from Time vector.
inds <- which(diff(Time) > 0.2)
Time[c(inds, length(Time))] - Time[c(1, inds + 1)]
#[1] 0.005155 0.008248 0.012887 0.004014

gene expression datamatrix filtration

I have one matrix with 3064 rows and 27 columns which contains values between -0.5 and 2.0. I want to extract every rows which have at least once value >=0.5. As answer I would like to have whole row in it's origional matrix form.
Consider m is my matrix, I tried:
m[m[1:190,1:16]>0.5,1:16]
As this command is not accepting process on more then 190 rows, I went for 190 rows, but somehow it went wrong, because it gave me rows which also have values < 0.5.
Is it possible to write any function, that can be applied for whole matrix ?
you can also try like this if your data name is df
df2<- df[apply(df, MARGIN = 1, function(x) any(x >= 0.5)), ]
library(fBasics)
m2 <- subset(x = m, subset = rowMaxs(m)>=0.5)
What mm=m[1:190,1:16]>0.5 gives you is a matrix of boolean indicating which values of m[1:190,1:16] are greater than 0.5.
Then when you do m[mm], it considers mm as a vector and gives you corresponding values. The thing is dim(m) = 3064*27 while dim(m[1:190,1:16]) = 190*16. Which means that the first 27 values of mm will be used to get the first line of m while they correspond to part of the second line of mm.
So in order to have only the elements greater than 0.5, you need to apply matrix to m[1:190,1:16] which has the same dimension, i.e:
`m[1:190,1:16][m[1:190,1:16]>0.5, 1:16]
But what you do here is m[mm, 1:16], so you consider each individual value of mm as a row number, while it is a 190*16 matrix. It means you specify 190*16=3040 rows, it does not work with more because m only has 3064 rows.
What you want is a vector of length 190 (or even 3064 I guess) specifying which rows to take. You can get this vector with rowSums(m >=0.5)>0, which means each row with more than 0 values greater than 0.5. Then you get your output with:
m[rowSums(m >= 0.5) > 0,]
And it will work for the whole matrix. Note that some values will be smaller than 0.5 since you selected the whole line if at least one value was greater than 0.5.
Edit
For rows with values <0.5, the idea is the same:
m[rowSums(m < 0.5) > 0,]

How can I skip increments in R 'for' loop?

I need to find stretches of values above 0 in a numeric vector where there are at least 10 members within each region. I do not want to check every single position as it would be very time intensive (vector is over 10 million).
Here is what I'm trying to do (very preliminary as I can't figure out how to skip increments in for loop):
1. Check if x[i] (start position) is positive.
a) if positive, check to see if x[i+10] (end position) is positive (since we want at least length 10 of positive integers)
* if positive, check every position in between to see if positive
* if negative, move to x[i+11], skip positions (e.g. new start position is x[i+12]) in between start & end positions since we would not get >10 members if negative end position is included.
x <- rnorm(50, mean=0, sd=4)
for(i in 1:length(x)){
if(x[i]>0){ # IF START POSITION IS POSITIVE
flag=1
print(paste0(i, ": start greater than 1"))
if(x[i+10]>0){ # IF END POSITION POSITIVE, THEN CHECK ALL POSITIONS IN BETWEEN
for(j in i+1:i+9){
if(x[j]>0){ # IF POSITION IS POSITIVE, CHECK NEXT POSITION IF POSITIVE
print(paste0(j, ": for j1"))
}else{ # IF POSITION IS NEGATIVE, THEN SKIP CHECKING & SET NEW START POSITION
print(paste0(j, ": for j2"))
i <- i+11
break;
}
}
}else{ # IF END POSITION IS NOT POSITIVE, START CHECK ONE POSITION AFTER END POSITION
i <- i+11
}
}
}
The issue I have is that even when I manually increment i, the for loop i value masks the new set value. Appreciate any insight.
I dunno if this approach is as efficient as Curt F's, but how about
runs <- rle(x>0)
And then working with the regions defined by runs$lengths>10 & runs$values ==TRUE ?
Here is a solution that finds stretches of ten positive numbers in a vector of length ten million. It does not use the loop approach suggested in the OP.
The idea here is to take the cumulative sum of the logical expression vec>0. The difference between position n and n-10 will be 10 only if all values of the vector at positions between n-10 and n are positive.
filter is an easy and relatively fast way to calculate these differences.
#generate random data
vec <- runif(1e7,-1,1)
#cumulative sum
csvec <- cumsum(vec>0)
#construct a filter that will find the difference between the nth value with the n-10th value of the cumulative sign vector
f11 <- c(1,rep(0,9),-1)
#apply the filter
fv <- filter(csvec, f11, sides = 1)
#find where the difference as computed by the filter is 10
inds <- which(fv == 10)
#check a few results
> vec[(inds[1]-9):(inds[1])]
[1] 0.98457526 0.03659257 0.77507743 0.69223183 0.70776891 0.34305865 0.90249491 0.93019927 0.18686722 0.69973176
> vec[(inds[2]-9):(inds[2])]
[1] 0.0623790 0.8489058 0.3783840 0.8781701 0.6193165 0.6202030 0.3160442 0.3859175 0.8416434 0.8994019
> vec[(inds[200]-9):(inds[200])]
[1] 0.0605163 0.7921233 0.3879834 0.6393018 0.2327136 0.3622615 0.1981222 0.8410318 0.3582605 0.6530633
#check all the results
> prod(sapply(1:length(inds),function(x){prod(sign(vec[(inds[x]-9):(inds[x])]))}))
[1] 1
I played around with system.time() to see how long the various steps took. On my not-very-powerful laptop the longest step was filter(), which took just over half a second for a vector of length ten million.
Vectorised solution using only basic commands:
x <- runif(1e7,-1,1) # generate random vector
y <- which(x<=0) # find boundaries i.e. negatives and zeros
dif <- y[2:length(y)] - y[1:(length(y)-1)] # find distance in boundaries
drange <- which(dif > 10) # find distances more than 10
starts <- y[drange]+1 # starting positions of sequence
ends <- y[drange+1]-1 # last positions of sequence
The first range you want is from x[starts[1]] to x[ends[1]] , etc.

Resources