Check in which interval lies all values in vector R - r

suppose I have a vector of size 915. Name of the vector is base
[1] 1.467352 4.651796 4.949438 5.625817 5.691591 5.839439 5.927564 7.152487 8.195661 8.640770....591.3779 591.9426 592.0126 592.3861 593.2927 593.3991 593.6104 594.1526 594.5325 594.7093
Also I have constructed another vector:
intervals <- c(0,seq(from = 1, by = 6,length.out = 100)) we can interpret this vector as intervals.
Then I want to test in which interval(vector interval) lies each value of vector base. For example first element of base lies in second interval( 1.467352 doesn't lie into interval (0,1], but lies into (1,7]). The same procedure I want to execute for each value in base
From this I want to create third vector, which means the number of interval in which lies i-th element of base
BUT! The maximum size of each interval is, for example, 5(One interval can consist only five elements). It means, that even if seven elements of vector base lies in the second interval, this second interval must include only five.
third_vector = 2,2,2,2,2,3,3....
As we see, only five elements are in the second interval. 6-th and 7-th element due to the lack of space must lie into the third interval.
And the question is: how can I effectively implement this in R?

One option is to bin the data into quantiles, where the number of quantiles is set based on the maximum number of values allowed in a given interval. Below is an example. Let me know if this is what you had in mind:
# Fake data
set.seed(1)
dat = data.frame(x=rnorm(83, 10, 5))
# Cut into intervals containing no more than n values
n = 5
dat$x.bin = cut(dat$x, quantile(dat$x, seq(0,1,length=ceiling(nrow(dat)/n)+1)),
include.lowest=TRUE)
# Check
table(dat$x.bin)
[-1.07,3.62] (3.62,5.87] (5.87,6.7] (6.7,7.29] (7.29,8.2] (8.2,9.32] (9.32,9.72]
5 5 5 5 5 4 5
(9.72,9.97] (9.97,10.8] (10.8,11.7] (11.7,12.1] (12.1,12.9] (12.9,13.5] (13.5,14]
5 5 5 5 4 5 5
(14,15.5] (15.5,17.4] (17.4,22]
5 5 5
To implement #LorenzoBusetto's suggestion, you could do the following. This method ensures that every interval except the last contains n values:
dat = dat[order(dat$x),]
dat$x.bin = 0:(nrow(dat)-1) %/% n

Related

Find the sum of specific elements from an interval (simulation required)

I would like to do some simulation with for loop/while loop/ifelse (or any other method) to get the total number of elements from a specific interval. Thank you in advance if you can help me! i've been struggling a lot for this question!
There must have a difference of more than 1 in between the elements of the second set of five numbers and the elements of the first set of five numbers, then also a difference of more than 1 for the elements of the third set of five numbers and elements of second set of five numbers, and so on for the following set of five numbers
Code to get the interval:
set.seed(50)
a=sort(runif(10,0,1))
b=sort(runif(30,1,4))
total=c(a,b)
for example, from the interval in the picture, total[1], total[2], total[3], total[4] and total[5] are my first five numbers, then my next 5 numbers must have a difference of more than one compared with the first 5 numbers. Hence, the next 5 numbers must be total[11], total[12], total[13], total[14], total[15]. then the 11th number must be total[27] because total[27] is the first element that has a difference of more than one compared with total[11].
May I know whether there are any ways to get the sum of the elements of total[1], total[2], total[3], total[4] and total[5], total[11], total[12],...,total[27],....? without counting manually
Here is a solution with an for() loop. First we create a dataframe with needed number of rows and columns. Then, inside the for loop we get a set of five numbers and compare them to the last set. After the for loop we keep only rows of the dataframe which are of interest, e.g. with the sets being a difference of one or more.
n_rows <- length(total)-4
df <- data.frame(ind= rep(NA, n_rows), keep= rep(FALSE, n_rows))
df$ind[1] <- 1; df$keep[1] <- TRUE
last_ind <- 1
for(i in 2:n_rows){
set_i <- total[i:(i+4)]
last_set <- total[last_ind:(last_ind+4)]
df$ind[i] <- i
df$keep[i] <- all(set_i - last_set >= 1)
last_ind <- df$ind[max(which(df$keep))]
}
df <- df[df$keep, ]
df
ind keep
1 1 TRUE
11 11 TRUE
27 27 TRUE

How to add grouping variable to data set that will classify both an observation and its N neighbors based on some condition

I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")

Randomly sampling from each element of a vector

Let's say I have a numeric vector X
X <- c(1,42,1,23,5,7)
I would like to create another vector Y with the same number of elements, each of which is a randomly generated whole number from a sequence in which 1 is the lower bound and the element in X is the upper bound e.g for Y[2] the number would be a randomly generated number selected from between 1 and 42 and for Y[4] the number would be randomly selected from between 1 and 23.
I have tried to use the apply function to do this
Y<-apply(C, 1, sample)
but I am having no luck and generating the error message
Error in apply(X, 1, sample) : dim(X) must have a positive length1,
sample
Is there a better way to do this?
You can't use apply for a vector, but for multidimensional objects only (e.g., matrices). You have to use sapply instead. Futhermore, you need the argument size = 1 since you want to sample one value for each entry of X.
sapply(X, sample, size = 1)
[1] 1 7 1 16 3 6

R: Find consecutive values beneath threshold

I need to find consecutive values in a data.frame of wind speed measurements that are smaller than a certain threshold. I'm looking for 2 consecutive observations beneath the threshold. I want to return the location of the first observation of the series that meets these criteria.
The following should work for what you are asking for:
# create random vector, for example
set.seed(1234)
temp <- rnorm(50)
# get position of all observations that fulfill criterion, here obs is > 0.2
thresholdObs <- which(temp > .2)
Here, which returns the position of all observations fulfilling some criterion. At this point, it is prudent to test that there are any observations that satisfy your critieron. This can be achieved with the intersect function or subsetting together with the %in% operator:
length(intersect(thresholdObs, thresholdObs + 1))
or
length(thresholdObs[thresholdObs %in% (thresholdObs + 1L)])
If length 0 is returned, then no such observation is in your data. If length is 1 or greate, then you can use
# get the answer
min(thresholdObs[thresholdObs %in% (thresholdObs + 1L)] - 1)
or
min(intersect(thresholdObs, thresholdObs + 1))-1
As #Frank notes below, if min is fed a vector of length 0, it returns Inf, which means infinity in R. I increment these positions thresholdObs + 1 and the take the intersection of these two sets. The only positions that are returned are those where the previous position passes the threshold test. I then substract 1 from these positions and take the minimum to get the desired result. Because which will return an ordered result, the following will also work:
intersect(thresholdObs, thresholdObs + 1)[1] - 1
where [1] extracts the first element in the intersection.
Also note that
intersect(thresholdObs, thresholdObs + 1) - 1
or
thresholdObs[thresholdObs %in% (thresholdObs + 1L)]
will return all positions where there are at least two consecutive elements that pass the threshold. However, there will be multiple positions returned for consecutive values passing the threshold that are greater than 2.

Finding peaks in vector

I'm trying to find "peaks" in a vector, i.e. elements for which the nearest neighboring elements on both sides that do not have the same value have lower values.
So, e.g. in the vector
c(0,1,1,2,3,3,3,2,3,4,5,6,5,7)
there are peaks at positions 5,6,7,12 and 14
Finding local maxima and minima comes close, but doesn't quite fit.
This should work. The call to diff(sign(diff(x)) == -2 finds peaks by, in essence, testing for a negative second derivative at/around each of the unique values picked out by rle.
x <- c(0,1,1,2,3,3,3,2,3,4,5,6,5,7)
r <- rle(x)
which(rep(x = diff(sign(diff(c(-Inf, r$values, -Inf)))) == -2,
times = r$lengths))
# [1] 5 6 7 12 14
(I padded your vector with -Infs so that both elements 1 and 14 have the possibility of being matched, should the nearest different-valued element have a lower value. You can obviously adjust the end-element matching rule by instead setting one or both of these to Inf.)

Resources