Find the sum of specific elements from an interval (simulation required) - r

I would like to do some simulation with for loop/while loop/ifelse (or any other method) to get the total number of elements from a specific interval. Thank you in advance if you can help me! i've been struggling a lot for this question!
There must have a difference of more than 1 in between the elements of the second set of five numbers and the elements of the first set of five numbers, then also a difference of more than 1 for the elements of the third set of five numbers and elements of second set of five numbers, and so on for the following set of five numbers
Code to get the interval:
set.seed(50)
a=sort(runif(10,0,1))
b=sort(runif(30,1,4))
total=c(a,b)
for example, from the interval in the picture, total[1], total[2], total[3], total[4] and total[5] are my first five numbers, then my next 5 numbers must have a difference of more than one compared with the first 5 numbers. Hence, the next 5 numbers must be total[11], total[12], total[13], total[14], total[15]. then the 11th number must be total[27] because total[27] is the first element that has a difference of more than one compared with total[11].
May I know whether there are any ways to get the sum of the elements of total[1], total[2], total[3], total[4] and total[5], total[11], total[12],...,total[27],....? without counting manually

Here is a solution with an for() loop. First we create a dataframe with needed number of rows and columns. Then, inside the for loop we get a set of five numbers and compare them to the last set. After the for loop we keep only rows of the dataframe which are of interest, e.g. with the sets being a difference of one or more.
n_rows <- length(total)-4
df <- data.frame(ind= rep(NA, n_rows), keep= rep(FALSE, n_rows))
df$ind[1] <- 1; df$keep[1] <- TRUE
last_ind <- 1
for(i in 2:n_rows){
set_i <- total[i:(i+4)]
last_set <- total[last_ind:(last_ind+4)]
df$ind[i] <- i
df$keep[i] <- all(set_i - last_set >= 1)
last_ind <- df$ind[max(which(df$keep))]
}
df <- df[df$keep, ]
df
ind keep
1 1 TRUE
11 11 TRUE
27 27 TRUE

Related

R: How to sample values under conditions from previous samples

I have tried to wrap my head around this for a few hours now and my head just blanked eventually...
My end goal is a data frame with 26 rows and 4 columns, containing the values 1-6, distributed at semi-random, meeting certain conditions.
Conditions are:
Each number can only appear once within a row.
Neighbouring rows of the same column can never contain the same value.
For this, column 1 and 2 must be seen as the same column containing 2 values, of which neither can repeat in a neighbouring row. So if I have "1 & 2" in one row of column 1 & 2, I can only have a combination between "3-6" in the next and previous row.
Values should be about equally distributed within any subsection of the table.
My original though was that I would sample a first row, then create a for loop to add the other rows one by one changing the probability for a number to be drawn based on the previous samples. Before adding this sample row I could checking that the second condition was met and if not resample.
I realised eventually that this is so nested that I can just not wrap my head around it... I assume that I need a while loop, which I have no experience with. As in while the first 2 conditions are not met, resample at a probability based on previous appearances. The problem is however that each column gets its own probability for each value based on the previous appearance, so I cannot just sample a row.
however if I sample each column individually based on the previous appearances, I will likely get the same values in a row...
So, this is the conditions I would want as FALSE(I tried it with a function that would just repeat if the statement is TRUE):
temp[1] == table[i-1,1] || temp[1] == table[i-1,2] || temp[2] == table[i-1,1] || temp[2] == table[i-1,2]
temp[3]== table[i-1,3]
temp[4]== table[i-1,4]
here is how I could calculate the probabilities for sampling (I realise there is a problem if values did not appear yet as this would mean division by 0)
probAB <- rep(1,6) /table(table[,c(1,2)])
probC <- rep(1,6) /table(table[,3])
probD <- rep(1,6) /table(table[,4])
I you want to know what it is supposed to be: It is a non repeating chores rota, between 6 people where 2 people take up 1 chore, while the other 2 chores are done by only one person. I am open for alternative suggestions to achieve this^^
We can use rejection sampling. Just generate a sample for each row and if it meets the conditions accept it and go to the next row; otherwise, repeat.
nr <- 26
nc <- 4
k <- 6
set.seed(123)
is_ok <- function(x, y) all(x != y) && x[1] != y[2] && x[2] != y[1]
tab <- matrix(NA, nr, nc)
tab[1, ] <- sample(k, nc)
for(i in 2:nr) repeat if (is_ok(tab[i, ] <- sample(k, nc), tab[i-1, ])) break

R programming- adding column in dataset error

cv.uk.df$new.d[2:nrow(cv.uk.df)] <- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1) # this line of code works
I wanted to know why do we -1 in the tail and -1 in head to create this new column.
I made an effort to understand by removing the -1 and "R"(The code is in R studio) throws me this error.
Could anyone shed some light on this? I can't explain how much I would appreciate it.
Look at what is being done. On the left-hand side of the assignment operator, we have:
cv.uk.df$new.d[2:nrow(cv.uk.df)] <-
Let's pick this apart.
cv.uk.df # This is the data.frame
$new.d # a new column to assign or a column to reassign
[2:nrow(cv.uk.df)] # the rows which we are going to assign
Specifically, this line of code will assign a new value all rows of this column except the first. Why would we want to do that? We don't have your data, but from your example, it looks like you want to calculate the change from one line to the next. That calculation is invalid for the first row (no previous row).
Now let's look at the right-hand side.
<- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1)
The cv.uk.df$deaths column has the same number of rows as the data.frame. R gets grouchy when the numbers of elements don't follow sum rules. For data.frames, the right-hand side needs to have the same number of elements, or a number that can be recycled a whole-number of times. For example, if you have 10 rows, you need to have a replacement of 10 values. Or you can have 5 values that R will recycle.
If your data.frame has 100 rows, only 99 are being replaced in this operation. You cannot feed 100 values into an operation that expects 99. We need to trim the data. Let's look at what is happening. The tail() function has the usage tail(x, n), where it returns the last n values of x. If n is a negative integer, tail() returns all values but the first n. The head() function works similarly.
tail(cv.uk.df$deaths, -1) # This returns all values but the first
head(cv.uk.df$deaths, -1) # This returns all values but the last
This makes sense for your calculation. You cannot subtract the number of deaths in the row before the first row from the number in the first row, nor can you subtract the number of deaths in the last row from the number in the row after the last row. There are more intuitive ways to do this thing using functions from other packages, but this gets the job done.

Count duration of value in vector in R

I am trying to count the length of occurrances of a value in a vector such as
q <- c(1,1,1,1,1,1,4,4,4,4,4,4,4,4,4,4,4,4,6,6,6,6,6,6,6,6,6,6,1,1,4,4,4)
Actual vectors are longer than this, and are time based. What I would like would be an output for 4 that tells me it occurred for 12 time steps (before the vector changes to 6) and then 3 time steps. (Not that it occurred 15 times total).
Currently my ideas to do this are pretty inefficient (a loop that looks element by element that I can have stop when it doesn't equal the value I specified). Can anyone recommend a more efficient method?
x <- with(rle(q), data.frame(values, lengths)) will pull the information that you want (courtesy of d.b. in the comments).
From the R Documentation: rle is used to "Compute the lengths and values of runs of equal values in a vector – or the reverse operation."
y <- x[x$values == 4, ] will subset the data frame to include only the value of interest (4). You can then see clearly that 4 ran for 12 times and then later for 3.
Modifying the code will let you check whatever value you want.

Check in which interval lies all values in vector R

suppose I have a vector of size 915. Name of the vector is base
[1] 1.467352 4.651796 4.949438 5.625817 5.691591 5.839439 5.927564 7.152487 8.195661 8.640770....591.3779 591.9426 592.0126 592.3861 593.2927 593.3991 593.6104 594.1526 594.5325 594.7093
Also I have constructed another vector:
intervals <- c(0,seq(from = 1, by = 6,length.out = 100)) we can interpret this vector as intervals.
Then I want to test in which interval(vector interval) lies each value of vector base. For example first element of base lies in second interval( 1.467352 doesn't lie into interval (0,1], but lies into (1,7]). The same procedure I want to execute for each value in base
From this I want to create third vector, which means the number of interval in which lies i-th element of base
BUT! The maximum size of each interval is, for example, 5(One interval can consist only five elements). It means, that even if seven elements of vector base lies in the second interval, this second interval must include only five.
third_vector = 2,2,2,2,2,3,3....
As we see, only five elements are in the second interval. 6-th and 7-th element due to the lack of space must lie into the third interval.
And the question is: how can I effectively implement this in R?
One option is to bin the data into quantiles, where the number of quantiles is set based on the maximum number of values allowed in a given interval. Below is an example. Let me know if this is what you had in mind:
# Fake data
set.seed(1)
dat = data.frame(x=rnorm(83, 10, 5))
# Cut into intervals containing no more than n values
n = 5
dat$x.bin = cut(dat$x, quantile(dat$x, seq(0,1,length=ceiling(nrow(dat)/n)+1)),
include.lowest=TRUE)
# Check
table(dat$x.bin)
[-1.07,3.62] (3.62,5.87] (5.87,6.7] (6.7,7.29] (7.29,8.2] (8.2,9.32] (9.32,9.72]
5 5 5 5 5 4 5
(9.72,9.97] (9.97,10.8] (10.8,11.7] (11.7,12.1] (12.1,12.9] (12.9,13.5] (13.5,14]
5 5 5 5 4 5 5
(14,15.5] (15.5,17.4] (17.4,22]
5 5 5
To implement #LorenzoBusetto's suggestion, you could do the following. This method ensures that every interval except the last contains n values:
dat = dat[order(dat$x),]
dat$x.bin = 0:(nrow(dat)-1) %/% n

optimizing markov chain transition matrix calculations?

As an intermediate R user, I know that for loops can very often be optimized by using functions like apply or otherwise. However, I am not aware of functions that can optimize my current code to generate a markov chain matrix, which is running quite slowly. Have I max-ed out on speed or are there things that I am overlooking? I am trying to find the transition matrix for a Markov chain by counting the number of occurrences in 24-hour time periods before given alerts. The vector ids contains all possible id's (about 1700).
The original matrix looks like this, as an example:
>matrix
id time
1 1376084071
1 1376084937
1 1376023439
2 1376084320
2 1372983476
3 1374789234
3 1370234809
And here is my code to try to handle this:
matrixtimesort <- matrix[order(-matrix$time),]
frequency = 86400 #number of seconds in 1 day
# Initialize matrix that will contain probabilities
transprobs <- matrix(data=0, nrow=length(ids), ncol=length(ids))
# Loop through each type of event
for (i in 1:length(ids)){
localmatrix <- matrix[matrix$id==ids[i],]
# Loop through each row of the event
for(j in 1:nrow(localmatrix)) {
localtime <- localmatrix[j,]$time
# Find top and bottom row number defining the 1-day window
indices <- which(matrixtimesort$time < localtime & matrixtimesort$time >= (localtime - frequency))
# Find IDs that occur within the 1-day window
positiveids <- unique(matrixtimesort[c(min(indices):max(indices)),]$id)
# Add one to each cell in the matrix that corresponds to the occurrence of an event
for (l in 1:length(positiveids)){
k <- which(ids==positiveids[l])
transprobs[i,k] <- transprobs[i,k] + 1
}
}
# Divide each row by total number of occurrences to determine probabilities
transprobs[i,] <- transprobs[i,]/nrow(localmatrix)
}
# Normalize rows so that row sums are equal to 1
normalized <- transprobs/rowSums(transprobs)
Can anyone make any suggestions to optimize this for speed?
Using nested loops seems a bad idea. Your code can be vectorized to speed up.
For example, why find the top and bottom of row numbers? You can simply compare the time value with "time_0 + frequency": it is a vectorized operation.
HTH.

Resources