R: How to sample values under conditions from previous samples - r

I have tried to wrap my head around this for a few hours now and my head just blanked eventually...
My end goal is a data frame with 26 rows and 4 columns, containing the values 1-6, distributed at semi-random, meeting certain conditions.
Conditions are:
Each number can only appear once within a row.
Neighbouring rows of the same column can never contain the same value.
For this, column 1 and 2 must be seen as the same column containing 2 values, of which neither can repeat in a neighbouring row. So if I have "1 & 2" in one row of column 1 & 2, I can only have a combination between "3-6" in the next and previous row.
Values should be about equally distributed within any subsection of the table.
My original though was that I would sample a first row, then create a for loop to add the other rows one by one changing the probability for a number to be drawn based on the previous samples. Before adding this sample row I could checking that the second condition was met and if not resample.
I realised eventually that this is so nested that I can just not wrap my head around it... I assume that I need a while loop, which I have no experience with. As in while the first 2 conditions are not met, resample at a probability based on previous appearances. The problem is however that each column gets its own probability for each value based on the previous appearance, so I cannot just sample a row.
however if I sample each column individually based on the previous appearances, I will likely get the same values in a row...
So, this is the conditions I would want as FALSE(I tried it with a function that would just repeat if the statement is TRUE):
temp[1] == table[i-1,1] || temp[1] == table[i-1,2] || temp[2] == table[i-1,1] || temp[2] == table[i-1,2]
temp[3]== table[i-1,3]
temp[4]== table[i-1,4]
here is how I could calculate the probabilities for sampling (I realise there is a problem if values did not appear yet as this would mean division by 0)
probAB <- rep(1,6) /table(table[,c(1,2)])
probC <- rep(1,6) /table(table[,3])
probD <- rep(1,6) /table(table[,4])
I you want to know what it is supposed to be: It is a non repeating chores rota, between 6 people where 2 people take up 1 chore, while the other 2 chores are done by only one person. I am open for alternative suggestions to achieve this^^

We can use rejection sampling. Just generate a sample for each row and if it meets the conditions accept it and go to the next row; otherwise, repeat.
nr <- 26
nc <- 4
k <- 6
set.seed(123)
is_ok <- function(x, y) all(x != y) && x[1] != y[2] && x[2] != y[1]
tab <- matrix(NA, nr, nc)
tab[1, ] <- sample(k, nc)
for(i in 2:nr) repeat if (is_ok(tab[i, ] <- sample(k, nc), tab[i-1, ])) break

Related

Find the sum of specific elements from an interval (simulation required)

I would like to do some simulation with for loop/while loop/ifelse (or any other method) to get the total number of elements from a specific interval. Thank you in advance if you can help me! i've been struggling a lot for this question!
There must have a difference of more than 1 in between the elements of the second set of five numbers and the elements of the first set of five numbers, then also a difference of more than 1 for the elements of the third set of five numbers and elements of second set of five numbers, and so on for the following set of five numbers
Code to get the interval:
set.seed(50)
a=sort(runif(10,0,1))
b=sort(runif(30,1,4))
total=c(a,b)
for example, from the interval in the picture, total[1], total[2], total[3], total[4] and total[5] are my first five numbers, then my next 5 numbers must have a difference of more than one compared with the first 5 numbers. Hence, the next 5 numbers must be total[11], total[12], total[13], total[14], total[15]. then the 11th number must be total[27] because total[27] is the first element that has a difference of more than one compared with total[11].
May I know whether there are any ways to get the sum of the elements of total[1], total[2], total[3], total[4] and total[5], total[11], total[12],...,total[27],....? without counting manually
Here is a solution with an for() loop. First we create a dataframe with needed number of rows and columns. Then, inside the for loop we get a set of five numbers and compare them to the last set. After the for loop we keep only rows of the dataframe which are of interest, e.g. with the sets being a difference of one or more.
n_rows <- length(total)-4
df <- data.frame(ind= rep(NA, n_rows), keep= rep(FALSE, n_rows))
df$ind[1] <- 1; df$keep[1] <- TRUE
last_ind <- 1
for(i in 2:n_rows){
set_i <- total[i:(i+4)]
last_set <- total[last_ind:(last_ind+4)]
df$ind[i] <- i
df$keep[i] <- all(set_i - last_set >= 1)
last_ind <- df$ind[max(which(df$keep))]
}
df <- df[df$keep, ]
df
ind keep
1 1 TRUE
11 11 TRUE
27 27 TRUE

How to add grouping variable to data set that will classify both an observation and its N neighbors based on some condition

I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")

How to select a specific amount of rows before and after predefined values

I am trying to select relevant rows from a large time-series data set. The tricky bit is, that the needed rows are before and after certain values in a column.
# example data
x <- rnorm(100)
y <- rep(0,100)
y[c(13,44,80)] <- 1
y[c(20,34,92)] <- 2
df <- data.frame(x,y)
In this case the critical values are 1 and 2 in the df$y column. If, e.g., I want to select 2 rows before and 4 after df$y==1 I can do:
ones<-which(df$y==1)
selection <- NULL
for (i in ones) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection <- 0
df$selection[selection] <- 1
This, arguably, scales poorly for more values. For df$y==2 I would have to repeat with:
twos<-which(df$y==2)
selection <- NULL
for (i in twos) {
jj <- (i-2):(i+4)
selection <- c(selection,jj)
}
df$selection[selection] <- 2
Ideal scenario would be a function doing something similar to this imaginary function selector(data=df$y, values=c(1,2), before=2, after=5, afterafter = FALSE, beforebefore=FALSE), where values is fed with the critical values, before with the amount of rows to select before and correspondingly after.
Whereas, afterafter would allow for the possibility to go from certain rows until certain rows after the value, e.g. after=5,afterafter=10 (same but going into the other direction with afterafter).
Any tips and suggestions are very welcome!
Thanks!
This is easy enough with rep and its each argument.
df$y[rep(which(df$y == 2), each=7L) + -2:4] <- 2
Here, rep repeats the row indices that your criterion 7 times each (two before, the value, and four after, the L indicates that the argument should be an integer). Add values -2 through 4 to get these indices. Now, replace.
Note that for some comparisons, == will not be adequate due to numerical precision. See the SO post why are these numbers not equal for a detailed discussion of this topic. In these cases, you could use something like
which(abs(df$y - 2) < 0.001)
or whatever precision measure will work for your problem.

R: how to conditionally replace rows in data frame with randomly sampled rows from another data frame?

I need to conditionally replace rows in a data frame (x) with rows selected at random from another data frame (y).Some of the rows between the two data frames are the same and so data frame x will contain rows with repeated information. What sort of base r code would I need to achieve this?
I am writing an agent based model in r where rows can be thought of as vectors of attributes pertaining to an agent and columns are attribute types. For agents to transmit their attributes they need to send rows from one data frame (population) to another, but according to conditional learning rules. These rules need to be: conditionally replace values in row n in data frame x if attribute in column 10 for that row is value 1 or more and if probability s is greater than a randomly selected number between 0 and 1. Probability s is itself an adjustable parameter that can take any value from 0 to 1.
I have tried IF function in the code below, but I am new to r and have made a mistake somewhere with it as I get this warning:
"missing value where TRUE/FALSE needed"
I reckon that I have not specified what should happen to a row if the conditions are not satisfied.
I cannot think of an alternative method of achieving my aim.
Note: agent.dat is data frame x and top_ten_percent is data frame y.
s = 0.7
N = nrow(agent.dat)
copy <- runif(N) #to generate a random probability for each row in agent.dat
for (i in 1:nrow(agent.dat)){
if(agent.dat[,10] >= 1 & copy < s){
agent.dat <- top_ten_percent[sample(nrow(top_ten_percent), 1), ]
}
}
The agent.dat data frame should have rows that are replaced with values from rows in the top_ten_percent data frame if the randomly selected value of copy between 0 and 1 for that row is less than the value of parameter s and if the value for that row in column 10 is 1 or more. For each row I need to replace the first 10 columns of agent.dat with the first 10 columns of top_ten_percent (excluding column 11 i.e. copy value).
Assistance with this problem is greatly appreciated.
So you just need to change a few things.
You need to get a particular value for copy for each iteration of the for loop (use: copy[i]).
You also need to make the & in the if statement an && (Boolean operators && and ||)
Then you need to replace a particular row (and columns 1 through 10) in agent.dat, instead of the whole thing (agent.dat[i,1:10])
So, the final code should look like:
copy <- runif(N)
for (i in 1:nrow(agent.dat)){
if(agent.dat[,10] >= 1 && copy[i] < s){
agent.dat[i,1:10] <- top_ten_percent[sample(nrow(top_ten_percent), 1), ]
}
}
This should fix your errors, assuming your data structure fits your code:
copy <- runif(nrow(agent.dat))
s <- 0.7
for (i in 1:nrow(agent.dat)){
if(agent.dat[i,10] >= 1 & copy[i] < s){
agent.dat[i,] <- top_ten_percent[sample(1:nrow(top_ten_percent), 1), ]
}
}

Excell or R: writting code to automate filtering of non-osicllatory changes in data.

I am new to coding and need direction to turn my method into code.
In my lab I am working on a time-series project to discover which gene's in a cell naturally change over the organism's cell cycle. I have a tabular data set with numerical values (originally 10 columns, 27,000 rows). To analyze whether a gene is cycling over the data set I divided the values of one time point (or column) by each subsequent time point (or column), and continued that trend across the data set (the top section of the picture is an example of spread sheet with numerical value at each time-point. The bottom section is an example of what the time-comparisons looked like across the data.
I then imposed an advanced filter with multiple AND / OR criteria that followed the logic (Source Jeeped)
WHERE (column A >= 2.0 AND column B <= 0.5)
OR (column A >= 2.0 AND column C <= 0.5)
OR (column A >= 2.0 AND column D <= 0.5)
OR (column A >= 2.0 AND column E <= 0.5)
(etc ...)
From there, I slid the advanced filter across the entire data set(in the photograph, A on the left -- exanple of the original filter, and B -- the filter sliding across the data)
The filters produced multiple sheets of genes that fit my criteria. To figure how many unique genes met this criteria I merged Column A (Gene_ID's) of all the sheets and removed duplicates to produce a list of unique gene ID's.
The process took me nearly 3 hours due to the size of each spread sheet (37 columns, 27000 rows before filtering). Can this process be expedited? and if so can someone point me in the right direction or help me create the code to do so?
Thank you for your time, and if you need any clarification please don't hesitate to ask.
There are a few ways to do this in R. I think but a common an easy to think about way is to use the any function. This basically takes a series of logical tests and puts an "OR" between each of them, so that if any of them return true then it returns true. You can pass each column to it and then combine it with an AND for the logical test for column a. There are probably other ways to abstract this as well, but this should get you started:
df <- data.frame(
a = 1:100,
b = 1:100,
c = 51:150,
d = 101:200,
value = rep("a", 100)
)
df[ df$a > 2 & any(df$b > 5, df$c > 5, df$d > 5), "value"] <- "Test Passed!"

Resources