In R: How do I increment a column value based on multiples of a certain value in the adjacent column - r

I'm quite new to R, unfortunately I wasn't able to find help in other related questions so far.
I have this dataframe called selection, including column 'RUN' and column 'TRNO'.
It originally had 9 columns. I added the column 'RUN' which contains a count that increases by 1 whenever the value in the column 'DAP' is 0, using this code:
# Insert column RUN in "selection" dataframe
library(dplyr)
selection$RUN <- cumsum(selection$DAP == 0)
That worked perfectly. Now I would like to do a similar operation for the column 'TRNO'. It also needs to contain a count that this time only increases when the column 'RUN' arrives at multiples of 80 (i.e. from RUN == 1-80 --> count =1; RUN == 81-160 --> count =2,...)
I tried several codes, amongst others this one:
# Insert column TRNO in "selection" dataframe
i = 0
repeat{
i = i+80
selection$TRNO <- cumsum(selection$RUN == i)
break
}
Instead of increasing the count at every multiple of 80, it returns "0" when RUN values are between 1-80, increases to 92 when RUN values are at 80, and then stagnates at 92 for all the higher values in RUN.

try this:
selection$TRONO <- ceiling(selection$RUN/80)

Related

R programming- adding column in dataset error

cv.uk.df$new.d[2:nrow(cv.uk.df)] <- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1) # this line of code works
I wanted to know why do we -1 in the tail and -1 in head to create this new column.
I made an effort to understand by removing the -1 and "R"(The code is in R studio) throws me this error.
Could anyone shed some light on this? I can't explain how much I would appreciate it.
Look at what is being done. On the left-hand side of the assignment operator, we have:
cv.uk.df$new.d[2:nrow(cv.uk.df)] <-
Let's pick this apart.
cv.uk.df # This is the data.frame
$new.d # a new column to assign or a column to reassign
[2:nrow(cv.uk.df)] # the rows which we are going to assign
Specifically, this line of code will assign a new value all rows of this column except the first. Why would we want to do that? We don't have your data, but from your example, it looks like you want to calculate the change from one line to the next. That calculation is invalid for the first row (no previous row).
Now let's look at the right-hand side.
<- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1)
The cv.uk.df$deaths column has the same number of rows as the data.frame. R gets grouchy when the numbers of elements don't follow sum rules. For data.frames, the right-hand side needs to have the same number of elements, or a number that can be recycled a whole-number of times. For example, if you have 10 rows, you need to have a replacement of 10 values. Or you can have 5 values that R will recycle.
If your data.frame has 100 rows, only 99 are being replaced in this operation. You cannot feed 100 values into an operation that expects 99. We need to trim the data. Let's look at what is happening. The tail() function has the usage tail(x, n), where it returns the last n values of x. If n is a negative integer, tail() returns all values but the first n. The head() function works similarly.
tail(cv.uk.df$deaths, -1) # This returns all values but the first
head(cv.uk.df$deaths, -1) # This returns all values but the last
This makes sense for your calculation. You cannot subtract the number of deaths in the row before the first row from the number in the first row, nor can you subtract the number of deaths in the last row from the number in the row after the last row. There are more intuitive ways to do this thing using functions from other packages, but this gets the job done.

How to add grouping variable to data set that will classify both an observation and its N neighbors based on some condition

I am having some trouble coming up with a solution that properly handles classifying a variable number of neighbors for any given observation in a data frame based on some condition. I would like to be able to add a simple, binary indicator variable to a data frame that will equal 1 if the condition is satisfied, and 0 if it is not.
Where I am getting stuck is I am unsure how to iteratively check the condition against neighboring observations only, in either direction (i.e., to check if out of 4 neighboring observations in a given column in my data frame, that at least 3 out of 4 of them contain the same value). I have tried first creating another indicator variable indicating if the condition is satisfied or not (1 or 0 = yes or no). Then, I tried setting up a series of ifelse() statements within a loop to try to assign the proper categorization of the observation where the initial condition is satisfied, +/- 2 observations in either direction. However, when I inspect the dataframe after running the loop, only the observation itself (not its neighbors) where the condition is satisfied is receiving the value, rather than all neighboring observations also receiving the value. Here is my code:
#sample data
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
sample_dat$violate <- NULL
for(i in 1:nrow(dat_date_ord)){
sample_dat$violate[i] <- ifelse(sample_dat$initial_ind[i]==1 &
((sample_dat$initial_ind[i-2]==1 |
sample_dat$initial_ind[i-1]==1) &
(sample_dat$initial_ind[i+2]==1 |
sample_dat$initial_ind[i+1]==1)),
"trending",
"non-trending"
)
}
This loop correctly identifies one of the four points that needs to be labelled "trending", but it does not also assign "trending" to the correct neighbors. In other words, I expect the output to be "trending for observations 7-10, since 3/4 observations in that group of 4 all have a value of 1 in the initial indicator column. I feel like there might be an easier way to accomplish this - but what I need to ensure is that my code is robust enough to identify and assign observations to a group regardless of if I want 3/4 to indicate a group, 5/6, 2/5, etc.
Thank you for any and all advice.
You can use the rollapply function from the zoo package to apply a function to set intervals in your data. The question then becomes about creating a function that satisfies your needs. I'm not sure if I've understood correctly, but it seems you want a function that checks if the condition is true for at least 3/5 of the observation plus its four closest neighbors. In this case just adding the 1s up and checking if they're above 2 works.
library(zoo)
sample_dat <- data.frame(initial_ind = c(0,1,0,1,0,0,1,1,0,1,0,0))
trend_test = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test, width = 5, fill = NA)
Edit: If you want a function that checks if the observation and the next 3 observations have at least 3 1s, you can do something very similar, just by changing the align argument on rollapply:
trend_test_2 = function(x){
ifelse(sum(x) > 2, "trending", "non-trending")
}
sample_dat$violate_new = rollapply(sample_dat$initial_ind, FUN = trend_test_2, width = 4,
fill = NA, align = "left")

R: how to conditionally replace rows in data frame with randomly sampled rows from another data frame?

I need to conditionally replace rows in a data frame (x) with rows selected at random from another data frame (y).Some of the rows between the two data frames are the same and so data frame x will contain rows with repeated information. What sort of base r code would I need to achieve this?
I am writing an agent based model in r where rows can be thought of as vectors of attributes pertaining to an agent and columns are attribute types. For agents to transmit their attributes they need to send rows from one data frame (population) to another, but according to conditional learning rules. These rules need to be: conditionally replace values in row n in data frame x if attribute in column 10 for that row is value 1 or more and if probability s is greater than a randomly selected number between 0 and 1. Probability s is itself an adjustable parameter that can take any value from 0 to 1.
I have tried IF function in the code below, but I am new to r and have made a mistake somewhere with it as I get this warning:
"missing value where TRUE/FALSE needed"
I reckon that I have not specified what should happen to a row if the conditions are not satisfied.
I cannot think of an alternative method of achieving my aim.
Note: agent.dat is data frame x and top_ten_percent is data frame y.
s = 0.7
N = nrow(agent.dat)
copy <- runif(N) #to generate a random probability for each row in agent.dat
for (i in 1:nrow(agent.dat)){
if(agent.dat[,10] >= 1 & copy < s){
agent.dat <- top_ten_percent[sample(nrow(top_ten_percent), 1), ]
}
}
The agent.dat data frame should have rows that are replaced with values from rows in the top_ten_percent data frame if the randomly selected value of copy between 0 and 1 for that row is less than the value of parameter s and if the value for that row in column 10 is 1 or more. For each row I need to replace the first 10 columns of agent.dat with the first 10 columns of top_ten_percent (excluding column 11 i.e. copy value).
Assistance with this problem is greatly appreciated.
So you just need to change a few things.
You need to get a particular value for copy for each iteration of the for loop (use: copy[i]).
You also need to make the & in the if statement an && (Boolean operators && and ||)
Then you need to replace a particular row (and columns 1 through 10) in agent.dat, instead of the whole thing (agent.dat[i,1:10])
So, the final code should look like:
copy <- runif(N)
for (i in 1:nrow(agent.dat)){
if(agent.dat[,10] >= 1 && copy[i] < s){
agent.dat[i,1:10] <- top_ten_percent[sample(nrow(top_ten_percent), 1), ]
}
}
This should fix your errors, assuming your data structure fits your code:
copy <- runif(nrow(agent.dat))
s <- 0.7
for (i in 1:nrow(agent.dat)){
if(agent.dat[i,10] >= 1 & copy[i] < s){
agent.dat[i,] <- top_ten_percent[sample(1:nrow(top_ten_percent), 1), ]
}
}

How to create a new column with repeated values based on another column?

Here is what I currently Have. I have a column named "test1M", which has values of either 0 or 1. If the value is 1, I want to set the next 20 values in column "test1Mxx" to value 1.
If I run this code, I get an error of (Error in if (data$test1M[x] == 1) { : argument is of length zero).
Whats a better way for me to do this? The code is pretty repetitive, so I would like to minimize that if possible. If there is a way to turn this into a function that would be preferable, so I could change the number of values (for instance, maybe the following 25 values, or 40 values, etc.)
for(x in data$test1){
if(data$test1[x]==1){
data$test2[x+1]=1
data$test2[x+2]=1
data$test2[x+3]=1
data$test2[x+4]=1
data$test2[x+5]=1
data$test2[x+6]=1
data$test2[x+7]=1
data$test2[x+8]=1
data$test2[x+9]=1
data$test2[x+10]=1
data$test2[x+11]=1
data$test2[x+12]=1
data$test2[x+13]=1
data$test2[x+14]=1
data$test2[x+15]=1
data$test2[x+16]=1
data$test2[x+17]=1
data$test2[x+18]=1
data$test2[x+19]=1
data$test2[x+20]=1}
}
Your loop doesn't work because x is a value of data$test1, not an index of it. You need something like:
data$test2 <- data$test1
for (x in seq_along(data$test1))
if (data$test1[x] == 1) data$test2[x + 1:20] <- 1

Using logical functions and rowSums together

I am trying to understand an R code I have inherited (see below).
sel <- which(rowSums(m3T3L1mRNA.tmp[,c(2,4)] == 20) != 2)
The output of this code essentially excludes all rows from this table (there are thousands of rows, only the first 5 have been shown) that have the value 20 (which in this table equates to NAs).
The code works fine, but I am having trouble interpreting the code. As I understand the code is asking to get the rowSum of rows that contain a value of "20" at columns 2 and 4 (which is 40) and select ones that do not sum up to 2.
Where does the value 2 come from? Shouldn't it be as below for the code to work as I think it should?
sel <- which(rowSums(m3T3L1mRNA.tmp[,c(2,4)] == 20) != 40)

Resources