I am trying to understand an R code I have inherited (see below).
sel <- which(rowSums(m3T3L1mRNA.tmp[,c(2,4)] == 20) != 2)
The output of this code essentially excludes all rows from this table (there are thousands of rows, only the first 5 have been shown) that have the value 20 (which in this table equates to NAs).
The code works fine, but I am having trouble interpreting the code. As I understand the code is asking to get the rowSum of rows that contain a value of "20" at columns 2 and 4 (which is 40) and select ones that do not sum up to 2.
Where does the value 2 come from? Shouldn't it be as below for the code to work as I think it should?
sel <- which(rowSums(m3T3L1mRNA.tmp[,c(2,4)] == 20) != 40)
Related
I have tried to wrap my head around this for a few hours now and my head just blanked eventually...
My end goal is a data frame with 26 rows and 4 columns, containing the values 1-6, distributed at semi-random, meeting certain conditions.
Conditions are:
Each number can only appear once within a row.
Neighbouring rows of the same column can never contain the same value.
For this, column 1 and 2 must be seen as the same column containing 2 values, of which neither can repeat in a neighbouring row. So if I have "1 & 2" in one row of column 1 & 2, I can only have a combination between "3-6" in the next and previous row.
Values should be about equally distributed within any subsection of the table.
My original though was that I would sample a first row, then create a for loop to add the other rows one by one changing the probability for a number to be drawn based on the previous samples. Before adding this sample row I could checking that the second condition was met and if not resample.
I realised eventually that this is so nested that I can just not wrap my head around it... I assume that I need a while loop, which I have no experience with. As in while the first 2 conditions are not met, resample at a probability based on previous appearances. The problem is however that each column gets its own probability for each value based on the previous appearance, so I cannot just sample a row.
however if I sample each column individually based on the previous appearances, I will likely get the same values in a row...
So, this is the conditions I would want as FALSE(I tried it with a function that would just repeat if the statement is TRUE):
temp[1] == table[i-1,1] || temp[1] == table[i-1,2] || temp[2] == table[i-1,1] || temp[2] == table[i-1,2]
temp[3]== table[i-1,3]
temp[4]== table[i-1,4]
here is how I could calculate the probabilities for sampling (I realise there is a problem if values did not appear yet as this would mean division by 0)
probAB <- rep(1,6) /table(table[,c(1,2)])
probC <- rep(1,6) /table(table[,3])
probD <- rep(1,6) /table(table[,4])
I you want to know what it is supposed to be: It is a non repeating chores rota, between 6 people where 2 people take up 1 chore, while the other 2 chores are done by only one person. I am open for alternative suggestions to achieve this^^
We can use rejection sampling. Just generate a sample for each row and if it meets the conditions accept it and go to the next row; otherwise, repeat.
nr <- 26
nc <- 4
k <- 6
set.seed(123)
is_ok <- function(x, y) all(x != y) && x[1] != y[2] && x[2] != y[1]
tab <- matrix(NA, nr, nc)
tab[1, ] <- sample(k, nc)
for(i in 2:nr) repeat if (is_ok(tab[i, ] <- sample(k, nc), tab[i-1, ])) break
cv.uk.df$new.d[2:nrow(cv.uk.df)] <- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1) # this line of code works
I wanted to know why do we -1 in the tail and -1 in head to create this new column.
I made an effort to understand by removing the -1 and "R"(The code is in R studio) throws me this error.
Could anyone shed some light on this? I can't explain how much I would appreciate it.
Look at what is being done. On the left-hand side of the assignment operator, we have:
cv.uk.df$new.d[2:nrow(cv.uk.df)] <-
Let's pick this apart.
cv.uk.df # This is the data.frame
$new.d # a new column to assign or a column to reassign
[2:nrow(cv.uk.df)] # the rows which we are going to assign
Specifically, this line of code will assign a new value all rows of this column except the first. Why would we want to do that? We don't have your data, but from your example, it looks like you want to calculate the change from one line to the next. That calculation is invalid for the first row (no previous row).
Now let's look at the right-hand side.
<- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1)
The cv.uk.df$deaths column has the same number of rows as the data.frame. R gets grouchy when the numbers of elements don't follow sum rules. For data.frames, the right-hand side needs to have the same number of elements, or a number that can be recycled a whole-number of times. For example, if you have 10 rows, you need to have a replacement of 10 values. Or you can have 5 values that R will recycle.
If your data.frame has 100 rows, only 99 are being replaced in this operation. You cannot feed 100 values into an operation that expects 99. We need to trim the data. Let's look at what is happening. The tail() function has the usage tail(x, n), where it returns the last n values of x. If n is a negative integer, tail() returns all values but the first n. The head() function works similarly.
tail(cv.uk.df$deaths, -1) # This returns all values but the first
head(cv.uk.df$deaths, -1) # This returns all values but the last
This makes sense for your calculation. You cannot subtract the number of deaths in the row before the first row from the number in the first row, nor can you subtract the number of deaths in the last row from the number in the row after the last row. There are more intuitive ways to do this thing using functions from other packages, but this gets the job done.
I'm quite new to R, unfortunately I wasn't able to find help in other related questions so far.
I have this dataframe called selection, including column 'RUN' and column 'TRNO'.
It originally had 9 columns. I added the column 'RUN' which contains a count that increases by 1 whenever the value in the column 'DAP' is 0, using this code:
# Insert column RUN in "selection" dataframe
library(dplyr)
selection$RUN <- cumsum(selection$DAP == 0)
That worked perfectly. Now I would like to do a similar operation for the column 'TRNO'. It also needs to contain a count that this time only increases when the column 'RUN' arrives at multiples of 80 (i.e. from RUN == 1-80 --> count =1; RUN == 81-160 --> count =2,...)
I tried several codes, amongst others this one:
# Insert column TRNO in "selection" dataframe
i = 0
repeat{
i = i+80
selection$TRNO <- cumsum(selection$RUN == i)
break
}
Instead of increasing the count at every multiple of 80, it returns "0" when RUN values are between 1-80, increases to 92 when RUN values are at 80, and then stagnates at 92 for all the higher values in RUN.
try this:
selection$TRONO <- ceiling(selection$RUN/80)
I need to conditionally replace rows in a data frame (x) with rows selected at random from another data frame (y).Some of the rows between the two data frames are the same and so data frame x will contain rows with repeated information. What sort of base r code would I need to achieve this?
I am writing an agent based model in r where rows can be thought of as vectors of attributes pertaining to an agent and columns are attribute types. For agents to transmit their attributes they need to send rows from one data frame (population) to another, but according to conditional learning rules. These rules need to be: conditionally replace values in row n in data frame x if attribute in column 10 for that row is value 1 or more and if probability s is greater than a randomly selected number between 0 and 1. Probability s is itself an adjustable parameter that can take any value from 0 to 1.
I have tried IF function in the code below, but I am new to r and have made a mistake somewhere with it as I get this warning:
"missing value where TRUE/FALSE needed"
I reckon that I have not specified what should happen to a row if the conditions are not satisfied.
I cannot think of an alternative method of achieving my aim.
Note: agent.dat is data frame x and top_ten_percent is data frame y.
s = 0.7
N = nrow(agent.dat)
copy <- runif(N) #to generate a random probability for each row in agent.dat
for (i in 1:nrow(agent.dat)){
if(agent.dat[,10] >= 1 & copy < s){
agent.dat <- top_ten_percent[sample(nrow(top_ten_percent), 1), ]
}
}
The agent.dat data frame should have rows that are replaced with values from rows in the top_ten_percent data frame if the randomly selected value of copy between 0 and 1 for that row is less than the value of parameter s and if the value for that row in column 10 is 1 or more. For each row I need to replace the first 10 columns of agent.dat with the first 10 columns of top_ten_percent (excluding column 11 i.e. copy value).
Assistance with this problem is greatly appreciated.
So you just need to change a few things.
You need to get a particular value for copy for each iteration of the for loop (use: copy[i]).
You also need to make the & in the if statement an && (Boolean operators && and ||)
Then you need to replace a particular row (and columns 1 through 10) in agent.dat, instead of the whole thing (agent.dat[i,1:10])
So, the final code should look like:
copy <- runif(N)
for (i in 1:nrow(agent.dat)){
if(agent.dat[,10] >= 1 && copy[i] < s){
agent.dat[i,1:10] <- top_ten_percent[sample(nrow(top_ten_percent), 1), ]
}
}
This should fix your errors, assuming your data structure fits your code:
copy <- runif(nrow(agent.dat))
s <- 0.7
for (i in 1:nrow(agent.dat)){
if(agent.dat[i,10] >= 1 & copy[i] < s){
agent.dat[i,] <- top_ten_percent[sample(1:nrow(top_ten_percent), 1), ]
}
}
Here is what I currently Have. I have a column named "test1M", which has values of either 0 or 1. If the value is 1, I want to set the next 20 values in column "test1Mxx" to value 1.
If I run this code, I get an error of (Error in if (data$test1M[x] == 1) { : argument is of length zero).
Whats a better way for me to do this? The code is pretty repetitive, so I would like to minimize that if possible. If there is a way to turn this into a function that would be preferable, so I could change the number of values (for instance, maybe the following 25 values, or 40 values, etc.)
for(x in data$test1){
if(data$test1[x]==1){
data$test2[x+1]=1
data$test2[x+2]=1
data$test2[x+3]=1
data$test2[x+4]=1
data$test2[x+5]=1
data$test2[x+6]=1
data$test2[x+7]=1
data$test2[x+8]=1
data$test2[x+9]=1
data$test2[x+10]=1
data$test2[x+11]=1
data$test2[x+12]=1
data$test2[x+13]=1
data$test2[x+14]=1
data$test2[x+15]=1
data$test2[x+16]=1
data$test2[x+17]=1
data$test2[x+18]=1
data$test2[x+19]=1
data$test2[x+20]=1}
}
Your loop doesn't work because x is a value of data$test1, not an index of it. You need something like:
data$test2 <- data$test1
for (x in seq_along(data$test1))
if (data$test1[x] == 1) data$test2[x + 1:20] <- 1