Creating column based on values of other columns in R - r

I am trying to create a new column ($Correct) in a data frame based on values in two other columns ($Condition and $Response).
I realise that there are multiple ways of achieving this (I have since used another method), but I'm interested in the reason why the method below did not work.
training_data.df$Correct<- 0
training_data.df$Correct[training_data.df$Condition==2 & training_data.df$Response==1] <- 1
training_data.df$Correct[(training_data.df$Condition==1|3) & training_data.df$Response==2] <- 1
This method produces the correct values in the output (the new $Correct column), except for cases where $Condition==2 and $Response==2 (the value '1' prints in the $Correct column rather than '0').
This line of code works correctly on its own, but not in combination with the other (last) line for $Condition==1|3.
Can anyone explain why this occurs?

training_data.df$Condition==1|3
reads as:
"(training_data.df$Condition is equal to 1)"
or
"three".
"(training_data.df$Condition is equal to 1)" can be TRUE or FALSE.
"three" not so much.
Whereas what I think you mean is:
"training_data.df$Condition is equal to (either 1 or 3)".
This would be (training_data.df$Condition==1 | training_data.df$Condition==3) or training_data.df$Condition %in% c(1,3).

Related

What does index do in r?

I have a code I'm working with which has the following line,
data2 <- apply(data1[,-c(1:(index-1))],2,log)
I understand that this creates a new data frame, from the data1, taking column-wise values log-transformed and some columns are eliminated, but I don't understand how the columns are removed. what does 1:(index-1) do exactly?
The ":" operator creates an integer sequence. Because (1:(index-1) ) is numeric and being used in the second position for the extraction operator"[" applied to a dataframe, it is is referring to column numbers. The person writing the code didn't need the c-function. It could have been more economically written:
data1[,-(1:(index-1))]
# but the outer "("...")"'s are needed so it starts at 1 rather than -1
So it removes the first index-1 columns from the object passed to apply. (As MrFlick points out, index must have been defined before this gets passed to R. There's not default value or interpretation for index in R.
Suppose the index is 5, then index -1 returns 4 so the sequence will be from 1 to 4 i.e. and then we use - implies loop over the columns other than the first 4 columns as MARGIN = 2

R programming- adding column in dataset error

cv.uk.df$new.d[2:nrow(cv.uk.df)] <- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1) # this line of code works
I wanted to know why do we -1 in the tail and -1 in head to create this new column.
I made an effort to understand by removing the -1 and "R"(The code is in R studio) throws me this error.
Could anyone shed some light on this? I can't explain how much I would appreciate it.
Look at what is being done. On the left-hand side of the assignment operator, we have:
cv.uk.df$new.d[2:nrow(cv.uk.df)] <-
Let's pick this apart.
cv.uk.df # This is the data.frame
$new.d # a new column to assign or a column to reassign
[2:nrow(cv.uk.df)] # the rows which we are going to assign
Specifically, this line of code will assign a new value all rows of this column except the first. Why would we want to do that? We don't have your data, but from your example, it looks like you want to calculate the change from one line to the next. That calculation is invalid for the first row (no previous row).
Now let's look at the right-hand side.
<- tail(cv.uk.df$deaths, -1) - head(cv.uk.df$deaths, -1)
The cv.uk.df$deaths column has the same number of rows as the data.frame. R gets grouchy when the numbers of elements don't follow sum rules. For data.frames, the right-hand side needs to have the same number of elements, or a number that can be recycled a whole-number of times. For example, if you have 10 rows, you need to have a replacement of 10 values. Or you can have 5 values that R will recycle.
If your data.frame has 100 rows, only 99 are being replaced in this operation. You cannot feed 100 values into an operation that expects 99. We need to trim the data. Let's look at what is happening. The tail() function has the usage tail(x, n), where it returns the last n values of x. If n is a negative integer, tail() returns all values but the first n. The head() function works similarly.
tail(cv.uk.df$deaths, -1) # This returns all values but the first
head(cv.uk.df$deaths, -1) # This returns all values but the last
This makes sense for your calculation. You cannot subtract the number of deaths in the row before the first row from the number in the first row, nor can you subtract the number of deaths in the last row from the number in the row after the last row. There are more intuitive ways to do this thing using functions from other packages, but this gets the job done.

How to create a new column with repeated values based on another column?

Here is what I currently Have. I have a column named "test1M", which has values of either 0 or 1. If the value is 1, I want to set the next 20 values in column "test1Mxx" to value 1.
If I run this code, I get an error of (Error in if (data$test1M[x] == 1) { : argument is of length zero).
Whats a better way for me to do this? The code is pretty repetitive, so I would like to minimize that if possible. If there is a way to turn this into a function that would be preferable, so I could change the number of values (for instance, maybe the following 25 values, or 40 values, etc.)
for(x in data$test1){
if(data$test1[x]==1){
data$test2[x+1]=1
data$test2[x+2]=1
data$test2[x+3]=1
data$test2[x+4]=1
data$test2[x+5]=1
data$test2[x+6]=1
data$test2[x+7]=1
data$test2[x+8]=1
data$test2[x+9]=1
data$test2[x+10]=1
data$test2[x+11]=1
data$test2[x+12]=1
data$test2[x+13]=1
data$test2[x+14]=1
data$test2[x+15]=1
data$test2[x+16]=1
data$test2[x+17]=1
data$test2[x+18]=1
data$test2[x+19]=1
data$test2[x+20]=1}
}
Your loop doesn't work because x is a value of data$test1, not an index of it. You need something like:
data$test2 <- data$test1
for (x in seq_along(data$test1))
if (data$test1[x] == 1) data$test2[x + 1:20] <- 1

How to find the length of a list based on a condition in R

The problem
I would like to find a length of a list.
The expected output
I would like to find the length based on a condition.
Example
Suppose that I have a list of 4 elements as follows:
myve <–list(1,2,3,0)
Here I have 4 elements, one of them is zero. How can I find the length by extracting the zero values? Then, if the length is > 1I would like to substruct one. That is:
If the length is 4 then, I would like to have 4-1=3. So, the output should be 3.
Note
Please note that I am working with a problem where the zero values may be changed from one case to another. For example, For the first list may I have only one 0 value, while for the second list may I have 2 or 3 zero values.
The values are always positive or zero.
You just need to apply the condition to each element. This will produce a list of boolean, then you sum it to get the number of True elements (i.e. validation your condition).
In your case:
sum(myve != 0)
In a more complex case, where the confition is expressed by a function f:
sapply(myve, f)
Use sapply to extract the ones different to zeros and sum to count them
sum(sapply(myve, function(x) x!=0))

extracting value of variable from dataframe

I have one issue in selecting a value of one variable conditional on the value of another variable in a dataframe.
Dilutionfactor=c(1,3,9,27,80)
Log10Dilutionfactor=log10(Dilutionfactor)
Protection=c(100,81.25,40,10.52,0)
RM=as.data.frame(cbind(Dilutionfactor,Log10Dilutionfactor,Protection))
Now i want to know the value of Log10Dilutionfactor condition on the value of Protection is equal to either 50 (if it appear) or the value immediately just below 50.
when i used subset(RM,Protection<= 50)it gives three rows and when I tried RM[grepl(RM$Protection<=50,Log10Dilutionfactor),] it gives 0 values with warning message. I really appreciate if someone help me.
You can use 2 subset:
subset(RM,Protection==max(subset(RM,Protection<= 50)$Protection))$Log10Dilutionfactor
# [1] 0.954243
You could use
with(RM, Log10Dilutionfactor[which(Protection == max(Protection[Protection <= 50]))])
# [1] 0.9542425
or find the index value of protection that is closest to 50
index = which(abs(RM$Protection-50)<=min(abs(RM$Protection-50)))
and then look it up in what ever column you want. e.g for Dilutionfactor
RM$Dilutionfactor[index]

Resources