conditional assignment of values - r

I have a dataset with you variables:
ACCURACY Feedback
141 0 3
156 0 1
167 1 2
185 1 1
191 1 NA
193 1 1
I have created a new column called X, where I would like to assign 3 potential values (correct, incorrect, unknown) based on combinations between the previous two values (i.e. accuracy ~ Feedback).
I have tried the next:
df$X=NA
df[!is.na((df$ACC==1)&(df$Feedback==1)),]$X <- "correct"
df[!is.na((df$ACC==1)&(df$Feedback==2)),]$X <- "unknown"
df[!is.na((df$ACC==1)&(df$Feedback==3)),]$X <- "incorrect"
df[!is.na((df$ACC==0)&(df$Feedback==1)),]$X <- "correct"
df[!is.na((df$ACC==0)&(df$Feedback==2)),]$X <- "unknown"
df[!is.na((df$ACC==0)&(df$Feedback==3)),]$X <- "incorrect"
But it doesnt assign a value in X based on both ACC and Feedback, but each line of code overrides the values assigned by the previous one.
I would appreciate any guidance/suggestions.

This can be done with nested ifelse functions. Although, based on the example posted, it looks like X depends only on Feedback, never ACCURACY.
ACCURACY Feedback
1 0 3
2 0 1
3 1 2
4 1 1
5 1 NA
6 1 1
df$X <- ifelse(df$ACCURACY == 1, ifelse(df$Feedback == 1, "correct", ifelse(df$Feedback == 2, "unknown", "incorrect")), ifelse(df$Feedback == 1, "correct", ifelse(df$Feedback == 2, "unknown", "incorrect")))
ACCURACY Feedback X
1 0 3 incorrect
2 0 1 correct
3 1 2 unknown
4 1 1 correct
5 1 NA <NA>
6 1 1 correct

If the values of X indeed do not depend on ACCURACY, you could just recode Feedback as a factor
df$X <- factor(df$Feedback,
levels = c(1, 2, 3),
labels = c("correct", "unkown", "incorrect"))

The issue is that you've wrapped all the assignment conditions with !is.na. These vectors all evaluate to the same thing. For example:
> !is.na((df$ACC==1)&(df$Feedback==2))
[1] TRUE TRUE TRUE TRUE FALSE TRUE
> !is.na((df$ACC==1)&(df$Feedback==3))
[1] TRUE TRUE TRUE TRUE FALSE TRUE
A possible solution would be to write a little function to do the assignments you want, and then use apply.
recoder <- function(row) {
accuracy <- row[['ACCURACY']]
feedback <- row[['Feedback']]
if(is.na(accuracy) || is.na(feedback)) {
ret_val <- NA
}
else if((accuracy==1 && feedback==1) || (accuracy==0 && feedback==1)) {
ret_val <- "correct"
}
else if((accuracy==1 & feedback==2) || (accuracy==0 & feedback==2)) {
ret_val <- "unknown"
}
else {
ret_val <- "incorrect"
}
return(ret_val)
}
df$X <- apply(df, 1, recoder)
df
> df
ACCURACY Feedback X
141 0 3 incorrect
156 0 1 correct
167 1 2 unknown
185 1 1 correct
191 1 NA <NA>
193 1 1 correct

Related

R - use if statement to regroup variable

I want to regroup a variable into a new one.
If value is 0, new one should be 0 too.
If value ist 999, then make it missing, NA.
Everything else 1
This is my try:
id <- 1:10
variable <- c(0,0,0,1,2,3,4,5,999,999)
df <- data.frame(id,variable)
df$variable2 <-
if (df$variable == 0) {
df$variable2 = 0
} else if (df$variable == 999){
df$variable2 = NA
} else {
df$variable2 = 1
}
And this the error message:
In if (df$variable == 0) { : the condition has length > 1 and only
the first element will be used
A pretty basic question but I'm a basic user. Thanks in advance!
Try ifelse
df$variable2 <- ifelse(df$variable == 999, NA, ifelse(df$variable > 0, 1, 0))
df
# id variable variable2
#1 1 0 0
#2 2 0 0
#3 3 0 0
#4 4 1 1
#5 5 2 1
#6 6 3 1
#7 7 4 1
#8 8 5 1
#9 9 999 NA
#10 10 999 NA
When you do df$variable == 0 the output / condition is
#[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
where it should be a length-one logical vector that is not NA in if(condition), see ?"if".
You can avoid ifelse, for example, like so
df$variable2 <- df$variable
df$variable2[df$variable2 == 999] <- NA
df$variable2[df$variable2 > 0] <- 1
It might be easier to avoid the if/else statement all together by using conditional statements within subset notation:
when df$variable is equal to zero, change it to zero
df$variable[df$variable==0] <- 0
when df$variable is equal to 999, change it to NA
df$variable[df$variable==999] <- NA
when df$variable is greater than 0 and is not equal to NA, change it to 1
df$variable[df$variable>0 & is.na(df$variable) == 'FALSE'] <- 1
Looks like you want to recode your variable. You can do this (and other data/variable transformations) with the sjmisc-package, in your case with the rec()-command:
id <- 1:10
variable <- c(0,0,0,1,2,3,4,5,999,999)
df <- data.frame(id,variable)
library(sjmisc)
rec(df, variable, rec = c("0=0;999=NA;else=1"))
#> id variable variable_r
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 0 0
#> 4 4 1 1
#> 5 5 2 1
#> 6 6 3 1
#> 7 7 4 1
#> 8 8 5 1
#> 9 9 999 NA
#> 10 10 999 NA
# or a single vector as input
rec(df$variable, rec = c("0=0;999=NA;else=1"))
#> [1] 0 0 0 1 1 1 1 1 NA NA
There are many examples, also in the help-file, and you can find a sjmisc-cheatsheet at the RStudio-Cheatsheet collection (or direct PDF-download here).
df$variable2 <- sapply(df$variable,
function(el) if (el == 0) {0} else if (el == 999) {NA} else {1})
This one-liner reflects your:
If value is 0, new one should be 0 too. If value ist 999, then make it
missing, NA. Everything else 1
Well, it is slightly slower than #markus's second or #SPJ's solutions which are most r-ish solutions.
Why one should put away the hands from ifelse
tt <- c(TRUE, FALSE, TRUE, FALSE)
a <- c("a", "b", "c", "d")
b <- 1:4
ifelse(tt, a, b) ## [1] "a" "2" "c" "4"
# totally perfect and as expected!
df <- data.frame(a=a, b=b, c=tt)
df$d <- ifelse(df$c, df$a, df$b)
## > df
## a b c d
## 1 a 1 TRUE 1
## 2 b 2 FALSE 2
## 3 c 3 TRUE 3
## 4 d 4 FALSE 4
######### This is wrong!! ##########################
## df$d is not [1] "a" "2" "c" "4"
## the problem is that
## ifelse(df$c, df$a, df$b)
## returns for each TRUE or FALSE the entire
## df$a or df$b intead of treating it like a vector.
## Since the last df$c is FALSE, df$b is returned
## Thus we get df$b for df$d.
## Quite an unintuitive behaviour.
##
## If one uses purely vectors, ifelse is fine.
## But actually df$c, df$a, df$b should be treated each like a vector.
## However, `ifelse` does not.
## No warnings that using `ifelse` with them will lead to a
## totally different behaviour.
## In my view, this is a design mistake of `ifelse`.
## Thus I decided myself to abandon `ifelse` from my set of R commands.
## To avoid that such kind of mistakes can ever happen.
#####################################################
As #Parfait pointed out correctly, it was a misinterpretation.
The problem was that df$a was treated in the data frame as a factor.
df <- data.frame(a=a, b=b, c=tt, stringsAsFactor = F)
df$d <- ifelse(df$c, df$a, df$b)
df
Gives the correct result.
a b c d
1 a 1 TRUE a
2 b 2 FALSE 2
3 c 3 TRUE c
4 d 4 FALSE 4
Thank you #Parfait to pointing that out!
Strange that I didn't recognized that in my initial trials.
But yeah, you are absolutely right!

IF/THEN/ELSE in R

everyone.
Hopefully an easy syntax question. I'm trying to create a new variable in a table in R which would say "1" if my patient was in the age range I was looking at, or "0" for no. The age range I'm interested is between 2-155. The code is running without any errors, but it is not working. When I look in my table, the new variable will say 1 even though the age4 is 158 Here is what I have:
table$newvar <- if (table$age4>=2 && table$age4 <=155) {table$newvar=1} else {table$newvar=0}
Any help is appreciated! Thanks in advance!
Two changes should be made:
Use the vectorized ifelse() function to generate the new column data.
Use the vectorized & logical-AND operator when combining the results of the comparisons.
table <- data.frame(age4=seq(1,200,10));
table$newvar <- ifelse(table$age4>=2 & table$age4<=155,1,0);
table;
## age4 newvar
## 1 1 0
## 2 11 1
## 3 21 1
## 4 31 1
## 5 41 1
## 6 51 1
## 7 61 1
## 8 71 1
## 9 81 1
## 10 91 1
## 11 101 1
## 12 111 1
## 13 121 1
## 14 131 1
## 15 141 1
## 16 151 1
## 17 161 0
## 18 171 0
## 19 181 0
## 20 191 0
The reason your code is not working is because the if statement and the && operator are not vectorized. The && operator only examines the first element of each operand vector, and only returns a one-element vector representing the result of the logical-AND on those two input values. The if statement always expects a one-element vector for its conditional, and executes the if-branch if that element is true, or the else-branch if false.
If you use a multiple-element vector as the conditional in the if statement, you get a warning:
if (c(T,F)) 1 else 0;
## [1] 1
## Warning message:
## In if (c(T, F)) 1 else 0 :
## the condition has length > 1 and only the first element will be used
But for some odd reason, you don't get a warning if you use a multiple-element vector as an operand to && (or ||):
c(T,F) && c(T,F);
## [1] TRUE
That's why your code appeared to succeed (by which I mean it didn't print any warning message), but it didn't actually do what was intended.
When used in arithmetic TRUE and FALSE become 1 and 0 so:
transform(table, newvar = (age4 >= 2) * (age4 <= 155) )
These also work:
transform(table, newvar = as.numeric( (age4 >= 2) & (age4 <= 155) ) )
transform(table, newvar = ( (age4 >= 2) & (age4 <= 155) ) + 0 )
transform(table, newvar = ifelse( (age4 >= 2) & (age4 <= 155), 1, 0) )
transform(table, newvar = (age4 %in% 2:155) + 0) # assuming no fractional ages

In R, how to add TRUE / FALSE column if a value differs from previous or following value?

Sample data:
df <- data.frame(
time = c(1, 2, 3, 4, 5, 6, 7),
status = c("good", "good", "good", "bad", "good", "good", "good")
)
Output:
time status
1 1 good
2 2 good
3 3 good
4 4 bad
5 5 good
6 6 good
7 7 good
I would like to add a new column statuschange IF status differs from the row above or below. The output would look like this:
time status statuschange
1 1 good NA
2 2 good TRUE
3 3 good FALSE
4 4 bad FALSE
5 5 good FALSE
6 6 good TRUE
7 7 good NA
I have the sense that are lots of ways to do this, but I haven't been able to figure it out. Any assistance is appreciated!
You can apply diff to see if two entries are the same. You want two of these diffs, to see if both entries around an element are the same:
> !(c(NA, diff(as.numeric(x$status))) | c(rev(diff(as.numeric(rev(x$status)))), NA))
[1] NA TRUE FALSE FALSE FALSE TRUE NA
The first expression tells whether the prior element is different:
> c(NA, diff(as.numeric(x$status)))
[1] NA 0 0 -1 1 0 0
The second tells whether the following element is different:
> c(rev(diff(as.numeric(rev(x$status)))), NA)
[1] 0 0 1 -1 0 0 NA
The "or" operation | returns TRUE for nonzero, which means a leading or following element is different, so we then invert the result with the leading !.
You can use something like this
df$A <- rep(0,7)
for(i in 2:6){
df$A[i] <- ifelse(df$status[i]==df$status[i-1]
& df$status[i]==df$status[i+1],'TRUE','FALSE')
}
df
Another solution using looping:
statuschange=as.character()
for(i in seq_along(df$status))
{
statuschange[i]<-df$status[i]==df$status[i-1] && df$status[i]==df$status[i+1]
}
df$statuschange<-statuschange
Here's an approach using zoo::rollapply to loop through the column:
> library(zoo)
> c(NA, rollapply(x$status, width=3, FUN=function(x) x[1]==x[2] & x[2]==x[3]), NA)
[1] NA TRUE FALSE FALSE FALSE TRUE NA
What this does, is create a series of "windows" each of width 3, rolling across the data, and applies the function to each subset. As the window is 3, we end up with a vector that is two elements shorter than the original column (width - 1 shorter). The endpoint NA values are then added with c.

R ddply, applying if and ifelse functions

I'm trying to apply a function to a dataframe using ddply from the plyr package, but I'm getting some results that I don't understand. I have 3 questions about the
results
Given:
mydf<- data.frame(c(12,34,9,3,22,55),c(1,2,1,1,2,2)
, c(0,1,2,1,1,2))
colnames(mydf)[1] <- 'n'
colnames(mydf)[2] <- 'x'
colnames(mydf)[3] <- 'x1'
mydf looks like this:
n x x1
1 12 1 0
2 34 2 1
3 9 1 2
4 3 1 1
5 22 2 1
6 55 2 2
Question #1
If I do:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
mydf <- ddply(mydf, c("x") , .fun = k, .inform = TRUE)
I get the following error:
Error in `$<-.data.frame`(`*tmp*`, "z", value = structure(c(12, 34, 9, :
replacement has 3 rows, data has 6
Error: with piece 1:
n x x1
1 12 1 0
2 9 1 2
3 3 1 1
I get this error regardless of whether I specify the variable to split by as c("x"), "x", or .(x). I don't understand why I'm getting this error message.
Question #2
But, what I really want to do is set up an if/else function because my dataset has variables x1, x2, x3, and x4 and I want to take those variables into account as well. But when I try something simple such as:
j <- function(x) {
if(x == 1){
mydf$z <- 0
} else {
mydf$z <- mydf$n
}
return(mydf)
}
mydf <- ddply(mydf, x, .fun = j, .inform = TRUE)
I get:
Warning messages:
1: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
2: In if (x == 1) { :
the condition has length > 1 and only the first element will be used
Question #3
I'm confused about to use function() and when to use function(x). Using function() for either j() or k() gives me a different error:
Error in .fun(piece, ...) : unused argument (piece)
Error: with piece 1:
n x x1 z
1 12 1 0 12
2 9 1 2 9
3 3 1 1 3
4 12 1 0 12
5 9 1 2 9
6 3 1 1 3
7 12 1 0 12
8 9 1 2 9
9 3 1 1 3
10 12 1 0 12
11 9 1 2 9
12 3 1 1 3
where column z is not correct. Yet I see a lot of functions written as function().
I sincerely appreciate any comments that can help me out with this
There's a lot that needs explaining here. Let's start with the simplest case. In your first example, all you need is:
mydf$z <- with(mydf,ifelse(x == 1,0,n))
An equivalent ddply solution might look like this:
ddply(mydf,.(x),transform,z = ifelse(x == 1,0,n))
Probably your biggest source of confusion is that you seem to not understand what is being passed as arguments to functions within ddply.
Consider your first attempt:
k <- function(x) {
mydf$z <- ifelse(x == 1, 0, mydf$n)
return (mydf)
}
The way ddply works is that it splits mydf up into several, smaller data frame, based on the values in the column x. That means that each time ddply calls k, the argument passed to k is a data frame. Specifically, a subset of you primary data frame.
So within k, x is a subset of mydf, with all the columns. You should not be trying to modify mydf from within k. Modify x, and then return the modified version. (If you must, but the options I displayed above are better.) So we might re-write your k like this:
k <- function(x) {
x$z <- ifelse(x$x == 1, 0, x$n)
return (x)
}
Note that you've created some confusing stuff by using x as both an argument to k and as the name of one of our columns.

Error in If else loop in R

I am writing one code in R. First I am creating one blank column in the data set and I want to assign 0 and 1 value in that column according to some conditions. Here is my code
#Creating a empty column in the data file
Mydata$final <- "";
#To assign 0,1 value in final variable
if(Mydata$Default_Config == "No" & is.na(Mydata$Best_Config)=="TRUE" & (Mydata$AlmostDefaultConfig!=1 | Mydata$AlmostDefaultConfig!=3)){
Mydata$final <- 1
}else{
Mydata$final <- 0
}
And I am getting this error
Warning message:
In if (Mydata$Default_Config == "No" & is.na(Mydata$Best_Config) == :
the condition has length > 1 and only the first element will be used
How Can I fix this error? Please help me out. Thanks in advance
Your problem is one of vectorisation. if is not vectorised. You are testing multiple values in each comparison in your if statement and R is telling you it will only use the first because if is not vectorised. You need ifelse which is vectorised:
ifelse( Mydata$Default_Config == "No" & is.na(Mydata$Best_Config)=="TRUE" & (Mydata$AlmostDefaultConfig!=1 | Mydata$AlmostDefaultConfig!=3) , 1 , 0 )
A reproducible example is below. If x is > 5 and y is even then return 1 otherwise return 0:
x <- 1:10
# [1] 1 2 3 4 5 6 7 8 9 10
y <- seq(1,30,3)
# [1] 1 4 7 10 13 16 19 22 25 28
x > 5
# [1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE
y %% 2 == 0
# [1] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE
ifelse( x > 5 & y %% 2 == 0 , 1 , 0 )
# [1] 0 0 0 0 0 1 0 1 0 1
An alternative approach is to take advantage of R's coercion. You have a set of conditionals which are vectorizable, and R is happy to convert TRUE/FALSE to 1 / 0, so you can write it like:
Mydata$final <- ( (Mydata$Default_Config == "No") *( is.na(Mydata$Best_Config)=="TRUE") * (Mydata$AlmostDefaultConfig!=1 + Mydata$AlmostDefaultConfig!=3)) )
(extra parentheses added for clarity) .
Apologies if I fouled up the logic there.
Edit: My code for the OR won't quite work, since if both sides are TRUE you'd get a big number ("2" :-) ). Change it to as.logical((Mydata$AlmostDefaultConfig!=1 + Mydata$AlmostDefaultConfig!=3))

Resources