I want to regroup a variable into a new one.
If value is 0, new one should be 0 too.
If value ist 999, then make it missing, NA.
Everything else 1
This is my try:
id <- 1:10
variable <- c(0,0,0,1,2,3,4,5,999,999)
df <- data.frame(id,variable)
df$variable2 <-
if (df$variable == 0) {
df$variable2 = 0
} else if (df$variable == 999){
df$variable2 = NA
} else {
df$variable2 = 1
}
And this the error message:
In if (df$variable == 0) { : the condition has length > 1 and only
the first element will be used
A pretty basic question but I'm a basic user. Thanks in advance!
Try ifelse
df$variable2 <- ifelse(df$variable == 999, NA, ifelse(df$variable > 0, 1, 0))
df
# id variable variable2
#1 1 0 0
#2 2 0 0
#3 3 0 0
#4 4 1 1
#5 5 2 1
#6 6 3 1
#7 7 4 1
#8 8 5 1
#9 9 999 NA
#10 10 999 NA
When you do df$variable == 0 the output / condition is
#[1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
where it should be a length-one logical vector that is not NA in if(condition), see ?"if".
You can avoid ifelse, for example, like so
df$variable2 <- df$variable
df$variable2[df$variable2 == 999] <- NA
df$variable2[df$variable2 > 0] <- 1
It might be easier to avoid the if/else statement all together by using conditional statements within subset notation:
when df$variable is equal to zero, change it to zero
df$variable[df$variable==0] <- 0
when df$variable is equal to 999, change it to NA
df$variable[df$variable==999] <- NA
when df$variable is greater than 0 and is not equal to NA, change it to 1
df$variable[df$variable>0 & is.na(df$variable) == 'FALSE'] <- 1
Looks like you want to recode your variable. You can do this (and other data/variable transformations) with the sjmisc-package, in your case with the rec()-command:
id <- 1:10
variable <- c(0,0,0,1,2,3,4,5,999,999)
df <- data.frame(id,variable)
library(sjmisc)
rec(df, variable, rec = c("0=0;999=NA;else=1"))
#> id variable variable_r
#> 1 1 0 0
#> 2 2 0 0
#> 3 3 0 0
#> 4 4 1 1
#> 5 5 2 1
#> 6 6 3 1
#> 7 7 4 1
#> 8 8 5 1
#> 9 9 999 NA
#> 10 10 999 NA
# or a single vector as input
rec(df$variable, rec = c("0=0;999=NA;else=1"))
#> [1] 0 0 0 1 1 1 1 1 NA NA
There are many examples, also in the help-file, and you can find a sjmisc-cheatsheet at the RStudio-Cheatsheet collection (or direct PDF-download here).
df$variable2 <- sapply(df$variable,
function(el) if (el == 0) {0} else if (el == 999) {NA} else {1})
This one-liner reflects your:
If value is 0, new one should be 0 too. If value ist 999, then make it
missing, NA. Everything else 1
Well, it is slightly slower than #markus's second or #SPJ's solutions which are most r-ish solutions.
Why one should put away the hands from ifelse
tt <- c(TRUE, FALSE, TRUE, FALSE)
a <- c("a", "b", "c", "d")
b <- 1:4
ifelse(tt, a, b) ## [1] "a" "2" "c" "4"
# totally perfect and as expected!
df <- data.frame(a=a, b=b, c=tt)
df$d <- ifelse(df$c, df$a, df$b)
## > df
## a b c d
## 1 a 1 TRUE 1
## 2 b 2 FALSE 2
## 3 c 3 TRUE 3
## 4 d 4 FALSE 4
######### This is wrong!! ##########################
## df$d is not [1] "a" "2" "c" "4"
## the problem is that
## ifelse(df$c, df$a, df$b)
## returns for each TRUE or FALSE the entire
## df$a or df$b intead of treating it like a vector.
## Since the last df$c is FALSE, df$b is returned
## Thus we get df$b for df$d.
## Quite an unintuitive behaviour.
##
## If one uses purely vectors, ifelse is fine.
## But actually df$c, df$a, df$b should be treated each like a vector.
## However, `ifelse` does not.
## No warnings that using `ifelse` with them will lead to a
## totally different behaviour.
## In my view, this is a design mistake of `ifelse`.
## Thus I decided myself to abandon `ifelse` from my set of R commands.
## To avoid that such kind of mistakes can ever happen.
#####################################################
As #Parfait pointed out correctly, it was a misinterpretation.
The problem was that df$a was treated in the data frame as a factor.
df <- data.frame(a=a, b=b, c=tt, stringsAsFactor = F)
df$d <- ifelse(df$c, df$a, df$b)
df
Gives the correct result.
a b c d
1 a 1 TRUE a
2 b 2 FALSE 2
3 c 3 TRUE c
4 d 4 FALSE 4
Thank you #Parfait to pointing that out!
Strange that I didn't recognized that in my initial trials.
But yeah, you are absolutely right!
Related
I have stumbled across the behaviour of dplyr::filter in a complex statement on a large dataframe, which basically comes down to the treatment of NA values:
df <- tibble(a = c(rep(1,3),
rep(NA, 3)))
A tibble: 6 x 1
a
<dbl>
1 1
2 1
3 1
4 NA
5 NA
6 NA
Filtering for rows that equal 1 gives the expected result:
df %>% filter(a == 1)
A tibble: 3 x 1
a
<dbl>
1 1
2 1
3 1
Filtering for rows that do not equal 1, I would expect the remaining 3 rows of the df to be returned, which is not the case, however:
df %>% filter(!a == 1)
A tibble: 0 x 1
... with 1 variables: a <dbl>
So while in the first case NA is interpreted as not equaling 1, in the second case, it is interpreted as equaling 1. Is there a logic I am missing here?
I know I can use %in% to get the expected result:
df %>% filter(!a %in% 1)
A tibble: 3 x 1
a
<dbl>
1 NA
2 NA
3 NA
but it seems strange to me to use this operator with just one element (rather than a vector).
So my questions to the experts: Is this the intended behaviour of filter? Is it common practice to use %in% when negating a filter condition?
This is due to the behaviour of %in%, not filter.
Let's use a simple example:
a = c(1, 1, 1, NA, NA, NA)
> a == 1
[1] TRUE TRUE TRUE NA NA NA
> a != 1
[1] FALSE FALSE FALSE NA NA NA
> !(a == 1)
[1] FALSE FALSE FALSE NA NA NA
We see that when we use the relational operators == or !=, NA values in the input remain NA in the output. However...
> a %in% 1
[1] TRUE TRUE TRUE FALSE FALSE FALSE
> !(a %in% 1)
[1] FALSE FALSE FALSE TRUE TRUE TRUE
With the %in% operator, NA values in the input become FALSE in the output. Since this is supposed to be the more intuitive interface for match(), let's take a look at that as well:
> match(a, 1)
[1] 1 1 1 NA NA NA
So nope, match() itself doesn't behave this way, at least not with the default arguments. However, the help file ?match explains:
%in% is currently defined as
"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0
There you have it. When we use a %in% 1, we are actually doing the following:
> match(a, 1, nomatch = 0L)
[1] 1 1 1 0 0 0
> match(a, 1, nomatch = 0L) > 0L
[1] TRUE TRUE TRUE FALSE FALSE FALSE
Hence filter() returns rows with NA values when the %in% operator is used together with ! negation.
I created two nested for loops to complete the following:
Iterating through each column that is not the first column:
Iterate through each row i that is NOT the last row (the last row is denoted j)
Compare the value in i to the value in j.
If i is NA, i = NA.
If i >= j, i = 0.
If i < j, i = 1.
Store the results of all iterations across all columns and rows in a df.
The code below creates some test data but produces a Value "out" that is NULL (empty). Any recommendations?
# Create df
a <- rnorm(5)
b <- c(rnorm(3),NA,rnorm(1))
c <- rnorm(5)
df <- data.frame(a,b,c)
rows <- nrow(df) # rows
cols <- ncol(df) # cols
out <- for (c in 2:cols){
for (r in 1:(rows - 1)){
ifelse(
is.na(df[r,c]),
NA,
df[r, c] <- df[r, c] < df[rows, c])
}
}
There's no need for looping at all. Use a vectorised function like sweep to compare via > your last row - df[nrow(df),] vs all the other rows df[-nrow(df),]:
df
# a b c
#1 -0.2739735 0.5095727 0.30664838
#2 0.7613023 -0.1509454 -0.08818313
#3 -0.4781940 1.5760307 0.46769601
#4 1.1754130 NA 0.33394212
#5 0.5448537 1.0493805 -0.10528847
sweep(df[-nrow(df),], 2, unlist(df[nrow(df),]), FUN=`>`)
# a b c
#1 FALSE FALSE TRUE
#2 TRUE FALSE TRUE
#3 FALSE TRUE TRUE
#4 TRUE NA TRUE
sweep(df[-nrow(df),], 2, unlist(df[nrow(df),]), FUN=`>`) + 0
# a b c
#1 0 0 1
#2 1 0 1
#3 0 1 1
#4 1 NA 1
Here is another option. We can replicate the last row to make the dimensions of both datasets equal and then do the > to get a logical index, which can be coerced to binary by wrapping with +.
+(df[-nrow(df),] > df[nrow(df),][col(df[-nrow(df),])])
# a b c
#1 0 0 1
#2 1 0 1
#3 0 1 1
#4 1 NA 1
I have a dataset with you variables:
ACCURACY Feedback
141 0 3
156 0 1
167 1 2
185 1 1
191 1 NA
193 1 1
I have created a new column called X, where I would like to assign 3 potential values (correct, incorrect, unknown) based on combinations between the previous two values (i.e. accuracy ~ Feedback).
I have tried the next:
df$X=NA
df[!is.na((df$ACC==1)&(df$Feedback==1)),]$X <- "correct"
df[!is.na((df$ACC==1)&(df$Feedback==2)),]$X <- "unknown"
df[!is.na((df$ACC==1)&(df$Feedback==3)),]$X <- "incorrect"
df[!is.na((df$ACC==0)&(df$Feedback==1)),]$X <- "correct"
df[!is.na((df$ACC==0)&(df$Feedback==2)),]$X <- "unknown"
df[!is.na((df$ACC==0)&(df$Feedback==3)),]$X <- "incorrect"
But it doesnt assign a value in X based on both ACC and Feedback, but each line of code overrides the values assigned by the previous one.
I would appreciate any guidance/suggestions.
This can be done with nested ifelse functions. Although, based on the example posted, it looks like X depends only on Feedback, never ACCURACY.
ACCURACY Feedback
1 0 3
2 0 1
3 1 2
4 1 1
5 1 NA
6 1 1
df$X <- ifelse(df$ACCURACY == 1, ifelse(df$Feedback == 1, "correct", ifelse(df$Feedback == 2, "unknown", "incorrect")), ifelse(df$Feedback == 1, "correct", ifelse(df$Feedback == 2, "unknown", "incorrect")))
ACCURACY Feedback X
1 0 3 incorrect
2 0 1 correct
3 1 2 unknown
4 1 1 correct
5 1 NA <NA>
6 1 1 correct
If the values of X indeed do not depend on ACCURACY, you could just recode Feedback as a factor
df$X <- factor(df$Feedback,
levels = c(1, 2, 3),
labels = c("correct", "unkown", "incorrect"))
The issue is that you've wrapped all the assignment conditions with !is.na. These vectors all evaluate to the same thing. For example:
> !is.na((df$ACC==1)&(df$Feedback==2))
[1] TRUE TRUE TRUE TRUE FALSE TRUE
> !is.na((df$ACC==1)&(df$Feedback==3))
[1] TRUE TRUE TRUE TRUE FALSE TRUE
A possible solution would be to write a little function to do the assignments you want, and then use apply.
recoder <- function(row) {
accuracy <- row[['ACCURACY']]
feedback <- row[['Feedback']]
if(is.na(accuracy) || is.na(feedback)) {
ret_val <- NA
}
else if((accuracy==1 && feedback==1) || (accuracy==0 && feedback==1)) {
ret_val <- "correct"
}
else if((accuracy==1 & feedback==2) || (accuracy==0 & feedback==2)) {
ret_val <- "unknown"
}
else {
ret_val <- "incorrect"
}
return(ret_val)
}
df$X <- apply(df, 1, recoder)
df
> df
ACCURACY Feedback X
141 0 3 incorrect
156 0 1 correct
167 1 2 unknown
185 1 1 correct
191 1 NA <NA>
193 1 1 correct
I have a data frame structured something like:
a <- c(1,1,1,2,2,2,3,3,3,3,4,4)
b <- c(1,2,3,1,2,3,1,2,3,4,1,2)
c <- c(NA, NA, 2, NA, 1, 1, NA, NA, 1, 1, NA, NA)
df <- data.frame(a,b,c)
Where a and b uniquely identify an observation. I want to create a new variable, d, which indicates if each observation's value for b is present at least once in c as grouped by a. Such that d would be:
[1] 0 1 0 1 0 0 1 0 0 0 0 0
I can write a for loop which will do the trick,
attach(df)
for (i in unique(a)) {
for (j in b[a == i]) {
df$d[a == i & b == j] <- ifelse(j %in% c[a == i], 1, 0)
}
}
But surely in R there must be a cleaner/faster way of achieving the same result?
Using data.table:
library(data.table)
setDT(df) #convert df to a data.table without copying
# +() is code golf for as.integer
df[ , d := +(b %in% c), by = a]
# a b c d
# 1: 1 1 NA 0
# 2: 1 2 NA 1
# 3: 1 3 2 0
# 4: 2 1 NA 1
# 5: 2 2 1 0
# 6: 2 3 1 0
# 7: 3 1 NA 1
# 8: 3 2 NA 0
# 9: 3 3 1 0
# 10: 3 4 1 0
# 11: 4 1 NA 0
# 12: 4 2 NA 0
Adding the dplyr version for those of that persuasion. All credit due to #akrun.
library(dplyr)
df %>% group_by(a) %>% mutate(d = +(b %in% c))
And for posterity, a base R version as well (via #thelatemail below)
df <- df[order(df$a, df$b), ]
df$d <- unlist(by(df, df$a, FUN = function(x) (x$b %in% x$c) + 0L ))
The above answer by MichaelChirico apparently works well and is correct. I rarely use data.table so I don't understand the syntax. This is a way to get the same results without data.table.
invisible(lapply(unique(df$a), function(x) {
df$d[df$a==x] <<- 0L + (df$b[df$a==x] %in% df$c[df$a==x])
}))
This code gets all of the unique levels of a and then modifies the data.frame for that level of a using the logic you request. The <<- is necessary because df will otherwise be modified just in the scope of the apply and not in .GlobalEnv. With <<- it finds the parent environment where df is defined and sets df there.
Also, note the slightly different version of the + "trick" where a leading 0 which makes it clearer to the reader that the resulting vector is an integer because it must be cast that way for the addition to work. The L after the 0 indicates that 0 is in integer and not a double. Note that the notation used by MichaelChirico for this casting gives the same results (a column of class integer).
Fake data for illustration:
df <- data.frame(a=c(1,2,3,4,5), b=(c(2,2,2,2,NA)),
c=c(NA,2,3,4,5)))
This would get me the answer I want IF it weren't for the NA values:
df$count <- with(df, (a==1) + (b==2) + (c==3))
Also, would there be an even more elegant way if I was only interested in, e.g. variables==2?
df$count <- with(df, (a==2) + (b==2) + (c==2))
Many thanks!
The following works for your specific example, but I have a suspicion that your real use case is more complicated:
df$count <- apply(df,1,function(x){sum(x == 1:3,na.rm = TRUE)})
> df
a b c count
1 1 2 NA 2
2 2 2 2 1
3 3 2 3 2
4 4 2 4 1
5 5 NA 5 0
but this general approach should work. For instance, your second example would be something like this:
df$count <- apply(df,1,function(x){sum(x == 2,na.rm = TRUE)})
or more generally you could allow yourself to pass in a variable for the comparison:
df$count <- apply(df,1,function(x,compare){sum(x == compare,na.rm = TRUE)},compare = 1:3)
Another way is to subtract your target vector from each row of your data.frame, negate and then do rowSums with na.rm=TRUE:
target <- 1:3
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 2 1 2 1 0
target <- rep(2,3)
rowSums(!(df-rep(target,each=nrow(df))),na.rm=TRUE)
[1] 1 3 1 1 0