I have a data set of boolean variables and I am trying to generate a new variable based on 3 of the existing booleans using ifelse().
The rules I'd like to implement are:
If any of the three columns have value 1, 1
If all of the three columns have value 0, 0
If all of the three columns have value NA, NA
If the three columns have some combination of 0 and NA, 0
Here is the code to generate a sample with 3 variables that I want to use to create a fourth:
df <- structure(list(var1 = c(NA, NA, NA, 0,1),
var2 = c(1, NA, 0,0, 1),
var3 = c(NA, NA, NA,0,1)), class = "data.frame", row.names = c(NA, -5L))
I have tried the following to generate the new variable according to my desired rules:
df$newvar1 <-ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse((is.na(df$var1) & is.na(df$var2) & is.na(df$var3)), NA,0))
df$newvar2 <- ifelse((is.na(df$var1)|df$var1==0) &
(is.na(df$var2)|df$var2==0) &
(is.na(df$var3)|df$var3==0),0,
ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse(is.na(df$var1) & is.na(df$var2) & is.na(df$var3), NA,NA)))
df$newvar3 <-ifelse(df$var1 == 1 | df$var2 == 1 |df$var3 == 1, 1,
ifelse((is.na(df$var1) & is.na(df$var2) & is.na(df$var3)), NA,
ifelse((is.na(df$var1)|df$var1==0) &
(is.na(df$var2)|df$var2==0) &
(is.na(df$var3)|df$var3==0),0,0)))
I don't understand why newvar1 and newvar3 have NA values corresponding to combinations of NAs and 0s when both examples use "&" between the na specifications (row 3 in the results).
I am assuming that NAs don't show up in newvar2 because the first ifelse() function takes precedent.
Any insight to the ifelse() function or advice on how to get the results I'm looking for would be really helpful.
Here is another possible option using rowSums:
df$newvar <- +(rowSums(df, na.rm = TRUE) * NA ^ (rowSums(!is.na(df)) == 0) > 0)
# var1 var2 var3 newvar
#1 NA 1 NA 1
#2 NA NA NA NA
#3 NA 0 NA 0
#4 0 0 0 0
#5 1 1 1 1
This gives your expected results:
df$newvar <- 0
df$newvar[Reduce(`|`, lapply(df[1:3], `%in%`, 1))] <- 1
df$newvar[Reduce(`&`, lapply(df[1:3], is.na))] <- NA
df
# var1 var2 var3 newvar
# 1 NA 1 NA 1
# 2 NA NA NA NA
# 3 NA 0 NA 0
# 4 0 0 0 0
# 5 1 1 1 1
This defaults to 0 and only changes values with known conditions, which means that if there are any rows with NA and 1 (with or without 0), it will be assigned 0. It's not difficult to test for this, but it wasn't in your logic.
Related
I try to create a subset, where I remove all answers == 0 for variable B, given another variable A == 1. However, I want to keep the NAs in Variable B (just remove the 0s).
I tried it with this df2 <- subset(df, B[df$A == 1] > 0) but the result makes no sense. Can someone help?
i <- c(1:10)
A <- c(0,1,1,1,0,0,1,1,0,1)
B <- c(0, 10, 13, NA, NA, 9, 0, 0, 3, NA)
df <- data.frame(i, A, B)
subset takes a condition and returns only the rows where the value is TRUE. If you try NA == 0, or NA != 0 it will always return NA, which is neither TRUE nor FALSE, however as subset would have it it only returns rows where the value is TRUE. There are multiple ways around this:
subset(df, !(A == 1 & B == 0) | is.na(B))
or:
subset(df, !(A == 1 & B %in% 0))
There's plenty more options available however
This should work, if I understand it correctly:
subset(df, (df$A == 1) & ((df$B != 0) | (is.na(df$B))))
outputs:
i A B
2 1 10
3 1 13
4 1 NA
10 1 NA
If you do not want to specify every single column, you can just change the 0 to NA and the NA (temporarily) to a number (for example 999/-999) and switch back after you are finished.
i <- c(1:10)
A <- c(0,1,1,1,0,0,1,1,0,1)
B <- c(0, 10, 13, NA, NA, 9, 0, 0, 3, NA)
df <- data.frame(i, A, B)
df[is.na(df)] <- 999
df[df==0] <- NA
df <- na.omit(df)
df[df==999] <- NA
i A B
2 2 1 10
3 3 1 13
4 4 1 NA
10 10 1 NA
If i is unique, identify wich cases you want to remove and select the rest, try:
df[df$i != subset(df, A==1 & B==0)$i, ]
Output:
i A B
1 1 0 0
2 2 1 10
3 3 1 13
4 4 1 NA
5 5 0 NA
6 6 0 9
9 9 0 3
10 10 1 NA
As the title suggests my issue is the following. I have one variable identifying the beginning of an event and another variable indicating the end time of the same event. I want an variable indicating whether an event took place or not.
dat <-
data.frame(
"t" = c(1:10),
"id1" = c(1, NA, NA, NA, 2, 3, NA, NA, 4, NA),
"id2" = c(NA, 1, NA, NA, NA, 2, NA, 3, 4, NA),
"desiredoutcome" = c(1, 1, 0, 0, 1, 1, 1, 1,1, 0)
)
Here, the variable desired outcome would take value 1 whenever it is between the same value of id1 and id2. Consider e.g. row 6. it is both between id =2 and id = 3 and the dummy should hence be 1.
Any idea how I can achieve this?
How about this ?
#Position of non-NA index in `id1`
inds1 <- which(!is.na(dat$id1))
#Corresponding position of non-NA index in `id2`
inds2 <- match(dat$id1[inds1], dat$id2)
#Initialise the result column to 0
dat$result <- 0
#create a sequence between inds1 and inds2 and assign value as 1.
dat$result[unique(unlist(Map(seq, inds1, inds2)))] <- 1
dat
# t id1 id2 desiredoutcome result
#1 1 1 NA 1 1
#2 2 NA 1 1 1
#3 3 NA NA 0 0
#4 4 NA NA 0 0
#5 5 2 NA 1 1
#6 6 3 2 1 1
#7 7 NA NA 1 1
#8 8 NA 3 1 1
#9 9 4 4 1 1
#10 10 NA NA 0 0
Here is one way to do it,
First convert NA to 0,
dat[is.na(dat)] <- 0
Then we use ifelse
ifelse((dat$id1 == 0 & dat$id2 == 0), 0, 1)
In the form of dataframe,
dat = cbind(dat, ifelse((dat$id1 == 0 & dat$id2 == 0), 0, 1))
I have a data frame that has NA's in every row. Some are on the left, some in the middle, and some on the right. Something like this:
a <- c(NA, NA, 1, NA)
b <- c(NA, 1, 1, NA)
c <- c(NA, NA, 1, 1)
d <- c(1, 1, NA, 1)
df <- data.frame(a, b, c, d)
df
# a b c d
# NA NA NA 1
# NA 1 NA 1
# 1 1 1 NA
# NA NA 1 1
I would like to replace all the NAs that are in the middle and on the right side with 0 but keep all the NA's leading to a 1 on the left as NA. So I would like an efficient way (my data frame is large) to have this data frame:
# a b c d
# NA NA NA 1
# NA 1 0 1
# 1 1 1 0
# NA NA 1 1
We can use apply to loop over the rows, find the index of the first occurence of 1. Then replace the NAs from that element to the last with 0
df[] <- t(apply(df, 1, function(x) {
i1 <- which(x == 1)[1]
i2 <- i1:length(x)
x[i2][is.na(x[i2])] <- 0
x}))
Or another option is
df[] <- t(apply(df, 1, function(x) replace(x,
cumsum(x ==1 & !is.na(x)) >= 1 & is.na(x), 0)))
Imagine a small data set like the one below, composed of three variables:
v1 <- c(0, 1, NA, 1, NA, 0)
v2 <- c(0, 0, NA, 1, NA, NA)
v3 <- c(1, NA, 0, 0, NA, 0)
df <- data.frame(v1, v2, v3)
df
v1 v2 v3
1 0 0 1
2 1 0 NA
3 NA NA 0
4 1 1 0
5 NA NA NA
6 0 NA 0
One can use the is.na command as follows to calculate the number of rows with at least one missing value - and R would return 4:
sum(is.na(df$v1) | is.na(df$v2) | is.na(df$v3))
Or the number of rows with all three values missing - and R would return 1:
sum(is.na(df$v1) & is.na(df$v2) & is.na(df$v3))
Two questions at this point:
(1) How can I calculate the number of rows where "exactly one" or "exactly two" values are missing?
(2) If I am to do the above in a large data set, how can I limit the scope of the calculation to v1, v2 and v3 (that is, without having to create a subset)?
I tried variations of is.na, nrow and df, but could not get any of them to work.
Thanks!
We can use rowSums on the logical matrix (is.na(df)) and check whether the number of NAs are equal to the value of interest.
n1 <- 1
sum(rowSums(is.na(df))==n1)
To make it easier, create a function to do this
f1 <- function(dat, n){
sum(rowSums(is.na(dat)) == n)
}
f1(df, 0)
#[1] 2
f1(df, 1)
#[1] 2
f1(df, 3)
#[1] 1
f1(df, 2)
#[1] 1
NOTE: rowSums is very fast, but if it is a large dataset, then creating a logical matrix can also create problems in memory. So, we can use Reduce after looping through the columns of the dataset (lapply(df, is.na)).
sum(Reduce(`+`, lapply(df, is.na))==1)
#[1] 2
f2 <- function(dat, n){
sum(Reduce(`+`, lapply(dat, is.na))==n)
}
f2(df, 1)
Try this:
num.rows.with.x.NA <- function(df, x, cols=names(df)) {
return(sum(apply(df, 1, function(y) sum(is.na(y[cols])) == x)))
}
df
v1 v2 v3
1 0 0 1
2 1 0 NA
3 NA NA 0
4 1 1 0
5 NA NA NA
6 0 NA 0
num.rows.with.x.NA(df, 0, names(df))
#[1] 2
num.rows.with.x.NA(df, 1, names(df))
#[1] 2
num.rows.with.x.NA(df, 2, names(df))
#[1] 1
num.rows.with.x.NA(df, 3, names(df))
#[1] 1
ORIGINAL QUESTION
I want to add a series of dummy variables in a data frame for each value of x in that data frame but containing an NA if another variable is NA. For example, suppose I have the below data frame:
x <- seq(1:5)
y <- c(NA, 1, NA, 0, NA)
z <- data.frame(x, y)
I am looking to produce:
var1 such that: z$var1 == 1 if x == 1, else if y == NA, z$var1 == NA, else z$var1 == 0.
var2 such that: z$var2 == 1 if x == 2, else if y == NA, z$var2 == NA, else z$var2 == 0.
var3 etc.
I can't seem to figure out how to vectorize this. I am looking for a solution that can be used for a large count of values of x.
UPDATE
There was some confusion that I wanted to iterate through each index of x. I am not looking for this, but rather for a solution that creates a variable for each unique value of x. When taking the below data as an input:
x <- c(1,1,2,3,9)
y <- c(NA, 1, NA, 0, NA)
z <- data.frame(x, y)
I am looking for z$var1, z$var2, z$var3, z$var9 where z$var1 <- c(1, 1, NA, 0, NA) and z$var2 <- c(NA, 0, 1, 0, NA). The original solution produces z$var1 <- z$var2 <- c(1,1,NA,0,NA).
You can use the ifelse which is vectorized to construct the variables:
cbind(z, setNames(data.frame(sapply(unique(x), function(i) ifelse(x == i, 1, ifelse(is.na(y), NA, 0)))),
paste("var", unique(x), sep = "")))
x y var1 var2 var3 var9
1 1 NA 1 NA NA NA
2 1 1 1 0 0 0
3 2 NA NA 1 NA NA
4 3 0 0 0 1 0
5 9 NA NA NA NA 1
Update:
cbind(z, data.frame(sapply(unique(x), function(i) ifelse(x == i, 1, ifelse(is.na(y), NA, 0)))))
x y X1 X2 X3 X4
1 1 NA 1 NA NA NA
2 1 1 1 0 0 0
3 2 NA NA 1 NA NA
4 3 0 0 0 1 0
5 9 NA NA NA NA 1