Alternatives to apply same condition to multiple variables inside case_when function - r

I am trying to find a more efficient or elegant solution to multiple conditioning inside case_when function.
I am creating a dummy column based on multiple conditions across specific columns of a data frame. There are many cases where I use the same is.na() for many columns. I have the correct result, but I have tried other approaches with apply, reduce and anyNa without success.
Let's say this data frame looks like the data I'm working on:
set.seed(12)
dframe <- data.frame(
x1 = sample(letters[1:2], 10, replace = TRUE),
x2 = sample(0:1, 10, replace = TRUE),
x3 = sample(0:2, 10, replace = TRUE),
x4 = sample(0:2, 10, replace = TRUE),
x5 = sample(0:2, 10, replace = TRUE),
x6 = sample(0:2, 10, replace = TRUE)
) %>%
mutate_if(is.numeric, list(~na_if(., 2)))
And it looks like this:
x1 x2 x3 x4 x5 x6
1 b 1 NA 0 0 0
2 b 0 0 0 NA NA
3 b 1 0 0 0 1
4 a 0 NA 1 NA 0
5 a 1 1 NA NA NA
6 b 0 NA 1 1 1
7 a 1 1 NA NA 0
8 a 1 0 1 NA 0
9 b 1 NA NA 0 0
10 b 1 1 0 NA NA
Then, I create the column x7 based on the following conditions:
dframe %>%
mutate(
x7 = case_when(
x2 == 1 &
(!is.na(x3) | !is.na(x4) | !is.na(x5)) &
!is.na(x6) ~ 1,
x2 == 1 ~ 0,
TRUE ~ NA_real_
)
)
resulting in:
x1 x2 x3 x4 x5 x6 x7
1 b 1 NA 0 0 0 1
2 b 0 0 0 NA NA NA
3 b 1 0 0 0 1 1
4 a 0 NA 1 NA 0 NA
5 a 1 1 NA NA NA 0
6 b 0 NA 1 1 1 NA
7 a 1 1 NA NA 0 1
8 a 1 0 1 NA 0 1
9 b 1 NA NA 0 0 1
10 b 1 1 0 NA NA 0
However, I want to find an alternative to write (!is.na(x3) | !is.na(x4) | !is.na(x5)) because in my real script I have to type this for 11 columns.
I've tried to use complete.cases(x3, x4, x5), but it doesn't follow the logic I'm using in the code.
Using anyNA(x3, x4, x5) throws Error in anyNA(x3, x4, x5) : anyNA takes 1 or 2 arguments.
Also tried the answers of a similar problem, but since I'm not using it for filtering, it didn't work out.
Maybe I'm overthinking it, but what I'm looking for is something without having to use (!is.na(x3) | !is.na(x4) | !is.na(x5)).

We could use rowSums and specify the columns by name
library(dplyr)
dframe %>%
mutate(x7 = case_when(
x2 == 1 &
rowSums(!is.na(.[c("x3","x4","x5")])) > 0 &
!is.na(x6) ~ 1,
x2 == 1 ~ 0,
TRUE ~ NA_real_
)
)
Or by position
rowSums(!is.na(.[3:5])) > 0
We could do this using inverted logic as well.
rowSums(is.na(.[c("x3","x4","x5")])) != 3
Or
rowSums(is.na(.[3:5])) != 3
We use 3 here as there are 3 columns to check in the given example (x3, x4 and x5), you can change the number based on your actual number of columns (11).

Related

Ifelse across multiple columns matching on similar attributes

I need to create a binary variable called dum, (perhaps using an ifelse statement) matching on the number of the column names.
ifelse f[number] %in% c(4:6) & l[number]==1, 1, else 0
f1<-c(3,2,1,6,5)
f2<-c(4,1,5,NA,NA)
f3<-c(5,3,4,NA,NA)
f4<-c(1,2,4,NA,NA)
l1<-c(1,0,1,0,0)
l2<-c(1,1,1,NA,NA)
l3<-c(1,0,0,NA,NA)
l4<-c(0,0,0,NA,NA)
mydata<-data.frame(f1,f2,f3,f4,l1,l2,l3,l4)
dum is 1 if f1 contains values between 4, 5, 6 AND l1 contains a value of 1, OR f2 contains values between 4, 5, 6 AND l2 contains a value of 1, and so on.
In essence, the expected output should be
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
I can only think of doing it in a very long way such as
mutate(dum=ifelse(f1 %in% c(4:6 & l1==1, 1,
ifelse(f2 %in% c(4:6) & l2==1, 1,
ifelse(f3 %in% c(4:6) & l3==1, 1,
ifelse(f4 %in% c(4:6) & l4==1, 1, 0))))
But this is burdensome since the real data has many more columns than that and can go up to f20 and l20.
Is there a more efficient way to do this?
Here is one suggestion. Again it is not exactly clear. Assuming you want one column with dum that indicates the presences of the number in the column names in that row in any of the columns:
library(dplyr)
library(readr)
mydata %>%
mutate(across(f1:l4, ~case_when(. == parse_number(cur_column()) ~ 1,
TRUE ~ 0), .names = 'new_{col}')) %>%
mutate(sumNew = rowSums(.[9:16])) %>%
mutate(dum = ifelse(sumNew >=1, 1, 0)) %>%
select(1:8, dum)
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 1
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
Here is one option with across - loop across the 'f' columns, use the first condition, loop across the 'l' columns' with the second condition applied, join them together with & to return a logical matrix, get the row wise sum of the columns (TRUE -> 1 and FALSE -> 0), check if that sum is greater than 0 (i.e. if there are any TRUE in that row), and coerce the logical to binary with + or as.integer
library(dplyr)
mydata %>%
mutate(dum = +(rowSums(across(starts_with('f'), ~.x %in% 4:6) &
across(starts_with('l'), ~ .x %in% 1)) > 0))
f1 f2 f3 f4 l1 l2 l3 l4 dum
1 3 4 5 1 1 1 1 0 1
2 2 1 3 2 0 1 0 0 0
3 1 5 4 4 1 1 0 0 1
4 6 NA NA NA 0 NA NA NA 0
5 5 NA NA NA 0 NA NA NA 0
We could also use base R
mydata$dum <- +(Reduce(`|`, Map(function(x, y) x %in% 4:6 &
y %in% 1, mydata[startsWith(names(mydata), "f")],
mydata[startsWith(names(mydata), "l")])))
Here's an approach multiplying two mapplys together, columns identified with grep, then calculating rowSums > 0. If you set na.rm=F you could get NAs in respective rows.
as.integer(rowSums(mapply(`%in%`, mydata[grep('^f', names(mydata))], list(4:6))*
mapply(`==`, mydata[grep('^l', names(mydata))], 1), na.rm=T) > 0)
# [1] 1 0 1 0 0
If f* and l* each aren't consecutive, rather use sort(grep(., value=T)).

how to make binary variable?

I have three Variables with the scale (0,1,2)
for Example;
x1
x2
x3
1
0
1
NA
NA
0
1
1
1
NA
NA
NA
0
0
0
I want to create another variable if variable x1 and/or X2 and/or x3 has 1 then x4 has to be 1, sample values for x4 are under
x1
x2
x3
x4
1
0
1
1
NA
NA
0
0
1
1
1
1
NA
NA
NA
NA
0
0
0
0
I am using rstudio, i used if else function but I didn't get what I wanted.
can anyone please guide me what other ways I can have this variable.
I used following code
data$hope <- ifelse(data$x1 > 0 && data$x2 > 0 && data$x3 > 0,1,0)
data$hope <- ifelse(data$x1 > 0 && data$x2 > 0 && data$x3 > 0,1,0)
We could use pmax if there are only binary columns in the dataset
df1$x4 <- do.call(pmax, c(df1, na.rm = TRUE))

Create a new variable based on any 2 conditions being true

I have a dataframe in R with 4 variables and would like to create a new variable based on any 2 conditions being true on those variables.
I have attempted to create it via if/else statements however would require a permutation of every variable condition being true. I would also need to scale to where I can create a new variable based on any 3 conditions being true. I am not sure if there is a more efficient method than using if/else statements?
My example:
I have a dataframe X with following column variables
x1 = c(1,0,1,0)
X2 = c(0,0,0,0)
X3 = c(1,1,0,0)
X4 = c(0,0,1,0)
I would like to create a new variable X5 if any 2 of the variables are true (eg ==1)
The new variable based on the above dataframe would produce X5 (1,0,1,0)
This can easily be done by using the apply function:
x1 = c(1,0,1,0)
x2 = c(0,0,0,0)
x3 = c(1,1,0,0)
x4 = c(0,0,1,0)
df <- data.frame(x1,x2,x3,x4)
df$x5 <- apply(df,1,function(row) ifelse(sum(row != 0) == 2, 1, 0))
x1 x2 x3 x4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
apply with option 1 means: Do this function on every row. To scale this up to 3...N true values, just change the number in the ifelse statement.
You can try this:
#Data
df <- data.frame(x1,X2,X3,X4)
#Code
df$X5 <- ifelse(rowSums(df,na.rm=T)==2,1,0)
x1 X2 X3 X4 X5
1 1 0 1 0 1
2 0 0 1 0 0
3 1 0 0 1 1
4 0 0 0 0 0
You can use:
df$X5 <- 1*(apply(df == 1, 1, sum) == 2)
or
df$X5 <- 1*(mapply(sum, df) == 2)
Output
> df
X1 X2 X3 X4 X5
1 0 1 0 1
0 0 1 0 0
1 0 0 1 1
0 0 0 0 0
Data
df <- data.frame(X1,X2,X3,X4)

How to parse a column while referring to values from other multiple columns?

I have this sample dataframe where column a to d are reference columns and column x1-3 need to be parsed and plugged with new values.
Here is the code to re-produce the data frame:
df1 <- data_frame(a = c(0,1,0,1), b = c(0,0,1,1), c = c(0,0,0,0), d =
c(1,0,0,1), x1= c(NA, NA, NA, NA), x2= c(NA, NA, NA, NA), x3= c(NA, NA, NA, NA))
I want to give new values to x1 -x3 based on different value combination from column a, b, c, d. My pseudocode is as follows:
for df1[ , "x1"]:
if a = 1: then return 1
else: return 0
for df1[ , "x2"]:
if a = 1 & b = 1: then return 1
else: return 0
for df1[ , "x3"]:
all conditions: return 1
Ideally, all the values in x1 and x2 will be changed according to their given conditions. X3 should be filled with 1 no matter what. Can anyone suggest a efficient way to loop & parse through those columns, please?
You don't need loops:
df1$x1 <- df1$a
df1$x2 <- as.integer(df1$a & df1$b)
df1$x3 <- 1
Result:
a b c d x1 x2 x3
1 0 0 0 1 0 0 1
2 1 0 0 0 1 0 1
3 0 1 0 0 0 0 1
4 1 1 0 1 1 1 1
Edit:
If columns a-d are not binary values (0 or 1) you still can use the same expressions to create columns x1-3. Let's say you have this data frame:
a b c d x1 x2 x3
1 0 0 1 5 NA NA NA
2 3 9 2 1 NA NA NA
3 4 2 3 5 NA NA NA
4 2 1 4 1 NA NA NA
And your conditions are:
x1 = 1 if (b >= 2) and (d < 4) 0 otherwise
x2 = 1 if (a > b) and (b < d) 0 otherwise
x3 = always 1
You can use the same methodology:
df1$x1 <- as.integer(df1$b >= 2 & df1$d < 4)
df1$x2 <- as.integer(df1$a > df1$b & df1$b < df1$d)
df1$x3 <- 1
Result:
a b c d x1 x2 x3
1 0 0 1 5 0 0 1
2 3 9 2 1 1 0 1
3 4 2 3 5 0 1 1
4 2 1 4 1 0 0 1

extract rows for which first non-zero element is one

I would like to extract every row from the data frame my.data for which the first non-zero element is a 1.
my.data <- read.table(text = '
x1 x2 x3 x4
0 0 1 1
0 0 0 1
0 2 1 1
2 1 2 1
1 1 1 2
0 0 0 0
0 1 0 0
', header = TRUE)
my.data
desired.result <- read.table(text = '
x1 x2 x3 x4
0 0 1 1
0 0 0 1
1 1 1 2
0 1 0 0
', header = TRUE)
desired.result
I am not even sure where to begin. Sorry if this is a duplicate. Thank you for any suggestions or advice.
Here's one approach:
# index of rows
idx <- apply(my.data, 1, function(x) any(x) && x[as.logical(x)][1] == 1)
# extract rows
desired.result <- my.data[idx, ]
The result:
x1 x2 x3 x4
1 0 0 1 1
2 0 0 0 1
5 1 1 1 2
7 0 1 0 0
Probably not the best answer, but:
rows.to.extract <- apply(my.data, 1, function(x) {
no.zeroes <- x[x!=0] # removing 0
to.return <- no.zeroes[1] == 1 # finding if first number is 0
# if a row is all 0, then to.return will be NA
# this fixes that problem
to.return[is.na(to.return)] <- FALSE # if row is all 0
to.return
})
my.data[rows.to.extract, ]
x1 x2 x3 x4
1 0 0 1 1
2 0 0 0 1
5 1 1 1 2
7 0 1 0 0
Use apply to iterate over all rows:
first.element.is.one <- apply(my.data, 1, function(x) x[x != 0][1] == 1)
The function passed to apply compares the first [1] non-zero [x != 0] element of x to == 1. It will be called once for each row, x will be a vector of four in your example.
Use which to extract the indices of the candidate rows (and remove NA values, too):
desired.rows <- which(first.element.is.one)
Select the rows of the matrix -- you probably know how to do this.
Bonus question: Where do the NA values mentioned in step 2 come from?

Resources