Unexpected behavior with case_when and is.na - r

I want to change all NA values in a column to 0 and all other values to 1. However, I can't get the combination of case_when and is.na to work.
# Create dataframe
a <- c(rep(NA,9), 2, rep(NA, 10))
b <- c(rep(NA,9), "test", rep(NA, 10))
df <- data.frame(a,b, stringsAsFactors = F)
# Create new column (c), where all NA values in (a) are transformed to 0 and other values are transformed to 1
df <- df %>%
mutate(
c = case_when(
a == is.na(.$a) ~ 0,
FALSE ~ 1
)
)
I expect column (c) to indicate all 0 values and one 1 value, but its all 0's.
It does work when I use an if_else statement with is.na, like:
df <- df %>%
mutate(
c = if_else(is.na(a), 0, 1))
)
What is going on here?

You should be doing this instead:
df %>%
mutate(
c = case_when(
is.na(a) ~ 0,
TRUE ~ 1
)
)

Related

How to select variables with numeric suffixes lower than a value

I have a data frame similar to this one.
df <- data.frame(id=c(1,2,3), tot_1=runif(3, 0, 100), tot_2=runif(3, 0, 100), tot_3=runif(3, 0, 100), tot_4=runif(3, 0, 100))
I want to select or make an operation only with those with suffixes lower than 3.
#select
df <- df %>% select(id, tot_1, tot_2)
#or sum
df <- df %>% mutate(sumVar = rowSums(across(c(tot_1, tot_2))))
However, in my real data, there are many more variables and not in order. So how could I select them without doing it manually?
We may use matches
df %>%
mutate(sumVar = rowSums(across(matches('tot_[1-2]$'))))
If we need to be more flexible, extract the digit part from the column names that starts with 'tot', subset based on the condition and use that new names
library(stringr)
nm1 <- str_subset(names(df), 'tot')
nm2 <- nm1[readr::parse_number(nm1) <3]
df %>%
mutate(sumVar = rowSums(across(all_of(nm2))))
Solution with num_range
This is the rare case for the often forgotten num_range selection helper from dplyr, which extracts the numbers from the names in a single step, then selects a range:
determine the threshold
suffix_threshold <- 3
Select( )
library(dplyr)
df %>% select(id, num_range(prefix='tot_',
range=seq_len(suffix_threshold-1)))
id tot_1 tot_2
1 1 26.75082 26.89506
2 2 21.86453 18.11683
3 3 51.67968 51.85761
mutate() with rowSums()
library(dplyr)
df %>% mutate(sumVar = across(num_range(prefix='tot_', range=seq_len(suffix_threshold-1)))%>%
rowSums)
id tot_1 tot_2 tot_3 tot_4 sumVar
1 1 26.75082 26.89506 56.27829 71.79353 53.64588
2 2 21.86453 18.11683 12.91569 96.14099 39.98136
3 3 51.67968 51.85761 25.63676 10.01408 103.53730
Here is a base R way -
cols <- grep('tot_', names(df), value = TRUE)
#Select
df[c('id', cols[as.numeric(sub('tot_', '',cols)) < 3])]
# id tot_1 tot_2
#1 1 75.409112 30.59338
#2 2 9.613496 44.96151
#3 3 58.589574 64.90672
#Rowsums
df$sumVar <- rowSums(df[cols[as.numeric(sub('tot_', '',cols)) < 3]])
df
# id tot_1 tot_2 tot_3 tot_4 sumVar
#1 1 75.409112 30.59338 59.82815 50.495758 106.00250
#2 2 9.613496 44.96151 84.19916 2.189482 54.57501
#3 3 58.589574 64.90672 18.17310 71.390459 123.49629

R Set Column Value based on other Column Values

I need to set the values of a column to 0 or 1 based on other columns values.
If they are 0 or NA the new column should be 1.
I Thought about:
ifelse(df[,53:62]==0|NA, df$newCol <- 1, df$newCol <- 0)
But I the End I get only 1 in the new Column
Thanks for your help
I think the tidyverse fits perfectly on this common use case
library(tidyverse)
df_example <- matrix(c(0,1),ncol = 100,nrow = 100) %>%
as_tibble()
df_example %>%
mutate(across(.cols = 53:62,
.fns = ~ if_else(.x == 0|is.na(.x),
1,
0))
) %>%
select(V54) # example**

How to use case_when with mutate_all to insert variable value

I have a seemingly small problem. I want to use mutate_all() in conjunction with case_when(). A sample data frame:
tbl <- tibble(
x = c(0, 1, 2, 3, NA),
y = c(0, 1, NA, 2, 3),
z = c(0, NA, 1, 2, 3),
date = rep(today(), 5)
)
I first made another data frame replacing all the NA's with zero's and the values with a 1 with the following piece of code.
tbl %>%
mutate_all(
funs(
case_when(
. %>% is.na() ~ 0,
TRUE ~ 1
)))
Now I want to replace the NA values with blanks ("") and leave the other values as it is. However, I don't know how to set the TRUE value in a way that it keeps the value of the column.
Any suggestions would be much appreciated!
To leave the NA as "", we can use replace_na from tidyr
library(dplyr)
library(tidyr)
tbl %>%
mutate_all(replace_na, "")
# A tibble: 5 x 3
# x y z
# <chr> <chr> <chr>
#1 0 0 0
#2 1 1 ""
#3 2 "" 1
#4 3 2 2
#5 "" 3 3
With case_when or if_else, we have to make sure the type are the same across. Here, we are converting to character when we insert the "", so make sure the other values are also coerced to character class
tbl %>%
mutate_all(~ case_when(is.na(.) ~ "", TRUE ~ as.character(.)))
If we want to use only specific columns, then we can use mutate_at
tbl %>%
mutate_at(vars(x:y), ~ case_when(is.na(.) ~ "", TRUE ~ as.character(.)))
Also, to simplify the code in OP's post, it can be directly coerced to integer with as.integer or +
tbl %>%
mutate_all(~ as.integer(!is.na(.)))
Or if we are using case_when
tbl %>%
mutate_all(~ case_when(is.na(.)~ 0, TRUE ~ 1))

Making tidyeval function inside case_when

I have a data set that I like to impute one value among others based on probability distribution of those values. Let make some reproducible example first
library(tidyverse)
library(janitor)
dummy1 <- runif(5000, 0, 1)
dummy11 <- case_when(
dummy1 < 0.776 ~ 1,
dummy1 < 0.776 + 0.124 ~ 2,
TRUE ~ 5)
df1 <- tibble(q1 = dummy11)
here is the output:
df1 %>% tabyl(q1)
q1 n percent
1 3888 0.7776
2 605 0.1210
5 507 0.1014
I used mutate and sample to share value= 5 among value 1 and 2 like this:
df1 %>%
mutate(q1 = case_when(q1 == 5 ~ sample(
2,
length(q1),
prob = c(0.7776, 0.1210),
replace = TRUE
),
TRUE ~ as.integer(q1))
)
and here is the result :
q1 n percent
1 4322 0.8644
2 678 0.1356
This approach seems working, however since I need to apply this for several variables I tried to write a function that working with tidyverse with tidyeval, like this
my_impute <- function(.data, .prob_var, ...) {
.prob_var <- enquo(.prob_var)
.data %>%
sample(2, prob=c(!!.prob_var), replace = TRUE)
}
# running on data
df1 %>%
mutate(q1 = case_when(q1 == 5 ~ !!my_impute(q1),
TRUE ~ as.integer(q1))
)
The error is :
Error in eval_tidy(pair$lhs, env = default_env) : object 'q1' not found
We need the prob values from the 'percent' column generated from tabyl, so the function can be modified to
library(janitor)
library(dplyr)
my_impute <- function(.data, .prob_var, vals, ...) {
.prob_var = enquo(.prob_var)
.prob_vals <- .data %>%
janitor::tabyl(!!.prob_var) %>%
filter(!!.prob_var %in% vals) %>%
pull(percent)
.data %>%
mutate(!! .prob_var := case_when(!! .prob_var == 5 ~
sample(
2,
n(),
prob = .prob_vals,
replace = TRUE
),
TRUE ~ as.integer(q1))
)
}
df1 %>%
my_impute(q1, vals = 1:2) %>%
tabyl(q1)
# q1 n percent
# 1 4285 0.857
# 2 715 0.143
Just to add my two cents, the new version of rlang allows to replace the quasiquotation process: enquo() + !! and you can use curly-curly to embrace variables: The function would be like:
my_impute <- function(.data, .prob_var, vals, ...) {
#.prob_var = enquo(.prob_var)
# commented out since it is no longer needed
.prob_vals <- .data %>%
janitor::tabyl({{.prob_var}}) %>%
filter({{.prob_var}} %in% {{vals}}) %>%
pull(percent)
.data %>%
mutate( {{.prob_var}} := case_when( {{.prob_var}} == 5 ~
sample(
2,
n(),
prob = {{.prob_vals}},
replace = TRUE
),
TRUE ~ as.integer(q1))
)
}

Apply function over data frame rows

I'm trying to apply a function over the rows of a data frame and return a value based on the value of each element in a column. I'd prefer to pass the whole dataframe instead of naming each variable as the actual code has many variables - this is a simple example.
I've tried purrr map_dbl and rowwise but can't get either to work. Any suggestions please?
#sample df
df <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31))
#required result
Res <- data.frame(Y=c("A","B","B","A","B"),
X=c(1,5,8,23,31),
NewVal=c(10,500,800,230,3100)
)
#use mutate and map or rowwise etc
Res <- df %>%
mutate(NewVal=map_dbl(.x=.,.f=FnAdd(.)))
Res <- df %>%
rowwise() %>%
mutate(NewVal=FnAdd(.))
#sample fn
FnAdd <- function(Data){
if(Data$Y=="A"){
X=Data$X*10
}
if(Data$Y=="B"){
X=Data$X*100
}
return(X)
}
If there are multiple values, it is better to have a key/val dataset, join and then do the mulitiplication
keyVal <- data.frame(Y = c("A", "B"), NewVal = c(10, 100))
df %>%
left_join(keyVal) %>%
mutate(NewVal = X*NewVal)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
It is not clear how many unique values are there in the actual dataset 'Y' column. If we have only a few values, then case_when can be used
FnAdd <- function(Data){
Data %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X *100,
TRUE ~ X))
}
FnAdd(df)
# Y X NewVal
#1 A 1 10
#2 B 5 500
#3 B 8 800
#4 A 23 230
#5 B 31 3100
You were originally looking for a solution using dplyr's rowwise() function, so here is that solution. The nice thing about this approach is that you don't need to create a separate function.
Here's the version using if()
df %>%
rowwise() %>%
mutate(NewVal = ifelse(Y == "A", X * 10,
ifelse(Y == "B", X * 100)))
and here's the version using case_when:
df %>%
rowwise() %>%
mutate(NewVal = case_when(Y == "A" ~ X * 10,
Y == "B" ~ X * 100))

Resources