I have a situation where my data frame can contain different errors and I want to catch both cases with an if statement afterwards.
Situtation 1:
the data frame contains NA
library(dplyr)
data(iris)
attach(iris)
data <- iris %>% filter(Sepal.Length >=7.9)
sepal_slope <- data %>% group_by(Species) %>%
do(fit = lm(Sepal.Width ~ Sepal.Length, .)) %>%
summarise(sepal_slope = coef(fit)[2])
this is FALSE:
nrow(sepal_slope) == 0
# FALSE
is.na is TRUE here as intended
is.na(sepal_slope)
# TRUE
Situation 2: the data frame is empty
data <- iris %>% filter(Sepal.Length >=12)
sepal_slope <- data %>% group_by(Species) %>%
do(fit = lm(Sepal.Width ~ Sepal.Length, .)) %>%
summarise(sepal_slope = coef(fit)[2])
now this is TRUE as intended:
nrow(sepal_slope) == 0
# TRUE
but this produces an error:
is.na(sepal_slope)
# sepal_slope
So I cannot use
if(nrow(sepal_slope) == 0 | is.na(sepal_slope)) sepal_slope <- 5
# Error in if (nrow(sepal_slope) == 0 | is.na(sepal_slope)) sepal_slope <- 5 :
argument is of length zero
How can I catch both situations in one if statement
Of course the case where sepal_slope contains a num value should be handled, if should yield TRUE here by default.
If you coerce your sepal_slope to numeric both cases will respond as TRUE to is.na.
if(is.na(as.numeric(unlist(sepal_slope))[1])) sepal_slope <- 5
Related
I am getting myself confused with dplyr and if_any. I am trying to perform something along these lines:
If a column is present then evaluate an expression. If not return a FALSE.
So these three scenarios capture what I am thinking:
library(dplyr)
dat <- data.frame(x = 1)
## GOOD: if foo_col is NA then return FALSE
dat %>%
mutate(foo_col = NA_character_) %>%
mutate(present = if_any(matches("foo_col"), ~ !is.na(.x)))
#> x foo_col present
#> 1 1 <NA> FALSE
## GOOD: if foo_col is not NA return FALSE
dat %>%
mutate(foo_col = "value") %>%
mutate(present = if_any(matches("foo_col"), ~ !is.na(.x)))
#> x foo_col present
#> 1 1 value TRUE
## NOT GOOD: if foo_col is absent, return TRUE? Want this to be FALSE.
dat %>%
mutate(present = if_any(matches("foo_col"), ~ !is.na(.x)))
#> x present
#> 1 1 TRUE
So can anyone suggest a way to determine how I could check for the is.na condition but also if the column is actually there?
If we need the last to be FALSE while giving the TRUE/FALSE for the other two cases
library(dplyr)
dat %>%
mutate(present = ncol(pick(matches("foo_col"))) > 0 &
if_any(matches("foo_col"), ~ !is.na(.x)))
-output
x present
1 1 FALSE
Or as #boshek mentioned in the comments, rlang::is_empty should work as well
dat %>%
mutate(present = !rlang::is_empty((across(matches("foo_col")))) &
if_any(matches("foo_col"), ~ !is.na(.x)))
-output
x present
1 1 FALSE
For the other cases
> dat %>%
+ mutate(foo_col = NA_character_) %>%
+ mutate(present = ncol(pick(matches("foo_col"))) > 0 &if_any(matches("foo_col"), ~ !is.na(.x)))
x foo_col present
1 1 <NA> FALSE
> dat %>%
+ mutate(foo_col = "value") %>%
+ mutate(present = ncol(pick(matches("foo_col"))) > 0 & if_any(matches("foo_col"), ~ !is.na(.x)))
x foo_col present
1 1 value TRUE
NOTE: But this test cannot differentiate the FALSE from the NA cases and column not found FALSE
Take this function foo(). I want it to have a default argument of cyl because that's the name of the field it will usually process.
library(tidyverse)
foo <- function(x = cyl){
case_when(
x == 6 ~ TRUE,
x == 8 ~ FALSE,
x == 4 ~ NA
)
}
# works:
mtcars %>%
mutate(cyl_refactor = foo(cyl)) %>%
select(cyl, cyl_refactor)
But I am surprised that the function will not work unless I explicitly supply the default argument. See failing code below
# fails:
mtcars %>%
mutate(cyl_refactor = foo()) %>%
select(cyl, cyl_refactor)
Error: Problem with `mutate()` column `cyl_refactor`. ℹ `cyl_refactor = foo()`. x object 'cyl' not found
It seems that default arguments are only processed when there is also a data parameter as below.
foo2 <- function(data, x = cyl){
data %>%
mutate(cyl_refactor = case_when(
{{x}} == 6 ~ TRUE,
{{x}} == 8 ~ FALSE,
{{x}} == 4 ~ NA
))
}
mtcars %>%
foo2() %>%
select(cyl, cyl_refactor)
I am sure there is some gap in my knowledge of quasiquotation, but I would like to understand how to use a default argument in foo().
Here's one that will "work" though I woudn't recommend it
foo <- function(x = cyl){
x <- enquo(x)
eval.parent(rlang::quo_squash(rlang::quo(case_when(
!!x == 6 ~ TRUE,
!!x == 8 ~ FALSE,
!!x == 4 ~ NA
))))
}
# Both run without error
mtcars %>%
mutate(cyl_refactor = foo(cyl)) %>%
select(cyl, cyl_refactor)
mtcars %>%
mutate(cyl_refactor = foo()) %>%
select(cyl, cyl_refactor)
The problem is that in order for case_when to work, you can't just pass in a column name without also passing in the data. In order to "find" the data in this case, I've used eval.parent() to go up the call chain to try to find the cyl variable.
It's better to make proper functions where you pass in the input data directly (rather than variable names they need to look up themselves).
We could do this with missing and cur_data_all
foo <- function(x = cyl){
if(missing(x)) x <- cur_data_all()[["cyl"]]
case_when(
x == 6 ~ TRUE,
x == 8 ~ FALSE,
x == 4 ~ NA
)
}
-testing
> out1 <- mtcars %>%
+ mutate(cyl_refactor = foo(cyl)) %>%
+ select(cyl, cyl_refactor)
> out2 <- mtcars %>%
+ mutate(cyl_refactor = foo()) %>%
+ select(cyl, cyl_refactor)
>
> identical(out1, out2)
[1] TRUE
What is the best way to automate mutate function in one dplyr aggregation.
Best if I demonstrate on the example.
So in the first part of an example I am creating new columns based on values of variable gear. However, imagine I need to automate this step to automatically 'iterate' over all unique values of gear and creates new columns for each value.
Is there any how to do to so?
library(tidyverse)
cr <-
mtcars %>%
group_by(gear) %>%
nest()
# This is 'by-hand' approach of what I would like to do - How to automate it? E.g. we do not know all values of 'carb'
cr$data[[1]] %>%
mutate(VARIABLE1 =
case_when(carb == 1 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE2 =
case_when(carb == 2 ~ hp/mpg,
TRUE ~ 0)) %>%
mutate(VARIABLE4 =
case_when(carb == 4 ~ hp/mpg,
TRUE ~ 0))
# This is a pseodu-idea of what I need to do. Is the any way how to change iteration number in ONE dplyr code?
vals <- cr$data[[1]] %>% pull(carb) %>% sort %>% unique()
for (i in vals) {
message(i)
cr$data[[1]] %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be first element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be second element of vals
TRUE ~ 0)) %>%
mutate(paste('VARIABLE', i, sep = '') = case_when(carb == i ~ hp/mpg, # At this line, all i shall be third element of vals
TRUE ~ 0))
}
One way would be to use dummy_cols from package fastDummies
Doing it for one dataframe at a time:
cr$data[[1]] %>%
dummy_cols(select_columns = 'carb')%>%
mutate_at(vars(starts_with('carb_')),funs(.*hp/mpg))
You can also do this first and the group by gear since you are not using gear value in calculation so it wouldn't matter. For that:
cr_new=mtcars %>%
dummy_cols(select_columns = 'carb')%>%
mutate_at(vars(starts_with('carb_')),funs(.*hp/mpg))%>%
group_by(gear)%>%
nest()
Perhaps, something like this would help -
library(dplyr)
library(purrr)
bind_cols(mtcars, map_dfc(unique(mtcars$carb),
~mtcars %>%
transmute(!!paste0('carb', .x) := case_when(carb == 1 ~ hp/mpg,TRUE ~ 0))))
It sounds a lot like what's called "the XY-problem".
https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem
Please read about tidy data, and/or tidyr's pivot_longer/pivot_wider. Column names should not encode information.
I want to create a single function called eval_data where the user can input
a list of data frames
a list of dplyr functions to apply to the data frames
a list of columns to select from each dataframe:
This will look something like:
eval_data <- function(data, dplyr_logic, select_vector) {
data %>%
# this doesn't work
eval(dplyr_logic) %>%
select(
{ select_vector }
)
}
The dplyr_logic is a list of either:
nothing
a mutate statement
2 mutate statements
a filter
Input 1: List of data frames:
dd <- list()
dd$data <- list(
mutate0 = iris,
mutate1 = iris,
mutate2= iris,
filter1 = iris
)
Input 3 Select vector:
select_vec <- list(
c("Species", "Sepal.Length"),
c("Species", "New_Column1"),
c("Species", "New_Column2", "New_Column3"),
c("Species", "Sepal.Width")
)
Input 2: list of logic to apply to each data frame in the list
logic <- list(
# do nothing -- this one works
I(),
#mutate1
rlang::expr(mutate(New_Column1 = case_when(
Sepal.Length > 7 ~'Big',
Sepal.Length > 6 ~ 'Medium',
TRUE ~ 'Small'
)
)),
#mutate2
rlang::expr(mutate(New_Column2 = case_when(
Sepal.Width > 3.5 ~'Big2',
Sepal.Width > 3 ~ 'Medium2',
TRUE ~ 'Small2'
)) %>%
mutate(
New_Column3 = case_when(
Petal.Width > 2 ~'Big3',
Petal.Width > 1 ~ 'Medium3',
TRUE ~ 'Small3'
)
)
),
#filter1
rlang::expr(filter(Sepal.Width > 3))
)
# eval_data(dd$data[[1]], logic[[1]], select_vec[[1]]) works
# eval_data(dd$data[[2]], logic[[2]], select_vec[[2]]) does not
Desired Goal:
pmap(dd$data, logic, select_vec, ~eval_data)
Desired Output
pmap_output <- list(
iris1 = iris %>% I() %>% select("Species", "Sepal.Length"),
iris2 = iris %>%
mutate(New_Column1 =
case_when(
Sepal.Length > 7 ~'Big',
Sepal.Length > 6 ~ 'Medium',
TRUE ~ 'Small')) %>%
select("Species", "New_Column1"),
iris4 = iris %>%
mutate(New_Column2 = case_when(
Sepal.Width > 3.5 ~'Big2',
Sepal.Width > 3 ~ 'Medium2',
TRUE ~ 'Small2'
)) %>%
mutate(
New_Column3 = case_when(
Petal.Width > 2 ~'Big3',
Petal.Width > 1 ~ 'Medium3',
TRUE ~ 'Small3'
)
) %>%
select("Species", "New_Column2", "New_Column3"),
iris3 = iris %>% filter(Sepal.Width > 3) %>% select("Species", "Sepal.Width")
)
What do I need to change in eval_data and the logic list in order to make this work? Any help appreciated!!
Two changes. First, you need to include data %>% into your dplyr logic evaluation:
eval_data <- function(data, dplyr_logic, select_vector) {
rlang::expr( data %>% !!dplyr_logic ) %>%
eval() %>%
select( one_of(select_vector) )
}
Second, the chained mutate is actually a bit tricky. Recall that x %>% f(y) can be rewritten as f(x,y). Your double-mutate expression can therefore be re-written as mutate( mutate(expr1), expr2 ). When you feed the data to it, it becomes
mutate(data, mutate(expr1), expr2)
instead of the desired
mutate(mutate(data, expr1), expr2)
So, we need to use the pronoun . to specify where the pipe input should go to in our complex expression:
logic <- rlang::exprs( # We can use exprs instead of list(expr())
I(),
mutate(New_Column1 = case_when(
Sepal.Length > 7 ~'Big',
Sepal.Length > 6 ~ 'Medium',
TRUE ~ 'Small'
)),
{mutate(., New_Column2 = case_when( # <--- NOTE the { and the .
Sepal.Width > 3.5 ~'Big2',
Sepal.Width > 3 ~ 'Medium2',
TRUE ~ 'Small2')) %>%
mutate(
New_Column3 = case_when(
Petal.Width > 2 ~'Big3',
Petal.Width > 1 ~ 'Medium3',
TRUE ~ 'Small3'
))}, # <--- NOTE the matching }
filter(Sepal.Width > 3)
)
Everything works now:
res <- pmap(list(dd$data, logic, select_vec), eval_data)
## Compare to desired output
map2_lgl( res, pmap_output, identical )
# mutate0 mutate1 mutate2 filter1
# TRUE TRUE TRUE TRUE
I want to create a new column (T/F) based on any value from a list being present in multiple columns. For this example, I'm using mtcars for my example, searching for two values in two columns, but my actual challenge is many values in many columns.
I have a successful filter using filter_at() included below, but I've been unable to apply that logic to a mutate:
# there are 7 cars with 6 cyl
mtcars %>%
filter(cyl == 6)
# there are 2 cars with 19.2 mpg, one with 6 cyl, one with 8
mtcars %>%
filter(mpg == 19.2)
# there are 8 rows with either.
# these are the rows I want as TRUE
mtcars %>%
filter(mpg == 19.2 | cyl == 6)
# set the cols to look at
mtcars_cols <- mtcars %>%
select(matches('^(mp|cy)')) %>% names()
# set the values to look at
mtcars_numbs <- c(19.2, 6)
# result is 8 vars with either value in either col.
# this is a successful filter of the data
out1 <- mtcars %>%
filter_at(vars(mtcars_cols), any_vars(
. %in% mtcars_numbs
)
)
# shows set with all 6 cyl, plus one 8cyl 21.9 mpg
out1 %>%
select(mpg, cyl)
# This attempts to apply the filter list to the cols,
# but I only get 6 rows as True
# I tried to change == to %in& but that results in an error
out2 <- mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs) > 0
)
# only 6 rows returned
out2 %>%
filter(myset == T)
I'm not sure why the two rows are skipped. I think it might be the use of rowSums that is aggregating those two rows in some way.
If we want to do the corresponding checks, it may be better to use map2
library(dplyr)
library(purrr)
map2_df(mtcars_cols, mtcars_numbs, ~
mtcars %>%
filter(!! rlang::sym(.x) == .y)) %>%
distinct
NOTE: Doing the comparison (==) with floating point numbers can get into trouble as the precision can vary and result in FALSE
Also, note that == works only when when either the lhs and rhs elements have the same length or the rhs vector is of length 1 (here the recycling happens). If the length is greater than 1 and not equal to length of lhs vector, then the recycling would be comparing in the column order.
We can replicate to make the lengths equal and now it should work
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == mtcars_numbs[col(select(., mtcars_cols))]) > 0
) %>% pull(myset) %>% sum
#[1] 8
In the above code select is used twice for better understanding. Otherwise, we can also use rep
mtcars %>%
mutate(
myset = rowSums(select(., mtcars_cols) == rep(mtcars_numbs, each = n())) > 0
) %>%
pull(myset) %>%
sum
#[1] 8