How to select_if in dplyr, where the logical condition is negated - r

I want to select all numeric columns from a dataframe, and then to select all the non-numeric columns. An obvious way to do this is the following :-
mtcars %>%
select_if(is.numeric) %>%
head()
This works exactly as I expect.
mtcars %>%
select_if(!is.numeric) %>%
head()
This doesn't, and produces the error message Error in !is.numeric : invalid argument type
Looking at another way to do the same thing :-
mtcars %>%
select_if(sapply(., is.numeric)) %>%
head()
works perfectly, but
mtcars %>%
select_if(sapply(., !is.numeric)) %>%
head()
fails with the same error message. (purrr::keep behaves exactly the same way).
In both cases using - to drop the undesired columns fails too, with the same error as above for the is.numeric version, and this error message for the sapply version Error: Can't convert an integer vector to function.
The help page for is.numeric says
is.numeric is an internal generic primitive function: you can write methods to handle specific classes of objects, see InternalMethods. ... Methods for is.numeric should only return true if the base type of the class is double or integer and values can reasonably be regarded as numeric (e.g., arithmetic on them makes sense, and comparison should be done via the base type).
The help page for ! says
Value
For !, a logical or raw vector(for raw x) of the same length as x: names, dims and dimnames are copied from x, and all other attributes (including class) if no coercion is done.
Looking at the useful question Negation ! in a dplyr pipeline %>% I can see some of the reasons why this doesn't work, but neither of the solutions suggested there works.
mtcars %>%
select_if(not(is.numeric())) %>%
head()
gives the reasonable error Error in is.numeric() : 0 arguments passed to 'is.numeric' which requires 1.
mtcars %>%
select_if(not(is.numeric(.))) %>%
head()
Fails with this error :-
Error in tbl_if_vars(.tbl, .predicate, caller_env(), .include_group_vars = TRUE) : length(.p) == length(tibble_vars) is not TRUE.
This behaviour definitely violates the principle of least surprise. It's not of great consequence to me now, but it suggests I am failing to understand some more fundamental point.
Any thoughts?

Negating a predicate function can be done with the dedicated Negate() or purrr::negate() functions (rather than the ! operator, that negates a vector):
library(dplyr)
mtcars %>%
mutate(foo = "bar") %>%
select_if(Negate(is.numeric)) %>%
head()
# foo
# 1 bar
# 2 bar
# 3 bar
# 4 bar
# 5 bar
# 6 bar
Or (purrr::negate() (lower-case) has slightly different behavior, see the respective help pages):
library(purrr)
library(dplyr)
mtcars %>%
mutate(foo = "bar") %>%
select_if(negate(is.numeric)) %>%
head()
# foo
# 1 bar
# 2 bar
# 3 bar
# 4 bar
# 5 bar
# 6 bar

you could define your own "is not numeric" function and then use that instead
is_not_num <- function(x) !is.numeric(x)
mtcars %>%
select_if(is_not_num) %>%
head()

mtcars %>%
select_if(funs(!is.numeric(.))) %>%
head()
does the same

Related

How to apply an anonymous function inside dplyr mutate pipeline in a rowwise manor?

A trivial reproducible example is presented below, I want to mutate the mtcars dataframe at variables
vs and vm such that if the value is equal to 1 it is changed to 2. Below is my original approach, which produces error the condition has length > 1. So obviously it's not iterating through each element in the vector.
mtcars %>% mutate_at(vars(vs,am),function(x) {if(x == 1){x <- 2}})
My second approach was to try a lapply to iterate over each element in the vector, which also gave me an error Error in match.fun(FUN) : argument "FUN" is missing, with no default.
mtcars %>% mutate_at(vars(vs,am),lapply(function(x) {if(x == 1){x <- 2}}))
I obviously know how to accomplish this in a for loop, just want to understand the logic behind the scenes.
mtcars %>%
mutate(across(c(vs, am), ~ case_when(.x == 1 ~ 2, TRUE ~ .x)))

Should I use %$% instead of %>%?

Recently I have found the %$% pipe operator, but I am missing the point regarding its difference with %>% and if it could completely replace it.
Motivation to use %$%
The operator %$% could replace %>% in many cases:
mtcars %>% summary()
mtcars %$% summary(.)
mtcars %>% head(10)
mtcars %$% head(.,10)
Apparently, %$% is more usable than %>%:
mtcars %>% plot(.$hp, .$mpg) # Does not work
mtcars %$% plot(hp, mpg) # Works
Implicitly fills the built-in data argument:
mtcars %>% lm(mpg ~ hp, data = .)
mtcars %$% lm(mpg ~ hp)
Since % and $ are next to each other in the keyboard, inserting %$% is more convenient than inserting %>%.
Documentation
We find the following information in their respective help pages.
(?magrittr::`%>%`):
Description:
Pipe an object forward into a function or call expression.
Usage:
lhs %>% rhs
(?magrittr::`%$%`):
Description:
Expose the names in ‘lhs’ to the ‘rhs’ expression. This is useful
when functions do not have a built-in data argument.
Usage:
lhs %$% rhs
I was not able to understand the difference between the two pipe operators. Which is the difference between piping an object and exposing a name? But, in the rhs of %$%, we are able to get the piped object with the ., right?
Should I start using %$% instead of %>%? Which problems could I face doing so?
In addition to the provided comments:
%$% also called the Exposition pipe vs. %>%:
This is a short summary of this article https://towardsdatascience.com/3-lesser-known-pipe-operators-in-tidyverse-111d3411803a
"The key difference in using %$% or %>% lies in the type of arguments of used functions."
One advantage, and as far as I can understand it, for me the only one to use %$% over %>% is the fact that
we can avoid repetitive input of the dataframe name in functions that have no data as an argument.
For example the lm() has a data argument. In this case we can use both %>% and %$% interchangeable.
But in functions like the cor() which has no data argument:
mtcars %>% cor(disp, mpg) # Will give an Error
cor(mtcars$disp, mtcars$mpg)
is equivalent to
mtcars %$% cor(disp, mpg)
And note to use %$% pipe operator you have to load library(magrittr)
Update: on OPs comment:
The pipe independent which one allows us to transform machine or computer language to a more readable human language.
ggplot2 is special. ggplot2 is not internally consistent.
ggplot1 had a tidier API then ggplot2
Pipes would work with ggplot1:
library(ggplot1) mtcars %>% ggplot(list( x= mpg, y = wt)) %>% ggpoint() %>% ggsave("mtcars.pdf", width= 8 height = 6)
In 2016 Wick Hadley said:
"ggplot2 newver would have existed if I'd discovered the pipe 10 years earlier!"
https://www.youtube.com/watch?v=K-ss_ag2k9E&list=LL&index=9
No, you shouldn't use %$% routinely. It is like using the with() function, i.e. it exposes the component parts of the LHS when evaluating the RHS. But it only works when the value on the left has names like a list or dataframe, so you can't always use it. For example,
library(magrittr)
x <- 1:10
x %>% mean()
#> [1] 5.5
x %$% mean()
#> Error in eval(substitute(expr), data, enclos = parent.frame()): numeric 'envir' arg not of length one
Created on 2022-02-06 by the reprex package (v2.0.1.9000)
You'd get a similar error with x %$% mean(.).
Even when the LHS has names, it doesn't automatically put the . argument in the first position. For example,
mtcars %>% nrow()
#> [1] 32
mtcars %$% nrow()
#> Error in nrow(): argument "x" is missing, with no default
Created on 2022-02-06 by the reprex package (v2.0.1.9000)
In this case mtcars %$% nrow(.) would work, because mtcars has names.
Your example involving .$hp and .$mpg is illustrating one of the oddities of magrittr pipes. Because the . is only used in expressions, not alone as an argument, it is passed as the first argument as well as being passed in those expressions. You can avoid this using braces, e.g.
mtcars %>% {plot(.$hp, .$mpg)}

Understand the warning message in across in R

This question is to build deeper understanding of R function Across & Which . I ran this code & got the message. I want to understand
a) what is the difference between good & bad pratice
b) How does where function work exactly in general & in this use case
library(tidyverse)
iris %>% mutate(across(is.character,as.factor)) %>% str()
Warning message:
Problem with `mutate()` input `..1`.
i Predicate functions must be wrapped in `where()`.
# Bad
data %>% select(is.character)
# Good
data %>% select(where(is.character))
i Please update your code.
There is not much difference between using where and not using it. It just shows a warning to suggest a better syntax. Basically where takes a predicate function and apply it on every variable (column) of your data set. It then returns every variable for which the function returns TRUE. The following examples are taken from the documentations of where:
iris %>% select(where(is.numeric))
# or an anonymous function
iris %>% select(where(function(x) is.numeric(x)))
# or a purrr style formula as a shortcut for creating a function on the spot
iris %>% select(where(~ is.numeric(.x)))
Or you can also have two conditions using shorthand &&:
# The following code selects are numeric variables whose means are greater thatn 3.5
iris %>% select(where(~ is.numeric(.x) && mean(.x) > 3.5))
You can use select(where(is.character)) for .cols argument of the across function and then apply a function in .fns argument on the selected columns.
For more information you can always refer to documentations which are the best source to learn more about these materials.

Use dplyr's _if() functions like mutate_if() with a negative predicate function

According to the documentation of the dplyr package:
# The _if() variants apply a predicate function (a function that
# returns TRUE or FALSE) to determine the relevant subset of
# columns.
# mutate_if() is particularly useful for transforming variables from
# one type to another
iris %>% mutate_if(is.factor, as.character)
So how do I use the inverse form? I would like to transform all non-numeric values to characters, so I thought of doing:
iris %>% mutate_if(!is.numeric, as.character)
#> Error in !is.numeric : invalid argument type
But that doesn't work. Or just select all variables that are not numeric:
iris %>% select_if(!is.numeric)
#> Error in !is.numeric : invalid argument type
Doesn't work either.
How do I use negation with dplyr functions like mutate_if(), select_if() and arrange_if()?
EDIT: This might be solved in the upcoming dplyr 1.0.0: NEWS.md.
We can use shorthand notation ~ for anonymous function in tidyverse
library(dplyr)
iris %>%
mutate_if(~ !is.numeric(.), as.character)
Or without anonymous function, use negate from purrr
library(purrr)
iris %>%
mutate_if(negate(is.numeric), as.character)
In addition to negate, Negate from base R also works
iris %>%
mutate_if(Negate(is.numeric), as.character)
Same notation, works with select_if/arrange_if
iris %>%
select_if(negate(is.numeric))%>%
head(2)
# Species
#1 setosa
#2 setosa
Could be a nice suggestion to add to their package, so feel free to open an issue on GitHub.
For now, you can write a function 'on-the-fly':
iris %>% mutate_if(function(x) !is.numeric(x), as.character)
iris %>% select_if(function(x) !is.numeric(x))
And this might even be safer, not sure how the _if() internals work:
iris %>% mutate_if(function(...) !is.numeric(...), as.character)
iris %>% select_if(function(...) !is.numeric(...))

How do I write these using pipes in R?

How do I get subgroups by using pipes? I don't understand why what I wrote doesn't work. Can someone explain how these work, reading online and seeing examples online hasn't help me because I am not sure what I am not understanding?
mean(mtcars$qsec)
mtcars %>%
select(qsec) %>%
mean()
Warning message:
In mean.default(.) : argument is not numeric or logical: returning NA
mean(mtcars$qsec[mtcars$cyl==8])
mtcars %>%
group-by(qsec) %>%
filter(cyl==8)
mean()
Error in mean.default() : argument "x" is missing, with no default
mean(mtcars$mpg[mtcars$hp > median(mtcars$hp)])
mtcars %>%
group_by(mpg) %>%
filter(hp>median(hp))
mean
The reason is that select still returns a data.frame with one column and mean expects a vector based on the ?mean
x - An R object. Currently there are methods for numeric/logical vectors and date, date-time and time interval objects. Complex vectors are allowed for trim = 0, only.
We can use pull to extract the column as a vector and apply the mean on it
library(dplyr)
mtcars %>%
pull(qsec) %>%
mean
#[1] 17.84875
In the second case, we are getting the mean of 'qsec' where 'cyl' is 8
mtcars %>%
select(qsec, cyl) %>%
filter(cyl == 8) %>%
pull(qsec) %>%
mean
#[1] 16.77214

Resources