This question is to build deeper understanding of R function Across & Which . I ran this code & got the message. I want to understand
a) what is the difference between good & bad pratice
b) How does where function work exactly in general & in this use case
library(tidyverse)
iris %>% mutate(across(is.character,as.factor)) %>% str()
Warning message:
Problem with `mutate()` input `..1`.
i Predicate functions must be wrapped in `where()`.
# Bad
data %>% select(is.character)
# Good
data %>% select(where(is.character))
i Please update your code.
There is not much difference between using where and not using it. It just shows a warning to suggest a better syntax. Basically where takes a predicate function and apply it on every variable (column) of your data set. It then returns every variable for which the function returns TRUE. The following examples are taken from the documentations of where:
iris %>% select(where(is.numeric))
# or an anonymous function
iris %>% select(where(function(x) is.numeric(x)))
# or a purrr style formula as a shortcut for creating a function on the spot
iris %>% select(where(~ is.numeric(.x)))
Or you can also have two conditions using shorthand &&:
# The following code selects are numeric variables whose means are greater thatn 3.5
iris %>% select(where(~ is.numeric(.x) && mean(.x) > 3.5))
You can use select(where(is.character)) for .cols argument of the across function and then apply a function in .fns argument on the selected columns.
For more information you can always refer to documentations which are the best source to learn more about these materials.
Related
Hello R and tidyverse wizards,
I try to count the rows of the starwars data set to know how many observations we get with the variables "height" and "mass"
.
I managed to get it with this code:
library(tidyverse)
starwars %>%
select(height, mass) %>%
drop_na() %>%
summarise(across(.cols = c(height, mass),
list(obs = ~ n(),
mean = mean,
sd = sd))) %>%
View()
I would like to replace the obs = ~ n() by the count function and tried this version:
library(tidyverse)
starwars %>%
select(height, mass) %>%
drop_na() %>%
summarise(across(.cols = c(height, mass),
list(obs = count,
mean = mean,
sd = sd))) %>%
View()
but it was too simple to work, classic :p
I had this error message --> Error in View : Problem while computing ..1 = across(...)
And when I got rid of the View() function, I had another error message --> Error in summarise():
! Problem while computing ..1 = across(...).
Caused by error in across():
! Problem while computing column height_obs.
Caused by error in UseMethod():
! no applicable method for 'count' applied to an object of class "c('integer', 'numeric')"
So, I got two questions:
could someone please explain why the code worked with ~ n() but not with count?
is it possible to use the count function instead of ~ n() in that case?
Sorry if it is a dumb question but I just try to understand the across and the count functions by playing with it.
In the function description it says that "df %>% count(a, b) is roughly equivalent to df %>% group_by(a, b) %>% summarise(n = n())", so I assume that using count() within across results in something like a double summarize-command, hence the use in favor of n().
Edit: Here you find the solution in the comment by G. Grothendieck
What is the difference between n() and count() in R? When should one favour the use of either or both?
n() returns a number
count() returns a dataframe
count() takes a dataframe as its first argument. It then returns counts for columns within that dataframe, passed as additional arguments. e.g.,
library(dplyr)
count(starwars, mass, height)
When you put count() inside across(), it passes columns to count() without including the dataframe as the first argument. Equivalent to if you ran,
count(starwars$mass, starwars$height)
Because count() expects a dataframe as the first argument, it throws an error.
n(), on the other hand, doesn’t take any arguments, and simply counts rows in the current environment (or group). You have to include the ~, as otherwise it will try passing each column to n(), which causes an error since n() doesn’t expect arguments.
I have a simple task which I would like to loop over many datasets (which have similar variable names). I know how to do it dplyr, but I need to convert it to base R in order to get it into an anonymous function.
For example (this not the real data I am working with):
This is my dplyr approach:
mtcars %>%
select(mpg, contains("cyl")) %>%
distinct()
However, when I throw this into an anonymous function:
I get an error: Error: No tidyselect variables were registered
mtcars %>% (function(x) subset(x, select=c(mpg, contains("cyl")))
Any ideas about how to solve this, and how to add distinct() to the function so that I only get unique values? Any and all suggestions are appreciated, thank you!
I want to select all numeric columns from a dataframe, and then to select all the non-numeric columns. An obvious way to do this is the following :-
mtcars %>%
select_if(is.numeric) %>%
head()
This works exactly as I expect.
mtcars %>%
select_if(!is.numeric) %>%
head()
This doesn't, and produces the error message Error in !is.numeric : invalid argument type
Looking at another way to do the same thing :-
mtcars %>%
select_if(sapply(., is.numeric)) %>%
head()
works perfectly, but
mtcars %>%
select_if(sapply(., !is.numeric)) %>%
head()
fails with the same error message. (purrr::keep behaves exactly the same way).
In both cases using - to drop the undesired columns fails too, with the same error as above for the is.numeric version, and this error message for the sapply version Error: Can't convert an integer vector to function.
The help page for is.numeric says
is.numeric is an internal generic primitive function: you can write methods to handle specific classes of objects, see InternalMethods. ... Methods for is.numeric should only return true if the base type of the class is double or integer and values can reasonably be regarded as numeric (e.g., arithmetic on them makes sense, and comparison should be done via the base type).
The help page for ! says
Value
For !, a logical or raw vector(for raw x) of the same length as x: names, dims and dimnames are copied from x, and all other attributes (including class) if no coercion is done.
Looking at the useful question Negation ! in a dplyr pipeline %>% I can see some of the reasons why this doesn't work, but neither of the solutions suggested there works.
mtcars %>%
select_if(not(is.numeric())) %>%
head()
gives the reasonable error Error in is.numeric() : 0 arguments passed to 'is.numeric' which requires 1.
mtcars %>%
select_if(not(is.numeric(.))) %>%
head()
Fails with this error :-
Error in tbl_if_vars(.tbl, .predicate, caller_env(), .include_group_vars = TRUE) : length(.p) == length(tibble_vars) is not TRUE.
This behaviour definitely violates the principle of least surprise. It's not of great consequence to me now, but it suggests I am failing to understand some more fundamental point.
Any thoughts?
Negating a predicate function can be done with the dedicated Negate() or purrr::negate() functions (rather than the ! operator, that negates a vector):
library(dplyr)
mtcars %>%
mutate(foo = "bar") %>%
select_if(Negate(is.numeric)) %>%
head()
# foo
# 1 bar
# 2 bar
# 3 bar
# 4 bar
# 5 bar
# 6 bar
Or (purrr::negate() (lower-case) has slightly different behavior, see the respective help pages):
library(purrr)
library(dplyr)
mtcars %>%
mutate(foo = "bar") %>%
select_if(negate(is.numeric)) %>%
head()
# foo
# 1 bar
# 2 bar
# 3 bar
# 4 bar
# 5 bar
# 6 bar
you could define your own "is not numeric" function and then use that instead
is_not_num <- function(x) !is.numeric(x)
mtcars %>%
select_if(is_not_num) %>%
head()
mtcars %>%
select_if(funs(!is.numeric(.))) %>%
head()
does the same
I'm trying as per
dplyr mutate using variable columns
&
dplyr - mutate: use dynamic variable names
to use dynamic names in mutate. What I am trying to do is to normalize column data by groups subject to a minimum standard deviation. Each column has a different minimum standard deviation
e.g. (I omitted loops & map statements for convenience)
require(dplyr)
require(magrittr)
data(iris)
iris <- tbl_df(iris)
minsd <- c('Sepal.Length' = 0.8)
varname <- 'Sepal.Length'
iris %>% group_by(Species) %>% mutate(!!varname := mean(pluck(iris,varname),na.rm=T)/max(sd(pluck(iris,varname)),minsd[varname]))
I got the dynamic assignment & variable selection to work as suggested by the reference answers. But group_by() is not respected which, for me at least, is the main benefit of using dplyr here
desired answer is given by
iris %>% group_by(Species) %>% mutate(!!varname := mean(Sepal.Length,na.rm=T)/max(sd(Sepal.Length),minsd[varname]))
Is there a way around this?
I actually did not know much about pluck, so I don't know what went wrong, but I would go for this and this works:
iris %>%
group_by(Species) %>%
mutate(
!! varname :=
mean(!!as.name(varname), na.rm = T) /
max(sd(!!as.name(varname)),
minsd[varname])
)
Let me know if this isn't what you were looking for.
The other answer is obviously the best and it also solved a similar problem that I have encountered. For example, with !!as.name(), there is no need to use group_by_() (or group_by_at or arrange_() (or arrange_at()).
However, another way is to replace pluck(iris,varname) in your code with .data[[varname]]. The reason why pluck(iris,varname) does not work is that, I suppose, iris in pluck(iris,varname) is not grouped. However, .data refer to the tibble that executes mutate(), and so is grouped.
An alternative to as.name() is rlang::sym() from the rlang package.
The article on dplyr here says "[]" (square brackets) can be used to subset filtered Tibbles like this:
filter(mammals, adult_body_mass_g > 1e7)[ , 3]
But I am getting an "object not found" error.
Here is the replication of the error on a more known dataset "iris"
library(dplyr)
iris %>% filter(Sepal.Length>6) [,c(1:3)]
Error in filter_(.data, .dots = lazyeval::lazy_dots(...)) :
object 'Sepal.Length' not found
I also want to mention that I am deliberately not preferring to use the native subsetting in dplyr using select() as I need a vector output and not a data frame on a single column. Unfortunately, dplyr always forces a data frame output (for good reasons).
You need an extra pipe:
iris %>% filter(Sepal.Length>6) %>% .[,1:3]
Sorry, forgot the . before the brackets.
Note: Your code will probably be more readable if you stick to the tidyverse syntax and use select as the last operation.
iris %>%
filter(Sepal.Length > 6) %>%
select(1:3)
The dplyr-native way of doing this is to use select:
iris %>% filter(Sepal.Length > 6) %>% select(1:3)
You could also use {} so that the filtering is done before [ is applied:
{iris %>% filter(Sepal.Length>6)}[,c(1:3)]
Or, as suggested in another answer, use the . notation to indicated where the data should go in relation to [:
iris %>% filter(Sepal.Length>6) %>% .[,1:3]
You can also load magrittr explicitly and use extract, which is a "pipe-able" version of [:
library(magrittr)
iris %>% filter(Sepal.Length>6) %>% extract( ,1:3)
The blog entry you reference is old in dplyr time - about 3 years old. dplyr has been changing a lot. I don't know whether the blog's suggestion worked at the time it was written or not, but I'd recommend finding more recent sources to learn about this frequently changing package.