The article on dplyr here says "[]" (square brackets) can be used to subset filtered Tibbles like this:
filter(mammals, adult_body_mass_g > 1e7)[ , 3]
But I am getting an "object not found" error.
Here is the replication of the error on a more known dataset "iris"
library(dplyr)
iris %>% filter(Sepal.Length>6) [,c(1:3)]
Error in filter_(.data, .dots = lazyeval::lazy_dots(...)) :
object 'Sepal.Length' not found
I also want to mention that I am deliberately not preferring to use the native subsetting in dplyr using select() as I need a vector output and not a data frame on a single column. Unfortunately, dplyr always forces a data frame output (for good reasons).
You need an extra pipe:
iris %>% filter(Sepal.Length>6) %>% .[,1:3]
Sorry, forgot the . before the brackets.
Note: Your code will probably be more readable if you stick to the tidyverse syntax and use select as the last operation.
iris %>%
filter(Sepal.Length > 6) %>%
select(1:3)
The dplyr-native way of doing this is to use select:
iris %>% filter(Sepal.Length > 6) %>% select(1:3)
You could also use {} so that the filtering is done before [ is applied:
{iris %>% filter(Sepal.Length>6)}[,c(1:3)]
Or, as suggested in another answer, use the . notation to indicated where the data should go in relation to [:
iris %>% filter(Sepal.Length>6) %>% .[,1:3]
You can also load magrittr explicitly and use extract, which is a "pipe-able" version of [:
library(magrittr)
iris %>% filter(Sepal.Length>6) %>% extract( ,1:3)
The blog entry you reference is old in dplyr time - about 3 years old. dplyr has been changing a lot. I don't know whether the blog's suggestion worked at the time it was written or not, but I'd recommend finding more recent sources to learn about this frequently changing package.
Related
This question is to build deeper understanding of R function Across & Which . I ran this code & got the message. I want to understand
a) what is the difference between good & bad pratice
b) How does where function work exactly in general & in this use case
library(tidyverse)
iris %>% mutate(across(is.character,as.factor)) %>% str()
Warning message:
Problem with `mutate()` input `..1`.
i Predicate functions must be wrapped in `where()`.
# Bad
data %>% select(is.character)
# Good
data %>% select(where(is.character))
i Please update your code.
There is not much difference between using where and not using it. It just shows a warning to suggest a better syntax. Basically where takes a predicate function and apply it on every variable (column) of your data set. It then returns every variable for which the function returns TRUE. The following examples are taken from the documentations of where:
iris %>% select(where(is.numeric))
# or an anonymous function
iris %>% select(where(function(x) is.numeric(x)))
# or a purrr style formula as a shortcut for creating a function on the spot
iris %>% select(where(~ is.numeric(.x)))
Or you can also have two conditions using shorthand &&:
# The following code selects are numeric variables whose means are greater thatn 3.5
iris %>% select(where(~ is.numeric(.x) && mean(.x) > 3.5))
You can use select(where(is.character)) for .cols argument of the across function and then apply a function in .fns argument on the selected columns.
For more information you can always refer to documentations which are the best source to learn more about these materials.
I'm trying as per
dplyr mutate using variable columns
&
dplyr - mutate: use dynamic variable names
to use dynamic names in mutate. What I am trying to do is to normalize column data by groups subject to a minimum standard deviation. Each column has a different minimum standard deviation
e.g. (I omitted loops & map statements for convenience)
require(dplyr)
require(magrittr)
data(iris)
iris <- tbl_df(iris)
minsd <- c('Sepal.Length' = 0.8)
varname <- 'Sepal.Length'
iris %>% group_by(Species) %>% mutate(!!varname := mean(pluck(iris,varname),na.rm=T)/max(sd(pluck(iris,varname)),minsd[varname]))
I got the dynamic assignment & variable selection to work as suggested by the reference answers. But group_by() is not respected which, for me at least, is the main benefit of using dplyr here
desired answer is given by
iris %>% group_by(Species) %>% mutate(!!varname := mean(Sepal.Length,na.rm=T)/max(sd(Sepal.Length),minsd[varname]))
Is there a way around this?
I actually did not know much about pluck, so I don't know what went wrong, but I would go for this and this works:
iris %>%
group_by(Species) %>%
mutate(
!! varname :=
mean(!!as.name(varname), na.rm = T) /
max(sd(!!as.name(varname)),
minsd[varname])
)
Let me know if this isn't what you were looking for.
The other answer is obviously the best and it also solved a similar problem that I have encountered. For example, with !!as.name(), there is no need to use group_by_() (or group_by_at or arrange_() (or arrange_at()).
However, another way is to replace pluck(iris,varname) in your code with .data[[varname]]. The reason why pluck(iris,varname) does not work is that, I suppose, iris in pluck(iris,varname) is not grouped. However, .data refer to the tibble that executes mutate(), and so is grouped.
An alternative to as.name() is rlang::sym() from the rlang package.
I'm new to sparklyr (but familiar with spark and pyspark), and I've got a really basic question. I'm trying to filter a column based on a partial match. In dplyr, i'd write my operation as so:
businesses %>%
filter(grepl('test', biz_name)) %>%
head
Running that code on a spark dataframe however gives me:
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'GREPL'. This function is neither a registered temporary function nor a permanent function registered in the database 'project_eftpos_failure'.; line 5 pos 7
The same as in standard Spark, you can use either rlike (Java regular expressions):
df <- copy_to(sc, iris)
df %>% filter(rlike(Species, "osa"))
# or anchored
df %>% filter(rlike(Species, "^.*osa.*$"))
or like (simple SQL regular expressions):
df %>% filter(like(Species, "%osa%"))
Both methods can be also used with suffix notation as
df %>% filter(Species %rlike% "^.*osa.*$")
and
df %>% filter(Species %like% "%osa%")
respectively.
For details see vignette("sql-translation").
Pipes and tidyverse are sometimes very convenient. The user wants to do convert one column from one type to another.
Like so:
mtcars$qsec <-as.integer(mtcars$qsec)
This requires typing twice what I need. Please do not suggest "with" command since I find it confusing to use.
What would be the tidyverse and magrittr %<>% way of doing the same with least amount of typing? Also, if qsec is 6th column, how can I do it just refering to column position. Something like (not correct code)
mtcars %<>% mutate(as.integer,qsec)
mtcars %<>% mutate(as.integer,[[6]])
With typing reference to the column just once - the compliant answer is
mtcars %<>% mutate_at(6, as.integer)
Edit: note that as of 2021, mutate_at syntax has been superseded by
mtcars %<>% mutate(across(6), as.integer)
To refer to column by name, solution with one redundant typing of column name is
mtcars %<>% mutate(qsec = as.integer(qsec))
NOTE:credit goes to commenting users above
This solution is probably the shortest:
mtcars$qsec %<>% as.integer
The trick is to perform the cast operation directly on the column > no need for mutate() any more.
Update dplyr 1.0.0
Tidyverse recommends using across() which replaces, as noted by #Aren Cambre, mutate_at.
Here is how it looks like:
mtcars %>% mutate(across(qsec, as.integer))
Important:
Note that as.integer is written without parentheses ()
I'd like to pick a different number of rows of each group of my data frame. I haven't figured out an elegant way to do this with dplyr yet. To pick out the same number of rows for each group I accomplish like this:
library(dplyr)
iris %>%
group_by(Species) %>%
arrange(Sepal.Length) %>%
top_n(2)
But I would like to be able to reference another table with the number of rows I'd like for each group, a sample table like this below:
top_rows_desired <- data.frame(Species = unique(iris$Species),
n_desired = c(4,2,5))
We can do a left_join with 'iris' and 'top_rows_desired' by 'Species', grouped by 'Species', slice the sequence of first 'n_desired' and remove the 'n_desired' column with select.
left_join(iris, top_rows_desired, by = "Species") %>%
group_by(Species) %>%
arrange(desc(Sepal.Length)) %>%
slice(seq(first(n_desired))) %>%
select(-n_desired)
Just adding this answer for those folks who are unable to run the code that akrun provided. I struggled with this for a while. This answer tackles the issue #2531 mentioned on github.
You might not be able to run slice because you already have xgboost loaded in your environment. xgboost masks dplyr's slice function leading to this issue.
Attaching package: ‘xgboost’
The following object is masked from ‘package:dplyr’:
slice
Warning message:
package ‘xgboost’ was built under R version 3.4.1
So using
detach(package: xgboost)
might work for you.
I wasted an hour because of this. Hope this is helpful.