I am trying to make a function which I pass a string variable to dplyr pipeline but having some problem. Like the following
col_spread = "speed".
In select(), I can use get(col_spread) to select the column named speed.
df %>% select(get(col_spread))
However, when I am using spread function in dplyr
df %>% spread(key = Key_col, value = get(col_spread))
Error: Invalid column specification
It doesn't work.
Is NSE the only way to go? If so, what should I do?
Thank you!
Actually get really isn't a great idea. It would be better to use the standard evaulation version of
df %>% select_(col_spread)
and then for spread it would look like
df %>% spread_("Key_col", col_spread)
note which values are quoted and which are not. spread_ expects two character values.
Related
I would like to arrange a variable called "Name" by the number of characters in their Name. I'm aware that I need the arrange() function in the package dplyr, but do not find a function in the arrange() function that helps me to arrange based on numbers of characters in the name.
So far I have come up with: arrange((Name))
Is there someone who can help me with this?
Here's a simple workaround with dplyr package and iris data:
library(dplyr)
iris %>%
mutate(Species = as.character(Species)) %>% # Convert factor to characters
arrange(nchar(Species))
I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]
I want to create a plot for each column in a vector called dates. My data frame contains only these columns and I want to group on it, count the occurrences and then plot it.
Below code works, except for map which I want to use to go across a previously unknown number of columns. I think I'm using map correctly, I've had success with it before. I'm new to using quosures but given that my function call works I'm not sure what is wrong. I've looked at several other posts that appear to be set up this way.
df <- data.frame(
date1 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
date2 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
stringsAsFactors = FALSE
)
dates<-names(df)
library(tidyverse)
dates.count<-function(.x){
group_by<-enquo(.x)
df %>% group_by(!!group_by) %>% summarise(count=n()) %>% ungroup() %>% ggplot() + geom_point(aes(y=count,x=!!group_by))
}
dates.count(date1)
map(dates,~dates.count(.x))
I get this error: Error in grouped_df_impl(data, unname(vars), drop) : Column .x is unknown
When you pass the variable names to map() you are using strings, which indicates you need ensym() instead of enquo().
So your function would look like
dates.count <- function(.x){
group_by = ensym(.x)
df %>%
group_by(!!group_by) %>%
summarise(count=n()) %>%
ungroup() %>%
ggplot() +
geom_point(aes(y=count,x=!!group_by))
}
And you would use the variable names as strings for the argument.
dates.count("date2")
Note that tidyeval doesn't always play nicely with the formula interface of map() (I think I'm remembering that correctly). You can always do an anonymous function instead, but in your case where you want to map the column names to a function with a single argument you can just do
map(dates, dates.count)
Using the formula interface in map() I needed an extra !!:
map(dates, ~dates.count(!!.x))
I'm trying as per
dplyr mutate using variable columns
&
dplyr - mutate: use dynamic variable names
to use dynamic names in mutate. What I am trying to do is to normalize column data by groups subject to a minimum standard deviation. Each column has a different minimum standard deviation
e.g. (I omitted loops & map statements for convenience)
require(dplyr)
require(magrittr)
data(iris)
iris <- tbl_df(iris)
minsd <- c('Sepal.Length' = 0.8)
varname <- 'Sepal.Length'
iris %>% group_by(Species) %>% mutate(!!varname := mean(pluck(iris,varname),na.rm=T)/max(sd(pluck(iris,varname)),minsd[varname]))
I got the dynamic assignment & variable selection to work as suggested by the reference answers. But group_by() is not respected which, for me at least, is the main benefit of using dplyr here
desired answer is given by
iris %>% group_by(Species) %>% mutate(!!varname := mean(Sepal.Length,na.rm=T)/max(sd(Sepal.Length),minsd[varname]))
Is there a way around this?
I actually did not know much about pluck, so I don't know what went wrong, but I would go for this and this works:
iris %>%
group_by(Species) %>%
mutate(
!! varname :=
mean(!!as.name(varname), na.rm = T) /
max(sd(!!as.name(varname)),
minsd[varname])
)
Let me know if this isn't what you were looking for.
The other answer is obviously the best and it also solved a similar problem that I have encountered. For example, with !!as.name(), there is no need to use group_by_() (or group_by_at or arrange_() (or arrange_at()).
However, another way is to replace pluck(iris,varname) in your code with .data[[varname]]. The reason why pluck(iris,varname) does not work is that, I suppose, iris in pluck(iris,varname) is not grouped. However, .data refer to the tibble that executes mutate(), and so is grouped.
An alternative to as.name() is rlang::sym() from the rlang package.
I use dplyr group_by function to group my data frame,
and need to be able to group the data, by a column, i don't know the name of the column yet, i need to decide it along the code, so the name can't be hard coded.
for example,
i can't use
data %>% group_by(col_name)
i need to do somthing like
data %>% c <- col_name
data %>% group_by(c)
when i try doing so, it popes error:
Error: unknown variable to group by : c
All the examples I find are for the trevial case when you can hard code the name of the column
group by example
Same in the r help
Thanks.
You would like to look up NSE as others have said in their comments. Using that also requires you to use lazyeval package, and group_by_ function, which allows you to you standard evaluation. So it will look like:
data %>% group_by_(lazyeval::interp(~var, var = as.name(c)))