Using dplyr, how to pipe or chain to plot()? - r

I am new to dplyr() package and trying to use it for my visualization assignment. I am able to pipe my data to ggplot() but unable to do that with plot(). I came across this post and the answers including the one in comments, didn't work for me.
Code 1:
emission <- mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))
emission %>%
plot(year, total,.)
I get the following error:
Error in plot(year, total, emission) : object 'year' not found
Code 2:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
plot(year, total, .)
This didn't work either and returned the same error.
Interestingly, the solution from the post I mentioned works for the same dataset but doesn't work out for my own data. However, I am able to create the plot using emission$year and emission$total.
Am I missing anything?

plot.default doesn't take a data argument, so your best bet is to pipe to with:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
with(plot(year, total))
In case anyone missed #aosmith's comment on the question, plot.formula does have a data argument, but of course the formula is the first argument so we need to use the . to put the data in the right place. So another option is
... %>%
plot(total ~ year, data = .)
Of course, ggplot takes data as the first argument, so to use ggplot do:
... %>%
ggplot(aes(x = year, y = total)) + geom_point()
lattice::xyplot is likeplot.formula: there is a data argument, but it's not first, so:
... %>%
xyplot(total ~ year, data = .)
Just look at the documentation and make sure you use a . if data isn't the first argument. If there's no data argument at all, using with is a good work-around.

As an alternative, you can use the %$% operator from magrittr to be able to access the columns of a dataframe directly. For example:
iris %$%
plot(Sepal.Length~Sepal.Width)
This is useful many times when you need to feed the result of a dplyr chain to a base R function (such as table, lm, plot, etc). It can also be used to extract a column from a dataframe as a vector, e.g.:
iris %>% filter(Species=='virginica') %$% Sepal.Length
This is the same as:
iris %>% filter(Species=='virginica') %>% pull(Sepal.Length)

Related

How to use R function in Sparklyr

I am researching how to use R function on line but still have hard time figuring out. Please help.
My initial code looks like:
whatever %>%
group_by(a) %>%
summarize(count=n()) %>%
collect() %>%
ggplot(aes(x=a, y=count)) +
geom_point()
I want to repeat this multiple times since there are other columns I want to check with the same function.
So I wrote:
point_dist <- function(dta, vari) {
dta %>%
group_by(vari) %>%
summarize(count=n()) %>%
collect() %>%
ggplot(aes(x=vari, y=count)) +
gemo_point()
}
point_dist(whatever, a)
but keep telling me:
Error in eval_bare(sym, env) : object 'a' not found
Don't know why.
I either don't know if this is the right direction I shall go.
Thanks again.
Your issue is related to non-standard evaluation that dplyr functions tend to give you. When you reference a in your first call to point_dist, R attempts to evaluate it, which of course fails. (It's even more confusing when you have some variable named as such in your calling environment or higher ...)
NSE in dplyr means you can do something like select(mtcars, cyl), whereas with most standard-evaluation functions, you'll need myfunc(mtcars, "cyl"), since there isn't a variable named cyl in the calling environment.
In your case, try:
point_dist <- function(dta, vari) {
vari <- enquo(vari)
dta %>%
group_by(!!vari) %>%
summarize(count=n()) %>%
collect() %>%
ggplot(aes(x=!!vari, y=count)) +
gemo_point()
}
This method of dealing with unquoted column-names in your functions can be confusing if you're familiar with normal R function definitions and/or are not familiar with NSE. This can be a good template for you if that's as far as you're going to go with it, otherwise I strongly urge you to read a little more at the first reference below.
Some good references for NSE, specifically in/around tidyverse stuff:
https://dplyr.tidyverse.org/articles/programming.html
http://adv-r.had.co.nz/Computing-on-the-language.html
If you are summarising your data and piping to ggplot, then you don't need to use collect().
df <- data.frame(group=sample(letters[1:10],1000,T))
df %>% group_by(group) %>% summarise(n=n()) %>%
ggplot(aes(group,n)) + geom_point()
If you are going to apply this summary and plot method to multiple columns, I would suggest trying gather() and then plotting everything at once using + facet_wrap() and bar plots.
df <- data.frame(matrix(sample(letters[1:10],10000,T),ncol = 10))
df %>% gather(k,v) %>% group_by(k,v) %>% summarise(n=n()) %>%
ggplot(aes(k,n,fill=v)) + geom_bar(stat='identity') +
facet_wrap(~v) + theme(legend.position = 'none')

R : doesn't recognise column in a new table

This is part of an online course I am doing, R for data analysis.
A tibble is created using the group_by and summarise functions on the diamonds data set - the new tibble indeed exists and looks as you would expect, I checked. Now a bar plot has to be created using these summary values in the new tibble, but it gives me all sorts of errors associated with not recognising the columns.
I transformed the tibble into a data frame, and still get the same problem.
Here is the code:
diamonds_by_color <- group_by(diamonds, color)
diamonds_mp_by_color <- summarise(diamonds_by_color, mean_price = mean(price))
diamonds_mp_by_color <- as.data.frame(diamonds_mp_by_color)
colorcounts <- count(diamonds_by_color$mean_price)
colorbarplot <- barplot(diamonds_by_color$mean_price, names.arg = diamonds_by_color$color,
main = "Average price for different colour diamonds")
The error I get when running the function count is:
Error in UseMethod("summarise_") :
no applicable method for 'summarise_' applied to an object of class "NULL"
In addition: Warning message:
Unknown or uninitialised column: 'mean_price'.
It's probably something trivial but I have been reading quite a lot and tried a few things and can't figure it out. Any help will be super appreciated :)
Your diamonds_by_color never has mean_price assigned to it.
Your last two lines of code work if you reference diamonds_mp_by_color instead:
colorcounts <- count(diamonds_mp_by_color, mean_price)
barplot(diamonds_mp_by_color$mean_price,
names.arg=diamonds_mp_by_color$color,
main="Average price for different colour diamonds")
Here is a way to summarise the price by color using dplyr and piping straight to a barplot using ggplot2.
diamonds %>% group_by(color) %>%
summarise(mean.price=mean(price,na.rm=1)) %>%
ggplot(aes(color,mean.price)) + geom_bar(stat='identity')
Best dplyr idiom is not to declare a temporary result for each operation. Just do one big pipe; also the %>% notation is clearer because you don't have to keep specifying which dataframe as the first arg in each operation:
diamonds %>%
group_by(color) %>%
summarise(mean_price = mean(price)) %>%
tally() %>% # equivalent to n() on a group
# may need ungroup() %>%
barplot(mean_price, names.arg = color,
main = "Average price for different colour diamonds")
(Something like that. You can assign the output of the pipe before the barplot if you like. I'm transiting through an airport so I can't check it in R.)

With dplyr and enquo my code works but not when I pass to purrr::map

I want to create a plot for each column in a vector called dates. My data frame contains only these columns and I want to group on it, count the occurrences and then plot it.
Below code works, except for map which I want to use to go across a previously unknown number of columns. I think I'm using map correctly, I've had success with it before. I'm new to using quosures but given that my function call works I'm not sure what is wrong. I've looked at several other posts that appear to be set up this way.
df <- data.frame(
date1 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
date2 = c("2018-01-01","2018-01-01","2018-01-01","2018-01-02","2018-01-02","2018-01-02"),
stringsAsFactors = FALSE
)
dates<-names(df)
library(tidyverse)
dates.count<-function(.x){
group_by<-enquo(.x)
df %>% group_by(!!group_by) %>% summarise(count=n()) %>% ungroup() %>% ggplot() + geom_point(aes(y=count,x=!!group_by))
}
dates.count(date1)
map(dates,~dates.count(.x))
I get this error: Error in grouped_df_impl(data, unname(vars), drop) : Column .x is unknown
When you pass the variable names to map() you are using strings, which indicates you need ensym() instead of enquo().
So your function would look like
dates.count <- function(.x){
group_by = ensym(.x)
df %>%
group_by(!!group_by) %>%
summarise(count=n()) %>%
ungroup() %>%
ggplot() +
geom_point(aes(y=count,x=!!group_by))
}
And you would use the variable names as strings for the argument.
dates.count("date2")
Note that tidyeval doesn't always play nicely with the formula interface of map() (I think I'm remembering that correctly). You can always do an anonymous function instead, but in your case where you want to map the column names to a function with a single argument you can just do
map(dates, dates.count)
Using the formula interface in map() I needed an extra !!:
map(dates, ~dates.count(!!.x))

How do pipes work with purrr map() function and the "." (dot) symbol

When using both pipes and the map() function from purrr, I am confused about how data and variables are passed along. For instance, this code works as I expect:
library(tidyverse)
cars %>%
select_if(is.numeric) %>%
map(~hist(.))
Yet, when I try something similar using ggplot, it behaves in a strange way.
cars %>%
select_if(is.numeric) %>%
map(~ggplot(cars, aes(.)) + geom_histogram())
I'm guessing this is because the "." in this case is passing a vector to aes(), which is expecting a column name. Either way, I wish I could pass each numeric column to a ggplot function using pipes and map(). Thanks in advance!
cars %>%
select_if(is.numeric) %>%
map2(., names(.),
~{ggplot(data_frame(var = .x), aes(var)) +
geom_histogram() +
labs(x = .y) })
# Alternate version
cars %>%
select_if(is.numeric) %>%
imap(.,
~{ggplot(data_frame(var = .x), aes(var)) +
geom_histogram() +
labs(x = .y) })
There's a few extra steps.
Use map2 instead of map. The first argument is the dataframe you're passing it, and the second argument is a vector of the names of that dataframe, so it knows what to map over. (Alternately, imap(x, ...) is a synonym for map2(x, names(x), ...). It's an "index-map", hence "imap".).
You then need to explicitly enframe your data, since ggplot only works on dataframes and coercible objects.
This also gives you access to the .y pronoun to name the plots.
You aren't supposed to pass raw data to an aesthetic mapping. Instead you should dynamically build the data.frame. For example
cars %>%
select_if(is.numeric) %>%
map(~ggplot(data_frame(x=.), aes(x)) + geom_histogram())

How can you obtain the group_by value for use in passing to a function?

I am trying to use dplyr to apply a function to a data frame that is grouped using the group_by function. I am applying a function to each row of the grouped data using do(). I would like to obtain the value of the group_by variable so that I might use it in a function call.
So, effectively, I have-
tmp <-
my_data %>%
group_by(my_grouping_variable) %>%
do(my_function_call(data.frame(x = .$X, y = .$Y),
GROUP_BY_VARIABLE)
I'm sure that I could call unique and get it...
do(my_function_call(data.frame(x = .$X, y = .$Y),
unique(.$my_grouping_variable))
But, it seems clunky and would inefficiently call unique for every grouping value.
Is there a way to get the value of the group_by variable in dplyr?
I'm going to prematurely say sorry if this is a crazy easy thing to answer. I promise that I've exhaustively searched for an answer.
First, if necessary, check if it's a grouped data frame: inherits(data, "grouped_df").
If you want the subsets of data frames, you could nest the groups:
mtcars %>% group_by(cyl) %>% nest()
Usually, you won't nest within the pipe-chain, but check in your function:
your_function(.x) <- function(x) {
if(inherits(x, "grouped_df")) x <- nest(x)
}
Your function should then iterate over the list-column data with all grouped subsets. If you use a function within mutate, e.g.
mtcars %>% group_by(cyl) %>% mutate(abc = your_function_call(.x))
then note that your function directly receives the values for each group, passed as class structure. It's a bit difficult to explain, just try it out and debug your_function_call step by step...
You can use groups(), however a SE version of this does not exist so I'm unsure of its use in programming.
library(dplyr)
df <- mtcars %>% group_by(cyl, mpg)
groups(df)
[[1]]
cyl
[[2]]
mpg

Resources