How to use R function in Sparklyr

How to use R function in Sparklyr - r

I am researching how to use R function on line but still have hard time figuring out. Please help.
My initial code looks like:
whatever %>%
group_by(a) %>%
summarize(count=n()) %>%
collect() %>%
ggplot(aes(x=a, y=count)) +
geom_point()
I want to repeat this multiple times since there are other columns I want to check with the same function.
So I wrote:
point_dist <- function(dta, vari) {
dta %>%
group_by(vari) %>%
summarize(count=n()) %>%
collect() %>%
ggplot(aes(x=vari, y=count)) +
gemo_point()
}
point_dist(whatever, a)
but keep telling me:
Error in eval_bare(sym, env) : object 'a' not found
Don't know why.
I either don't know if this is the right direction I shall go.
Thanks again.

Your issue is related to non-standard evaluation that dplyr functions tend to give you. When you reference a in your first call to point_dist, R attempts to evaluate it, which of course fails. (It's even more confusing when you have some variable named as such in your calling environment or higher ...)
NSE in dplyr means you can do something like select(mtcars, cyl), whereas with most standard-evaluation functions, you'll need myfunc(mtcars, "cyl"), since there isn't a variable named cyl in the calling environment.
In your case, try:
point_dist <- function(dta, vari) {
vari <- enquo(vari)
dta %>%
group_by(!!vari) %>%
summarize(count=n()) %>%
collect() %>%
ggplot(aes(x=!!vari, y=count)) +
gemo_point()
}
This method of dealing with unquoted column-names in your functions can be confusing if you're familiar with normal R function definitions and/or are not familiar with NSE. This can be a good template for you if that's as far as you're going to go with it, otherwise I strongly urge you to read a little more at the first reference below.
Some good references for NSE, specifically in/around tidyverse stuff:
https://dplyr.tidyverse.org/articles/programming.html
http://adv-r.had.co.nz/Computing-on-the-language.html

If you are summarising your data and piping to ggplot, then you don't need to use collect().
df <- data.frame(group=sample(letters[1:10],1000,T))
df %>% group_by(group) %>% summarise(n=n()) %>%
ggplot(aes(group,n)) + geom_point()
If you are going to apply this summary and plot method to multiple columns, I would suggest trying gather() and then plotting everything at once using + facet_wrap() and bar plots.
df <- data.frame(matrix(sample(letters[1:10],10000,T),ncol = 10))
df %>% gather(k,v) %>% group_by(k,v) %>% summarise(n=n()) %>%
ggplot(aes(k,n,fill=v)) + geom_bar(stat='identity') +
facet_wrap(~v) + theme(legend.position = 'none')

Related

Is it possible to use group_by in a function for more than one variable?

I created a function that aggregates the numeric values in a dataset, and I use a group_by() function to group the data first. Below is an example of what the code I wrote looks like. Is there a way I can group_by() more than one variable without having to create another input for the function?
agg <- function(data, group){ aggdata <- data %>% group_by({{group}}) %>% select_if(function(col) !is.numeric(col) & !is.integer(col)) %>% summarise_if(is.numeric, sum, na.rm = TRUE) return(aggdata)

Your code has (at least) a misplaced curly brace, and it's a bit difficult to see what you're trying to accomplish without a reproducible example and desired result.
It is possible to pass a vector of variable names to group_by(). For example, the following produces the same result as mtcars %>% group_by(cyl, gear):
my_groups <- c("cyl", "gear")
mtcars %>% group_by(!!!syms(my_groups))
Maybe you could use this syntax within your function definition.

more trouble with the R `{}` wrapper and dplyr

An earlier question I had (with answer) illustrates how the {} wrapper prevents piping to the first possible argument. Now I'm playing with this idea in the following manner.
# this all works
library(tidyverse)
mt <- mtcars %>% count(cyl)
seq_along(mt$cyl)
That code chunk aboves works. Neither of the two below work. I get an error, "Error in function_list[k] : object 'cyl' not found". What did I do wrong this time?
# does not work
mtcars %>%
count(cyl) %>%
{seq_along(cyl)}
#does not work
mtcars %>%
count(cyl) %>%
seq_along(cyl)
If none of my stuff makes sense all I really need is the simplest example of how the {} wrapper works with dplyr. Thank you.

You would need
mtcars %>%
count(cyl) %>%
{seq_along(.$cyl)}
The object is still passed as . with the braces, but it's not automatically inserted in to the first parameter.
In your first case
mtcars %>%
count(cyl) %>%
{seq_along(cyl)}
is the same as these two separate commands
count(mtcars, cyl)
seq_along(cyl)
because you never actually use anything from the chain. And your second case
mtcars %>%
count(cyl) %>%
seq_along(cyl)
is the same as
seq_along(count(mtcase, cyl), cyl)
which doesn't work because seq_along isn't a tidy-friendly function in that it doesn't accept a data source as the first parameter.

How do pipes work with purrr map() function and the "." (dot) symbol

When using both pipes and the map() function from purrr, I am confused about how data and variables are passed along. For instance, this code works as I expect:
library(tidyverse)
cars %>%
select_if(is.numeric) %>%
map(~hist(.))
Yet, when I try something similar using ggplot, it behaves in a strange way.
cars %>%
select_if(is.numeric) %>%
map(~ggplot(cars, aes(.)) + geom_histogram())
I'm guessing this is because the "." in this case is passing a vector to aes(), which is expecting a column name. Either way, I wish I could pass each numeric column to a ggplot function using pipes and map(). Thanks in advance!

cars %>%
select_if(is.numeric) %>%
map2(., names(.),
~{ggplot(data_frame(var = .x), aes(var)) +
geom_histogram() +
labs(x = .y) })
# Alternate version
cars %>%
select_if(is.numeric) %>%
imap(.,
~{ggplot(data_frame(var = .x), aes(var)) +
geom_histogram() +
labs(x = .y) })
There's a few extra steps.
Use map2 instead of map. The first argument is the dataframe you're passing it, and the second argument is a vector of the names of that dataframe, so it knows what to map over. (Alternately, imap(x, ...) is a synonym for map2(x, names(x), ...). It's an "index-map", hence "imap".).
You then need to explicitly enframe your data, since ggplot only works on dataframes and coercible objects.
This also gives you access to the .y pronoun to name the plots.

You aren't supposed to pass raw data to an aesthetic mapping. Instead you should dynamically build the data.frame. For example
cars %>%
select_if(is.numeric) %>%
map(~ggplot(data_frame(x=.), aes(x)) + geom_histogram())

R: Using piping to pass a single argument to multiple locations in a function

I am attempting to exclusively use piping to rewrite the following code (using babynames data from babynames package:
library(babynames)
library(dplyr)
myDF <- babynames %>%
group_by(year) %>%
summarise(totalBirthsPerYear = sum(n))
slice(myDF, seq(1, nrow(myDF), by = 20))
The closest I have gotten is this code (not working):
myDF <- babyNames %>%
group_by(year) %>%
summarise(totalBirthsPerYear = sum(n)) %>%
slice( XXX, seq(1, nrow(XXX), by = 20))
where XXX is meant to be passed via pipes to slice, but I'm stuck. Any help is appreciated.

You can reference piped data in a different position in the function by using the . In your case:
myDF2 <- babynames %>%
group_by(year) %>%
summarize(totalBirthsPerYear = sum(n)) %>%
slice(seq(1, nrow(.), by = 20))

Not sure if this should be opened as a separate question & answer but in case anybody arrives here as I did looking for the answer to the MULTIPLE in the title:
R: Using piping to pass a single argument to multiple locations in a function
Using the . from Andrew's answer in multiple places also achieves this.
[example] To get the last element of a vector vec <- c("first", "middle", "last")
we could use this code.
vec[length(vec)]
Using piping, the following code achieves the same thing:
vec %>% .[length(.)]
Hopefully this is helpful to others as it would have helped me (I knew about the . but couldn't get it working in multiple locations).

Using dplyr, how to pipe or chain to plot()?

I am new to dplyr() package and trying to use it for my visualization assignment. I am able to pipe my data to ggplot() but unable to do that with plot(). I came across this post and the answers including the one in comments, didn't work for me.
Code 1:
emission <- mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))
emission %>%
plot(year, total,.)
I get the following error:
Error in plot(year, total, emission) : object 'year' not found
Code 2:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
plot(year, total, .)
This didn't work either and returned the same error.
Interestingly, the solution from the post I mentioned works for the same dataset but doesn't work out for my own data. However, I am able to create the plot using emission$year and emission$total.
Am I missing anything?

plot.default doesn't take a data argument, so your best bet is to pipe to with:
mynei %>%
select(Emissions, year) %>%
group_by(year) %>%
summarise (total=sum(Emissions))%>%
with(plot(year, total))
In case anyone missed #aosmith's comment on the question, plot.formula does have a data argument, but of course the formula is the first argument so we need to use the . to put the data in the right place. So another option is
... %>%
plot(total ~ year, data = .)
Of course, ggplot takes data as the first argument, so to use ggplot do:
... %>%
ggplot(aes(x = year, y = total)) + geom_point()
lattice::xyplot is likeplot.formula: there is a data argument, but it's not first, so:
... %>%
xyplot(total ~ year, data = .)
Just look at the documentation and make sure you use a . if data isn't the first argument. If there's no data argument at all, using with is a good work-around.

As an alternative, you can use the %$% operator from magrittr to be able to access the columns of a dataframe directly. For example:
iris %$%
plot(Sepal.Length~Sepal.Width)
This is useful many times when you need to feed the result of a dplyr chain to a base R function (such as table, lm, plot, etc). It can also be used to extract a column from a dataframe as a vector, e.g.:
iris %>% filter(Species=='virginica') %$% Sepal.Length
This is the same as:
iris %>% filter(Species=='virginica') %>% pull(Sepal.Length)

Categories

HOME

jupyter-notebook

runtime-error

server-side-rendering

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

How to use R function in Sparklyr - r

Related

Is it possible to use group_by in a function for more than one variable?

more trouble with the R `{}` wrapper and dplyr

How do pipes work with purrr map() function and the "." (dot) symbol

R: Using piping to pass a single argument to multiple locations in a function

Using dplyr, how to pipe or chain to plot()?

Categories

Resources