Ungroup SparkR data frame - r

I have a spark data frame:
library(SparkR); library(magrittr)
as.DataFrame(mtcars) %>%
groupBy("am")
How can I ungroup this data frame? There doesn't seems to be any ungroup function in the SparkR library!

There doesn't seems to be any ungroup function in the SparkR library
That's because groupBy doesn't have the same meaning as group_by in dplyr.
SparkR::group_by / SparkR::groupBy returns not a SparkDataFrame but a GroupData object which corresponds to GROUP BY clause in SQL. To convert it back to a SparkDataFrame you should call SparkR::agg (or if you prefer dplyr nomenclature SparkR::summarize) which corresponds to SELECT component of the SQL query.
Once you aggregate you get back SparkDataFrame and grouping is no longer present.
Additionally SparkR::groupBy doesn't have dplyr group_by(...) %>% mutate(...) equivalent. Instead we use window functions with frame definition.
So the take away message is - if you don't plan to aggregate don't use groupBy.

Related

Alternative to slice() function in sparklyr

Currently I'm using this code in order to subset unique rows grouped by Loan_ID where Execution_Date is the closest to predetermined date (in my case "2022-03-31")
The example is as follows:
library(dplyr)
df %>%
group_by(Loan_ID) %>%
slice(which.min(abs(Execution_Date - as.Date("2022-03-31")))) %>%
ungroup
The problem is that if I implement this code is sparklyr I do get an error: "Slice() is not supported on database backends" (this is because no alternative of slice() function is around in SQL)
How can I deal with this problem?
Thank you in advance!

How can I write this R expression in the pipe operator format?

I am trying to rewrite this expression to magrittr’s pipe operator:
print(mean(pull(df, height), na.rm=TRUE))
which returns 175.4 for my dataset.
I know that I have to start with the data frame and write it as >df%>% but I’m confused about how to write it inside out. For example, should the na.rm=TRUE go inside mean(), pull() or print()?
UPDATE: I actually figured it out by trial and error...
>df%>%
+pull(height)%>%
+mean(na.rm=TRUE)
+print()
returns 175.4
It would be good practice to make a reproducible example, with dummy data like this:
height <- seq(1:30)
weight <- seq(1:30)
df <- data.frame(height, weight)
These pipe operators work with the majority of the tidyverse (not just magrittr). What you are trying to do is actually coming out of dplyr. The na.rm=T is required for many summary variables like mean, sd, as well as certain functions used to gather specific data points like min, max, etc. These functions don't play well with NA values.
df %>% pull(height) %>% mean(na.rm=T) %>% print()
Unless your data is nested you may not even need to use pull
df %>% summarise(mean = mean(height,na.rm=T))
Also, using summarise you can pipe these into another dataframe rather than just printing, and call them out of the dataframe whenever you want.
df %>% summarise(meanHt = mean(height,na.rm=T), sdHt = sd(height,na.rm=T)) -> summary
summary[1]
summary[2]

Dynamic Filter with dplyr R doesn't work

cols <- data %>% names()
data %>% dplyr::filter_(is.na(cols[1]))
gives zero although it should output some rows, alternatively when calling
data %>% dplyr::filter(is.na("colName"))
output rows
Thus, dynamic filtration not working, any idea what is the alternative?
dplyr::filter(data, is.na(data[, cols[1]]))

Error using dplyr package in R

I am using the below code to extract the summary of data with respect to column x by counting the values in column x from the dataset unique_data and arranging the count values in descending order.
unique_data %>%
group_by(x) %>%
arrange(desc(count(x)))
But, when I execute the above code i am getting the error message as below,
Error: no applicable method for 'group_by_' applied to an object of class "character"
Kindly, let me know as what is going wrong in my code. For your information the column x is of character data type.
Regards,
Mohan
The reason is the wrapping of arrange on count. We need to do this separately. If we use the same code as in the OP's post, just split up the count and arrange step in two separate pipes. The output of count is a frequency column 'n' (by default), which we arrange in descending (desc) order.
unique_data %>%
group_by(x) %>%
count(x) %>%
arrange(desc(n))
also the group_by is not needed. According to the ?count documentation
tally is a convenient wrapper for summarise that will either call n or
sum(n) depending on whether you're tallying for the first time, or
re-tallying. count() is similar, but also does the group_by for you.
So based on that, we can just do
count(unique_data, x) %>%
arrange(desc(n))

r dplyr group_by - by variable content

I use dplyr group_by function to group my data frame,
and need to be able to group the data, by a column, i don't know the name of the column yet, i need to decide it along the code, so the name can't be hard coded.
for example,
i can't use
data %>% group_by(col_name)
i need to do somthing like
data %>% c <- col_name
data %>% group_by(c)
when i try doing so, it popes error:
Error: unknown variable to group by : c
All the examples I find are for the trevial case when you can hard code the name of the column
group by example
Same in the r help
Thanks.
You would like to look up NSE as others have said in their comments. Using that also requires you to use lazyeval package, and group_by_ function, which allows you to you standard evaluation. So it will look like:
data %>% group_by_(lazyeval::interp(~var, var = as.name(c)))

Resources