Alternative to slice() function in sparklyr - r

Currently I'm using this code in order to subset unique rows grouped by Loan_ID where Execution_Date is the closest to predetermined date (in my case "2022-03-31")
The example is as follows:
library(dplyr)
df %>%
group_by(Loan_ID) %>%
slice(which.min(abs(Execution_Date - as.Date("2022-03-31")))) %>%
ungroup
The problem is that if I implement this code is sparklyr I do get an error: "Slice() is not supported on database backends" (this is because no alternative of slice() function is around in SQL)
How can I deal with this problem?
Thank you in advance!

Related

Anonymous function to select variables and get unique values

I have a simple task which I would like to loop over many datasets (which have similar variable names). I know how to do it dplyr, but I need to convert it to base R in order to get it into an anonymous function.
For example (this not the real data I am working with):
This is my dplyr approach:
mtcars %>%
select(mpg, contains("cyl")) %>%
distinct()
However, when I throw this into an anonymous function:
I get an error: Error: No tidyselect variables were registered
mtcars %>% (function(x) subset(x, select=c(mpg, contains("cyl")))
Any ideas about how to solve this, and how to add distinct() to the function so that I only get unique values? Any and all suggestions are appreciated, thank you!

Ungroup SparkR data frame

I have a spark data frame:
library(SparkR); library(magrittr)
as.DataFrame(mtcars) %>%
groupBy("am")
How can I ungroup this data frame? There doesn't seems to be any ungroup function in the SparkR library!
There doesn't seems to be any ungroup function in the SparkR library
That's because groupBy doesn't have the same meaning as group_by in dplyr.
SparkR::group_by / SparkR::groupBy returns not a SparkDataFrame but a GroupData object which corresponds to GROUP BY clause in SQL. To convert it back to a SparkDataFrame you should call SparkR::agg (or if you prefer dplyr nomenclature SparkR::summarize) which corresponds to SELECT component of the SQL query.
Once you aggregate you get back SparkDataFrame and grouping is no longer present.
Additionally SparkR::groupBy doesn't have dplyr group_by(...) %>% mutate(...) equivalent. Instead we use window functions with frame definition.
So the take away message is - if you don't plan to aggregate don't use groupBy.

Error using dplyr package in R

I am using the below code to extract the summary of data with respect to column x by counting the values in column x from the dataset unique_data and arranging the count values in descending order.
unique_data %>%
group_by(x) %>%
arrange(desc(count(x)))
But, when I execute the above code i am getting the error message as below,
Error: no applicable method for 'group_by_' applied to an object of class "character"
Kindly, let me know as what is going wrong in my code. For your information the column x is of character data type.
Regards,
Mohan
The reason is the wrapping of arrange on count. We need to do this separately. If we use the same code as in the OP's post, just split up the count and arrange step in two separate pipes. The output of count is a frequency column 'n' (by default), which we arrange in descending (desc) order.
unique_data %>%
group_by(x) %>%
count(x) %>%
arrange(desc(n))
also the group_by is not needed. According to the ?count documentation
tally is a convenient wrapper for summarise that will either call n or
sum(n) depending on whether you're tallying for the first time, or
re-tallying. count() is similar, but also does the group_by for you.
So based on that, we can just do
count(unique_data, x) %>%
arrange(desc(n))

Pass a string variable to spread function in dplyr

I am trying to make a function which I pass a string variable to dplyr pipeline but having some problem. Like the following
col_spread = "speed".
In select(), I can use get(col_spread) to select the column named speed.
df %>% select(get(col_spread))
However, when I am using spread function in dplyr
df %>% spread(key = Key_col, value = get(col_spread))
Error: Invalid column specification
It doesn't work.
Is NSE the only way to go? If so, what should I do?
Thank you!
Actually get really isn't a great idea. It would be better to use the standard evaulation version of
df %>% select_(col_spread)
and then for spread it would look like
df %>% spread_("Key_col", col_spread)
note which values are quoted and which are not. spread_ expects two character values.

Filter n rows of grouped data frame when different n for each group

I'd like to pick a different number of rows of each group of my data frame. I haven't figured out an elegant way to do this with dplyr yet. To pick out the same number of rows for each group I accomplish like this:
library(dplyr)
iris %>%
group_by(Species) %>%
arrange(Sepal.Length) %>%
top_n(2)
But I would like to be able to reference another table with the number of rows I'd like for each group, a sample table like this below:
top_rows_desired <- data.frame(Species = unique(iris$Species),
n_desired = c(4,2,5))
We can do a left_join with 'iris' and 'top_rows_desired' by 'Species', grouped by 'Species', slice the sequence of first 'n_desired' and remove the 'n_desired' column with select.
left_join(iris, top_rows_desired, by = "Species") %>%
group_by(Species) %>%
arrange(desc(Sepal.Length)) %>%
slice(seq(first(n_desired))) %>%
select(-n_desired)
Just adding this answer for those folks who are unable to run the code that akrun provided. I struggled with this for a while. This answer tackles the issue #2531 mentioned on github.
You might not be able to run slice because you already have xgboost loaded in your environment. xgboost masks dplyr's slice function leading to this issue.
Attaching package: ‘xgboost’
The following object is masked from ‘package:dplyr’:
slice
Warning message:
package ‘xgboost’ was built under R version 3.4.1
So using
detach(package: xgboost)
might work for you.
I wasted an hour because of this. Hope this is helpful.

Resources