R Dplyr top_n does not work when used within function

R Dplyr top_n does not work when used within function - r

My dplyr function looks like this
convert_to_top5_df=function(df)
{
require(dplyr)
require(lazyeval)
require(tidyr)
df %>%
filter(!is.na(SVM_LABEL_QOL)) %>%
select(globalsegment,Account,SVM_LABEL_QOL) %>%
group_by(globalsegment,Account) %>%
summarise_(QoL=interp(~round(sum(SVM_LABEL_QOL %in% 'QoL')/n(),2))) %>%
ungroup(globalsegment,Account) %>%
arrange(desc(QoL)) %>%
interp(~top_n(5,wt = "QoL"))
}
I added the interp argument, as I thought the problem was due to lazyeval
However this is not the case.
Using the function below (no interp for top_n), I get a result, however I do not see the top 5 results as desired.
Reading other stackoverflow posts, I understand that this has to do with ungroup, but not sure how to implement this.
convert_to_top5_df=function(df)
{
require(dplyr)
require(lazyeval)
require(tidyr)
df %>%
filter(!is.na(SVM_LABEL_QOL)) %>%
select(globalsegment,Account,SVM_LABEL_QOL) %>%
group_by(globalsegment,Account) %>%
summarise_(QoL=interp(~round(sum(SVM_LABEL_QOL %in% 'QoL')/n(),2))) %>%
ungroup(globalsegment,Account) %>%
arrange(desc(QoL)) %>%
top_n(5,wt = "QoL")
}
Any ideas?

My solutionn, remove the inverted quotes from QoL and add an additional argument to arrange:
#Function to convert dataframe for pie chart analysis (Global)
convert_to_top5_df=function(df)
{
require(dplyr)
require(lazyeval)
require(tidyr)
df %>%
filter(!is.na(SVM_LABEL_QOL)) %>%
select(globalsegment,Account,SVM_LABEL_QOL) %>%
group_by(globalsegment,Account) %>%
summarise_(QoL=interp(~round(sum(SVM_LABEL_QOL %in% 'QoL')/n(),2))) %>%
top_n(5,QoL) %>%
arrange(globalsegment,desc(QoL))
}
If anyone's got a more efficient way, please share

Related

Creating a function with column name in group_by clause

I am trying to specify a following function where I wall pass a dataset's column name as a name to group_by clause.
counter<-function(df,col_name){
a<-df %>%
group_by(col_name) %>%
count() %>%
arrange(desc(n))
return(a)
}
So if I try for example:
fraud_continent<-counter(fraud,continent_source1)
where fraud is dataset and continent_source1 is the column name from this dataset, the function wont work and the error I get is:
Error: Must group by variables found in .data.
Column col_name is not found.
How do I solve this?

You can use curly curly operator ({{}}).
counter<-function(df,col_name){
a<-df %>%
group_by({{col_name}}) %>%
count() %>%
arrange(desc(n))
return(a)
}
Also you can do this without group_by -
counter<-function(df,col_name){
a<-df %>%
count({{col_name}}) %>%
arrange(desc(n))
return(a)
}
This can be called as -
fraud_continent<-counter(fraud,continent_source1)

We could use ensym with !!
library(dplyr)
counter <- function(df, colname){
df %>%
count(!! rlang::ensym(colname)) %>%
arrange(desc(n))
}
and then it can be called as either
fraud_continent<-counter(fraud,continent_source1)
Or
fraud_continent<-counter(fraud, "continent_source1")

Update:
Thanks to the support of akrun .data[[col_name]] is better:
First answer:
Or you could use df[,col_name]
library(dplyr)
counter<-function(df,col_name){
a<-df %>%
group_by(df[,col_name]) %>%
count() %>%
arrange(desc(n))
return(a)
}
fraud_continent<-counter(fraud,"continent_source1")

Pipeline %>% in R

I want to use the pipeline %>% from TIDYVERSE/PURRR to make this more readable:
myChargingDevices<-data.frame(fromJSON(jsonFile))
myChargingDevices<-myChargingDevices %>%
mutate(myTime=ymd_hms(lastUpdateCheck))
myChargingDevices<-myChargingDevices[order(myChargingDevices$myTime,decreasing = TRUE),]
myChargingDevices$lastUpdateCheck<-NULL
Any ideas to do this more convenient?
Thanks in advance

Like this:
myChargingDevices <- jsonFile %>%
fromJSON %>%
data.frame %>%
mutate(myTime = ymd_hms(lastUpdateCheck)) %>%
arrange(desc(myTime)) %>%
select(-lastUpdateCheck)
I cannot test it, because you do not give reproducible code.

Printing intermediate results without breaking pipeline in tidyverse

Is there a command to add to tidyverse pipelines that does not break the flow, but produces some side effect, like printing something out. The usecase I have in mind is something like this. In case of a pipeline
data %>%
mutate(new_var = <some time consuming operation>) %>%
mutate(new_var2 = <some other time consuming operation>) %>%
...
I would like to add some command to the pipeline that would not modify the end result, but would print out some progress or the state of things. Maybe something like this:
data %>%
mutate(new_var = <some time consuming operation>) %>%
command_x(print("first operation done")) %>%
mutate(new_var2 = <some other time consuming operation>) %>%
...
Does there exist such command_x already?

For the specific case of printing an intermediate step in the pipeline, just use %>% print() %>%. E.g.,
mtcars %>%
filter(cyl == 4) %>%
print() %>%
summarise(mpg = mean(mpg))
For a simple status message, you'd do:
pipe_message = function(.data, status) {message(status); .data}
mtcars %>%
filter(cyl == 4) %>%
pipe_message("first operation done") %>%
select(cyl)
See the answer by #MrFlick for a more general solution for non-print functions.

You could easily write your own function
pass_through <- function(data, fun) {fun(data); data}
And use it like
mtcars %>% pass_through(. %>% ncol %>% print) %>% nrow
Here we use the . %>% syntax to create an anonymous function. You could also write your own more explicitly with
mtcars %>% pass_through(function(x) print(ncol(x))) %>% nrow

You can do on the fly with an anonymous function:
mtcars %>% ( function(x){print(x); return(x)} ) %>% nrow()

Getting the tidyr::nest() -> purrr:map() workflow to work for special case of no grouping var

I'm trying to write a function that does a split-apply-combine for which the split variable(s) are parameters, and - importantly - a null split is acceptable. For example, running statistics either on subsets of data or on the entire dataset.
somedata=expand.grid(a=1:3,b=1:3)
somefun=function(df_in,grpvars=NULL){
df_in %>% group_by_(.dots=grpvars) %>% nest() %>%
mutate(X2.Resid=map(data,~with(.x,chisq.test(b)$residuals))) %>%
unnest(data,X2.Resid) %>% return()
}
somefun(somedata,"a") # This works
somefun(somedata) # This fails
The null condition fails because nest() seems to need a variable to nest by, rather than nesting the entire df into a 1x1 data.frame. I can get around this as follows:
somefun2=function(df_in,grpvars="Dummy"){
df_in$Dummy=1
df_in %>% group_by_(.dots=grpvars) %>% nest() %>%
mutate(X2.Resid=map(data,~with(.x,chisq.test(b)$residuals))) %>%
unnest(data,X2.Resid) %>%
select(-Dummy) %>% return()
}
somefun2(somedata) # This works
However, I'm wondering if there is a more elegant way to fix this, without needing the dummy variabe?

Hmm, that behavior is a little surprising to me. A fix is easy though: you just have to make sure you nest everything():
somefun3 <- function(df_in, grpvars = NULL) {
df_in %>%
group_by_(.dots = grpvars) %>%
nest(everything()) %>%
mutate(X2.Resid = map(data, ~with(.x, chisq.test(b)$residuals))) %>%
unnest()
}
somefun3(somedata, "a")
somefun3(somedata)
Both work.

Can one argument be mapped to more than one argument in a user-defined function?

Assume I want to run this:
MS_date<-bind_inpatient_MSW %>%
arrange(NRIC,
APPROVED_DATE_BILL,APPROVED_DATE_FF_APPLICATION) %>%
group_by(NRIC,
APPROVED_DATE_BILL,APPROVED_DATE_FF_APPLICATION) %>%
mutate(n_marital_status=n_distinct(MARITAL_STATUS,na.rm=TRUE))
and this
TH_date<-bind_inpatient_MSW %>%
arrange(NRIC,
APPROVED_DATE_BILL) %>%
group_by(NRIC,
APPROVED_DATE_BILL) %>%
mutate(n_TH=n_distinct(TYPE_OF_HOUSING,na.rm=TRUE))
These two differ by the variables that arrange and group the dataframe, as well as the added variable. I would like to write a user-defined function so that I dont have to write this more than once. I tried as follows:
df_date<-function(df,grpby,cntby){
dfnew<-df %>%
arrange(grpby) %>%
group_by(grpby) %>%
mutate(n=n_distinct(cntby,na.rm=TRUE))
return(dfnew)
}
And applying df_date(bind_inpatient_MSW,NRIC,APPROVED_DATE_BILL,APPROVED_DATE_FF_APPLICATION,MARITAL_STATUS)
and
df_date(bind_inpatient_MSW,NRIC,APPROVED_DATE_BILL,TYPE_OF_HOUSING)
They wouldnt work. How could I solve this?

You can try something like:
fun <- function(dat,group,ctnby) {
dat %>%
group_by_(group) %>%
do((function(., ctnby) {
with(., data.frame(n = n_distinct(get(ctnby))))
}
)(.,ctnby))
}
fun(mtcars,"cyl","hp")
which avoids lazy evaluation using do.

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

R Dplyr top_n does not work when used within function - r

Related

Creating a function with column name in group_by clause

Pipeline %>% in R

Printing intermediate results without breaking pipeline in tidyverse

Getting the tidyr::nest() -> purrr:map() workflow to work for special case of no grouping var

Can one argument be mapped to more than one argument in a user-defined function?

Categories

Resources