I'm new to programming with dplyr. Let's say I have a function like this :
example <- function(data, group, var){
group <- enquo(group)
var <- enquo(var)
data %>%
group_by(!! group) %>%
summarise(
Min = min(!! var),
Max = max(!! var)
)
}
> diamonds %>% example(cut, price)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Good 327 18788
3 Very Good 336 18818
4 Premium 326 18823
5 Ideal 326 18806
I'd like to add a new argument sort. If non specified, the function does nothing more, if specified the function does arrange(desc(sort)).
I tried things like this :
example <- function(data, group, var, sort = NULL){
sort <- enquo(sort)
group <- enquo(group)
var <- enquo(var)
data <-
data %>%
group_by(!! group) %>%
summarise(
Min = min(!! var),
Max = max(!! var)
)
if(is.null(sort)) data
else arrange(data, desc(!! sort))
}
It works when the sort argument is specified
> diamonds %>% example(cut, price, sort = Min)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Very Good 336 18818
3 Good 327 18788
4 Premium 326 18823
5 Ideal 326 18806
But if I'm leaving sort blank I get the error message :
Error: cannot arrange column of class 'NULL' at position 1
How can I fix this ? Thanks for help and sorry for bad english
Moving sort <- enquo(sort) to the else section gets you closer, but you will still have a problem because the if(is.null(sort)) line will return an error if you use sort = Min in the function because is.null will want to evaluate the Min object, but will not find it.
Instead of using is.null, you could use missing, which does not evaluate the Min object.
example <- function(data, group, var, sort = NULL){
group <- enquo(group)
var <- enquo(var)
data <-
data %>%
group_by(!! group) %>%
summarise(
Min = min(!! var),
Max = max(!! var)
)
if(missing(sort)) {
data
} else { sort <- enquo(sort)
arrange(data, desc(!! sort))
}
}
Now we get the following outputs:
diamonds %>% example(cut, price)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Good 327 18788
3 Very Good 336 18818
4 Premium 326 18823
5 Ideal 326 18806
diamonds %>% example(cut, price, sort = Min)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Very Good 336 18818
3 Good 327 18788
4 Premium 326 18823
5 Ideal 326 18806
See Testing a function that uses enquo() for a NULL parameter
Related
I borrow a dataset from SPSS prepared by Julie Pallant's SPSS Survival Manual and run it on R.
I select three columns to run correlation and significance test: toptim, tnegaff, sex. I select the columns using select: df <- survey %>% select(toptim, tnegaff, sex).
Then, problems emerge.
I'd like to know the correlation between toptim and tnegaff by sex. But I can't use cor and resort to correlate. Why is there error and any difference between the two methods?
df %>% group_by(sex) %>% summarise(cor = correlate(toptim, tnegaff)) <- OK (male = 0.22 female = 0.394)
df %>% group_by(sex) %>% summarise(cor = cor(toptim, tnegaff)) <- failed, returns with NA
I failed to obtain the test of significance with cor.test (The answer should be p = 0.0488)
Error in `summarise()`:
! Problem while computing `cor = cor.test(toptim, tnegaff)`.
✖ `cor` must be a vector, not a `htest` object.
ℹ The error occurred in group 1: sex = 1.
Then I try to follow past examples and use broom::tidy, but no output for p-values....
> df %>% group_by(sex) %>% broom::tidy(cor.test(toptim, tnegaff))
# A tibble: 3 × 13
column n mean sd median trimmed mad min max range skew kurtosis se
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 toptim 435 22.1 4.43 22 22.3 3 7 30 23 NA NA 0.212
2 tnegaff 435 19.4 7.07 18 18.6 4 10 39 29 NA NA 0.339
3 sex 439 1.58 0.494 2 1.58 0 1 2 1 -0.318 1.10 0.0236
How can I get the result? May I know the reason for such failure?
Thank you for your answers in advance.
It's trying to use all values and coming across NAs I presume. If you set to use "complete.obs" then it should work. For the cor.test part wrap the output in a list function to use the tibble's capabilities to have a column of a vector of objects.
For the final tidying and getting p-values, use map(cor.test, broom::tidy) then tidyr::unnest() to get a full and tidy dataframe.
That's a few steps to go through but hope it helps!
df <- haven::read_sav("survey.sav")
library(tidyverse)
df %>%
group_by(sex) %>%
summarise(cor = cor(toptim, tnegaff, use = "complete.obs"),
cor.test = list(cor.test(toptim, tnegaff))) %>%
mutate(tidy_out = map(cor.test, broom::tidy)) %>%
unnest(tidy_out)
#> # A tibble: 2 × 11
#> sex cor cor.t…¹ estim…² stati…³ p.value param…⁴ conf.…⁵ conf.…⁶ method
#> <dbl+l> <dbl> <list> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <chr>
#> 1 1 [MAL… -0.220 <htest> -0.220 -3.04 2.73e- 3 182 -0.353 -0.0775 Pears…
#> 2 2 [FEM… -0.394 <htest> -0.394 -6.75 1.06e-10 248 -0.494 -0.284 Pears…
#> # … with 1 more variable: alternative <chr>, and abbreviated variable names
#> # ¹cor.test, ²estimate, ³statistic, ⁴parameter, ⁵conf.low, ⁶conf.high
Edit - examining difference in correlation
Borrowing the function from here you can examine the difference in correlation coefficients between sexes like this:
cor.diff.test(df$toptim[df$sex == 1], df$tnegaff[df$sex == 1], df$toptim[df$sex == 2], df$tnegaff[df$sex == 2])
I am trying to write a custom function where I use rlang's quasiquotation. This function also internally uses dplyr's join functions. I have provided below a minimal working example that illustrated my problem.
# needed libraries
library(tidyverse)
# function definition
df_combiner <- function(data, x, group.by) {
# check how many variables were entered for this grouping variable
group.by <- as.list(rlang::quo_squash(rlang::enquo(group.by)))
# based on number of arguments, select `group.by` in cases like `c(cyl)`,
# the first list element after `quo_squash` will be `c` which we don't need,
# but if we pass just `cyl`, there is no `c`, this will take care of that
# issue
group.by <-
if (length(group.by) == 1) {
group.by
} else {
group.by[-1]
}
# creating internal dataframe
df <- dplyr::group_by(.data = data, !!!group.by, .drop = TRUE)
# creating dataframes to be joined: one with tally, one with summary
df_tally <- dplyr::tally(df)
df_mean <- dplyr::summarise(df, mean = mean({{ x }}, na.rm = TRUE))
# without specifying `by` argument, this works but prints a message I want to avoid
print(dplyr::left_join(x = df_tally, y = df_mean))
# joining by specifying `by` argument (my failed attempt)
dplyr::left_join(x = df_tally, y = df_mean, by = !!!group.by)
}
# using the function
df_combiner(diamonds, carat, c(cut, clarity))
#> Joining, by = c("cut", "clarity")
#> # A tibble: 40 x 4
#> # Groups: cut [5]
#> cut clarity n mean
#> <ord> <ord> <int> <dbl>
#> 1 Fair I1 210 1.36
#> 2 Fair SI2 466 1.20
#> 3 Fair SI1 408 0.965
#> 4 Fair VS2 261 0.885
#> 5 Fair VS1 170 0.880
#> 6 Fair VVS2 69 0.692
#> 7 Fair VVS1 17 0.665
#> 8 Fair IF 9 0.474
#> 9 Good I1 96 1.20
#> 10 Good SI2 1081 1.04
#> # ... with 30 more rows
#> Error in !group.by: invalid argument type
As can be seen here, I want to avoid the message #> Joining, by = c("cut", "clarity") and so explicitly want to input the by argument for the _join function but I am not sure how to do this.
(I've tried rlang::as_string, rlang::quo_name, etc.).
We can convert to string with as_string
dplyr::left_join(x = df_tally, y = df_mean,
by = map_chr(group.by, rlang::as_string))
df_combiner <- function(data, x, group.by) {
# check how many variables were entered for this grouping variable
group.by <- as.list(rlang::quo_squash(rlang::enquo(group.by)))
# based on number of arguments, select `group.by` in cases like `c(cyl)`,
# the first list element after `quo_squash` will be `c` which we don't need,
# but if we pass just `cyl`, there is no `c`, this will take care of that
# issue
group.by <-
if (length(group.by) == 1) {
group.by
} else {
group.by[-1]
}
# creating internal dataframe
df <- dplyr::group_by(.data = data, !!!group.by, .drop = TRUE)
# creating dataframes to be joined: one with tally, one with summary
df_tally <- dplyr::tally(df)
df_mean <- dplyr::summarise(df, mean = mean({{ x }}, na.rm = TRUE))
# without specifying `by` argument, this works but prints a message I want to avoid
#print(dplyr::left_join(x = df_tally, y = df_mean))
# joining by specifying `by` argument (my failed attempt)
dplyr::left_join(x = df_tally, y = df_mean, by = map_chr(group.by, rlang::as_string))
}
-checking
df_combiner(diamonds, carat, c(cut, clarity))
# A tibble: 40 x 4
# Groups: cut [5]
# cut clarity n mean
# <ord> <ord> <int> <dbl>
# 1 Fair I1 210 1.36
# 2 Fair SI2 466 1.20
# 3 Fair SI1 408 0.965
# 4 Fair VS2 261 0.885
# 5 Fair VS1 170 0.880
# 6 Fair VVS2 69 0.692
# 7 Fair VVS1 17 0.665
# 8 Fair IF 9 0.474
# 9 Good I1 96 1.20
#10 Good SI2 1081 1.04
# … with 30 more rows
Join functions take a string vector for its by argument. Use deparse to go from expressions to strings:
dplyr::left_join(x = df_tally, y = df_mean, by = map_chr(group.by, deparse))
As mentioned by earlier authors, ´by´ expects a string vector. An easy way to move from lists of quosures to strings is illustrated by stanwood on the RStudio Community thread Should tidyeval be abandoned?
...tidyr::left_join still expects a list of strings: by = c("Species",
"Sepal.Length"). If I want to supply these programatically the best
solution I found was by = sapply(sepaldims, quo_text). Consider this a
plug for abstracting quo_text to lists of quosures.
sepaldims <- quos(Species, Sepal.Length)
For each categorical variable in dataset, I want to get counts and summary stats for each level. I can do this using dlookr R package using their diagnose_category() function. Since at work I don't have that package I recreated the function using dplyr.
In sparklye I am able to get counts for a single variable at a time. Need help to extend it all categorical variable.
Need Help:
Implement the function via SparklyR
Table 1: Final output needed:
# A tibble: 20 x 6
variables levels N freq ratio rank
<chr> <ord> <int> <int> <dbl> <int>
1 cut Ideal 53940 21551 40.0 1
2 cut Premium 53940 13791 25.6 2
3 cut Very Good 53940 12082 22.4 3
4 cut Good 53940 4906 9.10 4
5 cut Fair 53940 1610 2.98 5
6 color G 53940 11292 20.9 1
7 color E 53940 9797 18.2 2
8 color F 53940 9542 17.7 3
9 color H 53940 8304 15.4 4
10 color D 53940 6775 12.6 5
11 color I 53940 5422 10.1 6
12 color J 53940 2808 5.21 7
13 clarity SI1 53940 13065 24.2 1
14 clarity VS2 53940 12258 22.7 2
15 clarity SI2 53940 9194 17.0 3
16 clarity VS1 53940 8171 15.1 4
17 clarity VVS2 53940 5066 9.39 5
18 clarity VVS1 53940 3655 6.78 6
19 clarity IF 53940 1790 3.32 7
20 clarity I1 53940 741 1.37 8
R Code:
# Categorical Variable Profile
# Table based on dlookr package, diagnose_category() function
# variables : variable names
# types: the data type of the variable
# levels: level names
# N : Number of observation
# freq : Number of observation at the level
# ratio : Percentage of observation at the level
# rank : Rank of occupancy ratio of levels
library(ggplot2)
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
library(stringr)
# Helper Function
cat_level_summary <- function(df,x) {
count(df,x, sort = TRUE) %>%
transmute(levels = x, N = sum(n), freq = n,
ratio = n / sum(n) * 100, rank = row_number())
}
# Loading
diamonds_tbl <- diamonds
# Main Code
CategoricalVariableProfile <- diamonds_tbl %>%
select_if(~!is.numeric(.)) %>%
map(~cat_level_summary(data.frame(x=.x), x)) %>%
do.call(rbind.data.frame, .) %>%
rownames_to_column(., "variables")%>%
mutate(variables = str_match(variables, ".*(?=\\.)")[, 1] )
Spark Code:
#Spark data Table
diamonds_tbl <- copy_to(sc, diamonds, "diamonds", overwrite = TRUE)
CategoricalVariableProfile <- diamonds_tbl %>%
group_by(cut) %>%
summarize(count = n()) %>%
sdf_register("CategoricalVariableProfile")
Flatten your data using sdf_gather:
long <- diamonds_tbl %>%
select(cut, color, clarity) %>%
sdf_gather("variable", "level", "cut", "color", "clarity")
Aggregate by variable and level:
counts <- long %>% group_by(variable, level) %>% summarise(freq = n())
And finally apply required window functions:
result <- counts %>%
arrange(-freq) %>%
mutate(
rank = rank(),
total = sum(freq, na.rm = TRUE),
ratio = freq / total * 100)
Which will give you
result
# Source: spark<?> [?? x 6]
# Groups: variable
# Ordered by: -freq
variable level freq rank total ratio
<chr> <chr> <dbl> <int> <dbl> <dbl>
1 cut Ideal 21551 1 53940 40.0
2 cut Premium 13791 2 53940 25.6
3 cut Very Good 12082 3 53940 22.4
4 cut Good 4906 4 53940 9.10
5 cut Fair 1610 5 53940 2.98
6 clarity SI1 13065 1 53940 24.2
7 clarity VS2 12258 2 53940 22.7
8 clarity SI2 9194 3 53940 17.0
9 clarity VS1 8171 4 53940 15.1
10 clarity VVS2 5066 5 53940 9.39
# … with more rows
with following optimized plan
optimizedPlan(result)
<jobj[165]>
org.apache.spark.sql.catalyst.plans.logical.Project
Project [variable#524, level#525, freq#1478L, rank#1479, total#1480L, ((cast(freq#1478L as double) / cast(total#1480L as double)) * 100.0) AS ratio#1481]
+- Window [rank(_w1#1493L) windowspecdefinition(variable#524, _w1#1493L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#1479], [variable#524], [_w1#1493L ASC NULLS FIRST]
+- Window [sum(freq#1478L) windowspecdefinition(variable#524, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS total#1480L], [variable#524]
+- Project [variable#524, level#525, freq#1478L, -freq#1478L AS _w1#1493L]
+- Sort [-freq#1478L ASC NULLS FIRST], true
+- Aggregate [variable#524, level#525], [variable#524, level#525, count(1) AS freq#1478L]
+- Generate explode(map(cut, cut#19, color, color#20, clarity, clarity#21)), [0, 1, 2], false, [variable#524, level#525]
+- Project [cut#19, color#20, clarity#21]
+- InMemoryRelation [carat#18, cut#19, color#20, clarity#21, depth#22, table#23, price#24, x#25, y#26, z#27], StorageLevel(disk, memory, deserialized, 1 replicas)
+- Scan ExistingRDD[carat#18,cut#19,color#20,clarity#21,depth#22,table#23,price#24,x#25,y#26,z#27]
and query (sdf_gather component not included):
dbplyr::remote_query(result)
<SQL> SELECT `variable`, `level`, `freq`, `rank`, `total`, `freq` / `total` * 100.0 AS `ratio`
FROM (SELECT `variable`, `level`, `freq`, rank() OVER (PARTITION BY `variable` ORDER BY -`freq`) AS `rank`, sum(`freq`) OVER (PARTITION BY `variable`) AS `total`
FROM (SELECT *
FROM (SELECT `variable`, `level`, count(*) AS `freq`
FROM `sparklyr_tmp_ded2576b9f1`
GROUP BY `variable`, `level`) `dsbksdfhtf`
ORDER BY -`freq`) `obyrzsxeus`) `ekejqyjrfz`
After much searching, I can't seem to figure this out.
Trying to write a function that:
takes a data frame, db
groups the data frame by var1
returns the mean and sd by group on several different columns
Here is my function,
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
for (i in var2) {
db %>%
group_by(!!var1) %>%
summarise(mean_var = mean(!!!var2))
}}
when I pass the following, nothing returns
myfun(data, group, age, bmi)
Ideally, I would like to group both age and bmi by group and return the mean and sd for each. In the future, I would like to pass many more columns from data into the function...
The output would be similar to summaryBy from doby package, but on many columns at once and would look like:
Group age.mean age.sd
0
1
bmi.mean bmi.sd
0
1
Your loop appears to be unnecessary (you aren't doing anything with i). Instead, you could use summarize_at to achieve the effect you want:
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
db %>%
group_by(!!var1) %>%
summarise_at(vars(!!!var2), c(mean = mean, sd = sd))
}
And if we test it out with diamonds dataset:
myfun(diamonds, cut, x, z)
cut x_mean z_mean x_sd z_sd
<ord> <dbl> <dbl> <dbl> <dbl>
1 Fair 6.25 3.98 0.964 0.652
2 Good 5.84 3.64 1.06 0.655
3 Very Good 5.74 3.56 1.10 0.730
4 Premium 5.97 3.65 1.19 0.731
5 Ideal 5.51 3.40 1.06 0.658
To get the formatting closer to what you had in mind in your original post, we can use a bit of tidyr magic:
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
db %>%
group_by(!!var1) %>%
summarise_at(vars(!!!var2), c(mean = mean, sd = sd)) %>%
gather(variable, value, -(!!var1)) %>%
separate(variable, c('variable', 'measure'), sep = '_') %>%
spread(measure, value) %>%
arrange(variable, !!var1)
}
cut variable mean sd
<ord> <chr> <dbl> <dbl>
1 Fair x 6.25 0.964
2 Good x 5.84 1.06
3 Very Good x 5.74 1.10
4 Premium x 5.97 1.19
5 Ideal x 5.51 1.06
6 Fair z 3.98 0.652
7 Good z 3.64 0.655
8 Very Good z 3.56 0.730
9 Premium z 3.65 0.731
10 Ideal z 3.40 0.658
My question is how to us if..else..statement in dplyr chaining?
For example:
select.vars <- function(data, price=TRUE ){
diamonds %>% {if (price) select(price) else select(carat)}
}
select.vars(diamonds)
I got error:
Error in UseMethod("select_") :
no applicable method for 'select_' applied to an object of class "logical"
This is a meaningless function. Just for illustration purpose...
Thanks a lot.
We can use the if/else within the select
select.vars <- function(data, price=TRUE){
diamonds %>%
select(if(price) "price" else "carat")
}
resprice <- select.vars(diamonds)
rescarat <- select.vars(diamonds, FALSE)
head(rescarat)
# A tibble: 6 x 1
# carat
# <dbl>
#1 0.23
#2 0.21
#3 0.23
#4 0.29
#5 0.31
#6 0.24
head(resprice)
# A tibble: 6 x 1
# price
# <int>
#1 326
#2 326
#3 327
#4 334
#5 335
#6 336
I just figured it out. Simply add the .for each select.
select.vars <- function(data, price=TRUE ){
diamonds %>% {if (price) select(., price) else select(., carat)}
}