I am trying to write a custom function where I use rlang's quasiquotation. This function also internally uses dplyr's join functions. I have provided below a minimal working example that illustrated my problem.
# needed libraries
library(tidyverse)
# function definition
df_combiner <- function(data, x, group.by) {
# check how many variables were entered for this grouping variable
group.by <- as.list(rlang::quo_squash(rlang::enquo(group.by)))
# based on number of arguments, select `group.by` in cases like `c(cyl)`,
# the first list element after `quo_squash` will be `c` which we don't need,
# but if we pass just `cyl`, there is no `c`, this will take care of that
# issue
group.by <-
if (length(group.by) == 1) {
group.by
} else {
group.by[-1]
}
# creating internal dataframe
df <- dplyr::group_by(.data = data, !!!group.by, .drop = TRUE)
# creating dataframes to be joined: one with tally, one with summary
df_tally <- dplyr::tally(df)
df_mean <- dplyr::summarise(df, mean = mean({{ x }}, na.rm = TRUE))
# without specifying `by` argument, this works but prints a message I want to avoid
print(dplyr::left_join(x = df_tally, y = df_mean))
# joining by specifying `by` argument (my failed attempt)
dplyr::left_join(x = df_tally, y = df_mean, by = !!!group.by)
}
# using the function
df_combiner(diamonds, carat, c(cut, clarity))
#> Joining, by = c("cut", "clarity")
#> # A tibble: 40 x 4
#> # Groups: cut [5]
#> cut clarity n mean
#> <ord> <ord> <int> <dbl>
#> 1 Fair I1 210 1.36
#> 2 Fair SI2 466 1.20
#> 3 Fair SI1 408 0.965
#> 4 Fair VS2 261 0.885
#> 5 Fair VS1 170 0.880
#> 6 Fair VVS2 69 0.692
#> 7 Fair VVS1 17 0.665
#> 8 Fair IF 9 0.474
#> 9 Good I1 96 1.20
#> 10 Good SI2 1081 1.04
#> # ... with 30 more rows
#> Error in !group.by: invalid argument type
As can be seen here, I want to avoid the message #> Joining, by = c("cut", "clarity") and so explicitly want to input the by argument for the _join function but I am not sure how to do this.
(I've tried rlang::as_string, rlang::quo_name, etc.).
We can convert to string with as_string
dplyr::left_join(x = df_tally, y = df_mean,
by = map_chr(group.by, rlang::as_string))
df_combiner <- function(data, x, group.by) {
# check how many variables were entered for this grouping variable
group.by <- as.list(rlang::quo_squash(rlang::enquo(group.by)))
# based on number of arguments, select `group.by` in cases like `c(cyl)`,
# the first list element after `quo_squash` will be `c` which we don't need,
# but if we pass just `cyl`, there is no `c`, this will take care of that
# issue
group.by <-
if (length(group.by) == 1) {
group.by
} else {
group.by[-1]
}
# creating internal dataframe
df <- dplyr::group_by(.data = data, !!!group.by, .drop = TRUE)
# creating dataframes to be joined: one with tally, one with summary
df_tally <- dplyr::tally(df)
df_mean <- dplyr::summarise(df, mean = mean({{ x }}, na.rm = TRUE))
# without specifying `by` argument, this works but prints a message I want to avoid
#print(dplyr::left_join(x = df_tally, y = df_mean))
# joining by specifying `by` argument (my failed attempt)
dplyr::left_join(x = df_tally, y = df_mean, by = map_chr(group.by, rlang::as_string))
}
-checking
df_combiner(diamonds, carat, c(cut, clarity))
# A tibble: 40 x 4
# Groups: cut [5]
# cut clarity n mean
# <ord> <ord> <int> <dbl>
# 1 Fair I1 210 1.36
# 2 Fair SI2 466 1.20
# 3 Fair SI1 408 0.965
# 4 Fair VS2 261 0.885
# 5 Fair VS1 170 0.880
# 6 Fair VVS2 69 0.692
# 7 Fair VVS1 17 0.665
# 8 Fair IF 9 0.474
# 9 Good I1 96 1.20
#10 Good SI2 1081 1.04
# … with 30 more rows
Join functions take a string vector for its by argument. Use deparse to go from expressions to strings:
dplyr::left_join(x = df_tally, y = df_mean, by = map_chr(group.by, deparse))
As mentioned by earlier authors, ´by´ expects a string vector. An easy way to move from lists of quosures to strings is illustrated by stanwood on the RStudio Community thread Should tidyeval be abandoned?
...tidyr::left_join still expects a list of strings: by = c("Species",
"Sepal.Length"). If I want to supply these programatically the best
solution I found was by = sapply(sepaldims, quo_text). Consider this a
plug for abstracting quo_text to lists of quosures.
sepaldims <- quos(Species, Sepal.Length)
Related
In the below code, I've simulated dice rolls at increasing sample sizes and computed the average roll at each sample size. My lapply function works, but I'm uncomfortable with it since I know sample_n is not a dplyr function and has been superceded by slice_sample. I would like make my code better with a dplyr solution rather than sample_n() within the lapply. I think I may have other syntactical errors within the lapply. Here is the code:
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = sample_n(dice_df,var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
The final step is computing the difference compared to the expected value, 3.5. I want a column where that shows the difference between 3.5 and the sample mean. We should see the difference decreasing as the sample size increases.
output <- output %>%
mutate(difference = across(sample_mean, ~3.5 - .x))
When I run this, it's throwing this error:
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
I've tried using sapply but I get a similar error: no applicable method for 'mutate' applied to an object of class "c('matrix', 'array', 'list')"
If it helps, here was my failed attempt at using slice_sample:
output <- lapply(X=sample_sizes, FUN = function(...){
obs = slice_sample(dice_df, ..., .preserve=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, ...)
return(new.df)
})
I got this error: Error: '...' used in an incorrect context
The output is just a single row data.frame element in a list. We can bind them with bind_rows and simply subtract once instead of doing this multiple times
library(dplyr)
bind_rows(output) %>%
mutate(difference = 3.5 - sample_mean )
sample_mean var difference
1 3.500000 10 0.00000000
2 2.800000 25 0.70000000
3 3.440000 50 0.06000000
4 3.510000 100 -0.01000000
5 3.495000 1000 0.00500000
6 3.502200 10000 -0.00220000
7 3.502410 100000 -0.00241000
8 3.498094 1000000 0.00190600
9 3.500183 100000000 -0.00018332
The n argument of slice_sample correspondes to sample_n's size argument.
And to calculate the difference of your output list we can use purrr::map instead of dplyr::across.
library(dplyr)
library(purrr)
set.seed(123)
#Dice
dice <- c(1,2,3,4,5,6) #the set of possible outcomes of a dice role
dice_probs <- c(1/6,1/6,1/6,1/6,1/6,1/6) #the probability of each option per roll
dice_df <- data.frame(dice,dice_probs)
#Simulate dice rolls for each of these sample sizes and record the average of the rolls
sample_sizes <- c(10,25,50,100,1000,10000,100000,1000000,100000000) #compute at each sample size
output <- lapply(X=sample_sizes, FUN = function(var){
obs = slice_sample(dice_df,n = var,replace=TRUE)
sample_mean = mean(obs$dice)
new.df <- data.frame(sample_mean, var)
return(new.df)
})
output %>%
map(~ 3.5 - .x$sample_mean)
#> [[1]]
#> [1] -0.5
#>
#> [[2]]
#> [1] 0.42
#>
#> [[3]]
#> [1] -0.04
#>
#> [[4]]
#> [1] -0.34
#>
#> [[5]]
#> [1] 0.025
#>
#> [[6]]
#> [1] 0.0317
#>
#> [[7]]
#> [1] 0.00416
#>
#> [[8]]
#> [1] -2.6e-05
#>
#> [[9]]
#> [1] -4.405e-05
Created on 2021-08-02 by the reprex package (v0.3.0)
Alternatively, we can use purrr::map_df and add a row diff inside each tibble as proposed by Martin Gal in the comments:
output %>%
map_df(~ tibble(.x, diff = 3.5 - .x$sample_mean))
#> # A tibble: 9 x 3
#> sample_mean var diff
#> <dbl> <dbl> <dbl>
#> 1 2.6 10 0.9
#> 2 3.28 25 0.220
#> 3 3.66 50 -0.160
#> 4 3.5 100 0
#> 5 3.53 1000 -0.0270
#> 6 3.50 10000 -0.00180
#> 7 3.50 100000 -0.00444
#> 8 3.50 1000000 -0.000226
#> 9 3.50 100000000 -0.0000669
Here is a base R way -
transform(do.call(rbind, output), difference = 3.5 - sample_mean)
# sample_mean var difference
#1 3.80 10 -0.300000
#2 3.44 25 0.060000
#3 3.78 50 -0.280000
#4 3.30 100 0.200000
#5 3.52 1000 -0.015000
#6 3.50 10000 -0.004200
#7 3.50 100000 -0.004370
#8 3.50 1000000 0.002696
#9 3.50 100000000 0.000356
If you just need the difference value you can do -
3.5 - sapply(output, `[[`, 'sample_mean')
I'm new to programming with dplyr. Let's say I have a function like this :
example <- function(data, group, var){
group <- enquo(group)
var <- enquo(var)
data %>%
group_by(!! group) %>%
summarise(
Min = min(!! var),
Max = max(!! var)
)
}
> diamonds %>% example(cut, price)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Good 327 18788
3 Very Good 336 18818
4 Premium 326 18823
5 Ideal 326 18806
I'd like to add a new argument sort. If non specified, the function does nothing more, if specified the function does arrange(desc(sort)).
I tried things like this :
example <- function(data, group, var, sort = NULL){
sort <- enquo(sort)
group <- enquo(group)
var <- enquo(var)
data <-
data %>%
group_by(!! group) %>%
summarise(
Min = min(!! var),
Max = max(!! var)
)
if(is.null(sort)) data
else arrange(data, desc(!! sort))
}
It works when the sort argument is specified
> diamonds %>% example(cut, price, sort = Min)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Very Good 336 18818
3 Good 327 18788
4 Premium 326 18823
5 Ideal 326 18806
But if I'm leaving sort blank I get the error message :
Error: cannot arrange column of class 'NULL' at position 1
How can I fix this ? Thanks for help and sorry for bad english
Moving sort <- enquo(sort) to the else section gets you closer, but you will still have a problem because the if(is.null(sort)) line will return an error if you use sort = Min in the function because is.null will want to evaluate the Min object, but will not find it.
Instead of using is.null, you could use missing, which does not evaluate the Min object.
example <- function(data, group, var, sort = NULL){
group <- enquo(group)
var <- enquo(var)
data <-
data %>%
group_by(!! group) %>%
summarise(
Min = min(!! var),
Max = max(!! var)
)
if(missing(sort)) {
data
} else { sort <- enquo(sort)
arrange(data, desc(!! sort))
}
}
Now we get the following outputs:
diamonds %>% example(cut, price)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Good 327 18788
3 Very Good 336 18818
4 Premium 326 18823
5 Ideal 326 18806
diamonds %>% example(cut, price, sort = Min)
# A tibble: 5 x 3
cut Min Max
<ord> <dbl> <dbl>
1 Fair 337 18574
2 Very Good 336 18818
3 Good 327 18788
4 Premium 326 18823
5 Ideal 326 18806
See Testing a function that uses enquo() for a NULL parameter
For each categorical variable in dataset, I want to get counts and summary stats for each level. I can do this using dlookr R package using their diagnose_category() function. Since at work I don't have that package I recreated the function using dplyr.
In sparklye I am able to get counts for a single variable at a time. Need help to extend it all categorical variable.
Need Help:
Implement the function via SparklyR
Table 1: Final output needed:
# A tibble: 20 x 6
variables levels N freq ratio rank
<chr> <ord> <int> <int> <dbl> <int>
1 cut Ideal 53940 21551 40.0 1
2 cut Premium 53940 13791 25.6 2
3 cut Very Good 53940 12082 22.4 3
4 cut Good 53940 4906 9.10 4
5 cut Fair 53940 1610 2.98 5
6 color G 53940 11292 20.9 1
7 color E 53940 9797 18.2 2
8 color F 53940 9542 17.7 3
9 color H 53940 8304 15.4 4
10 color D 53940 6775 12.6 5
11 color I 53940 5422 10.1 6
12 color J 53940 2808 5.21 7
13 clarity SI1 53940 13065 24.2 1
14 clarity VS2 53940 12258 22.7 2
15 clarity SI2 53940 9194 17.0 3
16 clarity VS1 53940 8171 15.1 4
17 clarity VVS2 53940 5066 9.39 5
18 clarity VVS1 53940 3655 6.78 6
19 clarity IF 53940 1790 3.32 7
20 clarity I1 53940 741 1.37 8
R Code:
# Categorical Variable Profile
# Table based on dlookr package, diagnose_category() function
# variables : variable names
# types: the data type of the variable
# levels: level names
# N : Number of observation
# freq : Number of observation at the level
# ratio : Percentage of observation at the level
# rank : Rank of occupancy ratio of levels
library(ggplot2)
library(dplyr)
library(tidyr)
library(purrr)
library(tibble)
library(stringr)
# Helper Function
cat_level_summary <- function(df,x) {
count(df,x, sort = TRUE) %>%
transmute(levels = x, N = sum(n), freq = n,
ratio = n / sum(n) * 100, rank = row_number())
}
# Loading
diamonds_tbl <- diamonds
# Main Code
CategoricalVariableProfile <- diamonds_tbl %>%
select_if(~!is.numeric(.)) %>%
map(~cat_level_summary(data.frame(x=.x), x)) %>%
do.call(rbind.data.frame, .) %>%
rownames_to_column(., "variables")%>%
mutate(variables = str_match(variables, ".*(?=\\.)")[, 1] )
Spark Code:
#Spark data Table
diamonds_tbl <- copy_to(sc, diamonds, "diamonds", overwrite = TRUE)
CategoricalVariableProfile <- diamonds_tbl %>%
group_by(cut) %>%
summarize(count = n()) %>%
sdf_register("CategoricalVariableProfile")
Flatten your data using sdf_gather:
long <- diamonds_tbl %>%
select(cut, color, clarity) %>%
sdf_gather("variable", "level", "cut", "color", "clarity")
Aggregate by variable and level:
counts <- long %>% group_by(variable, level) %>% summarise(freq = n())
And finally apply required window functions:
result <- counts %>%
arrange(-freq) %>%
mutate(
rank = rank(),
total = sum(freq, na.rm = TRUE),
ratio = freq / total * 100)
Which will give you
result
# Source: spark<?> [?? x 6]
# Groups: variable
# Ordered by: -freq
variable level freq rank total ratio
<chr> <chr> <dbl> <int> <dbl> <dbl>
1 cut Ideal 21551 1 53940 40.0
2 cut Premium 13791 2 53940 25.6
3 cut Very Good 12082 3 53940 22.4
4 cut Good 4906 4 53940 9.10
5 cut Fair 1610 5 53940 2.98
6 clarity SI1 13065 1 53940 24.2
7 clarity VS2 12258 2 53940 22.7
8 clarity SI2 9194 3 53940 17.0
9 clarity VS1 8171 4 53940 15.1
10 clarity VVS2 5066 5 53940 9.39
# … with more rows
with following optimized plan
optimizedPlan(result)
<jobj[165]>
org.apache.spark.sql.catalyst.plans.logical.Project
Project [variable#524, level#525, freq#1478L, rank#1479, total#1480L, ((cast(freq#1478L as double) / cast(total#1480L as double)) * 100.0) AS ratio#1481]
+- Window [rank(_w1#1493L) windowspecdefinition(variable#524, _w1#1493L ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$())) AS rank#1479], [variable#524], [_w1#1493L ASC NULLS FIRST]
+- Window [sum(freq#1478L) windowspecdefinition(variable#524, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS total#1480L], [variable#524]
+- Project [variable#524, level#525, freq#1478L, -freq#1478L AS _w1#1493L]
+- Sort [-freq#1478L ASC NULLS FIRST], true
+- Aggregate [variable#524, level#525], [variable#524, level#525, count(1) AS freq#1478L]
+- Generate explode(map(cut, cut#19, color, color#20, clarity, clarity#21)), [0, 1, 2], false, [variable#524, level#525]
+- Project [cut#19, color#20, clarity#21]
+- InMemoryRelation [carat#18, cut#19, color#20, clarity#21, depth#22, table#23, price#24, x#25, y#26, z#27], StorageLevel(disk, memory, deserialized, 1 replicas)
+- Scan ExistingRDD[carat#18,cut#19,color#20,clarity#21,depth#22,table#23,price#24,x#25,y#26,z#27]
and query (sdf_gather component not included):
dbplyr::remote_query(result)
<SQL> SELECT `variable`, `level`, `freq`, `rank`, `total`, `freq` / `total` * 100.0 AS `ratio`
FROM (SELECT `variable`, `level`, `freq`, rank() OVER (PARTITION BY `variable` ORDER BY -`freq`) AS `rank`, sum(`freq`) OVER (PARTITION BY `variable`) AS `total`
FROM (SELECT *
FROM (SELECT `variable`, `level`, count(*) AS `freq`
FROM `sparklyr_tmp_ded2576b9f1`
GROUP BY `variable`, `level`) `dsbksdfhtf`
ORDER BY -`freq`) `obyrzsxeus`) `ekejqyjrfz`
After much searching, I can't seem to figure this out.
Trying to write a function that:
takes a data frame, db
groups the data frame by var1
returns the mean and sd by group on several different columns
Here is my function,
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
for (i in var2) {
db %>%
group_by(!!var1) %>%
summarise(mean_var = mean(!!!var2))
}}
when I pass the following, nothing returns
myfun(data, group, age, bmi)
Ideally, I would like to group both age and bmi by group and return the mean and sd for each. In the future, I would like to pass many more columns from data into the function...
The output would be similar to summaryBy from doby package, but on many columns at once and would look like:
Group age.mean age.sd
0
1
bmi.mean bmi.sd
0
1
Your loop appears to be unnecessary (you aren't doing anything with i). Instead, you could use summarize_at to achieve the effect you want:
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
db %>%
group_by(!!var1) %>%
summarise_at(vars(!!!var2), c(mean = mean, sd = sd))
}
And if we test it out with diamonds dataset:
myfun(diamonds, cut, x, z)
cut x_mean z_mean x_sd z_sd
<ord> <dbl> <dbl> <dbl> <dbl>
1 Fair 6.25 3.98 0.964 0.652
2 Good 5.84 3.64 1.06 0.655
3 Very Good 5.74 3.56 1.10 0.730
4 Premium 5.97 3.65 1.19 0.731
5 Ideal 5.51 3.40 1.06 0.658
To get the formatting closer to what you had in mind in your original post, we can use a bit of tidyr magic:
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
db %>%
group_by(!!var1) %>%
summarise_at(vars(!!!var2), c(mean = mean, sd = sd)) %>%
gather(variable, value, -(!!var1)) %>%
separate(variable, c('variable', 'measure'), sep = '_') %>%
spread(measure, value) %>%
arrange(variable, !!var1)
}
cut variable mean sd
<ord> <chr> <dbl> <dbl>
1 Fair x 6.25 0.964
2 Good x 5.84 1.06
3 Very Good x 5.74 1.10
4 Premium x 5.97 1.19
5 Ideal x 5.51 1.06
6 Fair z 3.98 0.652
7 Good z 3.64 0.655
8 Very Good z 3.56 0.730
9 Premium z 3.65 0.731
10 Ideal z 3.40 0.658
I want to run a lm regression to each row of my data dt.
My code is
coe <- apply(dt, 1, FUN = function(x) lm(dbl ~ bld, data = as.data.frame(x))$coefficients)
But it returns:
Error in eval(predvars, data, env) : object 'dbl' not found
I confirm that there are dbl and bld in my data dt.
So I do not know how to deal with it.
I am guessing you have mistakenly written about running regression by row (which is impossible since there will just be one observation for x and y in y ~ x). Instead, you want to run the regression repeatedly for some grouping variable?
This is pretty easy to do with groupedstats:
groupedstats::grouped_lm(
data = ggplot2::diamonds,
grouping.vars = c(cut, color), # grouping variables
formula = price ~ carat * clarity # formula
)
#> # A tibble: 547 x 10
#> cut color term estimate std.error t.value conf.low conf.high
#> <ord> <ord> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Ideal E (Int~ -3085. 64.9 -47.5 -3212. -2958.
#> 2 Ideal E carat 10529. 74.1 142. 10384. 10674.
#> 3 Ideal E clar~ -2088. 267. -7.81 -2612. -1564.
#> 4 Ideal E clar~ 168. 265. 0.633 -352. 688.
#> 5 Ideal E clar~ -926. 217. -4.26 -1352. -500.
#> 6 Ideal E clar~ 625. 157. 3.99 318. 932.
#> 7 Ideal E clar~ -392. 107. -3.65 -602. -181.
#> 8 Ideal E clar~ 83.9 79.1 1.06 -71.1 239.
#> 9 Ideal E clar~ -40.8 67.4 -0.605 -173. 91.4
#> 10 Ideal E cara~ 9746. 287. 34.0 9185. 10308.
#> # ... with 537 more rows, and 2 more variables: p.value <dbl>,
#> # significance <chr>
Created on 2018-08-19 by the reprex package (v0.2.0.9000).
There are two problems with what you're trying to do. When you're passing dt to apply, it drops x to a (named) numeric vector. When you coerce it using as.data.frame, it becomes a data.frame with one column. This is why dbl is not found.
> x <- c(a = 1, b = 0.58)
> as.data.frame(x)
x
a 1.00
b 0.58
Second point is that you want to do a regression on two points. In essence, you are doing this:
> lm(b ~ a, data = data.frame(a = 1, b = 0.58))
Call:
lm(formula = b ~ a, data = data.frame(a = 1, b = 0.58))
Coefficients:
(Intercept) a
0.58 NA
You will not be able to estimate the parameter of interest because you'll need more points to do that.