Having problems with quoting and unquoting variable names within an R function that uses dplyr. Have been through this site as well as Hadley's Programming with dplyr site and it's still getting the best of me.
The function code that doesn't work is:
gcreatedata <- function(dataframe,depvar,iv1,iv2){
depvar <- enquo(depvar)
iv1 <- enquo(iv1)
iv2 <- enquo(iv2)
newdata <- dataframe %>%
mutate(!!iv1 := factor(!!iv1)) %>%
group_by(!!iv1, !!iv2) %>%
summarise(TheMean = mean(!!depvar,na.rm=TRUE),
TheSD = sd(!!depvar,na.rm=TRUE),
TheSEM = sd(!!depvar,na.rm=TRUE)/sqrt(length(!!depvar)),
CI95Muliplier = qt(.95/2 + .5, length(!!depvar)-1))
return(as_tibble(newdata))
}
calling it with mtcars it would be
sss <- gcreatedata(mtcars,mpg,am,cyl)
I'm simply trying to convert the variable am to a factor for use downstream in a ggplot. Yes I know I could do it before I enter the function but I want it generic. The function works minus the factor step just fine which you can see if you run this version.
gcreatedata <- function(dataframe,depvar,iv1,iv2){
depvar <- enquo(depvar)
iv1 <- enquo(iv1)
iv2 <- enquo(iv2)
newdata <- dataframe %>%
mutate(foo := factor(!!iv1)) %>%
group_by(foo, !!iv2) %>%
summarise(TheMean = mean(!!depvar,na.rm=TRUE),
TheSD = sd(!!depvar,na.rm=TRUE),
TheSEM = sd(!!depvar,na.rm=TRUE)/sqrt(length(!!depvar)),
CI95Muliplier = qt(.95/2 + .5, length(!!depvar)-1))
return(as_tibble(newdata))
}
sss <- gcreatedata(mtcars,mpg,am,cyl)
It returns what I want except for the fact that am has become foo how do I get the name right in this line of code mutate(!!iv1 := factor(!!iv1)) %>% right now I'm getting an Error: LHS must be a name or string message and despite all manner of combinations I could think of no dice.
Thanks in advance.
Your situation is described in the tutorial part here: http://dplyr.tidyverse.org/articles/programming.html#different-input-and-output-variable
The following code works for me:
> library(dplyr)
>
> gcreatedata <- function(dataframe,depvar,iv1,iv2){
+ depvar <- enquo(depvar)
+ iv1_q <- enquo(iv1)
+ iv2 <- enquo(iv2)
+
+ iv1_name <- paste0("mean_", quo_name(iv1_q))
+
+ newdata <- dataframe %>%
+ mutate(!!iv1_name := factor(!!iv1_q)) %>%
+ group_by(!!iv1_q, !!iv2) %>%
+ summarise(TheMean = mean(!!depvar,na.rm=TRUE),
+ TheSD = sd(!!depvar,na.rm=TRUE),
+ TheSEM = sd(!!depvar,na.rm=TRUE)/sqrt(length(!!depvar)),
+ CI95Muliplier = qt(.95/2 + .5, length(!!depvar)-1))
+ return(as_tibble(newdata))
+ }
> sss <- gcreatedata(mtcars,mpg,am,cyl)
> sss
# A tibble: 6 x 6
# Groups: am [?]
am cyl TheMean TheSD TheSEM CI95Muliplier
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 0 4.00 22.9 1.45 0.839 4.30
2 0 6.00 19.1 1.63 0.816 3.18
3 0 8.00 15.0 2.77 0.801 2.20
4 1.00 4.00 28.1 4.48 1.59 2.36
5 1.00 6.00 20.6 0.751 0.433 4.30
6 1.00 8.00 15.4 0.566 0.400 12.7
Hope that helps!
Well 24 hours brought me a clearer head. Here's the answer should anyone else need it in the future...
gcreatedata <- function(dataframe,depvar,iv1,iv2){
depvar <- enquo(depvar)
iv1 <- enquo(iv1)
iv2 <- enquo(iv2)
newdata <- dataframe %>%
mutate(!!quo_name(iv1) := factor(!!iv1), !!quo_name(iv2) := factor(!!iv2)) %>%
group_by(!!iv1, !!iv2) %>%
summarise(TheMean = mean(!!depvar,na.rm=TRUE),
TheSD = sd(!!depvar,na.rm=TRUE),
TheSEM = sd(!!depvar,na.rm=TRUE)/sqrt(length(!!depvar)),
CI95Muliplier = qt(.95/2 + .5, length(!!depvar)-1))
return(as_tibble(newdata))
}
to test it on common data...
gcreatedata(mtcars,mpg,am,vs)
# A tibble: 4 x 6
# Groups: am [?]
am vs TheMean TheSD TheSEM CI95Muliplier
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 0 0 15.0 2.77 0.801 2.20
2 0 1 20.7 2.47 0.934 2.45
3 1 0 19.8 4.01 1.64 2.57
4 1 1 28.4 4.76 1.80 2.45
Related
The answer to this question clearly explains how to retrieve tidy regression results by group when running a regression through a dplyr pipe, but the solution is no longer reproducible.
How can one use dplyr and broom in combination to run a regression by group and retrieve tidy results using R 4.02, dplyr 1.0.0, and broom 0.7.0?
Specifically, the example answer from the question linked above,
library(dplyr)
library(broom)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
returns the following error (and three warnings) when I run it on my system:
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
Calling var(x) on a factor x is defunct.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom.
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
If I reformat df.h$hour as a character rather than factor,
df.h <- df.h %>%
mutate(
hour = as.character(hour)
)
re-run the regression by group, and again attempt to retrieve the results using broom::tidy,
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
# get the coefficients by group in a tidy data_frame
dfHourCoef = tidy(dfHour, fitHour)
I get this error:
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm) :
is.atomic(x) is not TRUE
I assume that the problem has to do with the fact that the group-level regression results are stored as lists in dfHour$fitHour, but I am unsure how to correct the error and once again tidily and quickly compile the regression results, as used to work in the originally posted code/answer.
****** Updated with more succinct code pulled from the dplyr 1.0.0 release notes ******
Thank you. I was struggling with a similar question with the update to dplyr 1.0.0 related to using the examples in the provided link. This was both a helpful question and answer.
One note as an FYI, do() has been superseded as of dplyr 1.0.0, so may consider using the updated language (now very efficient with my update):
dfHour = df.h %>%
# replace group_by() with nest_by()
# to convert your model data to a vector of lists
nest_by(hour) %>%
# change do() to mutate(), then add list() before your model
# make sure to change data = . to data = data
mutate(fitHour = list(lm(price ~ wind + temp, data = data))) %>%
summarise(tidy(mod))
Done!
This gives a very efficient df with select output stats. The last line replaces the following code (from my original response), which does the same thing, but less easily:
ungroup() %>%
# then leverage the feedback from #akrun
transmute(hour, HourCoef = map(fitHour, tidy)) %>%
unnest(HourCoef)
dfHour
Which gives the outupt:
# A tibble: 72 x 6
hour term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 68.6 21.0 3.27 0.00428
2 1 wind 0.000558 0.0124 0.0450 0.965
3 1 temp -0.866 0.907 -0.954 0.353
4 2 (Intercept) 31.9 17.4 1.83 0.0832
5 2 wind 0.00950 0.0113 0.838 0.413
6 2 temp 1.69 0.802 2.11 0.0490
7 3 (Intercept) 85.5 22.3 3.83 0.00122
8 3 wind -0.0210 0.0165 -1.27 0.220
9 3 temp 0.276 1.14 0.243 0.811
10 4 (Intercept) 73.3 15.1 4.86 0.000126
# ... with 62 more rows
Thanks for the patience, I am working through this myself!
Issue would be that there is a grouping attribute rowwise after the do call and the column 'fitHour' is a list. We can ungroup, loop over the list with map and tidy it to a list column
library(dplyr)
library(purrr)
library(broom)
df.h %>%
group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .)) %>%
ungroup %>%
mutate(HourCoef = map(fitHour, tidy))
Or use unnest after the mtuate
df.h %>%
group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .)) %>%
ungroup %>%
transmute(hour, HourCoef = map(fitHour, tidy)) %>%
unnest(HourCoef)
# A tibble: 72 x 6
# hour term estimate std.error statistic p.value
# <fct> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 89.8 20.2 4.45 0.000308
# 2 1 wind 0.00493 0.0151 0.326 0.748
# 3 1 temp -1.84 1.08 -1.71 0.105
# 4 2 (Intercept) 75.6 23.7 3.20 0.00500
# 5 2 wind -0.00910 0.0146 -0.622 0.542
# 6 2 temp 0.192 0.853 0.225 0.824
# 7 3 (Intercept) 44.0 23.9 1.84 0.0822
# 8 3 wind -0.00158 0.0166 -0.0953 0.925
# 9 3 temp 0.622 1.19 0.520 0.609
#10 4 (Intercept) 57.8 18.9 3.06 0.00676
# … with 62 more rows
If we wanted a single dataset, pull the 'fitHour', loop over the list with map, condense it to a single dataset by row binding (suffix _dfr)
df.h %>%
group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .)) %>%
ungroup %>%
pull(fitHour) %>%
map_dfr(tidy, .id = 'grp')
NOTE: The OP's error message was able to be replicated with R 4.02, dplyr 1.0.0 and broom 0.7.0
tidy(dfHour,fitHour)
Error in var(if (is.vector(x) || is.factor(x)) x else as.double(x),
na.rm = na.rm) :
Calling var(x) on a factor x is defunct.
Use something like 'all(duplicated(x)[-1L])' to test for a constant vector.
In addition: Warning messages:
1: Data frame tidiers are deprecated and will be removed in an upcoming release of broom.
2: In mean.default(X[[i]], ...) :
Your code actually works. Maybe package version or re starting a new R session could help:
library(dplyr)
library(broom)
df.h = data.frame(
hour = factor(rep(1:24, each = 21)),
price = runif(504, min = -10, max = 125),
wind = runif(504, min = 0, max = 2500),
temp = runif(504, min = - 10, max = 25)
)
dfHour = df.h %>% group_by(hour) %>%
do(fitHour = lm(price ~ wind + temp, data = .))
tidy(dfHour,fitHour)
# A tibble: 72 x 6
# Groups: hour [24]
hour term estimate std.error statistic p.value
<fct> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 (Intercept) 66.4 14.8 4.48 0.000288
2 1 wind 0.000474 0.00984 0.0482 0.962
3 1 temp 0.0691 0.945 0.0731 0.943
4 2 (Intercept) 66.5 20.4 3.26 0.00432
5 2 wind -0.00540 0.0127 -0.426 0.675
6 2 temp -0.306 0.944 -0.324 0.750
7 3 (Intercept) 86.5 17.3 5.00 0.0000936
8 3 wind -0.0119 0.00960 -1.24 0.232
9 3 temp -1.18 0.928 -1.27 0.221
10 4 (Intercept) 59.8 17.5 3.42 0.00304
# ... with 62 more rows
I'd like to produce "wide" summary tables of data in this sort of format:
---- Centiles ----
Param Group Mean SD 25% 50% 75%
Height 1 x.xx x.xxx x.xx x.xx x.xx
2 x.xx x.xxx x.xx x.xx x.xx
3 x.xx x.xxx x.xx x.xx x.xx
Weight 1 x.xx x.xxx x.xx x.xx x.xx
2 x.xx x.xxx x.xx x.xx x.xx
3 x.xx x.xxx x.xx x.xx x.xx
I can do that in dplyr 0.8.x. I can do it generically, with a function that can handle arbitrary grouping variables with arbitrary numbers of levels and arbitrary statistics summarising arbitrary numbers of variables with arbitrary names. I get that level of flexibility by making my data tidy. That's not what this question is about.
First, some toy data:
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
Now, a simple summary function, and a helper:
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
tibble(Value := quantile(x, q), "Quantile" := q)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
So I can say things like
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>% head()
Giving
# A tibble: 6 x 5
Parameter Group Q$Value $Quantile Mean SD
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Height 1 1.45 0.25 1.54 0.141
2 Height 1 1.49 0.5 1.54 0.141
3 Height 1 1.59 0.75 1.54 0.141
4 Height 2 1.64 0.25 1.66 0.0649
5 Height 2 1.68 0.5 1.66 0.0649
6 Height 2 1.68 0.75 1.66 0.0649
So that's the summary I need, but it's in long format. And Q is a df-col. It's a tibble:
is_tibble(summary$Q)
[1] TRUE
So pivot_wider doesn't seem to work. I can use nest_by() to get to a one-row-per-group format:
toySummary <- summary %>% nest_by(Group, Mean, SD)
toySummary
# Rowwise: Group, Mean, SD
Group Mean SD data
<int> <dbl> <dbl> <list<tbl_df[,2]>>
1 1 1.54 0.141 [3 × 2]
2 1 78.8 10.2 [3 × 2]
3 2 1.66 0.0649 [3 × 2]
4 2 82.9 9.09 [3 × 2]
5 3 1.63 0.100 [3 × 2]
6 3 71.0 10.8 [3 × 2]
But now the format of the centiles is even more complicated:
> toySummary$data[1]
<list_of<
tbl_df<
Parameter: character
Q :
tbl_df<
Value : double
Quantile: double
>
>
>[1]>
[[1]]
# A tibble: 3 x 2
Parameter Q$Value $Quantile
<chr> <dbl> <dbl>
1 Height 1.45 0.25
2 Height 1.49 0.5
3 Height 1.59 0.75
It looks like a list, so I guess some form of lapply would probably work, but is there a neater, tidy, solution that I've not spotted yet? I've discovered several new verbs that I didn't know abou whilst researching this question (chop, pack, rowwise(), nest_by and such) but none seem to give me what I want: ideally, a tibble with 6 rows (defined by unique Group and Parameter combinations) and columns for Mean, SD, Q25, Q50 and Q75.
To clarify in response to the first two proposed answers: getting the exact numbers that my toy example generates is less important than finding a generic technique for moving from the df-col(s) that summarise returns in dplyr v1.0.0 to a wide data summary of the general form that my example illustrates.
revised answer
Here is my revised answer. This time, I rewrote your quibble2 function with enframe and pivot_wider so that it returns a tibble with three rows.
This will again lead to a df-col in your summary tibble, and now we can use unpack directly, without using pivot_wider to get the expected outcome.
This should generalize on centiles etc. as well.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
pivot_wider(enframe(quantile(x, q)),
names_from = name,
values_from = value)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>%
unpack(Q)
#> # A tibble: 6 x 7
#> Parameter Group `25%` `50%` `75%` Mean SD
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1 1.62 1.66 1.73 1.70 0.108
#> 2 Height 2 1.73 1.77 1.78 1.76 0.105
#> 3 Height 3 1.55 1.64 1.76 1.65 0.109
#> 4 Weight 1 75.6 80.6 84.3 80.0 9.05
#> 5 Weight 2 75.4 76.9 79.6 77.4 7.27
#> 6 Weight 3 70.7 75.2 82.0 76.3 6.94
Created on 2020-06-13 by the reprex package (v0.3.0)
Second approach
without changing quibble2, we would need to first call unpack and then pivot_wider. This should scale as well.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
tibble(Value := quantile(x, q), "Quantile" := q)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>%
unpack(Q) %>%
pivot_wider(names_from = Quantile, values_from = Value)
#> # A tibble: 6 x 7
#> Parameter Group Mean SD `0.25` `0.5` `0.75`
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1 1.70 0.108 1.62 1.66 1.73
#> 2 Height 2 1.76 0.105 1.73 1.77 1.78
#> 3 Height 3 1.65 0.109 1.55 1.64 1.76
#> 4 Weight 1 80.0 9.05 75.6 80.6 84.3
#> 5 Weight 2 77.4 7.27 75.4 76.9 79.6
#> 6 Weight 3 76.3 6.94 70.7 75.2 82.0
Created on 2020-06-13 by the reprex package (v0.3.0)
generalized approach
I tried to figure out a more general approach by rewriting the mySummary function. Now it will convert automatically those outputs to df-cols which return a vector or a named vector. It will also wrap list automatically around expressions if necessary.
Then, I defined a function widen which will widen the df as much as possible, by preserving rows, including calling broom::tidy on supported list-columns.
The approach is not perfect, and could be extended by including unnest_wider in the widen function.
Note, that I changed the grouping in the example to be able to use t.test as another example output.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
# modified summary function
mySummary <- function(data, ...) {
fns <- rlang::enquos(...)
fns <- map(fns, function(x) {
res <- rlang::eval_tidy(x, data = data)
if ( ((is.vector(res) || is.factor(res)) && length(res) == 1) ||
("list" %in% class(res) && is.list(res)) ||
rlang::call_name(rlang::quo_get_expr(x)) == "list") {
x
}
else if ((is.vector(res) || is.factor(res)) && length(res) > 1) {
x_expr <- as.character(list(rlang::quo_get_expr(x)))
x_expr <- paste0(
"pivot_wider(enframe(",
x_expr,
"), names_from = name, values_from = value)"
)
x <- rlang::quo_set_expr(x, str2lang(x_expr))
x
} else {
x_expr <- as.character(list(rlang::quo_get_expr(x)))
x_expr <- paste0("list(", x_expr,")")
x <- rlang::quo_set_expr(x, str2lang(x_expr))
x
}
})
data %>%
group_by(Parameter) %>%
summarise(!!! fns, .groups="drop")
}
# A function to automatically widen the df as much as possible while preserving rows
widen <- function(df) {
df_cols <- names(df)[map_lgl(df, is.data.frame)]
df <- unpack(df, all_of(df_cols), names_sep = "_")
try_tidy <- function(x) {
tryCatch({
broom::tidy(x)
}, error = function(e) {
x
})
}
df <- df %>% rowwise() %>% mutate(across(where(is.list), try_tidy))
ungroup(df)
}
# if you want to specify function arguments for convenience use purrr::partial
quantile3 <- partial(quantile, x = , q = c(.25, .5, .75))
summary <- mySummary(toy,
Q = quantile3(Value),
R = range(Value),
T_test = t.test(Value),
Mean = mean(Value, na.rm=TRUE),
SD = sd(Value, na.rm=TRUE)
)
summary
#> # A tibble: 2 x 6
#> Parameter Q$`0%` $`25%` $`50%` $`75%` $`100%` R$`1` $`2` T_test Mean SD
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <dbl> <dbl>
#> 1 Height 1.54 1.62 1.73 1.77 1.90 1.54 1.90 <htest> 1.70 0.109
#> 2 Weight 67.5 72.9 76.9 83.2 91.7 67.5 91.7 <htest> 77.9 7.40
widen(summary)
#> # A tibble: 2 x 11
#> Parameter `Q_0%` `Q_25%` `Q_50%` `Q_75%` `Q_100%` R_1 R_2 T_test$estimate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1.54 1.62 1.73 1.77 1.90 1.54 1.90 1.70
#> 2 Weight 67.5 72.9 76.9 83.2 91.7 67.5 91.7 77.9
#> # … with 9 more variables: $statistic <dbl>, $p.value <dbl>, $parameter <dbl>,
#> # $conf.low <dbl>, $conf.high <dbl>, $method <chr>, $alternative <chr>,
#> # Mean <dbl>, SD <dbl>
Created on 2020-06-14 by the reprex package (v0.3.0)
What if you change quibble2 to return a list, and then use unnest_wider?
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
list(quantile(x, q))
}
mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE)) %>%
unnest_wider(Q)
# A tibble: 6 x 7
Parameter Group `25%` `50%` `75%` Mean SD
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Height 1 1.62 1.66 1.73 1.70 0.108
2 Height 2 1.73 1.77 1.78 1.76 0.105
3 Height 3 1.55 1.64 1.76 1.65 0.109
4 Weight 1 75.6 80.6 84.3 80.0 9.05
5 Weight 2 75.4 76.9 79.6 77.4 7.27
6 Weight 3 70.7 75.2 82.0 76.3 6.94
After much searching, I can't seem to figure this out.
Trying to write a function that:
takes a data frame, db
groups the data frame by var1
returns the mean and sd by group on several different columns
Here is my function,
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
for (i in var2) {
db %>%
group_by(!!var1) %>%
summarise(mean_var = mean(!!!var2))
}}
when I pass the following, nothing returns
myfun(data, group, age, bmi)
Ideally, I would like to group both age and bmi by group and return the mean and sd for each. In the future, I would like to pass many more columns from data into the function...
The output would be similar to summaryBy from doby package, but on many columns at once and would look like:
Group age.mean age.sd
0
1
bmi.mean bmi.sd
0
1
Your loop appears to be unnecessary (you aren't doing anything with i). Instead, you could use summarize_at to achieve the effect you want:
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
db %>%
group_by(!!var1) %>%
summarise_at(vars(!!!var2), c(mean = mean, sd = sd))
}
And if we test it out with diamonds dataset:
myfun(diamonds, cut, x, z)
cut x_mean z_mean x_sd z_sd
<ord> <dbl> <dbl> <dbl> <dbl>
1 Fair 6.25 3.98 0.964 0.652
2 Good 5.84 3.64 1.06 0.655
3 Very Good 5.74 3.56 1.10 0.730
4 Premium 5.97 3.65 1.19 0.731
5 Ideal 5.51 3.40 1.06 0.658
To get the formatting closer to what you had in mind in your original post, we can use a bit of tidyr magic:
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
db %>%
group_by(!!var1) %>%
summarise_at(vars(!!!var2), c(mean = mean, sd = sd)) %>%
gather(variable, value, -(!!var1)) %>%
separate(variable, c('variable', 'measure'), sep = '_') %>%
spread(measure, value) %>%
arrange(variable, !!var1)
}
cut variable mean sd
<ord> <chr> <dbl> <dbl>
1 Fair x 6.25 0.964
2 Good x 5.84 1.06
3 Very Good x 5.74 1.10
4 Premium x 5.97 1.19
5 Ideal x 5.51 1.06
6 Fair z 3.98 0.652
7 Good z 3.64 0.655
8 Very Good z 3.56 0.730
9 Premium z 3.65 0.731
10 Ideal z 3.40 0.658
I have several models fit to predict an outcome y = x1 + x2 + .....+x22. That's a fair number of predictors and a fair number of models. My customers want to know what's the marginal impact of each X on the estimated y. The models may include splines and interaction terms. I can do this, but it's cumbersome and requires loops or a lot of copy paste, which is slow or error prone. Can I do this better by writing my function differently and/or using purrr or an *apply function? Reproducible example is below. Ideally, I could write one function and apply it to longdata.
## create my fake data.
library(tidyverse)
library (rms)
ltrans<- function(l1){
newvar <- exp(l1)/(exp(l1)+1)
return(newvar)
}
set.seed(123)
mystates <- c("AL","AR","TN")
mydf <- data.frame(idno = seq(1:1500),state = rep(mystates,500))
mydf$x1[mydf$state=='AL'] <- rnorm(500,50,7)
mydf$x1[mydf$state=='AR'] <- rnorm(500,55,8)
mydf$x1[mydf$state=='TN'] <- rnorm(500,48,10)
mydf$x2 <- sample(1:5,500, replace = T)
mydf$x3 <- (abs(rnorm(1500,10,20)))^2
mydf$outcome <- as.numeric(cut2(sample(1:100,1500,replace = T),95))-1
dd<- datadist(mydf)
options(datadist = 'dd')
m1 <- lrm(outcome ~ x1 + x2+ rcs(x3,3), data = mydf)
dothemath <- function(x1 = x1ref,x2 = x2ref,x3 = x3ref) {
ltrans(-2.1802256-0.01114239*x1+0.050319692*x2-0.00079289232* x3+
7.6508189e-10*pmax(x3-7.4686271,0)^3-9.0897627e-10*pmax(x3- 217.97865,0)^3+
1.4389439e-10*pmax(x3-1337.2538,0)^3)}
x1ref <- 51.4
x2ref <- 3
x3ref <- 217.9
dothemath() ## 0.0591
mydf$referent <- dothemath()
mydf$thisobs <- dothemath(x1 = mydf$x1, x2 = mydf$x2, x3 = mydf$x3)
mydf$predicted <- predict(m1,mydf,type = "fitted.ind") ## yes, matches.
mydf$x1_marginaleffect <- dothemath(x1= mydf$x1)/mydf$referent
mydf$x2_marginaleffect <- dothemath(x2 = mydf$x2)/mydf$referent
mydf$x3_marginaleffect <- dothemath(x3 = mydf$x3)/mydf$referent
## can I do this with long data?
longdata <- mydf %>%
select(idno,state,referent,thisobs,x1,x2,x3) %>%
gather(varname,value,x1:x3)
##longdata$marginaleffect <- dothemath(longdata$varname = longdata$value) ## no, this does not work.
## I need to communicate to the function which variable it is evaluating.
longdata$marginaleffect[longdata$varname=="x1"] <- dothemath(x1 = longdata$value[longdata$varname=="x1"])/
longdata$referent[longdata$varname=="x1"]
longdata$marginaleffect[longdata$varname=="x2"] <- dothemath(x2 = longdata$value[longdata$varname=="x2"])/
longdata$referent[longdata$varname=="x2"]
longdata$marginaleffect[longdata$varname=="x3"] <- dothemath(x3 = longdata$value[longdata$varname=="x3"])/
longdata$referent[longdata$varname=="x3"]
testing<- inner_join(longdata[longdata$varname=="x1",c(1,7)],mydf[,c(1,10)])
head(testing) ## yes, both methods work.
Mostly you're just talking about a grouped mutate, with the caveat that dothemath is built such that you need to specify the variable name, which can be done by using do.call or purrr::invoke to call it on a named list of parameters:
longdata <- longdata %>%
group_by(varname) %>%
mutate(marginaleffect = invoke(dothemath, setNames(list(value), varname[1])) / referent)
longdata
#> # A tibble: 4,500 x 7
#> # Groups: varname [3]
#> idno state referent thisobs varname value marginaleffect
#> <int> <fct> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 AL 0.0591 0.0688 x1 46.1 1.06
#> 2 2 AR 0.0591 0.0516 x1 50.2 1.01
#> 3 3 TN 0.0591 0.0727 x1 38.0 1.15
#> 4 4 AL 0.0591 0.0667 x1 48.4 1.03
#> 5 5 AR 0.0591 0.0515 x1 47.1 1.05
#> 6 6 TN 0.0591 0.0484 x1 37.6 1.15
#> 7 7 AL 0.0591 0.0519 x1 60.9 0.905
#> 8 8 AR 0.0591 0.0531 x1 63.2 0.883
#> 9 9 TN 0.0591 0.0780 x1 47.8 1.04
#> 10 10 AL 0.0591 0.0575 x1 50.5 1.01
#> # ... with 4,490 more rows
# the first values look similar
inner_join(longdata[longdata$varname == "x1", c(1,7)], mydf[,c(1,10)])
#> Joining, by = "idno"
#> # A tibble: 1,500 x 3
#> idno marginaleffect x1_marginaleffect
#> <int> <dbl> <dbl>
#> 1 1 1.06 1.06
#> 2 2 1.01 1.01
#> 3 3 1.15 1.15
#> 4 4 1.03 1.03
#> 5 5 1.05 1.05
#> 6 6 1.15 1.15
#> 7 7 0.905 0.905
#> 8 8 0.883 0.883
#> 9 9 1.04 1.04
#> 10 10 1.01 1.01
#> # ... with 1,490 more rows
# check everything is the same
mydf %>%
gather(varname, marginaleffect, x1_marginaleffect:x3_marginaleffect) %>%
select(idno, varname, marginaleffect) %>%
mutate(varname = substr(varname, 1, 2)) %>%
all_equal(select(longdata, idno, varname, marginaleffect))
#> [1] TRUE
It may be easier to reconfigure dothemath to take an additional parameter of the variable name so as to avoid the gymnastics.
I need to use the qchisq function on a column of a sparklyr data frame.
The problem is that it seems that qchisq function is not implemented in Spark. If I am reading the error message below correctly, sparklyr tried execute a function called "QCHISQ", however this doesn't exist neither in Hive SQL, nor in Spark.
In general, is there a way to run arbitrary functions that are not implemented in Hive or Spark, with sparklyr? I know about spark_apply, but haven't figured out how to configure it yet.
> mydf = data.frame(beta=runif(100, -5, 5), pval = runif(100, 0.001, 0.1))
> mydf_tbl = copy_to(con, mydf)
> mydf_tbl
# Source: table<mydf> [?? x 2]
# Database: spark_connection
beta pval
<dbl> <dbl>
1 3.42 0.0913
2 -1.72 0.0629
3 0.515 0.0335
4 -3.12 0.0717
5 -2.12 0.0253
6 1.36 0.00640
7 -3.33 0.0896
8 1.36 0.0235
9 0.619 0.0414
10 4.73 0.0416
> mydf_tbl %>% mutate(se = sqrt(beta^2/qchisq(pval)))
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'QCHISQ'.
This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 49
As you noted you can use spark_apply:
mydf_tbl %>%
spark_apply(function(df)
dplyr::mutate(df, se = sqrt(beta^2/qchisq(pval, df = 12))))
# # Source: table<sparklyr_tmp_14bd5feacf5> [?? x 3]
# # Database: spark_connection
# beta pval X3
# <dbl> <dbl> <dbl>
# 1 1.66 0.0763 0.686
# 2 0.153 0.0872 0.0623
# 3 2.96 0.0485 1.30
# 4 4.86 0.0349 2.22
# 5 -1.82 0.0712 0.760
# 6 2.34 0.0295 1.10
# 7 3.54 0.0297 1.65
# 8 4.57 0.0784 1.88
# 9 4.94 0.0394 2.23
# 10 -0.610 0.0906 0.246
# # ... with more rows
but fair warning - it is embarrassingly slow. Unfortunately you don't have alternative here, short of writing your own Scala / Java extensions.
In the end I've used an horrible hack, which for this case works fine.
Another solution would have been to write a User Defined Function (UDF), but sparklyr doesn't support it yet: https://github.com/rstudio/sparklyr/issues/1052
This is the hack I've used. In short, I precompute a qchisq table, upload it as a sparklyr object, then join. If I compare this with results calculated on a local data frame, I get a correlation of r=0.99999990902236146617.
#' #param n: number of significant digits to use
> check_precomputed_strategy = function(n) {
chisq = data.frame(pval=seq(0, 1, 1/(10**(n)))) %>%
mutate(qval=qchisq(pval, df=1, lower.tail = FALSE)) %>%
mutate(pval_s = as.character(round(as.integer(pval*10**n),0)))
chisq %>% head %>% print
chisq_tbl = copy_to(con, chisq, overwrite=T)
mydf = data.frame(beta=runif(100, -5, 5), pval = runif(100, 0.001, 0.1)) %>%
mutate(se1 = sqrt(beta^2/qchisq(pval, df=1, lower.tail = FALSE)))
mydf_tbl = copy_to(con, mydf)
mydf_tbl.up = mydf_tbl %>%
mutate(pval_s=as.character(round(as.integer(pval*10**n),0))) %>%
left_join(chisq_tbl, by="pval_s") %>%
mutate(se=sqrt(beta^2 / qval)) %>%
collect %>%
filter(!duplicated(beta))
mydf_tbl.up %>% head %>% print
mydf_tbl.up %>% filter(complete.cases(.)) %>% nrow %>% print
mydf_tbl.up %>% filter(complete.cases(.)) %>% select(se, se1) %>% cor
}
> check_precomputed_strategy(4)
pval qval pval_s
1 0.00000000000000000000000 Inf 0
2 0.00010000000000000000479 15.136705226623396570 1
3 0.00020000000000000000958 13.831083619091122827 2
4 0.00030000000000000002793 13.070394140069462097 3
5 0.00040000000000000001917 12.532193305401813532 4
6 0.00050000000000000001041 12.115665146397173402 5
# A tibble: 6 x 8
beta pval.x se1 myvar pval_s pval.y qval se
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 3.42 0.0913 2.03 1. 912 0.0912 2.85 2.03
2 -1.72 0.0629 0.927 1. 628 0.0628 3.46 0.927
3 0.515 0.0335 0.242 1. 335 0.0335 4.52 0.242
4 -3.12 0.0717 1.73 1. 716 0.0716 3.25 1.73
5 -2.12 0.0253 0.947 1. 253 0.0253 5.00 0.946
6 1.36 0.00640 0.498 1. 63 0.00630 7.46 0.497
[1] 100
se se1
se 1.00000000000000000000 0.99999990902236146617
se1 0.99999990902236146617 1.00000000000000000000