I'd like to produce "wide" summary tables of data in this sort of format:
---- Centiles ----
Param Group Mean SD 25% 50% 75%
Height 1 x.xx x.xxx x.xx x.xx x.xx
2 x.xx x.xxx x.xx x.xx x.xx
3 x.xx x.xxx x.xx x.xx x.xx
Weight 1 x.xx x.xxx x.xx x.xx x.xx
2 x.xx x.xxx x.xx x.xx x.xx
3 x.xx x.xxx x.xx x.xx x.xx
I can do that in dplyr 0.8.x. I can do it generically, with a function that can handle arbitrary grouping variables with arbitrary numbers of levels and arbitrary statistics summarising arbitrary numbers of variables with arbitrary names. I get that level of flexibility by making my data tidy. That's not what this question is about.
First, some toy data:
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
Now, a simple summary function, and a helper:
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
tibble(Value := quantile(x, q), "Quantile" := q)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
So I can say things like
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>% head()
Giving
# A tibble: 6 x 5
Parameter Group Q$Value $Quantile Mean SD
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Height 1 1.45 0.25 1.54 0.141
2 Height 1 1.49 0.5 1.54 0.141
3 Height 1 1.59 0.75 1.54 0.141
4 Height 2 1.64 0.25 1.66 0.0649
5 Height 2 1.68 0.5 1.66 0.0649
6 Height 2 1.68 0.75 1.66 0.0649
So that's the summary I need, but it's in long format. And Q is a df-col. It's a tibble:
is_tibble(summary$Q)
[1] TRUE
So pivot_wider doesn't seem to work. I can use nest_by() to get to a one-row-per-group format:
toySummary <- summary %>% nest_by(Group, Mean, SD)
toySummary
# Rowwise: Group, Mean, SD
Group Mean SD data
<int> <dbl> <dbl> <list<tbl_df[,2]>>
1 1 1.54 0.141 [3 × 2]
2 1 78.8 10.2 [3 × 2]
3 2 1.66 0.0649 [3 × 2]
4 2 82.9 9.09 [3 × 2]
5 3 1.63 0.100 [3 × 2]
6 3 71.0 10.8 [3 × 2]
But now the format of the centiles is even more complicated:
> toySummary$data[1]
<list_of<
tbl_df<
Parameter: character
Q :
tbl_df<
Value : double
Quantile: double
>
>
>[1]>
[[1]]
# A tibble: 3 x 2
Parameter Q$Value $Quantile
<chr> <dbl> <dbl>
1 Height 1.45 0.25
2 Height 1.49 0.5
3 Height 1.59 0.75
It looks like a list, so I guess some form of lapply would probably work, but is there a neater, tidy, solution that I've not spotted yet? I've discovered several new verbs that I didn't know abou whilst researching this question (chop, pack, rowwise(), nest_by and such) but none seem to give me what I want: ideally, a tibble with 6 rows (defined by unique Group and Parameter combinations) and columns for Mean, SD, Q25, Q50 and Q75.
To clarify in response to the first two proposed answers: getting the exact numbers that my toy example generates is less important than finding a generic technique for moving from the df-col(s) that summarise returns in dplyr v1.0.0 to a wide data summary of the general form that my example illustrates.
revised answer
Here is my revised answer. This time, I rewrote your quibble2 function with enframe and pivot_wider so that it returns a tibble with three rows.
This will again lead to a df-col in your summary tibble, and now we can use unpack directly, without using pivot_wider to get the expected outcome.
This should generalize on centiles etc. as well.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
pivot_wider(enframe(quantile(x, q)),
names_from = name,
values_from = value)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>%
unpack(Q)
#> # A tibble: 6 x 7
#> Parameter Group `25%` `50%` `75%` Mean SD
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1 1.62 1.66 1.73 1.70 0.108
#> 2 Height 2 1.73 1.77 1.78 1.76 0.105
#> 3 Height 3 1.55 1.64 1.76 1.65 0.109
#> 4 Weight 1 75.6 80.6 84.3 80.0 9.05
#> 5 Weight 2 75.4 76.9 79.6 77.4 7.27
#> 6 Weight 3 70.7 75.2 82.0 76.3 6.94
Created on 2020-06-13 by the reprex package (v0.3.0)
Second approach
without changing quibble2, we would need to first call unpack and then pivot_wider. This should scale as well.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
tibble(Value := quantile(x, q), "Quantile" := q)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>%
unpack(Q) %>%
pivot_wider(names_from = Quantile, values_from = Value)
#> # A tibble: 6 x 7
#> Parameter Group Mean SD `0.25` `0.5` `0.75`
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1 1.70 0.108 1.62 1.66 1.73
#> 2 Height 2 1.76 0.105 1.73 1.77 1.78
#> 3 Height 3 1.65 0.109 1.55 1.64 1.76
#> 4 Weight 1 80.0 9.05 75.6 80.6 84.3
#> 5 Weight 2 77.4 7.27 75.4 76.9 79.6
#> 6 Weight 3 76.3 6.94 70.7 75.2 82.0
Created on 2020-06-13 by the reprex package (v0.3.0)
generalized approach
I tried to figure out a more general approach by rewriting the mySummary function. Now it will convert automatically those outputs to df-cols which return a vector or a named vector. It will also wrap list automatically around expressions if necessary.
Then, I defined a function widen which will widen the df as much as possible, by preserving rows, including calling broom::tidy on supported list-columns.
The approach is not perfect, and could be extended by including unnest_wider in the widen function.
Note, that I changed the grouping in the example to be able to use t.test as another example output.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
# modified summary function
mySummary <- function(data, ...) {
fns <- rlang::enquos(...)
fns <- map(fns, function(x) {
res <- rlang::eval_tidy(x, data = data)
if ( ((is.vector(res) || is.factor(res)) && length(res) == 1) ||
("list" %in% class(res) && is.list(res)) ||
rlang::call_name(rlang::quo_get_expr(x)) == "list") {
x
}
else if ((is.vector(res) || is.factor(res)) && length(res) > 1) {
x_expr <- as.character(list(rlang::quo_get_expr(x)))
x_expr <- paste0(
"pivot_wider(enframe(",
x_expr,
"), names_from = name, values_from = value)"
)
x <- rlang::quo_set_expr(x, str2lang(x_expr))
x
} else {
x_expr <- as.character(list(rlang::quo_get_expr(x)))
x_expr <- paste0("list(", x_expr,")")
x <- rlang::quo_set_expr(x, str2lang(x_expr))
x
}
})
data %>%
group_by(Parameter) %>%
summarise(!!! fns, .groups="drop")
}
# A function to automatically widen the df as much as possible while preserving rows
widen <- function(df) {
df_cols <- names(df)[map_lgl(df, is.data.frame)]
df <- unpack(df, all_of(df_cols), names_sep = "_")
try_tidy <- function(x) {
tryCatch({
broom::tidy(x)
}, error = function(e) {
x
})
}
df <- df %>% rowwise() %>% mutate(across(where(is.list), try_tidy))
ungroup(df)
}
# if you want to specify function arguments for convenience use purrr::partial
quantile3 <- partial(quantile, x = , q = c(.25, .5, .75))
summary <- mySummary(toy,
Q = quantile3(Value),
R = range(Value),
T_test = t.test(Value),
Mean = mean(Value, na.rm=TRUE),
SD = sd(Value, na.rm=TRUE)
)
summary
#> # A tibble: 2 x 6
#> Parameter Q$`0%` $`25%` $`50%` $`75%` $`100%` R$`1` $`2` T_test Mean SD
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <dbl> <dbl>
#> 1 Height 1.54 1.62 1.73 1.77 1.90 1.54 1.90 <htest> 1.70 0.109
#> 2 Weight 67.5 72.9 76.9 83.2 91.7 67.5 91.7 <htest> 77.9 7.40
widen(summary)
#> # A tibble: 2 x 11
#> Parameter `Q_0%` `Q_25%` `Q_50%` `Q_75%` `Q_100%` R_1 R_2 T_test$estimate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1.54 1.62 1.73 1.77 1.90 1.54 1.90 1.70
#> 2 Weight 67.5 72.9 76.9 83.2 91.7 67.5 91.7 77.9
#> # … with 9 more variables: $statistic <dbl>, $p.value <dbl>, $parameter <dbl>,
#> # $conf.low <dbl>, $conf.high <dbl>, $method <chr>, $alternative <chr>,
#> # Mean <dbl>, SD <dbl>
Created on 2020-06-14 by the reprex package (v0.3.0)
What if you change quibble2 to return a list, and then use unnest_wider?
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
list(quantile(x, q))
}
mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE)) %>%
unnest_wider(Q)
# A tibble: 6 x 7
Parameter Group `25%` `50%` `75%` Mean SD
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Height 1 1.62 1.66 1.73 1.70 0.108
2 Height 2 1.73 1.77 1.78 1.76 0.105
3 Height 3 1.55 1.64 1.76 1.65 0.109
4 Weight 1 75.6 80.6 84.3 80.0 9.05
5 Weight 2 75.4 76.9 79.6 77.4 7.27
6 Weight 3 70.7 75.2 82.0 76.3 6.94
Related
Let me take simulated datasets to explain:
I have dataset dt and dt1
# dataset 1 `dt`
set.seed(12)
dt <- rnorm(5000,mean=10,sd=1)
dt <- data.frame(dt)
dt$group <- c("case","control")
colnames(dt) <- c("severity","group")
head(dt)
severity group
1 8.519432 case
2 11.577169 control
3 9.043256 case
4 9.079995 control
5 8.002358 case
6 9.727704 control
# dataset 2 `dt2`
set.seed(12)
dt2 <- rnorm(200,mean=12,sd=1)
dt2 <- data.frame(dt2)
dt2$group <- c("case2","control2")
colnames(dt2) <- c("severity","group")
head(dt2)
severity group
1 10.51943 case2
2 13.57717 control2
3 11.04326 case2
4 11.07999 control2
5 10.00236 case2
6 11.72770 control2
I am building one 1000 iterations for loop to do the following steps:
randomly take 500 rows from the dt and save as dt_sub
rbind dt_sub with dt2 and save as bd
select only rows with group as either case2 or control from the bd dataset (only cares the difference between these two groups)
t.tests on the variable severity between the case2 and control group
output t.tests results to t
use a for loop to repeat 1000 times
iteratively appends all t.test results to a dataframe results
Following is the code that I built in r
library(broom)
library(dplyr)
iter <- 1000
t <- data.frame()
for (i in 1:iter) {
dt_sub <- dt[sample(nrow(dt),500),]
bd <- rbind(dt_sub,dt2)
compare <- filter(bd, group %in% c("case2", "control"))
compare %>% group_by(group) %>% do(tidy(t.test(severity ~ group,data = compare))) -> t
t$iter <- i
}
results <- do.call(rbind,t)
My question is, this code works well when iter=1, but how should I set the compare %>% group_by(group) %>% do(tidy(t.test(severity ~ group,data = compare))) -> t line to ensure each run's t.test results will not be overwritten when iter ≥ 1? I tried t[i] but failed, anyone could advise please?
Thanks.
Create a function which runs the process once.
library(broom)
library(dplyr)
t_test_function <- function() {
dt_sub <- dt[sample(nrow(dt),500),]
bd <- rbind(dt_sub,dt2)
compare <- filter(bd, group %in% c("case2", "control"))
compare %>%
group_by(group) %>%
do(tidy(t.test(severity ~ group,data = compare))) %>%
ungroup
}
t_test_function()
# group estimate estimate1 estimate2 statistic p.value parameter conf.low
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 case2 1.94 11.9 9.99 17.4 9.40e-42 199. #1.72
#2 cont… 1.94 11.9 9.99 17.4 9.40e-42 199. 1.72
# … with 3 more variables: conf.high <dbl>, method <chr>,
# alternative <chr>
Now you can call this iter times using replicate and combine the dataset.
iter <- 5
results <- bind_rows(replicate(iter, t_test_function(), simplify = FALSE), .id = 'iter')
# A tibble: 10 x 12
# iter group estimate estimate1 estimate2 statistic p.value parameter
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 case2 1.88 11.9 10.1 17.3 1.05e-40 189.
# 2 1 cont… 1.88 11.9 10.1 17.3 1.05e-40 189.
# 3 2 case2 1.96 11.9 9.97 17.8 9.88e-43 194.
# 4 2 cont… 1.96 11.9 9.97 17.8 9.88e-43 194.
# 5 3 case2 1.94 11.9 9.99 17.9 3.76e-42 184.
# 6 3 cont… 1.94 11.9 9.99 17.9 3.76e-42 184.
# 7 4 case2 2.03 11.9 9.90 18.6 1.82e-44 189.
# 8 4 cont… 2.03 11.9 9.90 18.6 1.82e-44 189.
# 9 5 case2 1.96 11.9 9.97 18.1 7.05e-43 187.
#10 5 cont… 1.96 11.9 9.97 18.1 7.05e-43 187.
# … with 4 more variables: conf.low <dbl>, conf.high <dbl>, method <chr>,
# alternative <chr>
I have a df that looks something like this like this:
set.seed(42)
ID <- sample(1:30, 100, rep=T)
Trait <- sample(0:1, 100, rep=T)
Year <- sample(1992:1999, 100, rep=T)
df <- cbind(ID, Trait, Year)
df <- as.data.frame(df)
Where ID is an individual organism, trait is a presence/absence of a phenotype and Year is the year an observation was made.
I would like to model if trait is random between individuals, something like this
library(MCMCglmm)
m <- MCMCglmm(Trait ~ ID, random = ~ Year, data = df, family = "categorical")
Now, would like to shuffle the Trait column and run x permutations, to check if my observed mean and CI fall outside of what's expected from random.
I could do this with a for loop, but I'd rather use a tidyverse solution.
I've read that lapply is a bette(?) alternative, but I am struggling to find a specific enough walk-through that I can follow.
I'd appreciate any advice offered here.
Cheers!
Jamie
EDIT October 10th. Cleaned up the code and per comment below added the code to give you back a nice organized tibble\dataframe
### decide how many shuffles you want and name them
### in an orderly fashion for the output
shuffles <- 1:10
names(shuffles) <- paste0("shuffle_", shuffles)
library(MCMCglmm)
library(dplyr)
library(tibble)
library(purrr)
ddd <- purrr::map(shuffles,
~ df %>%
mutate(Trait = sample(Trait)) %>%
MCMCglmm(fixed = Trait ~ ID,
random = ~ Year,
data = .,
family = "categorical",
verbose = FALSE)) %>%
purrr::map( ~ tibble::as_tibble(summary(.x)$solutions, rownames = "model_term")) %>%
dplyr::bind_rows(., .id = 'shuffle')
ddd
#> # A tibble: 20 x 7
#> shuffle model_term post.mean `l-95% CI` `u-95% CI` eff.samp pMCMC
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 shuffle_1 (Intercept) 112. 6.39 233. 103. 0.016
#> 2 shuffle_1 ID -6.31 -13.5 -0.297 112. 0.014
#> 3 shuffle_2 (Intercept) 24.9 -72.5 133. 778. 0.526
#> 4 shuffle_2 ID -0.327 -6.33 5.33 849. 0.858
#> 5 shuffle_3 (Intercept) 4.39 -77.3 87.4 161. 0.876
#> 6 shuffle_3 ID 1.04 -3.84 5.99 121. 0.662
#> 7 shuffle_4 (Intercept) 7.71 -79.0 107. 418. 0.902
#> 8 shuffle_4 ID 0.899 -4.40 6.57 408. 0.694
#> 9 shuffle_5 (Intercept) 30.4 -62.4 144. 732. 0.51
#> 10 shuffle_5 ID -0.644 -6.61 4.94 970. 0.866
#> 11 shuffle_6 (Intercept) -45.5 -148. 42.7 208. 0.302
#> 12 shuffle_6 ID 4.73 -0.211 11.6 89.1 0.058
#> 13 shuffle_7 (Intercept) -16.2 -133. 85.9 108. 0.696
#> 14 shuffle_7 ID 2.47 -2.42 10.3 47.8 0.304
#> 15 shuffle_8 (Intercept) 0.568 0.549 0.581 6.60 0.001
#> 16 shuffle_8 ID -0.0185 -0.0197 -0.0168 2.96 0.001
#> 17 shuffle_9 (Intercept) -6.95 -112. 92.2 452. 0.886
#> 18 shuffle_9 ID 2.07 -3.30 8.95 370. 0.476
#> 19 shuffle_10 (Intercept) 43.8 -57.0 159. 775. 0.396
#> 20 shuffle_10 ID -1.36 -7.44 5.08 901. 0.62
Your original data
set.seed(42)
ID <- sample(1:30, 100, rep=T)
Trait <- sample(0:1, 100, rep=T)
Year <- sample(1992:1999, 100, rep=T)
df <- cbind(ID, Trait, Year)
df <- as.data.frame(df)
After much searching, I can't seem to figure this out.
Trying to write a function that:
takes a data frame, db
groups the data frame by var1
returns the mean and sd by group on several different columns
Here is my function,
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
for (i in var2) {
db %>%
group_by(!!var1) %>%
summarise(mean_var = mean(!!!var2))
}}
when I pass the following, nothing returns
myfun(data, group, age, bmi)
Ideally, I would like to group both age and bmi by group and return the mean and sd for each. In the future, I would like to pass many more columns from data into the function...
The output would be similar to summaryBy from doby package, but on many columns at once and would look like:
Group age.mean age.sd
0
1
bmi.mean bmi.sd
0
1
Your loop appears to be unnecessary (you aren't doing anything with i). Instead, you could use summarize_at to achieve the effect you want:
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
db %>%
group_by(!!var1) %>%
summarise_at(vars(!!!var2), c(mean = mean, sd = sd))
}
And if we test it out with diamonds dataset:
myfun(diamonds, cut, x, z)
cut x_mean z_mean x_sd z_sd
<ord> <dbl> <dbl> <dbl> <dbl>
1 Fair 6.25 3.98 0.964 0.652
2 Good 5.84 3.64 1.06 0.655
3 Very Good 5.74 3.56 1.10 0.730
4 Premium 5.97 3.65 1.19 0.731
5 Ideal 5.51 3.40 1.06 0.658
To get the formatting closer to what you had in mind in your original post, we can use a bit of tidyr magic:
myfun <- function(db,var1, ...) {
var1 <- enquo(var1)
var2 <- quos(...)
db %>%
group_by(!!var1) %>%
summarise_at(vars(!!!var2), c(mean = mean, sd = sd)) %>%
gather(variable, value, -(!!var1)) %>%
separate(variable, c('variable', 'measure'), sep = '_') %>%
spread(measure, value) %>%
arrange(variable, !!var1)
}
cut variable mean sd
<ord> <chr> <dbl> <dbl>
1 Fair x 6.25 0.964
2 Good x 5.84 1.06
3 Very Good x 5.74 1.10
4 Premium x 5.97 1.19
5 Ideal x 5.51 1.06
6 Fair z 3.98 0.652
7 Good z 3.64 0.655
8 Very Good z 3.56 0.730
9 Premium z 3.65 0.731
10 Ideal z 3.40 0.658
I have several models fit to predict an outcome y = x1 + x2 + .....+x22. That's a fair number of predictors and a fair number of models. My customers want to know what's the marginal impact of each X on the estimated y. The models may include splines and interaction terms. I can do this, but it's cumbersome and requires loops or a lot of copy paste, which is slow or error prone. Can I do this better by writing my function differently and/or using purrr or an *apply function? Reproducible example is below. Ideally, I could write one function and apply it to longdata.
## create my fake data.
library(tidyverse)
library (rms)
ltrans<- function(l1){
newvar <- exp(l1)/(exp(l1)+1)
return(newvar)
}
set.seed(123)
mystates <- c("AL","AR","TN")
mydf <- data.frame(idno = seq(1:1500),state = rep(mystates,500))
mydf$x1[mydf$state=='AL'] <- rnorm(500,50,7)
mydf$x1[mydf$state=='AR'] <- rnorm(500,55,8)
mydf$x1[mydf$state=='TN'] <- rnorm(500,48,10)
mydf$x2 <- sample(1:5,500, replace = T)
mydf$x3 <- (abs(rnorm(1500,10,20)))^2
mydf$outcome <- as.numeric(cut2(sample(1:100,1500,replace = T),95))-1
dd<- datadist(mydf)
options(datadist = 'dd')
m1 <- lrm(outcome ~ x1 + x2+ rcs(x3,3), data = mydf)
dothemath <- function(x1 = x1ref,x2 = x2ref,x3 = x3ref) {
ltrans(-2.1802256-0.01114239*x1+0.050319692*x2-0.00079289232* x3+
7.6508189e-10*pmax(x3-7.4686271,0)^3-9.0897627e-10*pmax(x3- 217.97865,0)^3+
1.4389439e-10*pmax(x3-1337.2538,0)^3)}
x1ref <- 51.4
x2ref <- 3
x3ref <- 217.9
dothemath() ## 0.0591
mydf$referent <- dothemath()
mydf$thisobs <- dothemath(x1 = mydf$x1, x2 = mydf$x2, x3 = mydf$x3)
mydf$predicted <- predict(m1,mydf,type = "fitted.ind") ## yes, matches.
mydf$x1_marginaleffect <- dothemath(x1= mydf$x1)/mydf$referent
mydf$x2_marginaleffect <- dothemath(x2 = mydf$x2)/mydf$referent
mydf$x3_marginaleffect <- dothemath(x3 = mydf$x3)/mydf$referent
## can I do this with long data?
longdata <- mydf %>%
select(idno,state,referent,thisobs,x1,x2,x3) %>%
gather(varname,value,x1:x3)
##longdata$marginaleffect <- dothemath(longdata$varname = longdata$value) ## no, this does not work.
## I need to communicate to the function which variable it is evaluating.
longdata$marginaleffect[longdata$varname=="x1"] <- dothemath(x1 = longdata$value[longdata$varname=="x1"])/
longdata$referent[longdata$varname=="x1"]
longdata$marginaleffect[longdata$varname=="x2"] <- dothemath(x2 = longdata$value[longdata$varname=="x2"])/
longdata$referent[longdata$varname=="x2"]
longdata$marginaleffect[longdata$varname=="x3"] <- dothemath(x3 = longdata$value[longdata$varname=="x3"])/
longdata$referent[longdata$varname=="x3"]
testing<- inner_join(longdata[longdata$varname=="x1",c(1,7)],mydf[,c(1,10)])
head(testing) ## yes, both methods work.
Mostly you're just talking about a grouped mutate, with the caveat that dothemath is built such that you need to specify the variable name, which can be done by using do.call or purrr::invoke to call it on a named list of parameters:
longdata <- longdata %>%
group_by(varname) %>%
mutate(marginaleffect = invoke(dothemath, setNames(list(value), varname[1])) / referent)
longdata
#> # A tibble: 4,500 x 7
#> # Groups: varname [3]
#> idno state referent thisobs varname value marginaleffect
#> <int> <fct> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 AL 0.0591 0.0688 x1 46.1 1.06
#> 2 2 AR 0.0591 0.0516 x1 50.2 1.01
#> 3 3 TN 0.0591 0.0727 x1 38.0 1.15
#> 4 4 AL 0.0591 0.0667 x1 48.4 1.03
#> 5 5 AR 0.0591 0.0515 x1 47.1 1.05
#> 6 6 TN 0.0591 0.0484 x1 37.6 1.15
#> 7 7 AL 0.0591 0.0519 x1 60.9 0.905
#> 8 8 AR 0.0591 0.0531 x1 63.2 0.883
#> 9 9 TN 0.0591 0.0780 x1 47.8 1.04
#> 10 10 AL 0.0591 0.0575 x1 50.5 1.01
#> # ... with 4,490 more rows
# the first values look similar
inner_join(longdata[longdata$varname == "x1", c(1,7)], mydf[,c(1,10)])
#> Joining, by = "idno"
#> # A tibble: 1,500 x 3
#> idno marginaleffect x1_marginaleffect
#> <int> <dbl> <dbl>
#> 1 1 1.06 1.06
#> 2 2 1.01 1.01
#> 3 3 1.15 1.15
#> 4 4 1.03 1.03
#> 5 5 1.05 1.05
#> 6 6 1.15 1.15
#> 7 7 0.905 0.905
#> 8 8 0.883 0.883
#> 9 9 1.04 1.04
#> 10 10 1.01 1.01
#> # ... with 1,490 more rows
# check everything is the same
mydf %>%
gather(varname, marginaleffect, x1_marginaleffect:x3_marginaleffect) %>%
select(idno, varname, marginaleffect) %>%
mutate(varname = substr(varname, 1, 2)) %>%
all_equal(select(longdata, idno, varname, marginaleffect))
#> [1] TRUE
It may be easier to reconfigure dothemath to take an additional parameter of the variable name so as to avoid the gymnastics.
I need to use the qchisq function on a column of a sparklyr data frame.
The problem is that it seems that qchisq function is not implemented in Spark. If I am reading the error message below correctly, sparklyr tried execute a function called "QCHISQ", however this doesn't exist neither in Hive SQL, nor in Spark.
In general, is there a way to run arbitrary functions that are not implemented in Hive or Spark, with sparklyr? I know about spark_apply, but haven't figured out how to configure it yet.
> mydf = data.frame(beta=runif(100, -5, 5), pval = runif(100, 0.001, 0.1))
> mydf_tbl = copy_to(con, mydf)
> mydf_tbl
# Source: table<mydf> [?? x 2]
# Database: spark_connection
beta pval
<dbl> <dbl>
1 3.42 0.0913
2 -1.72 0.0629
3 0.515 0.0335
4 -3.12 0.0717
5 -2.12 0.0253
6 1.36 0.00640
7 -3.33 0.0896
8 1.36 0.0235
9 0.619 0.0414
10 4.73 0.0416
> mydf_tbl %>% mutate(se = sqrt(beta^2/qchisq(pval)))
Error: org.apache.spark.sql.AnalysisException: Undefined function: 'QCHISQ'.
This function is neither a registered temporary function nor a permanent function registered in the database 'default'.; line 1 pos 49
As you noted you can use spark_apply:
mydf_tbl %>%
spark_apply(function(df)
dplyr::mutate(df, se = sqrt(beta^2/qchisq(pval, df = 12))))
# # Source: table<sparklyr_tmp_14bd5feacf5> [?? x 3]
# # Database: spark_connection
# beta pval X3
# <dbl> <dbl> <dbl>
# 1 1.66 0.0763 0.686
# 2 0.153 0.0872 0.0623
# 3 2.96 0.0485 1.30
# 4 4.86 0.0349 2.22
# 5 -1.82 0.0712 0.760
# 6 2.34 0.0295 1.10
# 7 3.54 0.0297 1.65
# 8 4.57 0.0784 1.88
# 9 4.94 0.0394 2.23
# 10 -0.610 0.0906 0.246
# # ... with more rows
but fair warning - it is embarrassingly slow. Unfortunately you don't have alternative here, short of writing your own Scala / Java extensions.
In the end I've used an horrible hack, which for this case works fine.
Another solution would have been to write a User Defined Function (UDF), but sparklyr doesn't support it yet: https://github.com/rstudio/sparklyr/issues/1052
This is the hack I've used. In short, I precompute a qchisq table, upload it as a sparklyr object, then join. If I compare this with results calculated on a local data frame, I get a correlation of r=0.99999990902236146617.
#' #param n: number of significant digits to use
> check_precomputed_strategy = function(n) {
chisq = data.frame(pval=seq(0, 1, 1/(10**(n)))) %>%
mutate(qval=qchisq(pval, df=1, lower.tail = FALSE)) %>%
mutate(pval_s = as.character(round(as.integer(pval*10**n),0)))
chisq %>% head %>% print
chisq_tbl = copy_to(con, chisq, overwrite=T)
mydf = data.frame(beta=runif(100, -5, 5), pval = runif(100, 0.001, 0.1)) %>%
mutate(se1 = sqrt(beta^2/qchisq(pval, df=1, lower.tail = FALSE)))
mydf_tbl = copy_to(con, mydf)
mydf_tbl.up = mydf_tbl %>%
mutate(pval_s=as.character(round(as.integer(pval*10**n),0))) %>%
left_join(chisq_tbl, by="pval_s") %>%
mutate(se=sqrt(beta^2 / qval)) %>%
collect %>%
filter(!duplicated(beta))
mydf_tbl.up %>% head %>% print
mydf_tbl.up %>% filter(complete.cases(.)) %>% nrow %>% print
mydf_tbl.up %>% filter(complete.cases(.)) %>% select(se, se1) %>% cor
}
> check_precomputed_strategy(4)
pval qval pval_s
1 0.00000000000000000000000 Inf 0
2 0.00010000000000000000479 15.136705226623396570 1
3 0.00020000000000000000958 13.831083619091122827 2
4 0.00030000000000000002793 13.070394140069462097 3
5 0.00040000000000000001917 12.532193305401813532 4
6 0.00050000000000000001041 12.115665146397173402 5
# A tibble: 6 x 8
beta pval.x se1 myvar pval_s pval.y qval se
<dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
1 3.42 0.0913 2.03 1. 912 0.0912 2.85 2.03
2 -1.72 0.0629 0.927 1. 628 0.0628 3.46 0.927
3 0.515 0.0335 0.242 1. 335 0.0335 4.52 0.242
4 -3.12 0.0717 1.73 1. 716 0.0716 3.25 1.73
5 -2.12 0.0253 0.947 1. 253 0.0253 5.00 0.946
6 1.36 0.00640 0.498 1. 63 0.00630 7.46 0.497
[1] 100
se se1
se 1.00000000000000000000 0.99999990902236146617
se1 0.99999990902236146617 1.00000000000000000000