Problems with appending t.test results in a for loop - r

Let me take simulated datasets to explain:
I have dataset dt and dt1
# dataset 1 `dt`
set.seed(12)
dt <- rnorm(5000,mean=10,sd=1)
dt <- data.frame(dt)
dt$group <- c("case","control")
colnames(dt) <- c("severity","group")
head(dt)
severity group
1 8.519432 case
2 11.577169 control
3 9.043256 case
4 9.079995 control
5 8.002358 case
6 9.727704 control
# dataset 2 `dt2`
set.seed(12)
dt2 <- rnorm(200,mean=12,sd=1)
dt2 <- data.frame(dt2)
dt2$group <- c("case2","control2")
colnames(dt2) <- c("severity","group")
head(dt2)
severity group
1 10.51943 case2
2 13.57717 control2
3 11.04326 case2
4 11.07999 control2
5 10.00236 case2
6 11.72770 control2
I am building one 1000 iterations for loop to do the following steps:
randomly take 500 rows from the dt and save as dt_sub
rbind dt_sub with dt2 and save as bd
select only rows with group as either case2 or control from the bd dataset (only cares the difference between these two groups)
t.tests on the variable severity between the case2 and control group
output t.tests results to t
use a for loop to repeat 1000 times
iteratively appends all t.test results to a dataframe results
Following is the code that I built in r
library(broom)
library(dplyr)
iter <- 1000
t <- data.frame()
for (i in 1:iter) {
dt_sub <- dt[sample(nrow(dt),500),]
bd <- rbind(dt_sub,dt2)
compare <- filter(bd, group %in% c("case2", "control"))
compare %>% group_by(group) %>% do(tidy(t.test(severity ~ group,data = compare))) -> t
t$iter <- i
}
results <- do.call(rbind,t)
My question is, this code works well when iter=1, but how should I set the compare %>% group_by(group) %>% do(tidy(t.test(severity ~ group,data = compare))) -> t line to ensure each run's t.test results will not be overwritten when iter ≥ 1? I tried t[i] but failed, anyone could advise please?
Thanks.

Create a function which runs the process once.
library(broom)
library(dplyr)
t_test_function <- function() {
dt_sub <- dt[sample(nrow(dt),500),]
bd <- rbind(dt_sub,dt2)
compare <- filter(bd, group %in% c("case2", "control"))
compare %>%
group_by(group) %>%
do(tidy(t.test(severity ~ group,data = compare))) %>%
ungroup
}
t_test_function()
# group estimate estimate1 estimate2 statistic p.value parameter conf.low
# <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#1 case2 1.94 11.9 9.99 17.4 9.40e-42 199. #1.72
#2 cont… 1.94 11.9 9.99 17.4 9.40e-42 199. 1.72
# … with 3 more variables: conf.high <dbl>, method <chr>,
# alternative <chr>
Now you can call this iter times using replicate and combine the dataset.
iter <- 5
results <- bind_rows(replicate(iter, t_test_function(), simplify = FALSE), .id = 'iter')
# A tibble: 10 x 12
# iter group estimate estimate1 estimate2 statistic p.value parameter
# <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 case2 1.88 11.9 10.1 17.3 1.05e-40 189.
# 2 1 cont… 1.88 11.9 10.1 17.3 1.05e-40 189.
# 3 2 case2 1.96 11.9 9.97 17.8 9.88e-43 194.
# 4 2 cont… 1.96 11.9 9.97 17.8 9.88e-43 194.
# 5 3 case2 1.94 11.9 9.99 17.9 3.76e-42 184.
# 6 3 cont… 1.94 11.9 9.99 17.9 3.76e-42 184.
# 7 4 case2 2.03 11.9 9.90 18.6 1.82e-44 189.
# 8 4 cont… 2.03 11.9 9.90 18.6 1.82e-44 189.
# 9 5 case2 1.96 11.9 9.97 18.1 7.05e-43 187.
#10 5 cont… 1.96 11.9 9.97 18.1 7.05e-43 187.
# … with 4 more variables: conf.low <dbl>, conf.high <dbl>, method <chr>,
# alternative <chr>

Related

Unnesting tibble columns: "Wide" data summaries with dplyr v1.0.0

I'd like to produce "wide" summary tables of data in this sort of format:
---- Centiles ----
Param Group Mean SD 25% 50% 75%
Height 1 x.xx x.xxx x.xx x.xx x.xx
2 x.xx x.xxx x.xx x.xx x.xx
3 x.xx x.xxx x.xx x.xx x.xx
Weight 1 x.xx x.xxx x.xx x.xx x.xx
2 x.xx x.xxx x.xx x.xx x.xx
3 x.xx x.xxx x.xx x.xx x.xx
I can do that in dplyr 0.8.x. I can do it generically, with a function that can handle arbitrary grouping variables with arbitrary numbers of levels and arbitrary statistics summarising arbitrary numbers of variables with arbitrary names. I get that level of flexibility by making my data tidy. That's not what this question is about.
First, some toy data:
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
Now, a simple summary function, and a helper:
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
tibble(Value := quantile(x, q), "Quantile" := q)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
So I can say things like
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>% head()
Giving
# A tibble: 6 x 5
Parameter Group Q$Value $Quantile Mean SD
<chr> <int> <dbl> <dbl> <dbl> <dbl>
1 Height 1 1.45 0.25 1.54 0.141
2 Height 1 1.49 0.5 1.54 0.141
3 Height 1 1.59 0.75 1.54 0.141
4 Height 2 1.64 0.25 1.66 0.0649
5 Height 2 1.68 0.5 1.66 0.0649
6 Height 2 1.68 0.75 1.66 0.0649
So that's the summary I need, but it's in long format. And Q is a df-col. It's a tibble:
is_tibble(summary$Q)
[1] TRUE
So pivot_wider doesn't seem to work. I can use nest_by() to get to a one-row-per-group format:
toySummary <- summary %>% nest_by(Group, Mean, SD)
toySummary
# Rowwise: Group, Mean, SD
Group Mean SD data
<int> <dbl> <dbl> <list<tbl_df[,2]>>
1 1 1.54 0.141 [3 × 2]
2 1 78.8 10.2 [3 × 2]
3 2 1.66 0.0649 [3 × 2]
4 2 82.9 9.09 [3 × 2]
5 3 1.63 0.100 [3 × 2]
6 3 71.0 10.8 [3 × 2]
But now the format of the centiles is even more complicated:
> toySummary$data[1]
<list_of<
tbl_df<
Parameter: character
Q :
tbl_df<
Value : double
Quantile: double
>
>
>[1]>
[[1]]
# A tibble: 3 x 2
Parameter Q$Value $Quantile
<chr> <dbl> <dbl>
1 Height 1.45 0.25
2 Height 1.49 0.5
3 Height 1.59 0.75
It looks like a list, so I guess some form of lapply would probably work, but is there a neater, tidy, solution that I've not spotted yet? I've discovered several new verbs that I didn't know abou whilst researching this question (chop, pack, rowwise(), nest_by and such) but none seem to give me what I want: ideally, a tibble with 6 rows (defined by unique Group and Parameter combinations) and columns for Mean, SD, Q25, Q50 and Q75.
To clarify in response to the first two proposed answers: getting the exact numbers that my toy example generates is less important than finding a generic technique for moving from the df-col(s) that summarise returns in dplyr v1.0.0 to a wide data summary of the general form that my example illustrates.
revised answer
Here is my revised answer. This time, I rewrote your quibble2 function with enframe and pivot_wider so that it returns a tibble with three rows.
This will again lead to a df-col in your summary tibble, and now we can use unpack directly, without using pivot_wider to get the expected outcome.
This should generalize on centiles etc. as well.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
pivot_wider(enframe(quantile(x, q)),
names_from = name,
values_from = value)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>%
unpack(Q)
#> # A tibble: 6 x 7
#> Parameter Group `25%` `50%` `75%` Mean SD
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1 1.62 1.66 1.73 1.70 0.108
#> 2 Height 2 1.73 1.77 1.78 1.76 0.105
#> 3 Height 3 1.55 1.64 1.76 1.65 0.109
#> 4 Weight 1 75.6 80.6 84.3 80.0 9.05
#> 5 Weight 2 75.4 76.9 79.6 77.4 7.27
#> 6 Weight 3 70.7 75.2 82.0 76.3 6.94
Created on 2020-06-13 by the reprex package (v0.3.0)
Second approach
without changing quibble2, we would need to first call unpack and then pivot_wider. This should scale as well.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
tibble(Value := quantile(x, q), "Quantile" := q)
}
mySummary <- function(data, ...) {
data %>%
group_by(Parameter, Group) %>%
summarise(..., .groups="drop")
}
summary <- mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE))
summary %>%
unpack(Q) %>%
pivot_wider(names_from = Quantile, values_from = Value)
#> # A tibble: 6 x 7
#> Parameter Group Mean SD `0.25` `0.5` `0.75`
#> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1 1.70 0.108 1.62 1.66 1.73
#> 2 Height 2 1.76 0.105 1.73 1.77 1.78
#> 3 Height 3 1.65 0.109 1.55 1.64 1.76
#> 4 Weight 1 80.0 9.05 75.6 80.6 84.3
#> 5 Weight 2 77.4 7.27 75.4 76.9 79.6
#> 6 Weight 3 76.3 6.94 70.7 75.2 82.0
Created on 2020-06-13 by the reprex package (v0.3.0)
generalized approach
I tried to figure out a more general approach by rewriting the mySummary function. Now it will convert automatically those outputs to df-cols which return a vector or a named vector. It will also wrap list automatically around expressions if necessary.
Then, I defined a function widen which will widen the df as much as possible, by preserving rows, including calling broom::tidy on supported list-columns.
The approach is not perfect, and could be extended by including unnest_wider in the widen function.
Note, that I changed the grouping in the example to be able to use t.test as another example output.
library(tidyverse)
set.seed(123456)
toy <- tibble(
Group=rep(1:3, each=5),
Height=1.65 + rnorm(15, 0, 0.1),
Weight= 75 + rnorm(15, 0, 10)
) %>%
pivot_longer(
values_to="Value",
names_to="Parameter",
cols=c(Height, Weight)
)
# modified summary function
mySummary <- function(data, ...) {
fns <- rlang::enquos(...)
fns <- map(fns, function(x) {
res <- rlang::eval_tidy(x, data = data)
if ( ((is.vector(res) || is.factor(res)) && length(res) == 1) ||
("list" %in% class(res) && is.list(res)) ||
rlang::call_name(rlang::quo_get_expr(x)) == "list") {
x
}
else if ((is.vector(res) || is.factor(res)) && length(res) > 1) {
x_expr <- as.character(list(rlang::quo_get_expr(x)))
x_expr <- paste0(
"pivot_wider(enframe(",
x_expr,
"), names_from = name, values_from = value)"
)
x <- rlang::quo_set_expr(x, str2lang(x_expr))
x
} else {
x_expr <- as.character(list(rlang::quo_get_expr(x)))
x_expr <- paste0("list(", x_expr,")")
x <- rlang::quo_set_expr(x, str2lang(x_expr))
x
}
})
data %>%
group_by(Parameter) %>%
summarise(!!! fns, .groups="drop")
}
# A function to automatically widen the df as much as possible while preserving rows
widen <- function(df) {
df_cols <- names(df)[map_lgl(df, is.data.frame)]
df <- unpack(df, all_of(df_cols), names_sep = "_")
try_tidy <- function(x) {
tryCatch({
broom::tidy(x)
}, error = function(e) {
x
})
}
df <- df %>% rowwise() %>% mutate(across(where(is.list), try_tidy))
ungroup(df)
}
# if you want to specify function arguments for convenience use purrr::partial
quantile3 <- partial(quantile, x = , q = c(.25, .5, .75))
summary <- mySummary(toy,
Q = quantile3(Value),
R = range(Value),
T_test = t.test(Value),
Mean = mean(Value, na.rm=TRUE),
SD = sd(Value, na.rm=TRUE)
)
summary
#> # A tibble: 2 x 6
#> Parameter Q$`0%` $`25%` $`50%` $`75%` $`100%` R$`1` $`2` T_test Mean SD
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <list> <dbl> <dbl>
#> 1 Height 1.54 1.62 1.73 1.77 1.90 1.54 1.90 <htest> 1.70 0.109
#> 2 Weight 67.5 72.9 76.9 83.2 91.7 67.5 91.7 <htest> 77.9 7.40
widen(summary)
#> # A tibble: 2 x 11
#> Parameter `Q_0%` `Q_25%` `Q_50%` `Q_75%` `Q_100%` R_1 R_2 T_test$estimate
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Height 1.54 1.62 1.73 1.77 1.90 1.54 1.90 1.70
#> 2 Weight 67.5 72.9 76.9 83.2 91.7 67.5 91.7 77.9
#> # … with 9 more variables: $statistic <dbl>, $p.value <dbl>, $parameter <dbl>,
#> # $conf.low <dbl>, $conf.high <dbl>, $method <chr>, $alternative <chr>,
#> # Mean <dbl>, SD <dbl>
Created on 2020-06-14 by the reprex package (v0.3.0)
What if you change quibble2 to return a list, and then use unnest_wider?
quibble2 <- function(x, q = c(0.25, 0.5, 0.75)) {
list(quantile(x, q))
}
mySummary(toy, Q=quibble2(Value), Mean=mean(Value, na.rm=TRUE), SD=sd(Value, na.rm=TRUE)) %>%
unnest_wider(Q)
# A tibble: 6 x 7
Parameter Group `25%` `50%` `75%` Mean SD
<chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Height 1 1.62 1.66 1.73 1.70 0.108
2 Height 2 1.73 1.77 1.78 1.76 0.105
3 Height 3 1.55 1.64 1.76 1.65 0.109
4 Weight 1 75.6 80.6 84.3 80.0 9.05
5 Weight 2 75.4 76.9 79.6 77.4 7.27
6 Weight 3 70.7 75.2 82.0 76.3 6.94

Create new column based on regular expression match

Problem
I would like to create a new column for relative standard deviation using following formula:stdev * 100 / abs(mean). I have over 40 variables, each with their own stdev and mean (so 80 columns). What I would like to do is use regular expressions to calculate the relative standard deviation from the 2 columns (stdev and mean) based on the preceding names. For example, for columns AceticAcid.stdevand AceticAcid.mean, calculate the relative standard deviation to automatically create a new column AcetiAcid.rsd. The equation being: AceticAcid.stdev * 100 / abs(AceticAcid.mean).
Example Dataframe
print(df)
AceticAcid.mean AceticAcid.stdev Glucose.mean Glucose.stdev Propanol.mean Propanol.stdev
1 28.75775 0.911130 48.27333 4.4991249 144.4770 38.34122
2 78.83051 10.562110 28.13337 1.2304387 134.6402 31.76264
3 40.89769 17.848381 37.10283 0.2102977 132.0253 33.76568
4 88.30174 11.028700 32.90534 1.6396036 149.7135 21.56639
5 94.04673 9.132295 14.11699 4.7725182 132.7853 15.88455
Desired Output (Don't care about the order of the new columns)
print(df_rsd)
AceticAcid.mean AceticAcid.stdev Glucose.mean Glucose.stdev Propanol.mean Propanol.stdev AceticAcid.rsd Glucose.rsd Propanol.rsd
1 28.75775 0.911130 48.27333 4.4991249 144.4770 38.34122 3.168294 9.3201039 26.53795
2 78.83051 10.562110 28.13337 1.2304387 134.6402 31.76264 13.398504 4.3735921 23.59076
3 40.89769 17.848381 37.10283 0.2102977 132.0253 33.76568 43.641536 0.5667969 25.57515
4 88.30174 11.028700 32.90534 1.6396036 149.7135 21.56639 12.489788 4.9827894 14.40511
5 94.04673 9.132295 14.11699 4.7725182 132.7853 15.88455 9.710380 33.8069175 11.96258
Repetitive Attempt...
I do not want to write these out 40 times (there has to be a nice regex way to achieve this):
df_rsd <- df %>% mutate(AceticAcid.rsd = AceticAcid.stdev * 100 / abs(AceticAcid.mean),
Glucose.rsd = Glucose.stdev * 100 / abs(Glucose.mean),
Propanol.rsd = Propanol.stdev * 100 / abs(Propanol.mean))
Reproducible Data
structure(list(AceticAcid.mean = c(28.7577520124614, 78.8305135443807,
40.89769218117, 88.3017404004931, 94.0467284293845), AceticAcid.stdev = c(0.911129987798631,
10.5621097609401, 17.8483808878809, 11.0287002893165, 9.13229470606893
), Glucose.mean = c(48.2733338139951, 28.1333662476391, 37.1028254181147,
32.9053360782564, 14.1169873066247), Glucose.stdev = c(4.49912485200912,
1.2304386717733, 0.210297667654231, 1.63960359641351, 4.77251824573614
), Propanol.mean = c(144.476965803187, 134.64017030783, 132.025340688415,
149.713488831185, 132.785289955791), Propanol.stdev = c(38.3412187267095,
31.7626409884542, 33.7656808178872, 21.5663894917816, 15.884545892477
)), class = "data.frame", row.names = c(NA, -5L))
We can use split.default to split the dataset into a list of data.frame columns based on removing the suffix part of the column names, then loop over the list with lapply, do the calculation and assign it to new column in 'df'
out <- lapply(split.default(df, sub("\\..*", "", names(df))),
function(x) x[[2]]* 100/abs(x[[1]]))
df[paste0(names(out), ".rsd")] <- out
df
# AceticAcid.mean AceticAcid.stdev Glucose.mean Glucose.stdev Propanol.mean Propanol.stdev AceticAcid.rsd Glucose.rsd Propanol.rsd
#1 28.75775 0.911130 48.27333 4.4991249 144.4770 38.34122 3.168294 9.3201039 26.53795
#2 78.83051 10.562110 28.13337 1.2304387 134.6402 31.76264 13.398504 4.3735921 23.59076
#3 40.89769 17.848381 37.10283 0.2102977 132.0253 33.76568 43.641536 0.5667969 25.57515
#4 88.30174 11.028700 32.90534 1.6396036 149.7135 21.56639 12.489788 4.9827894 14.40511
#5 94.04673 9.132295 14.11699 4.7725182 132.7853 15.88455 9.710380 33.8069175 11.96258
Or with tidyverse
library(purrr)
library(dplyr)
library(stringr)
df %>%
split.default(str_remove(names(.), "\\..*")) %>%
map_dfc(~ .x[[2]] * 100/abs(.x[[1]])) %>%
rename_all(~ str_c(., '.rsd')) %>%
bind_cols(df, .)
alternative, also with the tidyverse.
library(tidyverse)
df_long <- df %>%
mutate(measurement_number=row_number(), .before=1) %>%
pivot_longer(cols=-measurement_number, names_to="var", values_to="value") %>%
separate(var, into=c("var", "indicator")) %>%
pivot_wider(id_cols=c("measurement_number", "var"), names_from = indicator, values_from=value) %>%
mutate(rsd=stdev * 100 / abs(mean)) %>%
arrange(var, measurement_number)
df_long
#> # A tibble: 15 x 5
#> measurement_number var mean stdev rsd
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 AceticAcid 28.8 0.911 3.17
#> 2 2 AceticAcid 78.8 10.6 13.4
#> 3 3 AceticAcid 40.9 17.8 43.6
#> 4 4 AceticAcid 88.3 11.0 12.5
#> 5 5 AceticAcid 94.0 9.13 9.71
#> 6 1 Glucose 48.3 4.50 9.32
#> 7 2 Glucose 28.1 1.23 4.37
#> 8 3 Glucose 37.1 0.210 0.567
#> 9 4 Glucose 32.9 1.64 4.98
#> 10 5 Glucose 14.1 4.77 33.8
#> 11 1 Propanol 144. 38.3 26.5
#> 12 2 Propanol 135. 31.8 23.6
#> 13 3 Propanol 132. 33.8 25.6
#> 14 4 Propanol 150. 21.6 14.4
#> 15 5 Propanol 133. 15.9 12.0
df_wide <- df_long %>%
pivot_wider(id_cols=c("measurement_number"),
names_from = c(var),
values_from = c(mean, stdev, rsd),
names_sep = ".")
df_wide
#> # A tibble: 5 x 10
#> measurement_num~ mean.AceticAcid mean.Glucose mean.Propanol stdev.AceticAcid
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 28.8 48.3 144. 0.911
#> 2 2 78.8 28.1 135. 10.6
#> 3 3 40.9 37.1 132. 17.8
#> 4 4 88.3 32.9 150. 11.0
#> 5 5 94.0 14.1 133. 9.13
#> # ... with 5 more variables: stdev.Glucose <dbl>, stdev.Propanol <dbl>,
#> # rsd.AceticAcid <dbl>, rsd.Glucose <dbl>, rsd.Propanol <dbl>
Created on 2020-05-26 by the reprex package (v0.3.0)

Many regressions using tidyverse and broom: Same dependent variable, different independent variables

This link shows how to answer my question in the case where we have the same independent variables, but potentially many different dependent variables: Use broom and tidyverse to run regressions on different dependent variables.
But my question is, how can I apply the same approach (e.g., tidyverse and broom) to run many regressions where we have the reverse situation: same dependent variables but different independent variable. In line with the code in the previous link, something like:
mod = lm(health ~ cbind(sex,income,happiness) + faculty, ds) %>% tidy()
However, this code does not do exactly what I want, and instead, produces:
Call:
lm(formula = income ~ cbind(sex, health) + faculty, data = ds)
Coefficients:
(Intercept) cbind(sex, health)sex
945.049 -47.911
cbind(sex, health)health faculty
2.342 1.869
which is equivalent to:
lm(formula = income ~ sex + health + faculty, data = ds)
Basically you'll need some way to create all the different formulas you want. Here's one way
qq <- expression(sex,income,happiness)
formulae <- lapply(qq, function(v) bquote(health~.(v)+faculty))
# [[1]]
# health ~ sex + faculty
# [[2]]
# health ~ income + faculty
# [[3]]
# health ~ happiness + faculty
Once you have all your formula, you can map them to lm and then to tidy()
library(purrr)
library(broom)
formulae %>% map(~lm(.x, ds)) %>% map_dfr(tidy, .id="model")
# A tibble: 9 x 6
# model term estimate std.error statistic p.value
# <chr> <chr> <dbl> <dbl> <dbl> <dbl>
# 1 1 (Intercept) 19.5 0.504 38.6 1.13e-60
# 2 1 sex 0.755 0.651 1.16 2.49e- 1
# 3 1 faculty -0.00360 0.291 -0.0124 9.90e- 1
# 4 2 (Intercept) 19.8 1.70 11.7 3.18e-20
# 5 2 income -0.000244 0.00162 -0.150 8.81e- 1
# 6 2 faculty 0.143 0.264 0.542 5.89e- 1
# 7 3 (Intercept) 18.4 1.88 9.74 4.79e-16
# 8 3 happiness 0.205 0.299 0.684 4.96e- 1
# 9 3 faculty 0.141 0.262 0.539 5.91e- 1
Using sample data
set.seed(11)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
You could use the combn function to get all combinations of n independent variables and then iterate over them. Let's say n=3 here:
library(tidyverse)
ds <- data.frame(income = rnorm(100, mean=1000,sd=200),
happiness = rnorm(100, mean = 6, sd=1),
health = rnorm(100, mean=20, sd = 3),
sex = c(0,1),
faculty = c(0,1,2,3))
ivs = combn(names(ds)[names(ds)!="income"], 3, simplify=FALSE)
# Or, to get all models with 1 to 4 variables:
# ivs = map(1:4, ~combn(names(ds)[names(ds)!="income"], .x, simplify=FALSE)) %>%
# flatten()
names(ivs) = map(ivs, ~paste(.x, collapse="-"))
models = map(ivs,
~lm(as.formula(paste("income ~", paste(.x, collapse="+"))), data=ds))
map_df(models, broom::tidy, .id="model")
model term estimate std.error statistic p.value
* <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 happiness-health-sex (Intercept) 1086. 201. 5.39 5.00e- 7
2 happiness-health-sex happiness -25.4 21.4 -1.19 2.38e- 1
3 happiness-health-sex health 3.58 6.99 0.512 6.10e- 1
4 happiness-health-sex sex 11.5 41.5 0.277 7.82e- 1
5 happiness-health-faculty (Intercept) 1085. 197. 5.50 3.12e- 7
6 happiness-health-faculty happiness -25.8 20.9 -1.23 2.21e- 1
7 happiness-health-faculty health 3.45 6.98 0.494 6.23e- 1
8 happiness-health-faculty faculty 7.86 18.2 0.432 6.67e- 1
9 happiness-sex-faculty (Intercept) 1153. 141. 8.21 1.04e-12
10 happiness-sex-faculty happiness -25.9 21.4 -1.21 2.28e- 1
11 happiness-sex-faculty sex 3.44 46.2 0.0744 9.41e- 1
12 happiness-sex-faculty faculty 7.40 20.2 0.366 7.15e- 1
13 health-sex-faculty (Intercept) 911. 143. 6.35 7.06e- 9
14 health-sex-faculty health 3.90 7.03 0.554 5.81e- 1
15 health-sex-faculty sex 15.6 45.6 0.343 7.32e- 1
16 health-sex-faculty faculty 7.02 20.4 0.345 7.31e- 1

How to run a paired t-test on different levels of a categorical variable?

I am trying to run a paired t-test on pre- and post-intervention results of three intervention types. I am trying to run the the test on each intervention separately using "subset" in t.test function but it keeps running the test on the whole sample. I cannot separate the intervention levels manually as this is a large database and I do not have access to the excel file. Does anyone have any suggestions?
Here's the codes I am using:
Treatment (intervention) levels:"Passive" "Pro" "Peer"
"Post" and "Pre" are continuous variables.
t.test(data$Post, data$Pre, paired=T, subset=data$Treatment=="Peer")
t.test(data$Post, data$Pre, paired=T, subset=data$Treatment=="Pro")
t.test(data$Post, data$Pre, paired=T, subset=data$Treatment=="Passive")
There is no subset argument (nor a data argument) for the t.test function when using the default method:
> args(stats:::t.test.default)
function (x, y = NULL, alternative = c("two.sided", "less",
"greater"), mu = 0, paired = FALSE, var.equal = FALSE,
conf.level = 0.95, ...)
You'll have to subset first,
with(subset(data, subset=Treatment=="Peer"),
t.test(Post, Pre, paired=TRUE)
)
There's also an easier way using dplyr and broom...
library(dplyr)
library(broom)
data %>%
group_by(Treatment) %>%
do(tidy(t.test(.$Pre, .$Post, paired=TRUE)))
Reproducible example:
set.seed(123)
data <- tibble(id=1:63, Pre=rnorm(21*3,10,5), Post=rnorm(21*3,13,5),
Treatment=sample(c("Peer","Pro","Passive"), 63, TRUE))
data
# A tibble: 63 x 4
id Pre Post Treatment
<int> <dbl> <dbl> <chr>
1 1 7.20 7.91 Pro
2 2 8.85 7.64 Peer
3 3 17.8 14.5 Peer
4 4 10.4 15.2 Peer
5 5 10.6 13.3 Passive
6 6 18.6 17.6 Passive
7 7 12.3 23.3 Pro
8 8 3.67 10.5 Peer
9 9 6.57 1.45 Pro
10 10 7.77 18.0 Passive
# ... with 53 more rows
Output:
# A tibble: 3 x 9
# Groups: Treatment [3]
Treatment estimate statistic p.value parameter conf.low conf.high method alternative
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
1 Passive -2.41 -1.72 0.107 14 -5.42 0.592 Paired t-~ two.sided
2 Peer -3.61 -2.96 0.00636 27 -6.11 -1.10 Paired t-~ two.sided
3 Pro -1.22 -0.907 0.376 19 -4.03 1.59 Paired t-~ two.sided

Calculate predicted model results by iterating through variables

I have several models fit to predict an outcome y = x1 + x2 + .....+x22. That's a fair number of predictors and a fair number of models. My customers want to know what's the marginal impact of each X on the estimated y. The models may include splines and interaction terms. I can do this, but it's cumbersome and requires loops or a lot of copy paste, which is slow or error prone. Can I do this better by writing my function differently and/or using purrr or an *apply function? Reproducible example is below. Ideally, I could write one function and apply it to longdata.
## create my fake data.
library(tidyverse)
library (rms)
ltrans<- function(l1){
newvar <- exp(l1)/(exp(l1)+1)
return(newvar)
}
set.seed(123)
mystates <- c("AL","AR","TN")
mydf <- data.frame(idno = seq(1:1500),state = rep(mystates,500))
mydf$x1[mydf$state=='AL'] <- rnorm(500,50,7)
mydf$x1[mydf$state=='AR'] <- rnorm(500,55,8)
mydf$x1[mydf$state=='TN'] <- rnorm(500,48,10)
mydf$x2 <- sample(1:5,500, replace = T)
mydf$x3 <- (abs(rnorm(1500,10,20)))^2
mydf$outcome <- as.numeric(cut2(sample(1:100,1500,replace = T),95))-1
dd<- datadist(mydf)
options(datadist = 'dd')
m1 <- lrm(outcome ~ x1 + x2+ rcs(x3,3), data = mydf)
dothemath <- function(x1 = x1ref,x2 = x2ref,x3 = x3ref) {
ltrans(-2.1802256-0.01114239*x1+0.050319692*x2-0.00079289232* x3+
7.6508189e-10*pmax(x3-7.4686271,0)^3-9.0897627e-10*pmax(x3- 217.97865,0)^3+
1.4389439e-10*pmax(x3-1337.2538,0)^3)}
x1ref <- 51.4
x2ref <- 3
x3ref <- 217.9
dothemath() ## 0.0591
mydf$referent <- dothemath()
mydf$thisobs <- dothemath(x1 = mydf$x1, x2 = mydf$x2, x3 = mydf$x3)
mydf$predicted <- predict(m1,mydf,type = "fitted.ind") ## yes, matches.
mydf$x1_marginaleffect <- dothemath(x1= mydf$x1)/mydf$referent
mydf$x2_marginaleffect <- dothemath(x2 = mydf$x2)/mydf$referent
mydf$x3_marginaleffect <- dothemath(x3 = mydf$x3)/mydf$referent
## can I do this with long data?
longdata <- mydf %>%
select(idno,state,referent,thisobs,x1,x2,x3) %>%
gather(varname,value,x1:x3)
##longdata$marginaleffect <- dothemath(longdata$varname = longdata$value) ## no, this does not work.
## I need to communicate to the function which variable it is evaluating.
longdata$marginaleffect[longdata$varname=="x1"] <- dothemath(x1 = longdata$value[longdata$varname=="x1"])/
longdata$referent[longdata$varname=="x1"]
longdata$marginaleffect[longdata$varname=="x2"] <- dothemath(x2 = longdata$value[longdata$varname=="x2"])/
longdata$referent[longdata$varname=="x2"]
longdata$marginaleffect[longdata$varname=="x3"] <- dothemath(x3 = longdata$value[longdata$varname=="x3"])/
longdata$referent[longdata$varname=="x3"]
testing<- inner_join(longdata[longdata$varname=="x1",c(1,7)],mydf[,c(1,10)])
head(testing) ## yes, both methods work.
Mostly you're just talking about a grouped mutate, with the caveat that dothemath is built such that you need to specify the variable name, which can be done by using do.call or purrr::invoke to call it on a named list of parameters:
longdata <- longdata %>%
group_by(varname) %>%
mutate(marginaleffect = invoke(dothemath, setNames(list(value), varname[1])) / referent)
longdata
#> # A tibble: 4,500 x 7
#> # Groups: varname [3]
#> idno state referent thisobs varname value marginaleffect
#> <int> <fct> <dbl> <dbl> <chr> <dbl> <dbl>
#> 1 1 AL 0.0591 0.0688 x1 46.1 1.06
#> 2 2 AR 0.0591 0.0516 x1 50.2 1.01
#> 3 3 TN 0.0591 0.0727 x1 38.0 1.15
#> 4 4 AL 0.0591 0.0667 x1 48.4 1.03
#> 5 5 AR 0.0591 0.0515 x1 47.1 1.05
#> 6 6 TN 0.0591 0.0484 x1 37.6 1.15
#> 7 7 AL 0.0591 0.0519 x1 60.9 0.905
#> 8 8 AR 0.0591 0.0531 x1 63.2 0.883
#> 9 9 TN 0.0591 0.0780 x1 47.8 1.04
#> 10 10 AL 0.0591 0.0575 x1 50.5 1.01
#> # ... with 4,490 more rows
# the first values look similar
inner_join(longdata[longdata$varname == "x1", c(1,7)], mydf[,c(1,10)])
#> Joining, by = "idno"
#> # A tibble: 1,500 x 3
#> idno marginaleffect x1_marginaleffect
#> <int> <dbl> <dbl>
#> 1 1 1.06 1.06
#> 2 2 1.01 1.01
#> 3 3 1.15 1.15
#> 4 4 1.03 1.03
#> 5 5 1.05 1.05
#> 6 6 1.15 1.15
#> 7 7 0.905 0.905
#> 8 8 0.883 0.883
#> 9 9 1.04 1.04
#> 10 10 1.01 1.01
#> # ... with 1,490 more rows
# check everything is the same
mydf %>%
gather(varname, marginaleffect, x1_marginaleffect:x3_marginaleffect) %>%
select(idno, varname, marginaleffect) %>%
mutate(varname = substr(varname, 1, 2)) %>%
all_equal(select(longdata, idno, varname, marginaleffect))
#> [1] TRUE
It may be easier to reconfigure dothemath to take an additional parameter of the variable name so as to avoid the gymnastics.

Resources