R: How to run between group t-tests? - r

I am attempting to run a series of t-tests in R when splitting groups in the same dataset. I have easily been able to group data using group_by and selecting necessary variables. I also understand how to easily run t-tests using the t.test functions but these do not solve the problem of groups.
The data set consists of a group of participants completing an intervention with two different conditions, and varying degrees of load (see below for example).
Participant Condition Load var.1 var.2 var.3
P01 a 1 834.99 0.383 0.342
P01 a 2 917.22 0.342 0.301
P01 a 3 995.24 0.305 0.263
P01 b 1 1074.22 0.276 0.235
P01 b 2 1156.46 0.247 0.208
P01 b 3 871.41 0.307 0.277
P02 a 1 945.10 0.290 0.260
P02 a 2 1010.39 0.272 0.239
P02 a 3 1096.92 0.265 0.234
P02 b 1 1171.91 0.227 0.195
P02 b 2 664.00 0.260 0.191
P02 b 3 711.92 0.238 0.175
P03 a 1 782.02 0.211 0.154
P03 a 2 858.70 0.174 0.134
P03 a 3 915.21 0.154 0.114
P03 b 1 668.22 0.178 0.207
P03 b 2 723.92 0.243 0.186
P03 b 3 788.31 0.209 0.157
I have split groups using:
grouped.my.df <- my.df %>%
group_by(Condition, Load) %>%
select(-var.4, -var.5,-var.6)
I have then tried to run t-tests but not sure how to run it from groups created within the tbl. Is it better to create vectors of each group (if so how) or can I run t-tests directly with the groups created? (The below code is an example of what I want to do, I know it doesn't actually function).
t.test(group.P01.a.1$var.1, group.P01.b.1$var1)
Any help is appreciated.

You are not applying group_by correctly. It doesn't really do anything the way you use it right now.
You can select a subset of your data set with filter, e.g.:
grouped.a.1 = my.df %>% filter(Condition == "a", Load == 1)
grouped.b.1 = my.df %>% filter(Condition == "b", Load == 1)
and then use that in the t.test:
t.test(grouped.a.1$var.1, grouped.b.1$var.1)
or, because t.test also accepts a formula argument if there are two groups:
t.test(var.1 ~ Condition, my.df %>% filter(Load == 1))
Both test the a condition against the b condition for Load == 1. I assume that the discrimination by participant in your t.test(group.P01.a.1$var.1, group.P01.b.1$var1) line was unintended.
I think I misunderstood your question, and what you want may be something like
my.df %>%
select(-Participant) %>%
group_by(Load) %>%
summarize_at(
vars(-group_cols(), -Condition),
list(p.value = ~ t.test(. ~ Condition)$p.value) )
This will give you the p-values of all two-group t-tests between the two conditions for all values of Load and all variables.

Related

create new variables from formulas stored in a list using dplyr

I have a list of formulas which I want to use to create new variables with mutate. For each formula stored in my list, I want to create a new variable. I want to automatically generate one variable for each element in my list. This is my code
library("dplyr")
library("purrr")
library("formula.tools")
t<-10 #just some constant which needs to be included (and found within my pipe)
ut <- list( # my list with the formulas as elements
v1 = V.1 ~ A * B*t,
v2 = V.2 ~ A+B)
data <- tibble(A=rnorm(10),B=runif(10)) %>% ## the dataset
mutate(!!lhs(ut[["v1"]]) := !!rhs(ut[["v1"]]),
!!lhs(ut[["v2"]]) := !!rhs(ut[["v2"]]))
This works fine. However, I do not want to write this for each element in my function. I want to mutate to take each element of the list, and apply the formula, i.e. I need some kind of loop. I tried with across, but across requires existing variables.
I tried to wrap it into a function and use map, but this didn't work
by_formula <- function(equation){
!!lhs(equation) := !!rhs(equation)
}
data <- tibble(A=rnorm(10),B=runif(10)) %>%
mutate(map(ut,by_formula))
I appreciate any hints how to do this so that I do not need to worry about the length of the list. This should be part of a function where the length of the list depends on the user input.
Here is one way
library(dplyr)
library(purrr)
library(formula.tools)
by_formula <- function(equation){
# //! cur_data_all may get deprecated in favor of pick
# pick(everything()) %>%
cur_data_all() %>%
transmute(!!lhs(equation) := !!rhs(equation) )
}
tibble(A=rnorm(10),B=runif(10)) %>%
mutate(map_dfc(ut, by_formula))
-output
# A tibble: 10 × 4
A B V.1 V.2
<dbl> <dbl> <dbl> <dbl>
1 1.73 0.0770 1.33 1.80
2 -1.46 0.894 -13.0 -0.562
3 -0.620 0.804 -4.99 0.184
4 0.834 0.524 4.37 1.36
5 -0.980 0.00581 -0.0569 -0.974
6 -0.361 0.316 -1.14 -0.0444
7 1.73 0.833 14.4 2.57
8 1.71 0.512 8.74 2.22
9 0.233 0.944 2.20 1.18
10 -0.832 0.474 -3.94 -0.358

How to apply function with multiple outputs on each group in R and store results in different columns?

Suppose I am using panel data: for each individual and time, there is an observation of a numerical variable. I want to apply a function to this numerical variable but this function outputs a vector of numbers. I'd like to apply this function over the observations of each individual and store the resulting vector as columns of a new dataframe.
Example:
TICKER OFTIC CNAME ANNDATS_ACT ACTUAL
<chr> <chr> <chr> <date> <dbl>
1 0001 EPE EP ENGR CORP 2019-05-08 -0.15
2 0004 ACSF AMERICAN CAPITAL 2014-08-04 0.29
3 000R CRCM CARECOM 2018-02-27 0.32
4 000V EIGR EIGER 2018-05-11 -0.84
5 000Y RARE ULTRAGENYX 2016-02-25 -1.42
6 000Z BIOC BIOCEPT 2018-03-28 -54
7 0018 EGLT EGALET 2016-03-08 -0.28
8 001A SESN SESEN BIO 2021-03-15 -0.11
9 001C ARGS ARGOS 2017-03-16 -7
10 001J KN KNOWLES 2021-02-04 0.38
For each TICKER, I will consider the time-series implied by ACTUAL and compute the autocorrelation function. I defined the following wrapper to perform the operation:
my_acf <- function(x, lag = NULL){
acf_vec <- acf(x, lag.max = lag, plot = FALSE, na.action = na.contiguous)$acf
acf_vec <- as.vector(acf_vec)[-1]
return(acf_vec)
}
If the desired maximum lag is, say, 3, I'd like to create another dataset in which I have 4 columns: TICKER and the correspoding 3 first autocorrelations of the associated series of ACTUAL observations.
My solution was:
max_lag = 3
autocorrs <- final_sample %>%
group_by(TICKER) %>%
filter(!all(is.na(ACTUAL))) %>%
summarise(rho = my_acf(ACTUAL, lag = max_lag)) %>%
mutate(order = row_number()) %>%
pivot_wider(id_cols = TICKER, values_from = rho, names_from = order, names_prefix = "rho_")
This indeed provides the desired output:
TICKER rho_1 rho_2 rho_3
<chr> <dbl> <dbl> <dbl>
1 0001 0.836 0.676 0.493
2 0004 0.469 -0.224 -0.366
3 000R 0.561 0.579 0.327
4 000V 0.634 0.626 0.604
5 000Y 0.370 0.396 0.117
6 000Z 0.476 0.454 0.382
7 0018 0.382 -0.0170 -0.278
8 001A 0.330 0.316 0.0944
9 001C 0.727 0.590 0.400
10 001J 0.281 -0.308 -0.0343
My question is how can one perform this operation without a pivot_wider and the manual creation of the order column? The summarise verb creates a single column that store the autocorrelations sequentially for each TICKER. Is there a way to force summarize to create different columns for the different output a given function may provide when applied to, let's say, the ACTUAL series?

R run linear model by group in dataset [duplicate]

This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 2 years ago.
My dataset looks like this
df = data.frame(site=c(rep('A',95),rep('B',110),rep('C',250)),
nps_score=c(floor(runif(455, min=0, max=10))),
service_score=c(floor(runif(455, min=0, max=10))),
food_score=c(floor(runif(455, min=0, max=10))),
clean_score=c(floor(runif(455, min=0, max=10))))
I'd like to run a linear model on each group (i.e. for each site), and produce the coefficients for each group in a dataframe, along with the significance levels of each variable.
I am trying to group_by the site variable and then run the model for each site but it doesn't seem to be working. I've looked at some existing solutions on stack overflow but cannot seem to adapt the code to my solution.
#Trying to run this by group, and output the resulting coefficients per site in a separate df with their signficance levels.
library(MASS)
summary(ols <- rlm(nps_score ~ ., data = df))
Any help on this would be greatly appreciated
library(tidyverse)
library(broom)
library(MASS)
# We first create a formula object
my_formula <- as.formula(paste("nps_score ~ ", paste(df %>% select(-site, -nps_score) %>% names(), collapse= "+")))
# Now we can group by site and use the formula object within the pipe.
results <- df %>%
group_by(site) %>%
do(tidy(rlm(formula(my_formula), data = .)))
which gives:
# A tibble: 12 x 5
# Groups: site [3]
site term estimate std.error statistic
<chr> <chr> <dbl> <dbl> <dbl>
1 A (Intercept) 5.16 0.961 5.37
2 A service_score -0.0656 0.110 -0.596
3 A food_score -0.0213 0.102 -0.209
4 A clean_score -0.0588 0.110 -0.536
5 B (Intercept) 2.22 0.852 2.60
6 B service_score 0.221 0.103 2.14
7 B food_score 0.163 0.104 1.56
8 B clean_score -0.0383 0.0928 -0.413
9 C (Intercept) 5.47 0.609 8.97
10 C service_score -0.0367 0.0721 -0.509
11 C food_score -0.0585 0.0724 -0.808
12 C clean_score -0.0922 0.0691 -1.33
Note: i'm not familiar with the rlm function and if it provides p-values in the first place. But at least the tidy function doesn't offer p-values for rlm. If a simple linear regression would fit your suits, you could replace the rlm function by lm in which case a sixth column with p-values would be added.

Conditional sorting / reordering of column values in R

I have a data set similar to the following with 1 column and 60 rows:
value
1 0.0423
2 0.0388
3 0.0386
4 0.0342
5 0.0296
6 0.0276
7 0.0246
8 0.0239
9 0.0234
10 0.0214
.
40 0.1424
.
60 -0.0312
I want to reorder the rows so that certain conditions are met. For example one condition could be: sum(df$value[4:7]) > 0.1000 & sum(df$value[4:7]) <0.1100
With the data set looking like this for example.
value
1 0.0423
2 0.0388
3 0.0386
4 0.1312
5 -0.0312
6 0.0276
7 0.0246
8 0.0239
9 0.0234
10 0.0214
.
.
.
60 0.0342
What I tried was using repeat and sample as in the following:
repeat{
df1 <- as_tibble(sample(sdf$value, replace = TRUE))
if (sum(df$value[4:7]) > 0.1000 & sum(df$value[4:7]) <0.1100) break
}
Unfortunately, this method takes quite some time and I was wondering if there is a faster way to reorder rows based on mathematical conditions such as sum or prod
Here's a quick implementation of the hill-climbing method I outlined in my comment. I've had to slightly reframe the desired condition as "distance of sum(x[4:7]) from 0.105" to make it continuous, although you can still use the exact condition when doing the check that all requirements are satisfied. The benefit is that you can add extra conditions to the distance function easily.
# Using same example data as Jon Spring
set.seed(42)
vs = rnorm(60, 0.05, 0.08)
get_distance = function(x) {
distance = abs(sum(x[4:7]) - 0.105)
# Add to the distance with further conditions if needed
distance
}
max_attempts = 10000
best_distance = Inf
swaps_made = 0
for (step in 1:max_attempts) {
# Copy the vector and swap two random values
new_vs = vs
swap_inds = sample.int(length(vs), 2, replace = FALSE)
new_vs[swap_inds] = rev(new_vs[swap_inds])
# Keep the new vector if the distance has improved
new_distance = get_distance(new_vs)
if (new_distance < best_distance) {
vs = new_vs
best_distance = new_distance
swaps_made = swaps_made + 1
}
complete = (sum(vs[4:7]) < 0.11) & (sum(vs[4:7]) > 0.1)
if (complete) {
print(paste0("Solution found in ", step, " steps"))
break
}
}
sum(vs[4:7])
There's no real guarantee that this method will reach a solution, but I often try this kind of basic hill-climbing when I'm not sure if there's a "smart" way to approach a problem.
Here's an approach using combn from base R, and then filtering using dplyr. (I'm sure there's a way w/o it but my base-fu isn't there yet.)
With only 4 numbers from a pool of 60, there are "only" 488k different combinations (ignoring order; =60*59*58*57/4/3/2), so it's quick to brute force in about a second.
# Make a vector of 60 numbers like your example
set.seed(42)
my_nums <- rnorm(60, 0.05, 0.08);
all_combos <- combn(my_nums, 4) # Get all unique combos of 4 numbers
library(tidyverse)
combos_table <- all_combos %>%
t() %>%
as_tibble() %>%
mutate(sum = V1 + V2 + V3 + V4) %>%
filter(sum > 0.1, sum < 0.11)
> combos_table
# A tibble: 8,989 x 5
V1 V2 V3 V4 sum
<dbl> <dbl> <dbl> <dbl> <dbl>
1 0.160 0.00482 0.0791 -0.143 0.100
2 0.160 0.00482 0.101 -0.163 0.103
3 0.160 0.00482 0.0823 -0.145 0.102
4 0.160 0.00482 0.0823 -0.143 0.104
5 0.160 0.00482 -0.0611 -0.00120 0.102
6 0.160 0.00482 -0.0611 0.00129 0.105
7 0.160 0.00482 0.0277 -0.0911 0.101
8 0.160 0.00482 0.0277 -0.0874 0.105
9 0.160 0.00482 0.101 -0.163 0.103
10 0.160 0.00482 0.0273 -0.0911 0.101
# … with 8,979 more rows
This says that in this example, there are about 9000 different sets of 4 numbers from my sequence which meet the criteria. We could pick any of these and put them in positions 4-7 to meet your requirement.

car::Anova Way to have a covariate that does not interact with the within-subject factors

I would like to run an ANCOVA using car::Anova but cannot find out if there is a way to add a covariate only as a main effect (i.e., should not interact with anything).
As far as I understand ANCOVA, covariates are just another main effect added to the model (i.e., one more effect), thereby controlling for the overall additive influence of this covariate. Followingly, the covariate(s) do not interact with the other factors. However, I cannot add a variable to Anova that does not interact with the within-subject factors (i.e., my final model does not seem to ba an ANCOVA).
Let me illustrate my problem with an example from ?Anova. The OBrienKaiser data set has 2 between (treatment and gender) and 2 within (phase and hour) factors. Now lets assume we also recorded the age of the participants and would like to add it as a covariate to the any analysis.
require(car)
set.seed(1)
n.OBrienKaiser <- within(OBrienKaiser, age <- sample(18:35, size = 16, replace = TRUE))
# the next part is taken from ?Anova
# I only modified the mod.ok <- ... call by adding + age
phase <- factor(rep(c("pretest", "posttest", "followup"), c(5, 5, 5)), levels=c("pretest", "posttest", "followup"))
hour <- ordered(rep(1:5, 3))
idata <- data.frame(phase, hour)
mod.ok <- lm(cbind(pre.1, pre.2, pre.3, pre.4, pre.5, post.1, post.2, post.3, post.4, post.5,
fup.1, fup.2, fup.3, fup.4, fup.5) ~ treatment*gender + age, data=n.OBrienKaiser)
(av.ok <- Anova(mod.ok, idata=idata, idesign=~phase*hour, type = 3))
As the results show, the results contain interaction with the covariate age, namely of the within-subject (or repeated-measures) factors phase, hour and their interaction phase:hour:
Type III Repeated Measures MANOVA Tests: Pillai test statistic
Df test stat approx F num Df den Df Pr(>F)
(Intercept) 1 0.129 1.33 1 9 0.278
treatment 2 0.443 3.58 2 9 0.072 .
gender 1 0.305 3.95 1 9 0.078 .
age 1 0.054 0.52 1 9 0.490
treatment:gender 2 0.222 1.28 2 9 0.323
phase 1 0.418 2.87 2 8 0.115
treatment:phase 2 0.871 3.47 4 18 0.029 *
gender:phase 1 0.084 0.37 2 8 0.703
age:phase 1 0.393 2.59 2 8 0.136
treatment:gender:phase 2 0.545 1.69 4 18 0.197
hour 1 0.565 1.95 4 6 0.222
treatment:hour 2 0.580 0.72 8 14 0.676
gender:hour 1 0.310 0.68 4 6 0.633
age:hour 1 0.508 1.55 4 6 0.301
treatment:gender:hour 2 0.707 0.96 8 14 0.504
phase:hour 1 0.975 9.56 8 2 0.098 .
treatment:phase:hour 2 1.145 0.50 16 6 0.873
gender:phase:hour 1 0.693 0.56 8 2 0.770
age:phase:hour 1 0.974 9.40 8 2 0.100 .
treatment:gender:phase:hour 2 1.314 0.72 16 6 0.723
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
My question is: Can one run a ANCOVA with car::Anova and if so is there a way to specify this ANCOVA without any interaction of age?
Update (July 22, 2012): I asked this question on R-help, but so far no responses. If there are news, I will post it here.
I asked this question on R-help which started a helpful discussion with John Fox (later joined by Peter Dalgaard). Unfortunately it got split up into two threads: one, two.
The punchline is:
"The within-subjects contrasts are constructed by Anova() to be orthogonal in the row-basis of the design, so you should be able to safely ignore the effects in which (for some reason that escapes me) you are uninterested." (John Fox)
So the answer to the question is: No one can't, but it doesn't matter because these interactions do not alter the other effects as they are orthogonal.

Resources