anova - selecting multiple DVs simultaneously - r

I am trying to run anova on many dependent variables. I have one independent variable, which is my grouping variable (Group). I have about 25 DVs - "TMTG, TMTF, CUE, CSE, TCUE, TCSE, WRS, WMAO, TWRS, TWMAO, JCP, JCPE ....etc". I used the following code for the first three variables and I am getting the desired output. How do I tweak the code to get the output for all 25 variables at the same time, but without naming them? I have another dataset with 100 DV - I cant write those out!
here is the data frame
Group TMTG TMTF CUE CSE WRS
TN 27 33 35.12 13.56 0
TN 32 34 12.90 25.56 0
TN 14 78 11 14.78 0
TN 89 41 98 45.25 0
TL 65 11 18.5 23.89 0
TL 12 78 34.6 41.85 0
TL 11 20 35.5 45.5 0
TL 27 25 11.28 55.69 0
Here is the code:
mydataframe
manova_1 <-
manova(cbind(TMTG, TMTF, CUE) ~ as.factor(Group), data = mydataframe)
manova_1
summary.aov(manova_1)
Here is the output
Response TMTG :
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(Group) 1 0.535 0.5351 0.1683 0.6858
Residuals 21 66.769 3.1795
Response TMTF :
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(Group) 1 0.02 0.016 5e-04 0.9831
Residuals 21 749.13 35.673
Response CUE :
Df Sum Sq Mean Sq F value Pr(>F)
as.factor(Group) 1 14.7 14.75 0.0372 0.8489
Residuals 21 8325.7 396.46
I want to tweak this line:
manova(cbind(TMTG, TMTF, CUE) ~ as.factor(Group), data = mydataframe,
so that cbind can take in all the columns without me having to write them out. I tried cbind(2:24) but its not working! Any help would be appreciated!!!

Assuming 1) Group is the first variable in mydataframe, and 2) you want to do a manova as opposed to a number of separate anovas, you could replace the line:
manova(cbind(TMTG, TMTF, CUE) ~ as.factor(Group), data = mydataframe)
with:
manova(as.matrix(mydataframe[, -1]) ~ as.factor(Group), data = mydataframe)

Related

Obtaining Predictions for New Observations (R Programming Language)

I am working with the R programming language. I created a decision tree for this dataset in R (to predict whether the "diabetes" column is either "pos" or "neg"):
#load libraries
library(pdp)
library(C50)
#load data
data(pima)
#remove na's
new_data = na.omit(pima)
#format data
new_data$age = as.factor(ifelse(new_data$age >30, "1", "0"))
new_data$pregnant = as.factor(ifelse(new_data$pregnant >2, "1", "0"))
#run model
tree_mod <- C5.0(x = new_data[, 1:8], rules = TRUE, y = new_data$diabetes)
Here is my question: I am trying to obtain a column of "predictions" made by the model for new observations. I am then want to take this column and append it to the original dataset.
Using the following link, https://cran.r-project.org/web/packages/C50/vignettes/C5.0.html, I used the "predict" function:
#pretend this is new data
new = new_data[1:10,]
#run predictions
pred = predict(tree_mod, newdata = new[, 1:8])
But this produces the following error:
Error in x[j] : invalid subscript type 'closure'
Can anyone please show me how to do this?
I am trying to create something like this ("prediction_made_by_model"):
pregnant glucose pressure triceps insulin mass pedigree age diabetes prediction_made_by_model
4 0 89 66 23 94 28.1 0.167 0 neg pos
5 0 137 40 35 168 43.1 2.288 1 pos neg
7 1 78 50 32 88 31.0 0.248 0 pos neg
9 0 197 70 45 543 30.5 0.158 1 pos pos
14 0 189 60 23 846 30.1 0.398 1 pos neg
15 1 166 72 19 175 25.8 0.587 1 pos pos
Thanks!
I was able to figure it out. For some reason, this was not working before:
pred = predict(tree_mod, newdata = new[, 1:8])
new$prediction_made_by_model = pred

Writing a function to compare differences of a series of numeric variables

I am working on a problem set and absolutely cannot figure this one out. I think I've fried my brain to the point where it doesn't even make sense anymore.
Here is a look at the data ...
sex age chol tg ht wt sbp dbp vldl hdl ldl bmi
<chr> <int> <int> <int> <dbl> <dbl> <int> <int> <int> <int> <int> <dbl>
1 M 60 137 50 68.2 112. 110 70 10 53 74 2.40
2 M 26 154 202 82.8 185. 88 64 34 31 92 2.70
3 M 33 198 108 64.2 147 120 80 22 34 132 3.56
4 F 27 154 47 63.2 129 110 76 9 57 88 3.22
5 M 36 212 79 67.5 176. 130 100 16 37 159 3.87
6 F 31 197 90 64.5 121 122 78 18 58 111 2.91
7 M 28 178 163 66.5 167 118 68 19 30 135 3.78
8 F 28 146 60 63 105. 120 80 12 46 88 2.64
9 F 25 231 165 64 126 130 72 23 70 137 3.08
10 M 22 163 30 68.8 173 112 70 6 50 107 3.66
# … with 182 more rows
I must write a function, myTtest, to perform the following task:
Perform a two-sample t-tests to compare the differences of a series of numeric variables between each level of a classification variable
The first argument, dat, is a data frame
The second argument, classVar, is a character vector of length 1. It is the name of the classification variable, such as 'sex.'
The third argument, numVar, is a character vector that contains the name of the numeric variables, such as c("age", "chol", "tg"). This means I need to perform three t-tests to compare the difference of those between males and females.
The function should return a data frame with the following variables: Varname, F.mean, M.mean, t (for t-statistics), df (for degrees of freedom), and p (for p-value).
I should be able to run this ...
myTtest(dat = chol, classVar = "sex", numVar = c("age", "chol", "tg")
... and then get the data frame to appear.
Any help is greatly appreciated. I am pulling my hair out over this one! As well, as noted in my comment below, this has to be done without Tidyverse ... which is why I'm having so much trouble to begin with.
The intuition for this solution is that you can loop over your dependent variables, and call t.test() in each loop. Then save the results from each DV and stack them together in one big data frame.
I'll leave out some bits for you to fill in, but here's the gist:
First, some example data:
set.seed(123)
n <- 20
grp <- sample(c("m", "f"), n, replace = TRUE)
df <- data.frame(grp = grp, age = rnorm(n), chol = rnorm(n), tg = rnorm(n))
df
grp age chol tg
1 m 1.2240818 0.42646422 0.25331851
2 m 0.3598138 -0.29507148 -0.02854676
3 m 0.4007715 0.89512566 -0.04287046
4 f 0.1106827 0.87813349 1.36860228
5 m -0.5558411 0.82158108 -0.22577099
6 f 1.7869131 0.68864025 1.51647060
7 f 0.4978505 0.55391765 -1.54875280
8 f -1.9666172 -0.06191171 0.58461375
9 m 0.7013559 -0.30596266 0.12385424
10 m -0.4727914 -0.38047100 0.21594157
Now make a container that each of the model outputs will go into:
fits_df <- data.frame()
Loop over each DV and append the model output to fits_df each time with rbind:
for (dv in c("age", "chol", "tg")) {
frml <- as.formula(paste0(dv, " ~ grp")) # make a model formula: dv ~ grp
fit <- t.test(frml, two.sided = TRUE, data = df) # perform the t-test
# hint: use str(fit) to figure out how to pull out each value you care about
fit_df <- data.frame(
dv = col,
f_mean = xxx,
m_mean = xxx,
t = xxx,
df = xxx,
p = xxx
)
fits_df <- rbind(fits_df, fit_df)
}
Your output will look like this:
fits_df
dv f_mean m_mean t df p
1 age -0.18558068 -0.04446755 -0.297 15.679 0.7704954
2 chol 0.07731514 0.22158672 -0.375 17.828 0.7119400
3 tg 0.09349567 0.23693052 -0.345 14.284 0.7352112
One note: When you're pulling out values from fit, you may get odd row names in your output data frame. This is due to the names property of the various fit attributes. You can get rid of these by using as.numeric() or as.character() wrappers around the values you pull from fit (for example, fit$statistic can be cleaned up with as.character(round(fit$statistic, 3))).

Cox proportional hazard model

I am trying to run Cox proportional hazard model on a data of 4 groups.
Here's the data:
I am using this code:
time_Allo_NHL<- c(28,32,49,84,357,933,1078,1183,1560,2114,2144)
censor_Allo_NHL<- c(rep(1,5), rep(0,6))
time_Auto_NHL<- c(42,53,57,63,81,140,176,210,252,476,524,1037)
censor_Auto_NHL<- c(rep(1,7), rep(0,1), rep(1,1), rep(0,1), rep(1,1), rep(0,1))
time_Allo_HOD<- c(2,4,72,77,79)
censor_Allo_HOD<- c(rep(1,5))
time_Auto_HOD<- c(30,36,41,52,62,108,132,180,307,406,446,484,748,1290,1345)
censor_Auto_HOD<- c(rep(1,7), rep(0,8))
myData <- data.frame(time=c(time_Allo_NHL, time_Auto_NHL, time_Allo_HOD, time_Auto_HOD),
censor=c(censor_Allo_NHL, censor_Auto_NHL, censor_Allo_HOD, censor_Auto_HOD),
group= rep(1:4,), each= )
str(myData)
The problem is each group has different number of observations. What I should modify in the code :
myData <- data.frame(time=c(time_Allo_NHL, time_Auto_NHL, time_Allo_HOD, time_Auto_HOD),
censor=c(censor_Allo_NHL, censor_Auto_NHL, censor_Allo_HOD,
censor_Auto_HOD), group= rep(1:4,), each= )
Instead of writing each=# so I can run the code properly in order to complete doing the Cox proportional hazard model?
Then I have attempted to run a Cox proportional hazard model using the following code:
library(survival)
for(i in 1:43){
if (myData$group[i]==2)
myData$Z1[i]<-1
else myData$Z1[i]<-0
}
for(i in 1:43){
if (myData$group[i]==3)
myData$Z2[i]<-1
else myData$Z2[i]<-0
}
for(i in 1:43){
if (myData$group[i]==4)
myData$Z3[i]<-1
else myData$Z3[i]<-0
}
myData
Coxfit<-coxph(Surv(time,censor)~Z1+Z2+Z3, data = myData)
summary(Coxfit)
This is all I got. There's no valuse!!
Next, I want to test for an interaction between type of transplant and disease type using main effects and interaction terms.
The code I'm going to use:
n<-length(myData$time)
n
for (i in 1:n){
if (myData$(here?)[i]==2)
myData$W1[i] <-1
else myData$W1[i]<-0
}
for (i in 1:n){
if (myData$(here?)[i]==2)
myData$W2[i] <-1
else myData$W2[i]<-0
}
myData
Coxfit.W<-coxph(Surv(time,censor)~W1+W2+W1*W2, data = myData)
summary(Coxfit.W)
I'm not sure what it should be written in here (myData$(here?) from the above code.
This looks like the bone marrow transplant study at Ohio State University.
As you mentioned, each group has different numbers of observations per group. I would consider binding the rows from each subgroup together in the end.
First, would create a data frame for each group. I would add a column indicating which group they belonged to. So, for example, in df_Allo_NHL would have all of the observations have Allo NHL for group:
df_Allo_NHL <- data.frame(group = "Allo NHL",
time = c(28,32,49,84,357,933,1078,1183,1560,2114,2144),
censor = c(rep(1,5), rep(0,6)))
Or just adding to the 2 vectors you have already:
df_Allo_NHL <- data.frame(group = "Allo NHL", time = time_Allo_NHL, censor = censor_Allo_NHL)
Then once you have your 4 data frames, you can combine them. One way to do this is by using Reduce and putting all your data frames in a list. The final result should be ready for cox proportional hazards analysis, in long form, and you will have group available to include. (Edit: Z1 and Z2 added from table for model.)
time_Allo_NHL<- c(28,32,49,84,357,933,1078,1183,1560,2114,2144)
censor_Allo_NHL<- c(rep(1,5), rep(0,6))
df_Allo_NHL <- data.frame(group = "Allo NHL",
time = time_Allo_NHL,
censor = censor_Allo_NHL,
Z1 = c(90,30,40,60,70,90,100,90,80,80,90),
Z2 = c(24,7,8,10,42,9,16,16,20,27,5))
time_Auto_NHL<- c(42,53,57,63,81,140,176,210,252,476,524,1037)
censor_Auto_NHL<- c(rep(1,7), rep(0,1), rep(1,1), rep(0,1), rep(1,1), rep(0,1))
df_Auto_NHL <- data.frame(group = "Auto NHL",
time = time_Auto_NHL,
censor = censor_Auto_NHL,
Z1 = c(80,90,30,60,50,100,80,90,90,90,90,90),
Z2 = c(19,17,9,13,12,11,38,16,21,24,39,84))
time_Allo_HOD<- c(2,4,72,77,79)
censor_Allo_HOD<- c(rep(1,5))
df_Allo_HOD <- data.frame(group = "Allo HOD",
time = time_Allo_HOD,
censor = censor_Allo_HOD,
Z1 = c(20,50,80,60,70),
Z2 = c(34,28,59,102,71))
time_Auto_HOD<- c(30,36,41,52,62,108,132,180,307,406,446,484,748,1290,1345)
censor_Auto_HOD<- c(rep(1,7), rep(0,8))
df_Auto_HOD <- data.frame(group = "Auto HOD",
time = time_Auto_HOD,
censor = censor_Auto_HOD,
Z1 = c(90,80,70,60,90,70,60,100,100,100,100,90,90,90,80),
Z2 = c(73,61,34,18,40,65,17,61,24,48,52,84,171,20,98))
myData <- Reduce(rbind, list(df_Allo_NHL, df_Auto_NHL, df_Allo_HOD, df_Auto_HOD))
Edit
If you go ahead and also add Z1 (Karnofsky Score) and Z2 (waiting time from diagnosis to transplant), you can do the CPH survival model like this below. group is already a factor and the first level Allo NHL would by default be there reference category.
library(survival)
Coxfit<-coxph(Surv(time,censor)~group+Z1+Z2, data = myData)
summary(Coxfit)
Output
Call:
coxph(formula = Surv(time, censor) ~ group + Z1 + Z2, data = myData)
n= 43, number of events= 26
coef exp(coef) se(coef) z Pr(>|z|)
groupAuto NHL 0.77357 2.16748 0.58631 1.319 0.18704
groupAllo HOD 2.73673 15.43639 0.94081 2.909 0.00363 **
groupAuto HOD 1.06293 2.89485 0.63494 1.674 0.09412 .
Z1 -0.05052 0.95074 0.01222 -4.135 3.55e-05 ***
Z2 -0.01660 0.98354 0.01002 -1.656 0.09769 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
exp(coef) exp(-coef) lower .95 upper .95
groupAuto NHL 2.1675 0.46136 0.6869 6.8395
groupAllo HOD 15.4364 0.06478 2.4419 97.5818
groupAuto HOD 2.8948 0.34544 0.8340 10.0481
Z1 0.9507 1.05181 0.9282 0.9738
Z2 0.9835 1.01674 0.9644 1.0030
Concordance= 0.783 (se = 0.059 )
Likelihood ratio test= 32.48 on 5 df, p=5e-06
Wald test = 28.48 on 5 df, p=3e-05
Score (logrank) test = 39.45 on 5 df, p=2e-07
Data
group time censor Z1 Z2
1 Allo NHL 28 1 90 24
2 Allo NHL 32 1 30 7
3 Allo NHL 49 1 40 8
4 Allo NHL 84 1 60 10
5 Allo NHL 357 1 70 42
6 Allo NHL 933 0 90 9
7 Allo NHL 1078 0 100 16
8 Allo NHL 1183 0 90 16
9 Allo NHL 1560 0 80 20
10 Allo NHL 2114 0 80 27
11 Allo NHL 2144 0 90 5
12 Auto NHL 42 1 80 19
13 Auto NHL 53 1 90 17
14 Auto NHL 57 1 30 9
15 Auto NHL 63 1 60 13
16 Auto NHL 81 1 50 12
17 Auto NHL 140 1 100 11
18 Auto NHL 176 1 80 38
19 Auto NHL 210 0 90 16
20 Auto NHL 252 1 90 21
21 Auto NHL 476 0 90 24
22 Auto NHL 524 1 90 39
23 Auto NHL 1037 0 90 84
24 Allo HOD 2 1 20 34
25 Allo HOD 4 1 50 28
26 Allo HOD 72 1 80 59
27 Allo HOD 77 1 60 102
28 Allo HOD 79 1 70 71
29 Auto HOD 30 1 90 73
30 Auto HOD 36 1 80 61
31 Auto HOD 41 1 70 34
32 Auto HOD 52 1 60 18
33 Auto HOD 62 1 90 40
34 Auto HOD 108 1 70 65
35 Auto HOD 132 1 60 17
36 Auto HOD 180 0 100 61
37 Auto HOD 307 0 100 24
38 Auto HOD 406 0 100 48
39 Auto HOD 446 0 100 52
40 Auto HOD 484 0 90 84
41 Auto HOD 748 0 90 171
42 Auto HOD 1290 0 90 20
43 Auto HOD 1345 0 80 98

Several Grubbs tests simultaneously in R

I'm new using R, I'm just starting with the outliers package. Probably this is very easy, but could anybody tell me how to run several Grubbs tests at the same time? I have 20 columns and I want to test all of them simultaneously.
Thanks in advance
Edit: Sorry for not explaining well. I'll try. I started using R today and I learned how to make Grubbs test using grubbs.test(data$S1, type=10 or 11 or 20) and it goes well. But I have a table with 20 columns, and I want to run Grubbs test for each of them simultaneously. I can do it one by one, but I think there must be a way to do it faster.
I ran the code at How to repeat the Grubbs test and flag the outliers as well, and works perfectly, but again, I would like to do it with my 20 samples.
As an example of my data:
S1 S2 S3 S4 S5 S6 S7
96 40 99 45 12 16 48
52 49 11 49 59 77 64
18 43 11 67 6 97 91
79 19 39 28 45 44 99
9 78 88 6 25 43 78
60 12 29 32 2 68 25
18 61 60 30 26 51 70
96 98 55 74 83 17 69
19 0 17 24 0 75 45
42 70 71 7 61 82 100
39 80 71 58 6 100 94
100 5 41 18 33 98 97
Hope this helps.
You can use lapply:
library(outliers)
df = data.frame(a=runif(20),b=runif(20),c=runif(20))
tests = lapply(df,grubbs.test)
# or with parameters:
tests = lapply(df,grubbs.test,opposite=T)
Results:
> tests
$a
Grubbs test for one outlier
data: X[[i]]
G = 1.80680, U = 0.81914, p-value = 0.6158
alternative hypothesis: highest value 0.963759744539857 is an outlier
$b
Grubbs test for one outlier
data: X[[i]]
G = 1.53140, U = 0.87008, p-value = 1
alternative hypothesis: highest value 0.975481075001881 is an outlier
$c
Grubbs test for one outlier
data: X[[i]]
G = 1.57910, U = 0.86186, p-value = 1
alternative hypothesis: lowest value 0.0136249314527959 is an outlier
You can access the results as follows:
> tests$a$statistic
G U
1.8067906 0.8191417
Hope this helps.
A #Florian answer can be updated a bit. For example fancy and easy-reading result can be achieved with purrr package and tidyverse. It can be useful if you are comparing loads of groups:
Load necessary packages:
library(dplyr)
library(purrr)
library(tidyr)
library(outliers)
Create some data - we're going to use the same from Florian's answer, but transformed to a modern tibble and long format:
df <- tibble(a = runif(20),
b = runif(20),
c = runif(20)) %>%
# transform to along format
tidyr::gather(letter, value)
Then instead of apply functions we can use map and map_dbl from purrr:
df %>%
group_by(letter) %>%
nest() %>%
mutate(n = map_dbl(data, ~ nrow(.x)), # number of entries
G = map(data, ~ grubbs.test(.x$value)$statistic[[1]]), # G statistic
U = map(data, ~ grubbs.test(.x$value)$statistic[[2]]), # U statistic
grubbs = map(data, ~ grubbs.test(.x$value)$alternative), # Alternative hypotesis
p_grubbs = map_dbl(data, ~ grubbs.test(.x$value)$p.value)) %>% # p-value
# Let's make the output more fancy
mutate(G = signif(unlist(G), 3),
U = signif(unlist(U), 3),
grubbs = unlist(grubbs),
p_grubbs = signif(p_grubbs, 3)) %>%
select(-data) %>% # remove temporary column
arrange(p_grubbs)
And the desired output would be this:
# A tibble: 3 x 6
letter n G U grubbs p_grubbs
<chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 c 20 1.68 0.843 lowest value 0.0489965472370386 is an outlier 0.84
2 a 20 1.58 0.862 lowest value 0.0174888013862073 is an outlier 1
3 b 20 1.57 0.863 lowest value 0.0656482006888837 is an outlier 1

Using ddply across numerous variables when calculating descriptive statistics

Here's my data. It shows the amount of fish I found at three different sites.
Selidor.Bay Enlades.Bay Cumphrey.Bay
1 39 29 187
2 70 370 50
3 13 44 52
4 0 65 20
5 43 110 220
6 0 30 266
What I would like to do is create a script to calculate basic statistics for each site.
If I re-arrange the data by stacking it. I.e :
values site
1 29 Selidor.Bay
2 370 Selidor.Bay
3 44 Selidor.Bay
4 65 Enlades.Bay
I'm able to use the following:
data <- ddply(df, c("site"), summarise,
N = length(values),
mean = mean(values),
sd = sd(values),
se = sd / sqrt(N),
sum = sum(values)
)
data.
My question is how can I use the script without having to stack my dataframe?
Thanks.
A slight variation on #docendodiscimus' comment:
library(reshape2)
library(dplyr)
DF %>%
melt(variable.name="site") %>%
group_by(site) %>%
summarise_each(funs( n(), mean, sd, se=sd(.)/sqrt(n()), sum ), value)
# site n mean sd se sum
# 1 Selidor.Bay 6 27.5 27.93385 11.40395 165
# 2 Enlades.Bay 6 108.0 131.84688 53.82626 648
# 3 Cumphrey.Bay 6 132.5 104.29909 42.57992 795
melt does what the OP referred to as "stacking" the data.frame. There is likely some analogous function in the tidyr package.

Resources