i want to get t-tests between two populations (in or out of treatment group (1 or 0 in sample data below, respectively)) across a number of variables, and for different studies, all of which are sitting in the same dataframe. In the sample data below, I want to generate t-tests for all variables (in sample data: Age, Dollars, DiseaseCnt) between the 1/0 Treatment group. I want to run these t-tests, by Program, rather than across the population. I have the logic to generate the t-tests. However, I need assistance with the final step of extracting the appropriate parts from the function & creating something easily digestable.
Ultimately, what I want is: a table of t-stats, p-values, variable that t-test was performed on, and program for which variable was tested.
DT<-data.frame(
Treated=sample(0:1,1000,replace=T)
,Program=c('Program A','Program B','Program C','Program D')
,Age=as.integer(rnorm(1000,mean=65,sd=15))
,Dollars=as.integer(rpois(1000,lambda=1000))
,DiseaseCnt=as.integer(rnorm(1000,mean=5,sd=2)) )
progs<-unique(DT$Program) # Pull program names
vars<-names(DT)[3:5] # pull variables to run t tests
test<-lapply(progs, function(i)
tt<-lapply(vars, function(j) {t.test( DT[DT$Treated==1 & DT$Program == i,names(DT)==j]
,DT[DT$Treated==0 & DT$Program == i,names(DT)==j]
,alternative = 'two.sided' )
list(j,tt$statistic,tt$p.value) }
) )
# nested lapply produces results in list format that can be binded, but complete output w/ both lapply's is erroneous
You should convert it into a data.table first. (In my code I call your original table DF):
DT <- as.data.table(DF)
DT[, t.test(data=.SD, Age ~ Treated), by=Program]
Program statistic parameter p.value conf.int estimate null.value alternative
1: Program A -0.6286875 247.8390 0.5301326 -4.8110579 65.26667 0 two.sided
2: Program A -0.6286875 247.8390 0.5301326 2.4828527 66.43077 0 two.sided
3: Program B 1.4758524 230.5380 0.1413480 -0.9069634 67.15315 0 two.sided
4: Program B 1.4758524 230.5380 0.1413480 6.3211834 64.44604 0 two.sided
5: Program C 0.1994182 246.9302 0.8420998 -3.3560930 63.56557 0 two.sided
6: Program C 0.1994182 246.9302 0.8420998 4.1122406 63.18750 0 two.sided
7: Program D -1.1321569 246.0086 0.2586708 -6.1855837 62.31707 0 two.sided
8: Program D -1.1321569 246.0086 0.2586708 1.6701237 64.57480 0 two.sided
method data.name
1: Welch Two Sample t-test Age by Treated
2: Welch Two Sample t-test Age by Treated
3: Welch Two Sample t-test Age by Treated
4: Welch Two Sample t-test Age by Treated
5: Welch Two Sample t-test Age by Treated
6: Welch Two Sample t-test Age by Treated
7: Welch Two Sample t-test Age by Treated
8: Welch Two Sample t-test Age by Treated
In this format, for each Program, the statistic is the same for both and equal to t, the parameter here is the df, for conf.int, it goes (in order) lower then upper (so for Program A, the confidence interval is (-4.8110579, 2.4828527), and for estimate it will be group 0 and then group 1 (so for Program A, the mean for Treated == 0 is 65.26667, etc.
This was the quickest solution I could come up with, and you could loop through vars, or perhaps there's a simpler way.
EDIT: I only confirmed for Program A and for Age, using the following code:
DT[Program == 'Program A', t.test(Age ~ Treated)]
Welch Two Sample t-test
data: Age by Treated
t = -0.62869, df = 247.84, p-value = 0.5301
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.811058 2.482853
sample estimates:
mean in group 0 mean in group 1
65.26667 66.43077
EDIT 2: Here is code that loops through your variables and rbind's them together:
do.call(rbind, lapply(vars, function(x) DT[, t.test(data=.SD, eval(parse(text=x)) ~ Treated), by=Program]))
You can get the same t-test out of a regression; if you think the effect of treatment is different for different programs, you should include an interaction. You can also specify multiple responses.
> m <- lm(cbind(Age,Dollars,DiseaseCnt)~Treated * Program - Treated - 1, DT)
> lapply(summary(m), `[[`, "coefficients")
$`Response Age`
Estimate Std. Error t value Pr(>|t|)
ProgramProgram A 63.0875912409 1.294086510 48.7506752932 1.355786133e-265
ProgramProgram B 65.3846153846 1.400330869 46.6922616771 1.207761156e-252
ProgramProgram C 66.0695652174 1.412455172 46.7763979425 3.534894216e-253
ProgramProgram D 66.6691729323 1.313402302 50.7606640010 5.038015651e-278
Treated:ProgramProgram A 2.8593114140 1.924837595 1.4854819032 1.377339219e-01
Treated:ProgramProgram B -0.9786003470 1.919883369 -0.5097186438 6.103619649e-01
Treated:ProgramProgram C -0.5066022544 1.922108032 -0.2635659631 7.921691261e-01
Treated:ProgramProgram D -2.8657541289 1.919883369 -1.4926709484 1.358412980e-01
$`Response Dollars`
Estimate Std. Error t value Pr(>|t|)
ProgramProgram A 998.5474452555 2.681598120 372.3702808887 0.0000000000
ProgramProgram B 997.4188034188 2.901757030 343.7292623810 0.0000000000
ProgramProgram C 1001.6869565217 2.926880936 342.2370019265 0.0000000000
ProgramProgram D 1001.2180451128 2.721624185 367.8752013053 0.0000000000
Treated:ProgramProgram A -0.9899231316 3.988636646 -0.2481858388 0.8040419882
Treated:ProgramProgram B 2.5060086113 3.978370529 0.6299082986 0.5288996396
Treated:ProgramProgram C -5.4721417069 3.982980462 -1.3738811324 0.1697889454
Treated:ProgramProgram D -4.0043698991 3.978370529 -1.0065351806 0.3144036460
$`Response DiseaseCnt`
Estimate Std. Error t value Pr(>|t|)
ProgramProgram A 4.53284671533 0.1793523653 25.27341475576 3.409326912e-109
ProgramProgram B 4.56410256410 0.1940771747 23.51694665775 1.515736580e-97
ProgramProgram C 4.25217391304 0.1957575279 21.72163675698 6.839384262e-86
ProgramProgram D 4.60150375940 0.1820294143 25.27890219412 3.133081901e-109
Treated:ProgramProgram A 0.13087009883 0.2667705543 0.49057175444 6.238378600e-01
Treated:ProgramProgram B -0.02274918064 0.2660839292 -0.08549625944 9.318841210e-01
Treated:ProgramProgram C 0.47375201288 0.2663922537 1.77840010867 7.564438017e-02
Treated:ProgramProgram D -0.31090546880 0.2660839292 -1.16844887901 2.429064705e-01
You specifically care about the Treated:Program entries of the regression table.
You're getting errors because you're trying to access tt$statistic from within the function that creates tt. Some bracketing problems.
Here's one way to do it following your version
results <- lapply(progs, function (i) {
DS = subset(DT, Program == i)
o <- lapply(vars, function (i) {
frm <- formula(paste0(i, '~ Treated'))
tt <- t.test(frm, DS)
data.frame(Variable=i, T=tt$statistic, P=tt$p.value)
})
o <- do.call(rbind, o)
o$Program <- i
o
})
do.call(rbind, results)
Or you can do it with rather rbind-ing using (e.g.) ddply (I think the rbinding still happens, just behind the scenes):
library(plyr)
combinations <- expand.grid(Program=progs, Y=vars)
ddply(combinations, .(Program, Y),
function (x) {
# x is a dataframe with the program and variable;
# just do the t-test and add the statistic & p-val to it
frm <- formula(paste0(x$Y, '~ Treated'))
tt <- t.test(frm, subset(DT, Program == x$Program))
x$T <- tt$statistic
x$P <- tt$p.value
x
})
Related
I have the following data (dat)
I have the following data(dat)
V W X Y Z
1 8 89 3 900
1 8 100 2 800
0 9 333 4 980
0 9 560 1 999
I wish to perform TukeysHSD pairwise test to the above data set.
library(reshape2)
dat1 <- gather(dat) #convert to long form
pairwise.t.test(dat1$key, dat1$value, p.adj = "holm")
However, every time I try to run it, it keeps running and does not yield an output. Any suggestions on how to correct this?
I would also like to perform the same test using the function TukeyHSD(). However, when I try to use the wide/long format, I run into a error that says
" Error in UseMethod("TukeyHSD") :
no applicable method for 'TukeyHSD' applied to an object of class "data.frame"
We need 'x' to be dat1$value as it is not specified the first argument is taken as 'x' and second as 'g'
pairwise.t.test( dat1$value, dat1$key, p.adj = "holm")
#data: dat1$value and dat1$key
# V W X Y
#W 1.000 - - -
#X 0.018 0.018 - -
#Y 1.000 1.000 0.018 -
#Z 4.1e-08 4.1e-08 2.8e-06 4.1e-08
#P value adjustment method: holm
Or we specify the argument and use in any order we wanted
pairwise.t.test(g = dat1$key, x= dat1$value, p.adj = "holm")
Regarding the TukeyHSD
TukeyHSD(aov(value~key, data = dat1), ordered = TRUE)
#Tukey multiple comparisons of means
# 95% family-wise confidence level
# factor levels have been ordered
#Fit: aov(formula = value ~ key, data = dat1)
#$key
# diff lwr upr p adj
#Y-V 2.00 -233.42378 237.4238 0.9999999
#W-V 8.00 -227.42378 243.4238 0.9999691
#X-V 270.00 34.57622 505.4238 0.0211466
#Z-V 919.25 683.82622 1154.6738 0.0000000
#W-Y 6.00 -229.42378 241.4238 0.9999902
#X-Y 268.00 32.57622 503.4238 0.0222406
#Z-Y 917.25 681.82622 1152.6738 0.0000000
#X-W 262.00 26.57622 497.4238 0.0258644
#Z-W 911.25 675.82622 1146.6738 0.0000000
#Z-X 649.25 413.82622 884.6738 0.0000034
Based on the link below, I created a code to run regression on subsets of my data based on a variable.
Loop linear regression and saving coefficients
In this example I created a DUMMY (0 or 1) to create the subsets (in reality I have 3000 subsets)
res <- do.call(rbind, lapply(split(mydata, mydata$DUMMY),function(x){
fit <- lm(y~x1 + x2, data=x)
res <- data.frame(DUMMY=unique(x$DUMMY), coeff=coef(fit))
res
}))
This results in the following dataset
DUMMY coeff
0.(Intercept) 0 22.8419956
0.x1 0 -11.5623064
0.x2 0 2.1006948
1.(Intercept) 1 4.2020874
1.x1 1 -0.4924303
1.x2 1 1.0917668
What I would like however is one row per regression, and the variables in the columns. I also need the p values and standard errors included.
DUMMY interceptx1 coeffx1 p-valuex1 SEx1 coeffx2 p-valuex2 SEx2
0 22.84 -11.56 0.04 0.15 2.10 0.80 0.90
1 4.20 -0.49 0.10 0.60 1.09 0.60 1.20
Any idea how to do this?
While your desired output is (IMHO) not really tidy data, here is an approach using data.table and a custom-built extraction-function. It has an option to return a wide or long form of the results.
The extractor-function takes in a lm-object, and returns estimates, p-values and standard errors for all variables.
extractor <- function(model, return_wide = F){
#get datatable with coefficient, se and p-value
model_summary <- as.data.table(summary(model)$coefficients[,-3])
model_summary[,variable:=names(coef(model))]
#do some reshaping
step2 <- melt(model_summary, id.var="variable",variable.name="measure")
if(!return_wide){
return(step2)
}
step3 <- dcast(step2, 1~variable+measure,value.var="value")
return(step3)
}
Demonstration:
res_wide <- dat[,extractor(lm(y~x1 + x2), return_wide = T), by = dummy]
> res_wide
# dummy . (Intercept)_Estimate (Intercept)_Std. Error (Intercept)_Pr(>|t|) x1_Estimate x1_Std. Error x1_Pr(>|t|) x2_Estimate x2_Std. Error x2_Pr(>|t|)
# 1: 0 . 0.04314707 0.04495702 0.3376461 -0.054364406 0.04441204 0.2214895 0.01333804 0.04620999 0.7729757
# 2: 1 . -0.04137086 0.04471550 0.3553164 0.009864255 0.04533808 0.8278539 0.05272257 0.04507189 0.2426726
res_long <- dat[,extractor(lm(y~x1 + x2)), by = dummy]
# dummy variable measure value
# 1: 0 (Intercept) Estimate 0.043147072
# 2: 0 x1 Estimate -0.054364406
# 3: 0 x2 Estimate 0.013338043
# 4: 0 (Intercept) Std. Error 0.044957023
# 5: 0 x1 Std. Error 0.044412037
# 6: 0 x2 Std. Error 0.046209987
# 7: 0 (Intercept) Pr(>|t|) 0.337646052
# 8: 0 x1 Pr(>|t|) 0.221489530
Data used:
library(data.table)
set.seed(123)
nobs = 1000
dat <- data.table(
dummy = sample(0:1,nobs,T),
x1 = rnorm(nobs),
x2 = rnorm(nobs),
y = rnorm(nobs))
This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I am running a linear regression on some variables in a data frame. I'd like to be able to subset the linear regressions by a categorical variable, run the linear regression for each categorical variable, and then store the t-stats in a data frame. I'd like to do this without a loop if possible.
Here's a sample of what I'm trying to do:
a<- c("a","a","a","a","a",
"b","b","b","b","b",
"c","c","c","c","c")
b<- c(0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3)
c<- c(0.2,0.1,0.3,0.2,0.4,
0.2,0.5,0.2,0.1,0.2,
0.4,0.2,0.4,0.6,0.8)
cbind(a,b,c)
I can begin by running the following linear regression and pulling the t-statistic out very easily:
summary(lm(b~c))$coefficients[2,3]
However, I'd like to be able to run the regression for when column a is a, b, or c. I'd like to then store the t-stats in a table that looks like this:
variable t-stat
a 0.9
b 2.4
c 1.1
Hope that makes sense. Please let me know if you have any suggestions!
Here is a solution using dplyr and tidy() from the broom package. tidy() converts various statistical model outputs (e.g. lm, glm, anova, etc.) into a tidy data frame.
library(broom)
library(dplyr)
data <- data_frame(a, b, c)
data %>%
group_by(a) %>%
do(tidy(lm(b ~ c, data = .))) %>%
select(variable = a, t_stat = statistic) %>%
slice(2)
# variable t_stat
# 1 a 1.6124515
# 2 b -0.1369306
# 3 c 0.8000000
Or extracting both, the t-statistic for the intercept and the slope term:
data %>%
group_by(a) %>%
do(tidy(lm(b ~ c, data = .))) %>%
select(variable = a, term, t_stat = statistic)
# variable term t_stat
# 1 a (Intercept) 1.2366939
# 2 a c 1.6124515
# 3 b (Intercept) 2.6325081
# 4 b c -0.1369306
# 5 c (Intercept) 1.4572335
# 6 c c 0.8000000
You can use the lmList function from the nlme package to apply lm to subsets of data:
# the data
df <- data.frame(a, b, c)
library(nlme)
res <- lmList(b ~ c | a, df, pool = FALSE)
coef(summary(res))
The output:
, , (Intercept)
Estimate Std. Error t value Pr(>|t|)
a 0.1000000 0.08086075 1.236694 0.30418942
b 0.2304348 0.08753431 2.632508 0.07815663
c 0.1461538 0.10029542 1.457233 0.24110393
, , c
Estimate Std. Error t value Pr(>|t|)
a 0.50000000 0.3100868 1.6124515 0.2052590
b -0.04347826 0.3175203 -0.1369306 0.8997586
c 0.15384615 0.1923077 0.8000000 0.4821990
If you want the t values only, you can use this command:
coef(summary(res))[, "t value", -1]
# a b c
# 1.6124515 -0.1369306 0.8000000
Here's a vote for the plyr package and ddply().
plyrFunc <- function(x){
mod <- lm(b~c, data = x)
return(summary(mod)$coefficients[2,3])
}
tStats <- ddply(dF, .(a), plyrFunc)
tStats
a V1
1 a 1.6124515
2 b -0.1369306
3 c 0.6852483
Use split to subset the data and do the looping by lapply
dat <- data.frame(b,c)
dat_split <- split(x = dat, f = a)
res <- sapply(dat_split, function(x){
summary(lm(b~c, data = x))$coefficients[2,3]
})
Reshape the result to your needs:
data.frame(variable = names(res), "t-stat" = res)
variable t.stat
a a 1.6124515
b b -0.1369306
c c 0.8000000
You could do this:
a<- c("a","a","a","a","a",
"b","b","b","b","b",
"c","c","c","c","c")
b<- c(0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3)
c<- c(0.2,0.1,0.3,0.2,0.4,
0.2,0.5,0.2,0.1,0.2,
0.4,0.2,0.4,0.6,0.8)
df <- data.frame(a,b,c)
t.stats <- t(data.frame(lapply(c('a','b','c'),
function(x) summary(lm(b~c,data=df[df$a==x,]))$coefficients[2,3])))
colnames(t.stats) <- 't-stat'
rownames(t.stats) <- c('a','b','c')
Output:
> t.stats
t-stat
a 1.6124515
b -0.1369306
c 0.8000000
Unless I am mistaken the values you give in your output are not the correct ones.
Or:
t.stats <- data.frame(t.stats)
t.stats$variable <- rownames(t.stats)
> t.stats[,c(2,1)]
variable t.stat
a a 1.6124515
b b -0.1369306
c c 0.8000000
If you want a data.frame and a separate column.
This question already has answers here:
Linear Regression and group by in R
(10 answers)
Closed 6 years ago.
I am running a linear regression on some variables in a data frame. I'd like to be able to subset the linear regressions by a categorical variable, run the linear regression for each categorical variable, and then store the t-stats in a data frame. I'd like to do this without a loop if possible.
Here's a sample of what I'm trying to do:
a<- c("a","a","a","a","a",
"b","b","b","b","b",
"c","c","c","c","c")
b<- c(0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3)
c<- c(0.2,0.1,0.3,0.2,0.4,
0.2,0.5,0.2,0.1,0.2,
0.4,0.2,0.4,0.6,0.8)
cbind(a,b,c)
I can begin by running the following linear regression and pulling the t-statistic out very easily:
summary(lm(b~c))$coefficients[2,3]
However, I'd like to be able to run the regression for when column a is a, b, or c. I'd like to then store the t-stats in a table that looks like this:
variable t-stat
a 0.9
b 2.4
c 1.1
Hope that makes sense. Please let me know if you have any suggestions!
Here is a solution using dplyr and tidy() from the broom package. tidy() converts various statistical model outputs (e.g. lm, glm, anova, etc.) into a tidy data frame.
library(broom)
library(dplyr)
data <- data_frame(a, b, c)
data %>%
group_by(a) %>%
do(tidy(lm(b ~ c, data = .))) %>%
select(variable = a, t_stat = statistic) %>%
slice(2)
# variable t_stat
# 1 a 1.6124515
# 2 b -0.1369306
# 3 c 0.8000000
Or extracting both, the t-statistic for the intercept and the slope term:
data %>%
group_by(a) %>%
do(tidy(lm(b ~ c, data = .))) %>%
select(variable = a, term, t_stat = statistic)
# variable term t_stat
# 1 a (Intercept) 1.2366939
# 2 a c 1.6124515
# 3 b (Intercept) 2.6325081
# 4 b c -0.1369306
# 5 c (Intercept) 1.4572335
# 6 c c 0.8000000
You can use the lmList function from the nlme package to apply lm to subsets of data:
# the data
df <- data.frame(a, b, c)
library(nlme)
res <- lmList(b ~ c | a, df, pool = FALSE)
coef(summary(res))
The output:
, , (Intercept)
Estimate Std. Error t value Pr(>|t|)
a 0.1000000 0.08086075 1.236694 0.30418942
b 0.2304348 0.08753431 2.632508 0.07815663
c 0.1461538 0.10029542 1.457233 0.24110393
, , c
Estimate Std. Error t value Pr(>|t|)
a 0.50000000 0.3100868 1.6124515 0.2052590
b -0.04347826 0.3175203 -0.1369306 0.8997586
c 0.15384615 0.1923077 0.8000000 0.4821990
If you want the t values only, you can use this command:
coef(summary(res))[, "t value", -1]
# a b c
# 1.6124515 -0.1369306 0.8000000
Here's a vote for the plyr package and ddply().
plyrFunc <- function(x){
mod <- lm(b~c, data = x)
return(summary(mod)$coefficients[2,3])
}
tStats <- ddply(dF, .(a), plyrFunc)
tStats
a V1
1 a 1.6124515
2 b -0.1369306
3 c 0.6852483
Use split to subset the data and do the looping by lapply
dat <- data.frame(b,c)
dat_split <- split(x = dat, f = a)
res <- sapply(dat_split, function(x){
summary(lm(b~c, data = x))$coefficients[2,3]
})
Reshape the result to your needs:
data.frame(variable = names(res), "t-stat" = res)
variable t.stat
a a 1.6124515
b b -0.1369306
c c 0.8000000
You could do this:
a<- c("a","a","a","a","a",
"b","b","b","b","b",
"c","c","c","c","c")
b<- c(0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3,
0.1,0.2,0.3,0.2,0.3)
c<- c(0.2,0.1,0.3,0.2,0.4,
0.2,0.5,0.2,0.1,0.2,
0.4,0.2,0.4,0.6,0.8)
df <- data.frame(a,b,c)
t.stats <- t(data.frame(lapply(c('a','b','c'),
function(x) summary(lm(b~c,data=df[df$a==x,]))$coefficients[2,3])))
colnames(t.stats) <- 't-stat'
rownames(t.stats) <- c('a','b','c')
Output:
> t.stats
t-stat
a 1.6124515
b -0.1369306
c 0.8000000
Unless I am mistaken the values you give in your output are not the correct ones.
Or:
t.stats <- data.frame(t.stats)
t.stats$variable <- rownames(t.stats)
> t.stats[,c(2,1)]
variable t.stat
a a 1.6124515
b b -0.1369306
c c 0.8000000
If you want a data.frame and a separate column.
I am trying to calculate the correlation between two numeric columns in a data frame for each level of a factor. Here is an example data frame:
concentration <-(c(3, 8, 4, 7, 3, 1, 3, 3, 8, 6))
area <-c(0.5, 0.9, 0.3, 0.4, 0.5, 0.8, 0.9, 0.2, 0.7, 0.7)
area_type <-c("A", "B", "A", "B", "A", "B", "A", "B", "A", "B")
data_frame <-data.frame(concentration, area, area_type)
In this example, I want to calculate the correlation between concentration and area for each level of area_type. I want to use cor.test rather than cor because I want p-values and kendall tau values. I have tried to do this using ddply:
ddply(data_frame, "area_type", summarise,
corr=(cor.test(data_frame$area, data_frame$concentration,
alternative="two.sided", method="kendall") ) )
However, I am having a problem with the output: it is organized differently from the normal Kendall cor.test output, which states z value, p-value, alternative hypothesis, and tau estimate. Instead of that, I get the output below. I don't know what each row of the output indicates. In addition, the output values are the same for each level of area_type.
area_type corr
1 A 0.3766218
2 A NULL
3 A 0.7064547
4 A 0.1001252
5 A 0
6 A two.sided
7 A Kendall's rank correlation tau
8 A data_frame$area and data_frame$concentration
9 B 0.3766218
10 B NULL
11 B 0.7064547
12 B 0.1001252
13 B 0
14 B two.sided
15 B Kendall's rank correlation tau
16 B data_frame$area and data_frame$concentration
What am I doing wrong with ddply? Or are there other ways of doing this? Thanks.
You can add an additional column with the names of corr. Also, your syntax is slightly incorrect. The . specifies that the variable is from the data frame you've specified. Then remove the data_frame$ or else it will use the entire data frame:
ddply(data_frame, .(area_type), summarise,
corr=(cor.test(area, concentration,
alternative="two.sided", method="kendall")), name=names(corr) )
Which gives:
area_type corr name
1 A -0.285133 statistic
2 A NULL parameter
3 A 0.7755423 p.value
4 A -0.1259882 estimate
5 A 0 null.value
6 A two.sided alternative
7 A Kendall's rank correlation tau method
8 A area and concentration data.name
9 B 6 statistic
10 B NULL parameter
11 B 0.8166667 p.value
12 B 0.2 estimate
13 B 0 null.value
14 B two.sided alternative
15 B Kendall's rank correlation tau method
16 B area and concentration data.name
statistic is the z-value and estimate is the tau estimate.
EDIT: You can also do it like this to only pull what you want:
corfun<-function(x, y) {
corr=(cor.test(x, y,
alternative="two.sided", method="kendall"))
}
ddply(data_frame, .(area_type), summarise,z=corfun(area,concentration)$statistic,
pval=corfun(area,concentration)$p.value,
tau.est=corfun(area,concentration)$estimate,
alt=corfun(area,concentration)$alternative
)
Which gives:
area_type z pval tau.est alt
1 A -0.285133 0.7755423 -0.1259882 two.sided
2 B 6.000000 0.8166667 0.2000000 two.sided
Part of the reason this is not working is the cor.test returns:
Pearson's product-moment correlation
data: data_frame$concentration and data_frame$area
t = 0.5047, df = 8, p-value = 0.6274
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.5104148 0.7250936
sample estimates:
cor
0.1756652
This information cannot be put into a data.frame (which ddply does) without future complicating the code. If you can provide the exact information you need then I can provide further assistance. I would look at just using
corrTest <- ddply(.data = data_frame,
.variables = .(area_type),
.fun = cor(concentration, area,))
method="kendall")))
I haven't test this code but this is the route I would take initially and work from here.