"weighted" regression in R - r

I have created a script like the one below to do something I called as "weighted" regression:
library(plyr)
set.seed(100)
temp.df <- data.frame(uid=1:200,
bp=sample(x=c(100:200),size=200,replace=TRUE),
age=sample(x=c(30:65),size=200,replace=TRUE),
weight=sample(c(1:10),size=200,replace=TRUE),
stringsAsFactors=FALSE)
temp.df.expand <- ddply(temp.df,
c("uid"),
function(df) {
data.frame(bp=rep(df[,"bp"],df[,"weight"]),
age=rep(df[,"age"],df[,"weight"]),
stringsAsFactors=FALSE)})
temp.df.lm <- lm(bp~age,data=temp.df,weights=weight)
temp.df.expand.lm <- lm(bp~age,data=temp.df.expand)
You can see that in temp.df, each row has its weight, what I mean is that there is a total of 1178 sample but for rows with same bp and age, they are merge into 1 row and represented in the weight column.
I used the weight parameters in the lm function, then I cross check the result with another dataframe that the temp.df dataframe is "expanded". But I found the lm outputs different for the 2 dataframe.
Did I misinterpret the weight parameters in lm function, and can anyone let me know how to I run regression properly (i.e. without expanding the dataframe manually) for a dataset presented like temp.df? Thanks.

The problem here is that the degrees of freedom are not being properly added up to get the right Df and mean-sum-squares statistics. This will correct the problem:
temp.df.lm.aov <- anova(temp.df.lm)
temp.df.lm.aov$Df[length(temp.df.lm.aov$Df)] <-
sum(temp.df.lm$weights)-
sum(temp.df.lm.aov$Df[-length(temp.df.lm.aov$Df)] ) -1
temp.df.lm.aov$`Mean Sq` <- temp.df.lm.aov$`Sum Sq`/temp.df.lm.aov$Df
temp.df.lm.aov$`F value`[1] <- temp.df.lm.aov$`Mean Sq`[1]/
temp.df.lm.aov$`Mean Sq`[2]
temp.df.lm.aov$`Pr(>F)`[1] <- pf(temp.df.lm.aov$`F value`[1], 1,
temp.df.lm.aov$Df, lower.tail=FALSE)[2]
temp.df.lm.aov
Analysis of Variance Table
Response: bp
Df Sum Sq Mean Sq F value Pr(>F)
age 1 8741 8740.5 10.628 0.001146 **
Residuals 1176 967146 822.4
Compare with:
> anova(temp.df.expand.lm)
Analysis of Variance Table
Response: bp
Df Sum Sq Mean Sq F value Pr(>F)
age 1 8741 8740.5 10.628 0.001146 **
Residuals 1176 967146 822.4
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
I am a bit surprised this has not come up more often on R-help. Either that or my search strategy development powers are weakening with old age.

Related

R: Calculating ANOVA Sum sqr for a model with interacting numerical and categorical variables

I need to know how it is calculated the Sum Sqr column of the anova() function in R, for a linear model with the form:
modelXg <-lm(Y ~ X * group, data)
(which is equivalent to lm(Y~ X+group+X:group, data=dat) )
where: "X" is a numerical variable, and "group" is a categorical one.
The function anova(modelXg) returns a table like:
Analysis of Variance Table
Response: TMIN
Df Sum Sq Mean Sq F value Pr(>F)
X 1 6476 6476.1 282.9208 < 2.2e-16 ***
group 1 1176 1176.4 51.3956 7.666e-13 ***
X:group 1 64 64.2 2.8058 0.09393 .
Residuals 45130 1033029 22.9
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
What I need is to know how to calculate all the terms of the Sum Sq column, described in a way as easy and reproducible as possible, because I need to implement it in C#.
I already searched a lot accross the Net, but I didn't find this exact case. I found some useful info in Interpretation of Sum Sq in ANOVA with numeric independent variable but it is incomplete for this case, because there the model does not involve the interaction between both variables.

get p-values from post-hoc duncan test in r

I want to perform a post-hoc duncan test (use "agricolae" package in r) after running one-way anova comparing the means of 3 groups.
## run one-way anova
> t1 <- aov(q3a ~ pgy,data = pgy)
> summary(t1)
Df Sum Sq Mean Sq F value Pr(>F)
pgy 2 13 6.602 5.613 0.00367 **
Residuals 6305 7416 1.176
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
1541 observations deleted due to missingness
## run post-hoc duncan test
> duncan.test(t1,"pgy",group = T, console = T)
Study: t1 ~ "pgy"
Duncan's new multiple range test
for q3a
Mean Square Error: 1.176209
pgy, means
q3a std r Min Max
PGY1 1.604292 1.068133 2656 1 5
PGY2 1.711453 1.126446 2017 1 5
PGY3 1.656269 1.057937 1635 1 5
Groups according to probability of means differences and alpha level( 0.05 )
Means with the same letter are not significantly different.
q3a groups
PGY2 1.711453 a
PGY3 1.656269 ab
PGY1 1.604292 b
However, the output only tells me the mean of PGY1 and PGY2 are different without p-values for each group comparison ( post-hoc pairwise t tests would generate p-values for each group comparison).
How can I get p value from a duncan test?
Thanks!!
One solution would be to use PostHocTest from the DescTools package.
Here is an example using the warpbreaks sample data.
require(DescTools);
res <- aov(breaks ~ tension, data = warpbreaks);
PostHocTest(res, method = "duncan");
#
# Posthoc multiple comparisons of means : Duncan's new multiple range test
# 95% family-wise confidence level
#
#$tension
# diff lwr.ci upr.ci pval
#M-L -10.000000 -17.95042 -2.049581 0.01472 *
#H-L -14.722222 -23.08443 -6.360012 0.00072 ***
#H-M -4.722222 -12.67264 3.228197 0.23861
The pairwise differences between the means for every group are given in the first column (e.g. M-L, and so on), along with confidence intervals and p-values.
For example, the difference in the mean breaks between H and M is not statistically significant.
If performing Duncan's test is not a critical requirement, you can also run pairwise.t.test with various other multiple comparison corrections. For example, using Bonferroni's method
with(warpbreaks, pairwise.t.test(breaks, tension, p.adj = "bonferroni"));
#
# Pairwise comparisons using t tests with pooled SD
#
#data: breaks and tension
#
# L M
#M 0.0442 -
#H 0.0015 0.7158
#
#P value adjustment method: bonferroni
Results are consistent with those from the post-hoc Duncan's test.

How to run leveneTest for 5,834 genes at the same time

We have 35 mesenchyaml stem cells (MSCs) single-cell RNA-Seq data, and would like to compare gene expression heterogeneity between different culture conditions (i.e. hypoxia and normoxia). In other words, We would like to identify genes that are more homogeneous in hypoxia than in normoxia.
I know how to run leveneTest using the car package on R for single gene (please see two examples below). However, I don’t know how to run leveneTest for all 5,834 genes at the same time. Could you help me, and please see our data from: http://68.181.92.180/~Gary/temporal/Log2_Transpose_CPM_MSC35Sample_CPM5Sample10_5834Gene.txt
More importantly, how to transform all 5,834 leveneTest results to a table with two columns? The first column is the gene symbol, and the second column is the P-value. Many thanks.
setwd("/Volumes/TOSHIBA EXT/20170305_LeveneTest/car")
require(car)
msc <- read.table("Log2_Transpose_CPM_MSC35Sample_CPM5Sample10_5834Gene.txt", header = T)
Test:
leveneTest(CEBPB ~ Culture, data = msc, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 6.5486 0.01527 *
33
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
And:
leveneTest(PLIN2 ~ Culture, data = msc, center=median)
Levene's Test for Homogeneity of Variance (center = median)
Df F value Pr(>F)
group 1 0.1804 0.6738
33
The simples way is using an external package:
library(matrixTests)
row_brownforsythe(msc[-1], msc$Culture)
Here Brown-Forsythe is the Levene's test variant with center="median"
Note: I cannot see your data, so I assumed "Culture" is in the first column. All columns that are not used for comparison have to be removed.

running multiple anova tests in r

I have seven groups that I want to run ANOVA test on to see if there is a significant difference among each other based on a trait. And I have about 600 traits.
I already calculated per group and per trait their mean, standard deviation, and variance. the seven groups have different sample sizes. How can I arrange my data so that I will be able to run them all in R?
set.seed(2)
sampledata <- expand.grid(group = paste0("group", 1:7), trait = paste0("trait", 1:600), value = 1:5)
sampledata$value <- rnorm(nrow(sampledata))
sampledata.aov <- aov(value ~ group * trait, data = sampledata)
anova(sampledata.aov)
Analysis of Variance Table
Response: value
Df Sum Sq Mean Sq F value Pr(>F)
group 6 7.1 1.1784 1.1670 0.32072
trait 599 658.0 1.0985 1.0878 0.07096 .
group:trait 3594 3613.0 1.0053 0.9955 0.56604
Residuals 16800 16964.3 1.0098
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
A warning though, even with random numbers, you're more likely than not to have a significant difference when you have this many traits at once.

R One-Way ANOVA (getting only 1 DF and expecting 2 DFs)

I'm working through the examples of One-Way ANOVA on the UCLA website http://www.ats.ucla.edu/stat/r/faq/posthoc.htm.
When I run the command a1 <-aov(write ~ ses), my output differs from the example output. I'm particularly bothered by the fact that when I run the command summary(a1), my DF on ses is 1 and there are three ses categories (1,2,3) so I'm expecting 2 DFs which is what the example on the website shows. I've checked the data for the 'write' column and 'ses' column and the counts and averages seem to match with the example, but the result from aov(write ~ ses) doesn't. Has something changed? Why am I getting only 1 DF.
hsb2 <- read.table("http://www.ats.ucla.edu/stat/data/hsb2.csv", sep=",", header=TRUE)
a1 <- aov(write ~ ses, data = hsb2)
summary(a1)
# Df Sum Sq Mean Sq F value Pr(>F)
# ses 1 770 769.8 8.908 0.0032 **
# Residuals 198 17109 86.4
The page you are learning from has an error, in that it doesn't tell you how to enter the data correctly. The ses variable is supposed to be a factor, as we can see from the data they give us, it is read in as numeric:
str(hsb2$ses)
If we convert it to a factor, we get the same answer as the example:
hsb2$ses <- as.factor(hsb2$ses)
a1 <- aov(write ~ ses, data=hsb2)
summary(a1)
Df Sum Sq Mean Sq F value Pr(>F)
ses 2 859 429.4 4.97 0.00784 **
Residuals 197 17020 86.4
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In addition, using attach is highly discouraged by most R users.

Resources