I ran a model explaining the weight of some plant as a function of time and trying to incorporate the treatment effect.
mod <- lm(weight ~time + treatment)
The model looks like this:
with model summary being:
Call:
lm(formula = weight ~ time + treatment, data = df)
Residuals:
Min 1Q Median 3Q Max
-21.952 -7.674 0.770 6.851 21.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.5790 3.2897 -11.423 < 2e-16 ***
time 4.7478 0.2541 18.688 < 2e-16 ***
treatmentB 8.2000 2.4545 3.341 0.00113 **
treatmentC 5.4633 2.4545 2.226 0.02797 *
treatmentD 20.3533 2.4545 8.292 2.36e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.506 on 115 degrees of freedom
Multiple R-squared: 0.7862, Adjusted R-squared: 0.7788
F-statistic: 105.7 on 4 and 115 DF, p-value: < 2.2e-16
ANOVA table
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
time 1 31558.1 31558.1 349.227 < 2.2e-16 ***
treatment 3 6661.9 2220.6 24.574 2.328e-12 ***
Residuals 115 10392.0 90.4
I want to test the H0 that intercept1=intercept2=intercept3=intercept4. Is this done by simply interpreting the t-value and p-value for the intercept ( I guess not because this is the baseline (treatment A in this case))? I'm a bit puzzled by this as not much attention is paid on difference in intercept on most sources i looked up.
Related
I have a data.table data_dt on which I want to run linear regression so that user can choose the number of columns in groups G1 and G2 using variable n_col. The following code works perfectly but it is slow due to extra time spent on creating matrices. To improve the performance of the code below, is there a way to remove Steps 1, 2, and 3 altogether by tweaking the formula of lm function and still get the same results?
library(timeSeries)
library(data.table)
data_dt = as.data.table(LPP2005REC[, -1])
n_col = 3 # Choose a number from 1 to 3
######### Step 1 ######### Create independent variable
xx <- as.matrix(data_dt[, "SPI"])
######### Step 2 ######### Create Group 1 of dependent variables
G1 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2)])
######### Step 3 ######### Create Group 2 of dependent variables
G2 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2 + n_col)])
lm(xx ~ G1 + G2)
Results -
summary(lm(xx ~ G1 + G2))
Call:
lm(formula = xx ~ G1 + G2)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
G1LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
G1MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
G1ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
G2LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
G2LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
G2LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16
This may be easier by just creating the formula with reformulate
out <- lm(reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col + 2 + n_col)],
response = 'SPI'), data = data_dt)
-checking
> summary(out)
Call:
lm(formula = reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col +
2 + n_col)], response = "SPI"), data = data_dt)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16
My objective to fit a linear model with the same response, but predictors replaced by factors/scores. I am trying to find out which principal components to include in such a linear model if I want to achieve an R^2 of at least 0.9*r.squared from my original model.
Which predictors should I choose?
model1 <- lm(Resp~.,data=test_dat)
> summary(model1)
Call:
lm(formula = Resp ~ ., data = test_dat)
Residuals:
Min 1Q Median 3Q Max
-0.35934 -0.07729 0.00330 0.08204 0.38709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.18858 0.06926 -46.039 <2e-16 ***
Pred1 4.32083 0.03767 114.708 <2e-16 ***
Pred2 2.42110 0.04740 51.077 <2e-16 ***
Pred3 -1.00507 0.04435 -22.664 <2e-16 ***
Pred4 -3.19480 0.09147 -34.927 <2e-16 ***
Pred5 2.77779 0.05368 51.748 <2e-16 ***
Pred6 1.22923 0.05427 22.648 <2e-16 ***
Pred7 -1.21338 0.04562 -26.595 <2e-16 ***
Pred8 0.02485 0.05937 0.419 0.676
Pred9 -0.67831 0.05308 -12.778 <2e-16 ***
Pred10 1.69947 0.02628 64.672 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1193 on 489 degrees of freedom
Multiple R-squared: 0.997, Adjusted R-squared: 0.997
F-statistic: 1.645e+04 on 10 and 489 DF, p-value: < 2.2e-16
My new model should have an R^2 >= 0.897
(threshold<-0.9*r.sqrd)
[1] 0.8973323
metrics.swiss <- calc.relimp(model1, type = c("lmg", "first", "last","betasq", "Pratt"))
metrics.swiss
metrics.swiss#lmg.rank
>Pred1 Pred2 Pred3 Pred4 Pred5 Pred6 Pred7 Pred8 Pred9 Pred10
2 8 3 6 1 10 5 4 7 9
sum(metrics.swiss#lmg)
orderComponents<-c(5,1,3,8,7,4,9,2,10,6)
PCAFactors<-Project.Data.PCA$scores
Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
swissRotatedReordered<-Rotated[,c(1,orderComponents+1)]
(nestedRSquared<-sapply(2:11,function(z)
summary(lm(Resp~.,data=swissRotatedReordered[,1:z]))$r.squared))
[1] 0.001041492 0.622569992 0.689046489 0.690319839 0.715051745 0.732286987
[7] 0.742441421 0.991291253 0.995263470 0.997035905
You run a linear model on the new model with your scores.
"lmg" will allow you to see which factors made the most contribution and those are the factors you should keep. In my case it was the top 3 factors
predictors <- test_dat[-1]
Project.Data.PCA <- princomp(predictors)
summary(Project.Data.PCA)
PCAFactors<-Project.Data.PCA$scores
Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
linModPCA<-lm(Resp~.,data=Rotated)
metrics.swiss <- calc.relimp(linModPCA, type = c("lmg", "first", "last","betasq",
"pratt"))
metrics.swiss
Normally, me and you(assuming you're not a bot) are easily able to identify whether a predictor is categorical or quantitative. Like, for example, gender is obviously categorical. Your last vote can be classified categorically.
Basically, we can identify categorical predictors easily. But what happens when we input some data in R, and it's lm function makes dummy variables for a predictor? How does it do that?
Somewhat related Question on StackOverflow.
Search R factor function. Here is a small demo, first model uses number of cylinder as a numerical valuable. Second model uses it as a categorical variable.
> summary(lm(mpg~cyl,mtcars))
Call:
lm(formula = mpg ~ cyl, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.9814 -2.1185 0.2217 1.0717 7.5186
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.8846 2.0738 18.27 < 2e-16 ***
cyl -2.8758 0.3224 -8.92 6.11e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.206 on 30 degrees of freedom
Multiple R-squared: 0.7262, Adjusted R-squared: 0.7171
F-statistic: 79.56 on 1 and 30 DF, p-value: 6.113e-10
> summary(lm(mpg~factor(cyl),mtcars))
Call:
lm(formula = mpg ~ factor(cyl), data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-5.2636 -1.8357 0.0286 1.3893 7.2364
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 26.6636 0.9718 27.437 < 2e-16 ***
factor(cyl)6 -6.9208 1.5583 -4.441 0.000119 ***
factor(cyl)8 -11.5636 1.2986 -8.905 8.57e-10 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.223 on 29 degrees of freedom
Multiple R-squared: 0.7325, Adjusted R-squared: 0.714
F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09
Hxd1011 adressed the more difficult case, when a categorical variable is stored as a number and therefore R understands by default that it is a numerical value - and if this is not the desired behaviour we must use factor function.
Your example with predictor ShelveLoc in dataset Carseats is easier because it's a text (character) variable, and therefore it can only be a categorical variable.
> head(Carseats$ShelveLoc)
[1] Bad Good Medium Medium Bad Bad
Levels: Bad Good Medium
R decides that thing from the features type. You can check that by using the str(dataset).If the feature is factor type then it will create dummies for that feature.
I have trouble understanding the difference between these two notations.
According to R intro y~x1/x2 represents that x2 in nested within x1. If x1 is a factor and x2 a continuous variable, is lm( y~x1/x2) a correct representation of nested ANCOVA?
What is confusing is that some online help topics suggest using aov(y~x1+Error(x2)) to represent a nested anova. Yet those two codes have completely different results.
For example:
x2 = rnorm(1000,2)
x1 = rep( c("A","B"), each=500)
y = x2*3+rnorm(1000)
Under this scenario I would expect x2 to be significant and x1 to be non significant.
summary(aov(y~x1+Error(x2)))
Error: x2
Df Sum Sq Mean Sq
x1 1 9262 9262
Error: Within
Df Sum Sq Mean Sq F value Pr(>F)
x1 1 0.0 0.0003 0 0.985
Residuals 997 967.9 0.9708
aov() works as expected. However, lm()....
summary(lm( y~x1/x2))
Call:
lm(formula = y ~ x1/x2)
Residuals:
Min 1Q Median 3Q Max
-3.4468 -0.6352 0.0092 0.6526 2.8294
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.08727 0.09566 0.912 0.3618
x1B -0.24501 0.13715 -1.786 0.0743 .
x1A:x2 2.94012 0.04362 67.401 <2e-16 ***
x1B:x2 3.06272 0.04326 70.806 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.9838 on 996 degrees of freedom
Multiple R-squared: 0.9058, Adjusted R-squared: 0.9055
F-statistic: 3191 on 3 and 996 DF, p-value: < 2.2e-16
x1 is marginally significant, and in many iterations it is highly significant? How can these results be so different?
What am I missing? Those two formulas are not suppose to represent the same thing? Or am I misunderstanding something on the underlying statistics?
For more verbose algorithms, determining the time complexity (i.e. BigO) is a pain. My solution has been to time the execution of the algorithm with parameters n and k, and come up with a function (time function) that varies with n and k.
My data looks something like the below:
n k executionTime
500 1 0.02
500 2 0.03
500 3 0.05
500 ... ...
500 10 0.18
1000 1 0.08
... ... ...
10000 1 9.8
... ... ...
10000 10 74.57
I've been using the lm() function in the stats R package. I don't know how to interpret the output of the multiple regression, to determine a final Big-O. This is my main question: how do you translate the output of a multiple variable regression, to a final ruling on the best Big-O time complexity rating?
Here's the output of the lm():
Residuals:
Min 1Q Median 3Q Max
-14.943 -5.325 -1.916 3.681 31.475
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.130e+01 1.591e+00 -13.39 <2e-16 ***
n 4.080e-03 1.953e-04 20.89 <2e-16 ***
k 2.361e+00 1.960e-01 12.05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.962 on 197 degrees of freedom
Multiple R-squared: 0.747, Adjusted R-squared: 0.7444
F-statistic: 290.8 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of log(y) ~ log(n) + log(k):
Residuals:
Min 1Q Median 3Q Max
-0.4445 -0.1136 -0.0253 0.1370 0.5007
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -16.80405 0.13749 -122.22 <2e-16 ***
log(n) 2.02321 0.01609 125.72 <2e-16 ***
log(k) 1.01216 0.01833 55.22 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1803 on 197 degrees of freedom
Multiple R-squared: 0.9897, Adjusted R-squared: 0.9896
F-statistic: 9428 on 2 and 197 DF, p-value: < 2.2e-16
Here's the output of the principle components, showing both n and k are contributing to the spread of the multivariate model:
PC1(This is n) PC2 (this is k) PC3 (noise?)
Standard deviation 1.3654 1.0000 0.36840
Proportion of Variance 0.6214 0.3333 0.04524
Cumulative Proportion 0.6214 0.9548 1.00000