Linear regression on dynamic groups in R - r

I have a data.table data_dt on which I want to run linear regression so that user can choose the number of columns in groups G1 and G2 using variable n_col. The following code works perfectly but it is slow due to extra time spent on creating matrices. To improve the performance of the code below, is there a way to remove Steps 1, 2, and 3 altogether by tweaking the formula of lm function and still get the same results?
library(timeSeries)
library(data.table)
data_dt = as.data.table(LPP2005REC[, -1])
n_col = 3 # Choose a number from 1 to 3
######### Step 1 ######### Create independent variable
xx <- as.matrix(data_dt[, "SPI"])
######### Step 2 ######### Create Group 1 of dependent variables
G1 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2)])
######### Step 3 ######### Create Group 2 of dependent variables
G2 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2 + n_col)])
lm(xx ~ G1 + G2)
Results -
summary(lm(xx ~ G1 + G2))
Call:
lm(formula = xx ~ G1 + G2)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
G1LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
G1MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
G1ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
G2LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
G2LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
G2LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16

This may be easier by just creating the formula with reformulate
out <- lm(reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col + 2 + n_col)],
response = 'SPI'), data = data_dt)
-checking
> summary(out)
Call:
lm(formula = reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col +
2 + n_col)], response = "SPI"), data = data_dt)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16

Related

PCA with new Factors in R

My objective to fit a linear model with the same response, but predictors replaced by factors/scores. I am trying to find out which principal components to include in such a linear model if I want to achieve an R^2 of at least 0.9*r.squared from my original model.
Which predictors should I choose?
model1 <- lm(Resp~.,data=test_dat)
> summary(model1)
Call:
lm(formula = Resp ~ ., data = test_dat)
Residuals:
Min 1Q Median 3Q Max
-0.35934 -0.07729 0.00330 0.08204 0.38709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.18858 0.06926 -46.039 <2e-16 ***
Pred1 4.32083 0.03767 114.708 <2e-16 ***
Pred2 2.42110 0.04740 51.077 <2e-16 ***
Pred3 -1.00507 0.04435 -22.664 <2e-16 ***
Pred4 -3.19480 0.09147 -34.927 <2e-16 ***
Pred5 2.77779 0.05368 51.748 <2e-16 ***
Pred6 1.22923 0.05427 22.648 <2e-16 ***
Pred7 -1.21338 0.04562 -26.595 <2e-16 ***
Pred8 0.02485 0.05937 0.419 0.676
Pred9 -0.67831 0.05308 -12.778 <2e-16 ***
Pred10 1.69947 0.02628 64.672 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1193 on 489 degrees of freedom
Multiple R-squared: 0.997, Adjusted R-squared: 0.997
F-statistic: 1.645e+04 on 10 and 489 DF, p-value: < 2.2e-16
My new model should have an R^2 >= 0.897
(threshold<-0.9*r.sqrd)
[1] 0.8973323
metrics.swiss <- calc.relimp(model1, type = c("lmg", "first", "last","betasq", "Pratt"))
metrics.swiss
metrics.swiss#lmg.rank
>Pred1 Pred2 Pred3 Pred4 Pred5 Pred6 Pred7 Pred8 Pred9 Pred10
2 8 3 6 1 10 5 4 7 9
sum(metrics.swiss#lmg)
orderComponents<-c(5,1,3,8,7,4,9,2,10,6)
PCAFactors<-Project.Data.PCA$scores
Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
swissRotatedReordered<-Rotated[,c(1,orderComponents+1)]
(nestedRSquared<-sapply(2:11,function(z)
summary(lm(Resp~.,data=swissRotatedReordered[,1:z]))$r.squared))
[1] 0.001041492 0.622569992 0.689046489 0.690319839 0.715051745 0.732286987
[7] 0.742441421 0.991291253 0.995263470 0.997035905
You run a linear model on the new model with your scores.
"lmg" will allow you to see which factors made the most contribution and those are the factors you should keep. In my case it was the top 3 factors
predictors <- test_dat[-1]
Project.Data.PCA <- princomp(predictors)
summary(Project.Data.PCA)
PCAFactors<-Project.Data.PCA$scores
Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
linModPCA<-lm(Resp~.,data=Rotated)
metrics.swiss <- calc.relimp(linModPCA, type = c("lmg", "first", "last","betasq",
"pratt"))
metrics.swiss

Test if intercepts in ancova model are significantly different in R

I ran a model explaining the weight of some plant as a function of time and trying to incorporate the treatment effect.
mod <- lm(weight ~time + treatment)
The model looks like this:
with model summary being:
Call:
lm(formula = weight ~ time + treatment, data = df)
Residuals:
Min 1Q Median 3Q Max
-21.952 -7.674 0.770 6.851 21.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.5790 3.2897 -11.423 < 2e-16 ***
time 4.7478 0.2541 18.688 < 2e-16 ***
treatmentB 8.2000 2.4545 3.341 0.00113 **
treatmentC 5.4633 2.4545 2.226 0.02797 *
treatmentD 20.3533 2.4545 8.292 2.36e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.506 on 115 degrees of freedom
Multiple R-squared: 0.7862, Adjusted R-squared: 0.7788
F-statistic: 105.7 on 4 and 115 DF, p-value: < 2.2e-16
ANOVA table
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
time 1 31558.1 31558.1 349.227 < 2.2e-16 ***
treatment 3 6661.9 2220.6 24.574 2.328e-12 ***
Residuals 115 10392.0 90.4
I want to test the H0 that intercept1=intercept2=intercept3=intercept4. Is this done by simply interpreting the t-value and p-value for the intercept ( I guess not because this is the baseline (treatment A in this case))? I'm a bit puzzled by this as not much attention is paid on difference in intercept on most sources i looked up.

Extract root of dummy variable in model fit summary

In the following example, gender is encoded as dummy variables corresponding to the categories.
fit <- lm(mass ~ height + gender, data=dplyr::starwars)
summary(fit)
# Call:
# lm(formula = mass ~ height + gender, data = dplyr::starwars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -41.908 -6.536 -1.585 1.302 55.481
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -46.69901 12.67896 -3.683 0.000557 ***
# height 0.59177 0.06784 8.723 1.1e-11 ***
# genderhermaphrodite 1301.13951 17.37871 74.870 < 2e-16 ***
# gendermale 22.39565 5.82763 3.843 0.000338 ***
# gendernone 68.34530 17.49287 3.907 0.000276 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 16.57 on 51 degrees of freedom
# (31 observations deleted due to missingness)
# Multiple R-squared: 0.9915, Adjusted R-squared: 0.9909
# F-statistic: 1496 on 4 and 51 DF, p-value: < 2.2e-16
Is there a way to extract the root of the dummy variable name? For example, for gendernone, gendermale and genderhermaphrodite, the root would be gender, corresponding to the original column name in the dplyr::starwars data.
Get the variable names from the formula and check which one matches the input:
input <- c("gendermale", "height")
v <- all.vars(formula(fit))
v[sapply(input, function(x) which(pmatch(v, x) == 1))]
## [1] "gender" "height"

R- How to save the console data into a row/matrix or data frame for future use?

I'm performing the multiple regression to find the best model to predict the prices. See as following for the output in the R console.
I'd like to store the first column (Estimates) into a row/matrix or data frame for future use such as using R shiny to deploy on the web.
*(Price = 698.8+0.116*voltage-70.72*VendorCHICONY
-36.6*VendorDELTA-66.8*VendorLITEON-14.86*H)*
Can somebody kindly advise?? Thanks in advance.
Call:
lm(formula = Price ~ Voltage + Vendor + H, data = PSU2)
Residuals:
Min 1Q Median 3Q Max
-10.9950 -0.6251 0.0000 3.0134 11.0360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.821309 276.240098 2.530 0.0280 *
Voltage 0.116958 0.005126 22.818 1.29e-10 ***
VendorCHICONY -70.721088 9.308563 -7.597 1.06e-05 ***
VendorDELTA -36.639685 5.866688 -6.245 6.30e-05 ***
VendorLITEON -66.796531 6.120925 -10.913 3.07e-07 ***
H -14.869478 6.897259 -2.156 0.0541 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.307 on 11 degrees of freedom
Multiple R-squared: 0.9861, Adjusted R-squared: 0.9799
F-statistic: 156.6 on 5 and 11 DF, p-value: 7.766e-10
Use coef on your lm output.
e.g.
m <- lm(Sepal.Length ~ Sepal.Width + Species, iris)
summary(m)
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
# Residuals:
# Min 1Q Median 3Q Max
# -1.30711 -0.25713 -0.05325 0.19542 1.41253
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***
# Sepal.Width 0.8036 0.1063 7.557 4.19e-12 ***
# Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 ***
# Speciesvirginica 1.9468 0.1000 19.465 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.438 on 146 degrees of freedom
# Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203
# F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
coef(m)
# (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
# 2.2513932 0.8035609 1.4587431 1.9468166
See also names(m) which shows you some things you can extract, e.g. m$residuals (or equivalently, resid(m)).
And also methods(class='lm') will show you some other functions that work on a lm.
> methods(class='lm')
[1] add1 alias anova case.names coerce confint cooks.distance deviance dfbeta dfbetas drop1 dummy.coef effects extractAIC family
[16] formula hatvalues influence initialize kappa labels logLik model.frame model.matrix nobs plot predict print proj qr
[31] residuals rstandard rstudent show simulate slotsFromS3 summary variable.names vcov
(oddly, 'coef' is not in there? ah well)
Besides, I'd like to know if there is command to show the "residual percentage"
=(actual value-fitted value)/actual value"; currently the "residuals()" command can
only show the below info but I need the percentage instead.
residuals(fit3ab)
1 2 3 4 5 6
-5.625491e-01 -5.625491e-01 7.676578e-15 -8.293815e+00 -5.646900e+00 3.443652e+00

A priori contrasts for lm() in R

I'm having trouble with setting a priori contrasts and would like to ask for some help. The following code should give two orthogonal contrasts to the factor level "d".
Response <- c(1,3,2,2,2,2,2,2,4,6,5,5,5,5,5,5,4,6,5,5,5,5,5,5)
A <- factor(c(rep("c",8),rep("d",8),rep("h",8)))
contrasts(A) <- cbind("d vs h"=c(0,1,-1),"d vs c"=c(-1,1,0))
summary.lm(aov(Response~A))
What I get is:
Call:
aov(formula = Response ~ A)
Residuals:
Min 1Q Median 3Q Max
-1.000e+00 -3.136e-16 -8.281e-18 -8.281e-18 1.000e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.0000 0.1091 36.661 < 2e-16 ***
Ad vs h -1.0000 0.1543 -6.481 2.02e-06 ***
Ad vs c 2.0000 0.1543 12.961 1.74e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.5345 on 21 degrees of freedom
Multiple R-squared: 0.8889, Adjusted R-squared: 0.8783
F-statistic: 84 on 2 and 21 DF, p-value: 9.56e-11
But I expect the Estimate of (Intercept) to be 5.00, as it should be equal to the level d, right? Also the other estimates look strange...
I know I can get the correct values easier using relevel(A, ref="d") (where they are displayed correctly), but I am interested in learning the correct formulation to test own hypotheses. If I run a similar example with the folowing code (from a website), it works as expected:
irrigation<-factor(c(rep("Control",10),rep("Irrigated 10mm",10),rep("Irrigated20mm",10)))
biomass<-1:30
contrastmatrix<-cbind("10 vs 20"=c(0,1,-1),"c vs 10"=c(-1,1,0))
contrasts(irrigation)<-contrastmatrix
summary.lm(aov(biomass~irrigation))
Call:
aov(formula = biomass ~ irrigation)
Residuals:
Min 1Q Median 3Q Max
-4.500e+00 -2.500e+00 3.608e-16 2.500e+00 4.500e+00
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 15.5000 0.5528 28.04 < 2e-16 ***
irrigation10 vs 20 -10.0000 0.7817 -12.79 5.67e-13 ***
irrigationc vs 10 10.0000 0.7817 12.79 5.67e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.028 on 27 degrees of freedom
Multiple R-squared: 0.8899, Adjusted R-squared: 0.8817
F-statistic: 109.1 on 2 and 27 DF, p-value: 1.162e-13
I would really appreciate some explanation for this.
Thanks, Jeremias
I think the problem is in the understanding of contrasts (You may ?contrasts for detail). Let me explain in detail:
If you use the default way for factor A,
A <- factor(c(rep("c",8),rep("d",8),rep("h",8)))
> contrasts(A)
d h
c 0 0
d 1 0
h 0 1
thus the model lm gives you are
Mean(Response) = Intercept + beta_1 * I(d = 1) + beta_2 * I(h = 1)
summary.lm(aov(Response~A))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.000 0.189 10.6 7.1e-10 ***
Ad 3.000 0.267 11.2 2.5e-10 ***
Ah 3.000 0.267 11.2 2.5e-10 ***
So for group c, the mean is just intercept 2, for group d , the mean is 2 + 3 = 5, same for group h.
What if you use your own contrast:
contrasts(A) <- cbind("d vs h"=c(0,1,-1),"d vs c"=c(-1,1,0))
A
[1] c c c c c c c c d d d d d d d d h h h h h h h h
attr(,"contrasts")
d vs h d vs c
c 0 -1
d 1 1
h -1 0
The model you fit turns out to be
Mean(Response) = Intercept + beta_1 * (I(d = 1) - I(h = 1)) + beta_2 * (I(d = 1) - I(c = 1))
= Intercept + (beta_1 + beta_2) * I(d = 1) - beta_2 * I(c = 1) - beta_1 * I(h = 1)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.000 0.109 36.66 < 2e-16 ***
Ad vs h -1.000 0.154 -6.48 2.0e-06 ***
Ad vs c 2.000 0.154 12.96 1.7e-11 ***
So for group c, the mean is 4 - 2 = 2, for group d, the mean is 4 - 1 + 2 = 5, for group h, the mean is 4 - (-1) = 5.
==========================
Update:
The easiest way to do your contrast is to set the base (reference) level to be d.
contrasts(A) <- contr.treatment(3, base = 2)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.00e+00 1.89e-01 26.5 < 2e-16 ***
A1 -3.00e+00 2.67e-01 -11.2 2.5e-10 ***
A3 -4.86e-17 2.67e-01 0.0 1
If you want to use your contrast:
Response <- c(1,3,2,2,2,2,2,2,4,6,5,5,5,5,5,5,4,6,5,5,5,5,5,5)
A <- factor(c(rep("c",8),rep("d",8),rep("h",8)))
mat<- cbind(rep(1/3, 3), "d vs h"=c(0,1,-1),"d vs c"=c(-1,1,0))
mymat <- solve(t(mat))
my.contrast <- mymat[,2:3]
contrasts(A) <- my.contrast
summary.lm(aov(Response~A))
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.00e+00 1.09e-01 36.7 < 2e-16 ***
Ad vs h 7.69e-16 2.67e-01 0.0 1
Ad vs c 3.00e+00 2.67e-01 11.2 2.5e-10 ***
Reference: http://www.ats.ucla.edu/stat/r/library/contrast_coding.htm

Resources