PCA with new Factors in R - r

My objective to fit a linear model with the same response, but predictors replaced by factors/scores. I am trying to find out which principal components to include in such a linear model if I want to achieve an R^2 of at least 0.9*r.squared from my original model.
Which predictors should I choose?
model1 <- lm(Resp~.,data=test_dat)
> summary(model1)
Call:
lm(formula = Resp ~ ., data = test_dat)
Residuals:
Min 1Q Median 3Q Max
-0.35934 -0.07729 0.00330 0.08204 0.38709
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.18858 0.06926 -46.039 <2e-16 ***
Pred1 4.32083 0.03767 114.708 <2e-16 ***
Pred2 2.42110 0.04740 51.077 <2e-16 ***
Pred3 -1.00507 0.04435 -22.664 <2e-16 ***
Pred4 -3.19480 0.09147 -34.927 <2e-16 ***
Pred5 2.77779 0.05368 51.748 <2e-16 ***
Pred6 1.22923 0.05427 22.648 <2e-16 ***
Pred7 -1.21338 0.04562 -26.595 <2e-16 ***
Pred8 0.02485 0.05937 0.419 0.676
Pred9 -0.67831 0.05308 -12.778 <2e-16 ***
Pred10 1.69947 0.02628 64.672 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1193 on 489 degrees of freedom
Multiple R-squared: 0.997, Adjusted R-squared: 0.997
F-statistic: 1.645e+04 on 10 and 489 DF, p-value: < 2.2e-16
My new model should have an R^2 >= 0.897
(threshold<-0.9*r.sqrd)
[1] 0.8973323
metrics.swiss <- calc.relimp(model1, type = c("lmg", "first", "last","betasq", "Pratt"))
metrics.swiss
metrics.swiss#lmg.rank
>Pred1 Pred2 Pred3 Pred4 Pred5 Pred6 Pred7 Pred8 Pred9 Pred10
2 8 3 6 1 10 5 4 7 9
sum(metrics.swiss#lmg)
orderComponents<-c(5,1,3,8,7,4,9,2,10,6)
PCAFactors<-Project.Data.PCA$scores
Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
swissRotatedReordered<-Rotated[,c(1,orderComponents+1)]
(nestedRSquared<-sapply(2:11,function(z)
summary(lm(Resp~.,data=swissRotatedReordered[,1:z]))$r.squared))
[1] 0.001041492 0.622569992 0.689046489 0.690319839 0.715051745 0.732286987
[7] 0.742441421 0.991291253 0.995263470 0.997035905

You run a linear model on the new model with your scores.
"lmg" will allow you to see which factors made the most contribution and those are the factors you should keep. In my case it was the top 3 factors
predictors <- test_dat[-1]
Project.Data.PCA <- princomp(predictors)
summary(Project.Data.PCA)
PCAFactors<-Project.Data.PCA$scores
Rotated<-as.data.frame(cbind(Resp=test_dat$Resp,PCAFactors))
linModPCA<-lm(Resp~.,data=Rotated)
metrics.swiss <- calc.relimp(linModPCA, type = c("lmg", "first", "last","betasq",
"pratt"))
metrics.swiss

Related

Linear regression on dynamic groups in R

I have a data.table data_dt on which I want to run linear regression so that user can choose the number of columns in groups G1 and G2 using variable n_col. The following code works perfectly but it is slow due to extra time spent on creating matrices. To improve the performance of the code below, is there a way to remove Steps 1, 2, and 3 altogether by tweaking the formula of lm function and still get the same results?
library(timeSeries)
library(data.table)
data_dt = as.data.table(LPP2005REC[, -1])
n_col = 3 # Choose a number from 1 to 3
######### Step 1 ######### Create independent variable
xx <- as.matrix(data_dt[, "SPI"])
######### Step 2 ######### Create Group 1 of dependent variables
G1 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2)])
######### Step 3 ######### Create Group 2 of dependent variables
G2 <- as.matrix(data_dt[, .SD, .SDcols=c(1:n_col + 2 + n_col)])
lm(xx ~ G1 + G2)
Results -
summary(lm(xx ~ G1 + G2))
Call:
lm(formula = xx ~ G1 + G2)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
G1LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
G1MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
G1ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
G2LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
G2LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
G2LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16
This may be easier by just creating the formula with reformulate
out <- lm(reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col + 2 + n_col)],
response = 'SPI'), data = data_dt)
-checking
> summary(out)
Call:
lm(formula = reformulate(names(data_dt)[c(1:n_col + 2, 1:n_col +
2 + n_col)], response = "SPI"), data = data_dt)
Residuals:
Min 1Q Median 3Q Max
-3.763e-07 -4.130e-09 3.000e-09 9.840e-09 4.401e-07
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -4.931e-09 3.038e-09 -1.623e+00 0.1054
LMI -5.000e-01 4.083e-06 -1.225e+05 <2e-16 ***
MPI -2.000e+00 4.014e-06 -4.982e+05 <2e-16 ***
ALT -1.500e+00 5.556e-06 -2.700e+05 <2e-16 ***
LPP25 3.071e-04 1.407e-04 2.184e+00 0.0296 *
LPP40 -5.001e+00 2.360e-04 -2.119e+04 <2e-16 ***
LPP60 1.000e+01 8.704e-05 1.149e+05 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.762e-08 on 370 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 1.104e+12 on 6 and 370 DF, p-value: < 2.2e-16

Test if intercepts in ancova model are significantly different in R

I ran a model explaining the weight of some plant as a function of time and trying to incorporate the treatment effect.
mod <- lm(weight ~time + treatment)
The model looks like this:
with model summary being:
Call:
lm(formula = weight ~ time + treatment, data = df)
Residuals:
Min 1Q Median 3Q Max
-21.952 -7.674 0.770 6.851 21.514
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -37.5790 3.2897 -11.423 < 2e-16 ***
time 4.7478 0.2541 18.688 < 2e-16 ***
treatmentB 8.2000 2.4545 3.341 0.00113 **
treatmentC 5.4633 2.4545 2.226 0.02797 *
treatmentD 20.3533 2.4545 8.292 2.36e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 9.506 on 115 degrees of freedom
Multiple R-squared: 0.7862, Adjusted R-squared: 0.7788
F-statistic: 105.7 on 4 and 115 DF, p-value: < 2.2e-16
ANOVA table
Analysis of Variance Table
Response: weight
Df Sum Sq Mean Sq F value Pr(>F)
time 1 31558.1 31558.1 349.227 < 2.2e-16 ***
treatment 3 6661.9 2220.6 24.574 2.328e-12 ***
Residuals 115 10392.0 90.4
I want to test the H0 that intercept1=intercept2=intercept3=intercept4. Is this done by simply interpreting the t-value and p-value for the intercept ( I guess not because this is the baseline (treatment A in this case))? I'm a bit puzzled by this as not much attention is paid on difference in intercept on most sources i looked up.

Extract root of dummy variable in model fit summary

In the following example, gender is encoded as dummy variables corresponding to the categories.
fit <- lm(mass ~ height + gender, data=dplyr::starwars)
summary(fit)
# Call:
# lm(formula = mass ~ height + gender, data = dplyr::starwars)
#
# Residuals:
# Min 1Q Median 3Q Max
# -41.908 -6.536 -1.585 1.302 55.481
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) -46.69901 12.67896 -3.683 0.000557 ***
# height 0.59177 0.06784 8.723 1.1e-11 ***
# genderhermaphrodite 1301.13951 17.37871 74.870 < 2e-16 ***
# gendermale 22.39565 5.82763 3.843 0.000338 ***
# gendernone 68.34530 17.49287 3.907 0.000276 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 16.57 on 51 degrees of freedom
# (31 observations deleted due to missingness)
# Multiple R-squared: 0.9915, Adjusted R-squared: 0.9909
# F-statistic: 1496 on 4 and 51 DF, p-value: < 2.2e-16
Is there a way to extract the root of the dummy variable name? For example, for gendernone, gendermale and genderhermaphrodite, the root would be gender, corresponding to the original column name in the dplyr::starwars data.
Get the variable names from the formula and check which one matches the input:
input <- c("gendermale", "height")
v <- all.vars(formula(fit))
v[sapply(input, function(x) which(pmatch(v, x) == 1))]
## [1] "gender" "height"

Polynomial model to data in R [duplicate]

This question already has answers here:
Fitting polynomial model to data in R
(5 answers)
Closed 5 years ago.
Year <- c(1000,1500,1600,1700,1750,1800,1850,1900,1950,1955,1960,1965,1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
Africa <- c(70,86,114,106,106,107,111,133,229,254,285,322,366,416,478,550,632,720,814,920,1044,1186)
How can I find the population for the years: 1925, 1963, 1978, 1988, 1998 using Polynomial Linear Regression.
Here is a starting point for the solution of your problem.
Year <- c(1000,1500,1600,1700,1750,1800,1850,1900,1950,1955,1960,1965,
1970,1975,1980,1985,1990,1995,2000,2005,2010,2015)
Africa <- c(70,86,114,106,106,107,111,133,229,254,285,322,366,416,478,550,
632,720,814,920,1044,1186)
df <- data.frame(Year, Africa)
# Polynomial linear regression of order 5
model1 <- lm(Africa ~ poly(Year,5), data=df)
summary(model1)
###########
Call:
lm(formula = Africa ~ poly(Year, 5), data = df)
Residuals:
Min 1Q Median 3Q Max
-59.639 -27.119 -12.397 9.149 97.398
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 411.32 10.12 40.643 < 2e-16 ***
poly(Year, 5)1 881.26 47.47 18.565 3.01e-12 ***
poly(Year, 5)2 768.50 47.47 16.190 2.42e-11 ***
poly(Year, 5)3 709.43 47.47 14.945 8.07e-11 ***
poly(Year, 5)4 628.45 47.47 13.239 4.89e-10 ***
poly(Year, 5)5 359.04 47.47 7.564 1.14e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 47.47 on 16 degrees of freedom
Multiple R-squared: 0.9852, Adjusted R-squared: 0.9805
F-statistic: 212.5 on 5 and 16 DF, p-value: 4.859e-14
#############
pred <- predict(model1)
plot(Year, Africa, type="o", xlab="Year", ylab="Africa")
lines(Year, pred, lwd=2, col="red")
The model estimated above shows a bad fit for Years < 1900. It is therefore preferable to estimate a model only for data after 1900.
# Polynomial linear regression of order 2
df2 <- subset(df,Year>1900)
model2 <- lm(Africa ~ poly(Year,2), data=df2)
summary(model2)
###########
Call:
lm(formula = Africa ~ poly(Year, 2), data = df2)
Residuals:
Min 1Q Median 3Q Max
-9.267 -2.489 -0.011 3.334 12.482
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 586.857 1.677 349.93 < 2e-16 ***
poly(Year, 2)1 1086.646 6.275 173.17 < 2e-16 ***
poly(Year, 2)2 245.687 6.275 39.15 3.65e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.275 on 11 degrees of freedom
Multiple R-squared: 0.9997, Adjusted R-squared: 0.9996
F-statistic: 1.576e+04 on 2 and 11 DF, p-value: < 2.2e-16
###########
df2$pred <- predict(model2)
plot(df2$Year, df2$Africa, type="o", xlab="Year", ylab="Africa")
lines(df2$Year, df2$pred, lwd=2, col="red")
The fit of this second model is clearly better:
At last, we get model prediction for the years 1925, 1963, 1978, 1988, 1998.
df3 <- data.frame(Year=c(1925, 1963, 1978, 1988, 1998))
df3$pred <- predict(model2, newdata=df3)
df3
Year pred
1 1925 286.4863
2 1963 301.1507
3 1978 451.7210
4 1988 597.6301
5 1998 779.9623

R- How to save the console data into a row/matrix or data frame for future use?

I'm performing the multiple regression to find the best model to predict the prices. See as following for the output in the R console.
I'd like to store the first column (Estimates) into a row/matrix or data frame for future use such as using R shiny to deploy on the web.
*(Price = 698.8+0.116*voltage-70.72*VendorCHICONY
-36.6*VendorDELTA-66.8*VendorLITEON-14.86*H)*
Can somebody kindly advise?? Thanks in advance.
Call:
lm(formula = Price ~ Voltage + Vendor + H, data = PSU2)
Residuals:
Min 1Q Median 3Q Max
-10.9950 -0.6251 0.0000 3.0134 11.0360
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 698.821309 276.240098 2.530 0.0280 *
Voltage 0.116958 0.005126 22.818 1.29e-10 ***
VendorCHICONY -70.721088 9.308563 -7.597 1.06e-05 ***
VendorDELTA -36.639685 5.866688 -6.245 6.30e-05 ***
VendorLITEON -66.796531 6.120925 -10.913 3.07e-07 ***
H -14.869478 6.897259 -2.156 0.0541 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.307 on 11 degrees of freedom
Multiple R-squared: 0.9861, Adjusted R-squared: 0.9799
F-statistic: 156.6 on 5 and 11 DF, p-value: 7.766e-10
Use coef on your lm output.
e.g.
m <- lm(Sepal.Length ~ Sepal.Width + Species, iris)
summary(m)
# Call:
# lm(formula = Sepal.Length ~ Sepal.Width + Species, data = iris)
# Residuals:
# Min 1Q Median 3Q Max
# -1.30711 -0.25713 -0.05325 0.19542 1.41253
#
# Coefficients:
# Estimate Std. Error t value Pr(>|t|)
# (Intercept) 2.2514 0.3698 6.089 9.57e-09 ***
# Sepal.Width 0.8036 0.1063 7.557 4.19e-12 ***
# Speciesversicolor 1.4587 0.1121 13.012 < 2e-16 ***
# Speciesvirginica 1.9468 0.1000 19.465 < 2e-16 ***
# ---
# Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#
# Residual standard error: 0.438 on 146 degrees of freedom
# Multiple R-squared: 0.7259, Adjusted R-squared: 0.7203
# F-statistic: 128.9 on 3 and 146 DF, p-value: < 2.2e-16
coef(m)
# (Intercept) Sepal.Width Speciesversicolor Speciesvirginica
# 2.2513932 0.8035609 1.4587431 1.9468166
See also names(m) which shows you some things you can extract, e.g. m$residuals (or equivalently, resid(m)).
And also methods(class='lm') will show you some other functions that work on a lm.
> methods(class='lm')
[1] add1 alias anova case.names coerce confint cooks.distance deviance dfbeta dfbetas drop1 dummy.coef effects extractAIC family
[16] formula hatvalues influence initialize kappa labels logLik model.frame model.matrix nobs plot predict print proj qr
[31] residuals rstandard rstudent show simulate slotsFromS3 summary variable.names vcov
(oddly, 'coef' is not in there? ah well)
Besides, I'd like to know if there is command to show the "residual percentage"
=(actual value-fitted value)/actual value"; currently the "residuals()" command can
only show the below info but I need the percentage instead.
residuals(fit3ab)
1 2 3 4 5 6
-5.625491e-01 -5.625491e-01 7.676578e-15 -8.293815e+00 -5.646900e+00 3.443652e+00

Resources