I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")
Related
I am looking to use my PC1 from a PCA in a hierarchical regression analysis to account for additional variation in R. Is this possible?
I ran my pca with the code below in R
pca<- prcomp(my.data[,c(57:62)], center = TRUE,scale. = TRUE)
summary(pca)
str(pca)
fviz_eig(pca)
fviz_pca_ind(pca,
col.ind = "cos2", # Color by the quality of representation
gradient.cols = c("#00AFBB", "#E7B800", "#FC4E07"),
repel = TRUE # Avoid text overlapping
)
ggbiplot(pca)
print(pca)
#some results!
Rotation (n x k) = (4 x 4):
PC1 PC2 PC3
EC 0.5389823 -0.4785188 0.0003197419
temp 0.4787782 0.3590390 0.7913858440
pH 0.5495125 -0.3839466 -0.2673991595
DO. 0.4222624 0.7033461 -0.5497326925
PC4
EC 0.6931938
temp -0.1247834
pH -0.6921840
DO. 0.1574569
Now I hope to use the PC1 as a variable in my models
Somthing like this
m0<- lm(Rel.abund.rotifers~turb+chl.a+PC1,data=my.data)
Any help is very appreciated!
Extract the component scores using pca$x, add them to your dataframe using cbind(), then run your model. Example using mtcars:
pca <- prcomp(mtcars[, 3:6])
mtcars2 <- cbind(mtcars, pca$x)
m0 <- lm(mpg ~ cyl + PC1, data = mtcars2)
summary(m0)
Call:
lm(formula = mpg ~ cyl + PC1, data = mtcars2)
Residuals:
Min 1Q Median 3Q Max
-4.1424 -2.0289 -0.7483 1.3613 6.9373
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 27.99508 4.80433 5.827 2.56e-06 ***
cyl -1.27749 0.77169 -1.655 0.1086
PC1 -0.02275 0.01010 -2.251 0.0321 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.008 on 29 degrees of freedom
Multiple R-squared: 0.7669, Adjusted R-squared: 0.7508
F-statistic: 47.71 on 2 and 29 DF, p-value: 6.742e-10
for Boston dataset perform polynomial regression with degree 5,4,3 and 2 I want to use loop but get error :
Error in [.data.frame(data, 0, cols, drop = FALSE) :
undefined columns selected
library(caret)
train_control <- trainControl(method = "cv", number=10)
#set.seed(5)
cv <-rep(NA,4)
n=c(5,4,3,2)
for (i in n) {
cv[i]=train(nox ~ poly(dis,degree=i ), data = Boston, trncontrol = train_control, method = "lm")
}
outside the loop train(nox ~ poly(dis,degree=i ), data = Boston, trncontrol = train_control, method = "lm")
works well
Since you are using poly(..., raw = FALSE) that means you are getting orthogonal contrasts. Hence no need of for-loop, use the maximum degree since the coefficients and standard errors will not change for each coefficient.
Check quick example below using lm and iris dataset:
summary(lm(Sepal.Length~poly(Sepal.Width, 2), iris))
Call:
lm(formula = Sepal.Length ~ poly(Sepal.Width, 2), data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.63153 -0.62177 -0.08282 0.50531 2.33336
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.06692 87.316 <2e-16 ***
poly(Sepal.Width, 2)1 -1.18838 0.81962 -1.450 0.1492
poly(Sepal.Width, 2)2 -1.41578 0.81962 -1.727 0.0862 .
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8196 on 147 degrees of freedom
Multiple R-squared: 0.03344, Adjusted R-squared: 0.02029
F-statistic: 2.543 on 2 and 147 DF, p-value: 0.08209
> summary(lm(Sepal.Length~poly(Sepal.Width, 3), iris))
Call:
lm(formula = Sepal.Length ~ poly(Sepal.Width, 3), data = iris)
Residuals:
Min 1Q Median 3Q Max
-1.6876 -0.5001 -0.0876 0.5493 2.4600
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.84333 0.06588 88.696 <2e-16 ***
poly(Sepal.Width, 3)1 -1.18838 0.80687 -1.473 0.1430
poly(Sepal.Width, 3)2 -1.41578 0.80687 -1.755 0.0814 .
poly(Sepal.Width, 3)3 1.92349 0.80687 2.384 0.0184 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8069 on 146 degrees of freedom
Multiple R-squared: 0.06965, Adjusted R-squared: 0.05054
F-statistic: 3.644 on 3 and 146 DF, p-value: 0.01425
Take a look at the summary table. Everything is the same. Only the poly(Sepal.Width,3)3 was added when a degree of 3 was used. Meaning if we used a degree of 3, we could easily tell what degree 2 will look like. Hence no need of for loop.
Note that you could use different variables in poly: eg poly(cbind(Sepal.Width, Petal.Length, Petal.Width), 4) and still be able to easily recover poly(Sepal.Width, 2).
How do I change predictors in linear regression in loop in R?
Below is an example along with the error. Can someone please fix it.
# sample data
mpg <- mpg
str(mpg)
# array of predictors
predictors <- c("hwy", "cty")
# loop over predictors
for (predictor in predictors)
{
# fit linear regression
model <- lm(formula = predictor ~ displ + cyl,
data = mpg)
# summary of model
summary(model)
}
Error
Error in model.frame.default(formula = predictor ~ displ + cyl, data = mpg, :
variable lengths differ (found for 'displ')
We may use paste or reformulate. Also, as it is a for loop, create an object to store the output from summary
sumry_model <- vector('list', length(predictors))
names(sumry_model) <- predictors
for (predictor in predictors) {
# fit linear regression
model <- lm(reformulate(c("displ", "cyl"), response = predictor),
data = mpg)
# with paste
# model <- lm(formula = paste0(predictor, "~ displ + cyl"), data = mpg)
# summary of model
sumry_model[[predictor]] <- summary(model)
}
-output
> sumry_model
$hwy
Call:
lm(formula = reformulate(c("displ", "cyl"), response = predictor),
data = mpg)
Residuals:
Min 1Q Median 3Q Max
-7.5098 -2.1953 -0.2049 1.9023 14.9223
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 38.2162 1.0481 36.461 < 2e-16 ***
displ -1.9599 0.5194 -3.773 0.000205 ***
cyl -1.3537 0.4164 -3.251 0.001323 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.759 on 231 degrees of freedom
Multiple R-squared: 0.6049, Adjusted R-squared: 0.6014
F-statistic: 176.8 on 2 and 231 DF, p-value: < 2.2e-16
$cty
Call:
lm(formula = reformulate(c("displ", "cyl"), response = predictor),
data = mpg)
Residuals:
Min 1Q Median 3Q Max
-5.9276 -1.4750 -0.0891 1.0686 13.9261
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 28.2885 0.6876 41.139 < 2e-16 ***
displ -1.1979 0.3408 -3.515 0.000529 ***
cyl -1.2347 0.2732 -4.519 9.91e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.466 on 231 degrees of freedom
Multiple R-squared: 0.6671, Adjusted R-squared: 0.6642
F-statistic: 231.4 on 2 and 231 DF, p-value: < 2.2e-16
This may be also done as a multivariate response
summary(lm(cbind(hwy, cty) ~ displ + cyl, data = mpg))
Or if we want to use predictors
summary(lm(as.matrix(mpg[predictors]) ~ displ + cyl, data = mpg))
I am analysing whether the effects of x_t on y_t differ during and after a specific time period.
I am trying to regress the following model in R using lm():
y_t = b_0 + [b_1(1-D_t) + b_2 D_t]x_t
where D_t is a dummy variable with the value 1 over the time period and 0 otherwise.
Is it possible to use lm() for this formula?
observationNumber <- 1:80
obsFactor <- cut(observationNumber, breaks = c(0,55,81), right =F)
fit <- lm(y ~ x * obsFactor)
For example:
y = runif(80)
x = rnorm(80) + c(rep(0,54), rep(1, 26))
fit <- lm(y ~ x * obsFactor)
summary(fit)
Call:
lm(formula = y ~ x * obsFactor)
Residuals:
Min 1Q Median 3Q Max
-0.48375 -0.29655 0.05957 0.22797 0.49617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.50959 0.04253 11.983 <2e-16 ***
x -0.02492 0.04194 -0.594 0.554
obsFactor[55,81) -0.06357 0.09593 -0.663 0.510
x:obsFactor[55,81) 0.07120 0.07371 0.966 0.337
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3116 on 76 degrees of freedom
Multiple R-squared: 0.01303, Adjusted R-squared: -0.02593
F-statistic: 0.3345 on 3 and 76 DF, p-value: 0.8004
obsFactor[55,81) is zero if observationNumber < 55 and one if its greater or equal its coefficient is your $b_0$. x:obsFactor[55,81) is the product of the dummy and the variable $x_t$ - its coefficient is your $b_2$. The coefficient for $x_t$ is your $b_1$.
Is it possible to plot with emmip the marginal (log odds) means from a geeglm model when you have a quadratic term? I have repeated measures data and the model fits better with a treatment x time squared term in addition to an interaction term with linear time.
I just want to be able to visualise the predicted curve in the data. If it's possible I don't know how to specify it. I've tried:
mod3 <- geeglm(outcome ~ treatment*time + treatment*time_sq, data = dat, id = id, family = "binomial", corstr = "exchangeable"))
mod3a.rg <- ref_grid(mod3, at = list(time = c(1,2,3,4,5,6), time_sq = c(1,4,9,16,25,36)))
emmip(mod3a.rg, treatment ~ time)
I don't think your mod3 is including your quadratic term correctly (hard to tell since you did not include reproducible code). This will let you include your squared term for time correctly:
mod3 <- geeglm(outcome ~ treatment*time + treatment*I(time^2), data =
dat, id = id, family = "binomial", corstr = "exchangeable"))
The add plotit = TRUE to your call to emmip():
emmip(mod3a.rg, treatment ~ time, plotit = TRUE)
Here's a simple reproducible example with the savings dataset in the MASS, faraway package for comparison
library(MASS)
data(savings, package="faraway")
#fit model with polynomial term
mod <- lm(sr ~ ddpi+I(ddpi^2))
summary(mod)
The summary produces this output, note the additonal coefficient for your quadratic term
> Call: lm(formula = sr ~ ddpi + I(ddpi^2), data = savings)
>
> Residuals:
> Min 1Q Median 3Q Max
> -8.5601 -2.5612 0.5546 2.5735 7.8080
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
>(Intercept) 5.13038 1.43472 3.576 0.000821 ***
>ddpi 1.75752 0.53772 3.268 0.002026 **
>I(ddpi^2) -0.09299 0.03612 -2.574 0.013262 *
> --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 4.079 on 47 degrees of freedom Multiple
> R-squared: 0.205, Adjusted R-squared: 0.1711 F-statistic: 6.059 on
> 2 and 47 DF, p-value: 0.004559
If you don't enclose the quadratic term with I() your summary will only include the term for ddpi.
mod2 <- lm(sr ~ ddpi+ddpi^2)
summary(mod2)
produces the following summary with a coefficient only for ddpi
> lm(formula = sr ~ ddpi + ddpi^2, data = savings)
>
> Residuals:
> Min 1Q Median 3Q Max
> -8.5535 -3.7349 0.9835 2.7720 9.3104
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
>(Intercept) 7.8830 1.0110 7.797 4.46e-10 ***
>ddpi 0.4758 0.2146 2.217 0.0314 *
> --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 4.311 on 48 degrees of freedom Multiple
> R-squared: 0.0929, Adjusted R-squared: 0.074 F-statistic: 4.916 on
> 1 and 48 DF, p-value: 0.03139