I have a question about R; I would like to use two lags periods instead of one (please check my below code) in my model but I don't know how to write it in R. Can someone help please?
Here below are details of my R code:
library(plm)
fixed = plm(sp ~lag(debt)+lag(I(debt^2))+outgp+gvex+vlimp+vlexp+bcour+infcpi, data=pdata, index=c("country", "year"), model="within")
The lags must be on the variable debt.
This should give 2 lags on the debt variable.
library(plm)
fixed = plm(sp ~lag(debt, k=1:2)+lag(I(debt^2))+outgp+gvex+vlimp+vlexp+bcour+infcpi, data=pdata, index=c("country", "year"), model="within")
For example:
data("Grunfeld", package = "plm")
lags2mod <- plm(inv ~ lag(value, k=1:2) + capital, data = Grunfeld, model = "within")
summary(lags2mod)
Oneway (individual) effect Within Model
Call:
plm(formula = inv ~ lag(value, k = 1:2) + capital, data = Grunfeld,
model = "within")
Balanced Panel: n = 10, T = 18, N = 180
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-272.21434 -19.24168 0.42825 18.09930 260.85548
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
lag(value, k = 1:2)1 0.078234 0.015438 5.0677 1.059e-06 ***
lag(value, k = 1:2)2 -0.018754 0.016078 -1.1664 0.2451
capital 0.352658 0.021003 16.7910 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 2034500
Residual Sum of Squares: 617850
R-Squared: 0.69631
Adj. R-Squared: 0.67449
F-statistic: 127.633 on 3 and 167 DF, p-value: < 2.22e-16
Related
I am running a FE regression of firm characteristics on the dependant variable effective tax rates.
I tried both plm package and fixest package. I understand the differences in the standard errors (and I correct them with coeftest for the plm regression, not shown here), however I do not understand the difference in adjusted R-squared between fixest and plm.
Coefficients are the same in both models, so adjusted R-squared should be the same, right?
> fe <- feols(GETR ~ SIZE + LEV + CAPINT + INVINT + ROA + LLEV + CF + EK| id + year,data = panel52,cluster = ~ id+year)
> summary(fe)
OLS estimation, Dep. Var.: GETR
Observations: 19,240
Fixed-effects: id: 1,924, year: 10
Standard-errors: Clustered (id & year)
Estimate Std. Error t value Pr(>|t|)
SIZE 0.031979 0.010624 3.010150 0.0147123 *
LEV -0.021880 0.033039 -0.662243 0.5244090
CAPINT 0.098979 0.027374 3.615754 0.0056088 **
INVINT 0.045080 0.039294 1.147250 0.2808605
ROA 0.222094 0.089892 2.470664 0.0355315 *
LLEV 0.015973 0.025740 0.620558 0.5502796
CF -0.237174 0.098485 -2.408230 0.0393631 *
EK 0.027064 0.063651 0.425196 0.6806793
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
RMSE: 0.160357 Adj. R2: 0.212174
Within R2: 0.004764
> summary(fe52)
Twoways effects Within Model
Call:
plm(formula = GETR ~ SIZE + LEV + CAPINT + INVINT + ROA + LLEV +
CF + EK, data = panel52, na.action = na.exclude, effect = "twoways",
model = "within")
Balanced Panel: n = 1924, T = 10, N = 19240
Residuals:
Min. 1st Qu. Median 3rd Qu. Max.
-0.7032714 -0.0635238 -0.0079128 0.0376269 0.9293129
Coefficients:
Estimate Std. Error t-value Pr(>|t|)
SIZE 0.0319790 0.0065342 4.8941 9.967e-07 ***
LEV -0.0218800 0.0356996 -0.6129 0.5400
CAPINT 0.0989786 0.0222388 4.4507 8.612e-06 ***
INVINT 0.0450804 0.0366761 1.2292 0.2190
ROA 0.2220941 0.0389295 5.7050 1.182e-08 ***
LLEV 0.0159730 0.0180534 0.8848 0.3763
CF -0.2371736 0.0425711 -5.5712 2.567e-08 ***
EK 0.0270641 0.0380943 0.7104 0.4774
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 497.11
Residual Sum of Squares: 494.74
R-Squared: 0.004764
Adj. R-Squared: -0.10685
F-statistic: 10.3509 on 8 and 17299 DF, p-value: 1.4467e-14```
Is it possible to plot with emmip the marginal (log odds) means from a geeglm model when you have a quadratic term? I have repeated measures data and the model fits better with a treatment x time squared term in addition to an interaction term with linear time.
I just want to be able to visualise the predicted curve in the data. If it's possible I don't know how to specify it. I've tried:
mod3 <- geeglm(outcome ~ treatment*time + treatment*time_sq, data = dat, id = id, family = "binomial", corstr = "exchangeable"))
mod3a.rg <- ref_grid(mod3, at = list(time = c(1,2,3,4,5,6), time_sq = c(1,4,9,16,25,36)))
emmip(mod3a.rg, treatment ~ time)
I don't think your mod3 is including your quadratic term correctly (hard to tell since you did not include reproducible code). This will let you include your squared term for time correctly:
mod3 <- geeglm(outcome ~ treatment*time + treatment*I(time^2), data =
dat, id = id, family = "binomial", corstr = "exchangeable"))
The add plotit = TRUE to your call to emmip():
emmip(mod3a.rg, treatment ~ time, plotit = TRUE)
Here's a simple reproducible example with the savings dataset in the MASS, faraway package for comparison
library(MASS)
data(savings, package="faraway")
#fit model with polynomial term
mod <- lm(sr ~ ddpi+I(ddpi^2))
summary(mod)
The summary produces this output, note the additonal coefficient for your quadratic term
> Call: lm(formula = sr ~ ddpi + I(ddpi^2), data = savings)
>
> Residuals:
> Min 1Q Median 3Q Max
> -8.5601 -2.5612 0.5546 2.5735 7.8080
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
>(Intercept) 5.13038 1.43472 3.576 0.000821 ***
>ddpi 1.75752 0.53772 3.268 0.002026 **
>I(ddpi^2) -0.09299 0.03612 -2.574 0.013262 *
> --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 4.079 on 47 degrees of freedom Multiple
> R-squared: 0.205, Adjusted R-squared: 0.1711 F-statistic: 6.059 on
> 2 and 47 DF, p-value: 0.004559
If you don't enclose the quadratic term with I() your summary will only include the term for ddpi.
mod2 <- lm(sr ~ ddpi+ddpi^2)
summary(mod2)
produces the following summary with a coefficient only for ddpi
> lm(formula = sr ~ ddpi + ddpi^2, data = savings)
>
> Residuals:
> Min 1Q Median 3Q Max
> -8.5535 -3.7349 0.9835 2.7720 9.3104
>
> Coefficients:
> Estimate Std. Error t value Pr(>|t|)
>(Intercept) 7.8830 1.0110 7.797 4.46e-10 ***
>ddpi 0.4758 0.2146 2.217 0.0314 *
> --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>
> Residual standard error: 4.311 on 48 degrees of freedom Multiple
> R-squared: 0.0929, Adjusted R-squared: 0.074 F-statistic: 4.916 on
> 1 and 48 DF, p-value: 0.03139
i have fit negative binomial model to my data as follows:
> ngbinmodel <- glm.nb( seizure.rate ~ age + treatment, data = epilepsy_reduced)
> summary(ngbinmodel)
Call:
glm.nb(formula = seizure.rate ~ age + treatment, data = epilepsy_reduced,
init.theta = 1.498983674, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3510 -0.8790 -0.4563 0.4328 1.8916
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.0985089 0.5845392 3.590 0.000331 ***
age -0.0007965 0.0193064 -0.041 0.967092
treatment -0.5011593 0.2405658 -2.083 0.037228 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.499) family taken to be 1)
Null deviance: 71.217 on 57 degrees of freedom
Residual deviance: 66.875 on 55 degrees of freedom
AIC: 341.12
Number of Fisher Scoring iterations: 1
Theta: 1.499
Std. Err.: 0.362
2 x log-likelihood: -333.119
Now I would like to check if i should include the interaction effect between age and treatment. I have found two methods to do it:
> intearaction_nbm<-addterm(ngbinmodel, . ~ . * age,test="Chisq")
> summary(intearaction_nbm)
Df AIC LRT Pr(Chi)
Min. :1 Min. :339.1 Min. :0.9383 Min. :0.3327
1st Qu.:1 1st Qu.:339.4 1st Qu.:0.9383 1st Qu.:0.3327
Median :1 Median :339.6 Median :0.9383 Median :0.3327
Mean :1 Mean :339.6 Mean :0.9383 Mean :0.3327
3rd Qu.:1 3rd Qu.:339.9 3rd Qu.:0.9383 3rd Qu.:0.3327
Max. :1 Max. :340.2 Max. :0.9383 Max. :0.3327
NA's :1 NA's :1 NA's :1
and
> ngbinmodel_int <- glm.nb( seizure.rate ~ age*treatment, data = epilepsy_reduced)
> summary(ngbinmodel_int)
glm.nb(formula = seizure.rate ~ age * treatment, data = epilepsy_reduced,
init.theta = 1.531539174, link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3503 -0.8742 -0.3848 0.3403 1.8508
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.51361 0.83920 1.804 0.0713 .
age 0.01914 0.02826 0.677 0.4981
treatment 0.60748 1.12199 0.541 0.5882
age:treatment -0.03893 0.03850 -1.011 0.3119
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.5315) family taken to be 1)
Null deviance: 72.238 on 57 degrees of freedom
Residual deviance: 66.874 on 54 degrees of freedom
AIC: 342.18
Number of Fisher Scoring iterations: 1
Theta: 1.532
Std. Err.: 0.373
2 x log-likelihood: -332.180
I was expecting to obtain the same result from both of the methods.
How can i access the regression estimates of intearaction_nbm?
why are the outcomes different? According to intearaction_nbm i should include the interaction term (the AIC is lower) but according to ngbinmodel_int i should not include the interaction term (AIC increases).
would discretizing my continuous variable age be advised?
Remark: You should move this post to cross validated.
How can i access the regression estimates of intearaction_nbm?
intearaction_nbm gives you the result of the addition of single term to your model, if you print it, you will have a row per possible additional term (age:treatment, age:another_variable, etc.) giving you the AIC and P-value among other things.
why are the outcomes different?
Not possible to answer without the data, but what I would do is define both models and compare their AIC using: AIC(model_1, model_2). This way I am sure that I am comparing the same quantity. As you know, the AIC is defined up to an additional term, and unless you check how it is computed, you cannot be sure that two different functions in two different packages use the same definition.
would discretizing my continuous variable age be advised?
Not possible to answer without the data...
Let us consider the dataset quine and the following model with only main effects for Eth and Lrn factors:
library(MASS)
negbin_no_int <- glm.nb(Days ~ Eth + Lrn, data = quine)
summary(negbin_no_int)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 3.0367 0.1334 22.764 < 2e-16 ***
# EthN -0.5520 0.1597 -3.457 0.000546 ***
# LrnSL 0.0388 0.1611 0.241 0.809661
extractAIC(negbin_no_int)
# [1] 3.000 1112.576
The model with the interaction term between the two factor is:
negbin_with_int <- glm.nb(Days ~ Eth * Lrn, data = quine)
summary(negbin_with_int)
# Coefficients:
# Estimate Std. Error z value Pr(>|z|)
# (Intercept) 2.9218 0.1503 19.446 <2e-16 ***
# EthN -0.3374 0.2100 -1.607 0.108
# LrnSL 0.2929 0.2307 1.269 0.204
# EthN:LrnSL -0.4956 0.3201 -1.549 0.122
extractAIC(negbin_with_int)
# [1] 4.000 1112.196
The statistical significance of the interaction term is p=0.122.
Now we compare the two models using addterm:
interaction_nbm <- addterm(negbin_no_int, . ~ . + Eth:Lrn, test="Chisq")
print(interaction_nbm)
# Model:
# Days ~ Eth + Lrn
# Df AIC LRT Pr(Chi)
# <none> 1112.6
# Eth:Lrn 1 1112.2 2.3804 0.1229
The AICs given by addterm are the same calculated using extractAIC.
If you want to see the regression estimates of addterm, you can add a summary(print(nfit)) inside the function, as follows:
myaddterm <- function (object, scope, scale = 0, test = c("none", "Chisq"),
k = 2, sorted = FALSE, trace = FALSE, ...)
{
if (missing(scope) || is.null(scope))
stop("no terms in scope")
if (!is.character(scope))
scope <- add.scope(object, update.formula(object, scope))
if (!length(scope))
stop("no terms in scope for adding to object")
ns <- length(scope)
ans <- matrix(nrow = ns + 1L, ncol = 2L, dimnames = list(c("<none>",
scope), c("df", "AIC")))
ans[1L, ] <- extractAIC(object, scale, k = k, ...)
n0 <- nobs(object, use.fallback = TRUE)
env <- environment(formula(object))
for (i in seq_len(ns)) {
tt <- scope[i]
if (trace) {
message(gettextf("trying + %s", tt), domain = NA)
utils::flush.console()
}
nfit <- update(object, as.formula(paste("~ . +", tt)),
evaluate = FALSE)
nfit <- try(eval(nfit, envir = env), silent = TRUE)
print(summary(nfit))
ans[i + 1L, ] <- if (!inherits(nfit, "try-error")) {
nnew <- nobs(nfit, use.fallback = TRUE)
if (all(is.finite(c(n0, nnew))) && nnew != n0)
stop("number of rows in use has changed: remove missing values?")
extractAIC(nfit, scale, k = k, ...)
}
else NA_real_
}
dfs <- ans[, 1L] - ans[1L, 1L]
dfs[1L] <- NA
aod <- data.frame(Df = dfs, AIC = ans[, 2L])
o <- if (sorted)
order(aod$AIC)
else seq_along(aod$AIC)
test <- match.arg(test)
if (test == "Chisq") {
dev <- ans[, 2L] - k * ans[, 1L]
dev <- dev[1L] - dev
dev[1L] <- NA
nas <- !is.na(dev)
P <- dev
P[nas] <- MASS:::safe_pchisq(dev[nas], dfs[nas], lower.tail = FALSE)
aod[, c("LRT", "Pr(Chi)")] <- list(dev, P)
}
aod <- aod[o, ]
head <- c("Single term additions", "\nModel:", deparse(formula(object)))
if (scale > 0)
head <- c(head, paste("\nscale: ", format(scale), "\n"))
class(aod) <- c("anova", "data.frame")
attr(aod, "heading") <- head
aod
}
interaction_nbm1 <- myaddterm(negbin_no_int, . ~ . + Eth:Lrn, test="Chisq")
The output is:
Call:
glm.nb(formula = Days ~ Eth + Lrn + Eth:Lrn, data = quine, init.theta = 1.177546225,
link = log)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.5770 -1.0470 -0.3645 0.3521 2.7227
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.9218 0.1503 19.446 <2e-16 ***
EthN -0.3374 0.2100 -1.607 0.108
LrnSL 0.2929 0.2307 1.269 0.204
EthN:LrnSL -0.4956 0.3201 -1.549 0.122
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Negative Binomial(1.1775) family taken to be 1)
Null deviance: 182.93 on 145 degrees of freedom
Residual deviance: 168.18 on 142 degrees of freedom
AIC: 1114.2
Number of Fisher Scoring iterations: 1
Theta: 1.178
Std. Err.: 0.146
2 x log-likelihood: -1104.196
Using the plm package in R to fit a fixed-effects model, what is the correct syntax to add a lagged variable to the model? Similar to the 'L1.variable' command in Stata.
Here is my attempt adding a lagged variable (this is a test model and it might not make sense):
library(foreign)
nlswork <- read.dta("http://www.stata-press.com/data/r11/nlswork.dta")
pnlswork <- plm.data(nlswork, c('idcode', 'year'))
ffe <- plm(ln_wage ~ ttl_exp+lag(wks_work,1)
, model = 'within'
, data = nlswork)
summary(ffe)
R output:
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + lag(wks_work), data = nlswork,
model = "within")
Unbalanced Panel: n=3911, T=1-14, N=19619
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.77000 -0.10100 0.00293 0.11000 2.90000
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.02341057 0.00073832 31.7078 < 2.2e-16 ***
lag(wks_work) 0.00081576 0.00010628 7.6755 1.744e-14 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 1296.9
Residual Sum of Squares: 1126.9
R-Squared: 0.13105
Adj. R-Squared: -0.085379
F-statistic: 1184.39 on 2 and 15706 DF, p-value: < 2.22e-16
However, I got different results compared what Stata produces.
In my actual model, I would like to instrument an endogenous variable with its lagged value.
Thanks!
For reference, here is the Stata code:
webuse nlswork.dta
xtset idcode year
xtreg ln_wage ttl_exp L1.wks_work, fe
Stata output:
Fixed-effects (within) regression Number of obs = 10,680
Group variable: idcode Number of groups = 3,671
R-sq: Obs per group:
within = 0.1492 min = 1
between = 0.2063 avg = 2.9
overall = 0.1483 max = 8
F(2,7007) = 614.60
corr(u_i, Xb) = 0.1329 Prob > F = 0.0000
------------------------------------------------------------------------------
ln_wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ttl_exp | .0192578 .0012233 15.74 0.000 .0168597 .0216558
|
wks_work |
L1. | .0015891 .0001957 8.12 0.000 .0012054 .0019728
|
_cons | 1.502879 .0075431 199.24 0.000 1.488092 1.517666
-------------+----------------------------------------------------------------
sigma_u | .40678942
sigma_e | .28124886
rho | .67658275 (fraction of variance due to u_i)
------------------------------------------------------------------------------
F test that all u_i=0: F(3670, 7007) = 4.71 Prob > F = 0.0000
lag() as it is in plm lags the observations row-wise without "looking" at the time variable, i.e. it shifts the variable (per individual). If there are gaps in the time dimension, you probably want to take the value of the time variable into account. There is the (as of now) unexported function plm:::lagt.pseries which takes the time variable into account and hence handles gaps in data as you might expect.
Edit: Since plm version 1.7-0, default behaviour of lag in plm is to shift time-wise but one can control behaviour by argument shift(shift = c("time", "row")) to shift either time-wise or row-wise (old behaviour).
Use it as follows:
library(plm)
library(foreign)
nlswork <- read.dta("http://www.stata-press.com/data/r11/nlswork.dta")
pnlswork <- pdata.frame(nlswork, c('idcode', 'year'))
ffe <- plm(ln_wage ~ ttl_exp + plm:::lagt.pseries(wks_work,1)
, model = 'within'
, data = pnlswork)
summary(ffe)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + plm:::lagt.pseries(wks_work,
1), data = nlswork, model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
plm:::lagt.pseries(wks_work, 1) 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16
Btw1: Better use pdata.frame() instead of plm.data().
Btw2: You can check for gaps in your data with plm's is.pconsecutive():
is.pconsecutive(pnlswork)
all(is.pconsecutive(pnlswork))
You can also make the data consecutive first and then use lag(), like this:
pnlswork2 <- make.pconsecutive(pnlswork)
pnlswork2$wks_work_lag <- lag(pnlswork2$wks_work)
ffe2 <- plm(ln_wage ~ ttl_exp + wks_work_lag
, model = 'within'
, data = pnlswork2)
summary(ffe2)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + wks_work_lag, data = pnlswork2,
model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
wks_work_lag 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16
Or simply:
ffe3 <- plm(ln_wage ~ ttl_exp + lag(wks_work)
, model = 'within'
, data = pnlswork2) # note: it is the consecutive panel data set here
summary(ffe3)
Oneway (individual) effect Within Model
Call:
plm(formula = ln_wage ~ ttl_exp + lag(wks_work), data = pnlswork2,
model = "within")
Unbalanced Panel: n=3671, T=1-8, N=10680
Residuals :
Min. 1st Qu. Median 3rd Qu. Max.
-1.5900 -0.0859 0.0000 0.0957 2.5600
Coefficients :
Estimate Std. Error t-value Pr(>|t|)
ttl_exp 0.01925775 0.00122330 15.7425 < 2.2e-16 ***
lag(wks_work) 0.00158907 0.00019573 8.1186 5.525e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Total Sum of Squares: 651.49
Residual Sum of Squares: 554.26
R-Squared: 0.14924
Adj. R-Squared: -0.29659
F-statistic: 614.604 on 2 and 7007 DF, p-value: < 2.22e-16
I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")