I'm taking some models that were built in EViews and putting them into R. I'm having trouble replicating EViews' MA function.
I tried using the lag of the regressions residuals, but this isn't quite the same. I've seen some mentions that this is an ARIMA regression.. Is there no way to replicate MA from EViews in an lm regression?
For example in R:
set.seed(2)
a = data.frame(a = 1:6,
b = runif(6, 0.0, 1.0),
c = runif(6, 0.0, 1.0))
fit_C = lm(c ~ a + b, data = a)
a$C.pred = predict.lm(fit_C, a)
a$C.resid = a$c - a$C.pred
fit_C = lm(c ~ a + b + lag(C.resid, 1), data = a)
summary(fit_C)
Outputs:
Call:
lm(formula = c ~ a + b + lag(C.resid, 1), data = a)
Residuals:
1 2 3 4 5 6
-1.779e-17 -1.131e-17 5.474e-17 -5.218e-18 -1.959e-17 -8.320e-19
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.327e-01 4.279e-17 1.011e+16 <2e-16 ***
a -3.998e-02 1.353e-17 -2.954e+15 <2e-16 ***
b 2.889e-01 7.278e-17 3.969e+15 <2e-16 ***
lag(C.resid, 1) 1.000e+00 8.241e-17 1.213e+16 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.389e-17 on 2 degrees of freedom
Multiple R-squared: 1, Adjusted R-squared: 1
F-statistic: 5.444e+31 on 3 and 2 DF, p-value: < 2.2e-16
Where in EViews:
Dependent Variable: C01
Method: Least Squares
Date: 09/18/18 Time: 10:24
Sample: 1 6
Included observations: 6
Convergence achieved after 9 iterations
MA Backcast: 0
Variable Coefficient Std. Error t-Statistic Prob.
C 0.892941 0.147320 6.061254 0.0262
A -0.101365 0.041651 -2.433684 0.1354
B 0.063370 0.257874 0.245740 0.8288
MA(1) -0.982901 0.058536 -16.79134 0.0035
R-squared 0.933603 Mean dependent var 0.462030
Adjusted R-squared 0.834008 S.D. dependent var 0.250812
S.E. of regression 0.102186 Akaike info criterion -1.489321
Sum squared resid 0.020884 Schwarz criterion -1.628148
Log likelihood 8.467963 Hannan-Quinn criter. -2.045057
F-statistic 9.373951 Durbin-Watson stat 2.907407
Prob(F-statistic) 0.097923
Inverted MA Roots .98
How do I replicate the MA(1) variable in R?
Related
data("hprice2")
reg1 <- lm(price ~ rooms + crime + nox, hprice2)
summary(reg1)
Call:
lm(formula = price ~ rooms + crime + nox, data = hprice2)
Residuals:
Min 1Q Median 3Q Max
-18311 -3218 -772 2418 39164
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -19371.47 3250.94 -5.959 4.79e-09 ***
rooms 7933.18 407.87 19.450 < 2e-16 ***
crime -199.70 35.05 -5.697 2.08e-08 ***
nox -1306.06 266.14 -4.907 1.25e-06 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6103 on 502 degrees of freedom
Multiple R-squared: 0.5634, Adjusted R-squared: 0.5608
F-statistic: 215.9 on 3 and 502 DF, p-value: < 2.2e-16
Question 1.
Run two alternative (two-sided) t-tests for: H0: B1 = 8000
predict(reg1, data.frame(rooms=8000, crime = -199.70, nox = -1306.06), interval = .99)
Report your t-statistic and whether you reject or fail to reject the null at 90, 95, and/or 99 percent confidence levels.
I suppose by beta1 you mean rooms in this case. Your t.test in the summary is tested against beta0 = 0, you can see from wiki:
so using the example of nox:
tstat = (-1306.06 - 0)/266.14
[2] -4.907417
And p.value is
2*pt(-abs(tstat),502)
[2] 1.251945e-06
the null hypothesis in your case will be 8000 and you test rooms = 8000:
tstat = (7933.18 - 8000)/407.87
2*pt(-abs(tstat),502)
You can also use linearHypothesis from cars to do the above:
library(car)
linearHypothesis(reg1, c("rooms = 8000"))
I am analysing whether the effects of x_t on y_t differ during and after a specific time period.
I am trying to regress the following model in R using lm():
y_t = b_0 + [b_1(1-D_t) + b_2 D_t]x_t
where D_t is a dummy variable with the value 1 over the time period and 0 otherwise.
Is it possible to use lm() for this formula?
observationNumber <- 1:80
obsFactor <- cut(observationNumber, breaks = c(0,55,81), right =F)
fit <- lm(y ~ x * obsFactor)
For example:
y = runif(80)
x = rnorm(80) + c(rep(0,54), rep(1, 26))
fit <- lm(y ~ x * obsFactor)
summary(fit)
Call:
lm(formula = y ~ x * obsFactor)
Residuals:
Min 1Q Median 3Q Max
-0.48375 -0.29655 0.05957 0.22797 0.49617
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.50959 0.04253 11.983 <2e-16 ***
x -0.02492 0.04194 -0.594 0.554
obsFactor[55,81) -0.06357 0.09593 -0.663 0.510
x:obsFactor[55,81) 0.07120 0.07371 0.966 0.337
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.3116 on 76 degrees of freedom
Multiple R-squared: 0.01303, Adjusted R-squared: -0.02593
F-statistic: 0.3345 on 3 and 76 DF, p-value: 0.8004
obsFactor[55,81) is zero if observationNumber < 55 and one if its greater or equal its coefficient is your $b_0$. x:obsFactor[55,81) is the product of the dummy and the variable $x_t$ - its coefficient is your $b_2$. The coefficient for $x_t$ is your $b_1$.
I have run a logistic regression in R using glm to predict the likelihood that an individual in 1993 will have arthritis in 2004 (Arth2004) based on gender (Gen), smoking status (Smoke1993), hypertension (HT1993), high cholesterol (HC1993), and BMI (BMI1993) status in 1993. My sample size is n=7896. All variables are binary with 0 and 1 for false and true except BMI, which is continuous numeric. For gender, male=1 and female=0.
When I run the regression in R, I get good p-values, but when I actually use the regression for prediction, I get values greater than one quite often for very standard individuals. I apologize for the large code block, but I thought more information may be helpful.
library(ResourceSelection)
library(MASS)
data=read.csv(file.choose())
data$Arth2004 = as.factor(data$Arth2004)
data$Gen = as.factor(data$Gen)
data$Smoke1993 = as.factor(data$Smoke1993)
data$HT1993 = as.factor(data$HT1993)
data$HC1993 = as.factor(data$HC1993)
data$BMI1993 = as.numeric(data$BMI1993)
logistic <- glm(Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 + HT1993, data=data, family="binomial")
summary(logistic)
hoslem.test(logistic$y, fitted(logistic))
confint(logistic)
min(data$BMI1993)
median(data$BMI1993)
max(data$BMI1993)
e=2.71828
The output is as follows:
Call:
glm(formula = Arth2004 ~ Gen + Smoke1993 + BMI1993 + HC1993 +
HT1993, family = "binomial", data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0362 -1.0513 -0.7831 1.1844 1.8807
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.346104 0.158043 -14.845 < 2e-16 ***
Gen1 -0.748286 0.048398 -15.461 < 2e-16 ***
Smoke19931 -0.059342 0.064606 -0.919 0.358
BMI1993 0.084056 0.006005 13.997 < 2e-16 ***
HC19931 0.388217 0.047820 8.118 4.72e-16 ***
HT19931 0.341375 0.058423 5.843 5.12e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 10890 on 7895 degrees of freedom
Residual deviance: 10309 on 7890 degrees of freedom
AIC: 10321
Number of Fisher Scoring iterations: 4
Hosmer and Lemeshow goodness of fit (GOF) test
data: logistic$y, fitted(logistic)
X-squared = 18.293, df = 8, p-value = 0.01913
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) -2.65715966 -2.03756775
Gen1 -0.84336906 -0.65364134
Smoke19931 -0.18619647 0.06709748
BMI1993 0.07233866 0.09588198
HC19931 0.29454661 0.48200673
HT19931 0.22690608 0.45595006
[1] 18
[1] 26
[1] 43
A non-smoking female w/ median BMI (26), hypertension, and high cholesterol yields the following:
e^(26*0.084056+1*0.388217+1*0.341375-0*0.748286-0*0.059342-2.346104)
[1] 1.7664
I think the issue is related somehow to BMI considering that is the only variable that is numeric. Does anyone know why this regression produces probabilities greater than 1?
By default, family = "binomial" uses the logit link function (see ?family). So the probability you're looking for is 1.7664 / (1+1.7664).
I am performing an ANCOVA so as to test what is the relationship between body size (covariate, logLCC) and different head measures (response variable, logLP) in each sex (cathegorical variable, sexo).
I got the slopes for each sex in the lm and I would like to compare them to 1. More specifically, I would like to know if the slopes are significantly higher or less than 1, or if they are equal to 1, as this would have different biological meanings in their allometric relationships.
Here is my code:
#Modelling my lm#
> lm.logLP.sexo.adu<-lm(logLP~logLCC*sexo, data=ADU)
> anova(lm.logLP.sexo.adu)
Analysis of Variance Table
Response: logLP
Df Sum Sq Mean Sq F value Pr(>F)
logLCC 1 3.8727 3.8727 3407.208 < 2.2e-16 ***
sexo 1 0.6926 0.6926 609.386 < 2.2e-16 ***
logLCC:sexo 1 0.0396 0.0396 34.829 7.563e-09 ***
Residuals 409 0.4649 0.0011
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
#Obtaining slopes#
> lm.logLP.sexo.adu$coefficients
(Intercept) logLCC sexoM logLCC:sexoM
-0.1008891 0.6725818 -1.0058962 0.2633595
> lm.logLP.sexo.adu1<-lstrends(lm.logLP.sexo.adu,"sexo",var="logLCC")
> lm.logLP.sexo.adu1
sexo logLCC.trend SE df lower.CL upper.CL
H 0.6725818 0.03020017 409 0.6132149 0.7319487
M 0.9359413 0.03285353 409 0.8713585 1.0005241
Confidence level used: 0.95
#Comparing slopes#
> pairs(lm.logLP.sexo.adu1)
contrast estimate SE df t.ratio p.value
H - M -0.2633595 0.04462515 409 -5.902 <.0001
#Checking whether the slopes are different than 1#
#Computes Summary with statistics
> s1<-summary(lm.logLP.sexo.adu)
> s1
Call:
lm(formula = logLP ~ logLCC * sexo, data = ADU)
Residuals:
Min 1Q Median 3Q Max
-0.13728 -0.02202 -0.00109 0.01880 0.12468
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.10089 0.12497 -0.807 0.42
logLCC 0.67258 0.03020 22.271 < 2e-16 ***
sexoM -1.00590 0.18700 -5.379 1.26e-07 ***
logLCC:sexoM 0.26336 0.04463 5.902 7.56e-09 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.03371 on 409 degrees of freedom
Multiple R-squared: 0.9083, Adjusted R-squared: 0.9076
F-statistic: 1350 on 3 and 409 DF, p-value: < 2.2e-16
#Computes t-student H0: intercept=1. The estimation of coefficients and their s.d. are in s1$coefficients
> t1<-(1-s1$coefficients[2,1])/s1$coefficients[2,2]
#Calculates two tailed probability
> pval<- 2 * pt(abs(t1), df = df.residual(lm.logLP.sexo.adu), lower.tail = FALSE)
> print(pval)
[1] 3.037231e-24
I saw this whole process in several threads here. But all that I can understand is that my slopes are just different from 1.
How could I check that they are greater or smaller than 1?
EDITED
Solved!
#performs one-side test H0=slope bigger than 1
pval<-pt(t1, df = df.residual(lm.logLP.sexo.adu), lower.tail = FALSE)
#performs one-side test H0=slope smaller than 1
pval<-pt(t1, df = df.residual(lm.logLP.sexo.adu), lower.tail = TRUE)
Also, tests should be performed in single-sex models.
How could I check that they are greater or smaller than 1?
As in this post, this post, and as your in question, you can make Wald test which you compute by
t1<-(1-s1$coefficients[2,1])/s1$coefficients[2,2]
Alternatively, use the vcov and coef function to make the code more readable
fit <- lm.logLP.sexo.adu
t1<-(1-coef(fit)[1])/vcov(fit)[1, 1]
The Wald test gives you t-statistics which can be used to make both a two-sided or one-sided test. Thus, you can drop the abs and set the lower.tail argument according to which tail you want to test in.
I want to use the partial least squares regression to find the most representative variables to predict my data.
Here is my code:
library(pls)
potion<-read.table("potion-insomnie.txt",header=T)
potionTrain <- potion[1:182,]
potionTest <- potion[183:192,]
potion1 <- plsr(Sommeil ~ Aubepine + Bave + Poudre + Pavot, data = potionTrain, validation = "LOO")
The summary(lm(potion1)) give me this answer:
Call:
lm(formula = potion1)
Residuals:
Min 1Q Median 3Q Max
-14.9475 -5.3961 0.0056 5.2321 20.5847
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.63931 1.67955 22.410 < 2e-16 ***
Aubepine -0.28226 0.05195 -5.434 1.81e-07 ***
Bave -1.79894 0.26849 -6.700 2.68e-10 ***
Poudre 0.35420 0.72849 0.486 0.627
Pavot -0.47678 0.52027 -0.916 0.361
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 7.845 on 177 degrees of freedom
Multiple R-squared: 0.293, Adjusted R-squared: 0.277
F-statistic: 18.34 on 4 and 177 DF, p-value: 1.271e-12
I deduced that only the variables Aubepine et Bave are representative. So I redid the model just with this two variables:
potion1 <- plsr(Sommeil ~ Aubepine + Bave, data = potionTrain, validation = "LOO")
And I plot:
plot(potion1, ncomp = 2, asp = 1, line = TRUE)
Here is the plot of predicted vs measured values:
The problem is that I see the linear regression on the plot, but I can not know its equation and R². Is it possible ?
Is the first part is the same as a multiple regression linear (ANOVA)?
pacman::p_load(pls)
data(mtcars)
potion <- mtcars
potionTrain <- potion[1:28,]
potionTest <- potion[29:32,]
potion1 <- plsr(mpg ~ cyl + disp + hp + drat, data = potionTrain, validation = "LOO")
coef(potion1) # coefficeints
scores(potion1) # scores
## R^2:
R2(potion1, estimate = "train")
## cross-validated R^2:
R2(potion1)
## Both:
R2(potion1, estimate = "all")