Dynamic regression in R Forecasting Package - r

I have a time series of two variables representing two currencies: SYP (Syrian pound) and LBP (Lebanese pound). The data represent the daily values of both currencies over the previous six months. I previously run a standard regression model using SYP as dependent variable and LBP as independent variable. This are the results:
SYPts <- ts(SYP_LBP)
modelSYPLBP <- tslm(SYP ~ LBP, data = SYPts)
summary(modelSYPLBP)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -835.77100 228.93013 -3.651 0.000319 ***
LBP 0.41801 0.02744 15.235 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 324.9 on 248 degrees of freedom
Multiple R-squared: 0.4834, Adjusted R-squared: 0.4814
F-statistic: 232.1 on 1 and 248 DF, p-value: < 2.2e-16
However, when I try to run a dynamic regression model using the auto.arima call from the forecast package I obtain:
model <- auto.arima(SYPts[, "SYP"], xreg = SYPts[, "LBP"], stationary = FALSE)
summary(model)
Series: SYPts[, "SYP"]
Regression with ARIMA(0,1,1) errors
Coefficients:
ma1 drift xreg
-0.4654 5.6911 -0.0163
s.e. 0.0591 3.3376 0.0315
sigma^2 estimated as 9744: log likelihood=-1495.39
AIC=2998.78 AICc=2998.94 BIC=3012.85
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.06985184 97.92004 46.45812 -0.1826244 1.792092 1.18404
ACF1
Training set 0.001792873
The coefficient of the regression is now negative and a lot smaller than the one obtained using the standard regression model.
Furthermore, when I try the forecast with the model assuming 0 change in the LBP over the next 15 days I obtained a quite flat forecast curve and, since the coefficient estimate is negative, the curve becomes flatter and then negative as I increase the forecast values of LBP.
I would like to ask:
1- Did I do some specific error in treating the data and preparing the model? For example, I did not put any frequency of the time serie. Should I?
2 - Should I intervene further on the data by differentiang them before creating the model? In the formula I used STATIONARY = FALSE as if I use STATIONARY = TRUE I obtain very low p-value in the checkresiduals test.
3 - I don't understand if I do something wrong in generating the forecast values for the independent variable (LBP). In the formula forecast <- forecast(model, xreg = rep(10000,15)) I assume that the two arguments of rep represent respectively the additional value of LBP and for how many days I want it to be repeated in time. As 10000 was the latest value of LBP in the time series, by using it I assume no change intervene over the next 15 days. Is it correct?
Thank you

(This is a very brief answer, just to complete the thread, and in case someone chances upon this question again).
Check your residuals from the tslm() regression for autocorrelation "checkresiduals(modelSYPLBP))". (Please look up "?checkresiduals" for the interpretation). If the tests indicate (as is quite likely with time series regressions) that the regression residuals are autocorrelated, then inference based on the coefficient estimates are not valid.

Related

Is there a way to get unweighted residuals and fitted values from a WLS regression in R?

I'm working on question C9 in Wooldridge's Intro to Econometrics Textbook. It asks you to obtain the unweighted fitted values and residuals from a weighted least squares regression. Does the following code give me the weighted or unweighted fitted values and residuals?
fitted(wlsmodel)
resid(wlsmodel)
I'm getting different answers from those in the textbook and I'm thinking it must be because the code I'm entering is giving me weighted fitted values and residuals. If this is the case, is there a way to get unweighted fitted values and residuals from a wls regression?
Okay, I've figured it out.
Chapter 8, question C9
(i) Obtain the OLS estimates in equation (8.35)
library(wooldridge)
reg<-lm(cigs~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,data=smoke)
(ii) Obtain the hhat used in the WLS estimation of equation (8.36) and reproduce equation (8.36). From this equation, obtain the unweighted residuals and fitted values; call these uhat and yhat, respectively.
uhat<-resid(reg)
uhat2<-uhat^2
ghat<-fitted(lm(log(uhat^2)~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,data=smoke))
hhat<-exp(ghat)
wls<-lm(cigs~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,weight=1/hhat,data=smoke)
uhatwls<-resid(wls)
yhatwls<-fitted(wls)
(iii) Let utilde=uhat/sqrt(hhat) and ytilde=yhat/sqrt(hhat) be the weighted quantities. Carry out the special case of the white test for heteroskedasticity by regressing utilde^2 on ytilde and ytilde^2, being sure to include an intercept, as always. Do you find heteroskedasticity in the weighted residuals?
utilde<-uhatwls/sqrt(hhat)
ytilde<-yhatwls/sqrt(hhat)
utilde2<-utilde^2
ytilde2<-ytilde^2
whitetest<-lm(utilde2~ytilde+ytilde2)
summary(whitetest)
Call:
lm(formula = utilde2 ~ ytilde + ytilde2)
Residuals:
Min 1Q Median 3Q Max
-5.579 -1.801 -1.306 -0.855 90.871
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.06146 0.96043 -0.064 0.94899
ytilde 0.28667 1.41212 0.203 0.83918
ytilde2 2.40597 0.78615 3.060 0.00228 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.414 on 804 degrees of freedom
Multiple R-squared: 0.027, Adjusted R-squared: 0.02458
F-statistic: 11.15 on 2 and 804 DF, p-value: 1.667e-05
The process above gives the correct answers as given in the solutions manual, so I know it has been done correctly. The thing that was confusing me was the request to obtain the 'unweighted' residuals from the WLS. It turns out that these are just the residuals obtained by default from that regression, which are then weighted in part (iii) of the question, as per above. The goal being to then test the WLS regression for heteroskedasticity, which is indeed present in the WLS regression.

Predictor in logistic regression for a large sample size (1.8 million obs.) predicts only 0's

I am trying to run a logistic regression model to predict the default probabilities of individual loans. I have a large sample size of 1.85 million observations, about 81% of which were fully paid off and the rest defaulted. I had run the logistic regression with 20+ other predictors that were statistically significant and got warning "fitted probabilities 0 or 1 occurred", and by adding predictors step by step I found that only 1 predictor was causing this problem, the "annual income" (annual_inc). I ran a logistic regression with only this predictor and found that it predicts only 0's (fully paid off loans), although there is a significant proportion of defaulted loans. I tried different proportions of training and testing data. If I give split the model in the way that gives 80% of the original sample to the Testing set and 20% to the Training set, R doesn't show the fitted probabilities warning, but the model still predicts 0's only on the testing set. Below I attach the little code concerned just in case. I doubt that adding a small sample of my data would be of any use given the circumstance, but if I am mistaken, let me know please and I will add it.
>set.seed(42)
>indexes <- sample(1:nrow(df), 0.8*nrow(df))
>df_test = df[indexes,]
>df_train = df[-indexes,]
>mymodel_2 <- glm(loan_status ~ annual_inc, data = df_train, family = 'binomial')
>summary(mymodel_2)
Call:
glm(formula = loan_status ~ annual_inc, family = "binomial",
data = df_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6902 -0.6530 -0.6340 -0.5900 5.4533
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.308e+00 8.290e-03 -157.83 <2e-16 ***
annual_inc -2.426e-06 9.382e-08 -25.86 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 352917 on 370976 degrees of freedom
Residual deviance: 352151 on 370975 degrees of freedom
AIC: 352155
Number of Fisher Scoring iterations: 4
>res <- predict(mymodel_2, df_test, type = "response")
>confmatrix <- table(Actual_value = df_test$loan_status, Predicted_value = res >0.5)
>confmatrix
Predicted_value
Actual_value FALSE
0 1212481
1 271426
Moreover, when I searched for the solution of the issue on the Internet, I seen it is often often attributed to perfect separation, but my case predicts 0's only, and the analogue-cases I have seen had small sample size. So far I am hesitant about implementing penalised logistic regression, because I think my issue is not perfect separation. Also, it is worth pointing out, I want to use logistic regression specifically due to specifics of research. How can I overcome the issue at hand?
As #deschen suggested I used resampling ROSE technique from ROSE package for R and it solved my issue, although over-, under-sampling methods, and a combination of both worked as well.

How to implement a non linear model regression in R

I am quite new to both R and Statistics and really need your help. I should analyze some data to find an analytical model that describes it.
I have 2 response (y1,y2) and (4 predictors).
I thought of performing the analysis using R and followed these steps:
1) For each response, I tested a linear model (lm command) and I found:
Call:
lm(formula = data_mass$m ~ ., data = data_mass)
Residuals:
Min 1Q Median 3Q Max
-7.805e-06 -1.849e-06 -1.810e-07 2.453e-06 7.327e-06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.367e-04 1.845e-05 -7.413 1.47e-06 ***
d 1.632e-04 1.134e-05 14.394 1.42e-10 ***
L 2.630e-08 1.276e-07 0.206 0.83927
D 1.584e-05 5.103e-06 3.104 0.00682 **
p 1.101e-06 1.195e-07 9.215 8.46e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.472e-06 on 16 degrees of freedom
Multiple R-squared: 0.9543, Adjusted R-squared: 0.9429
F-statistic: 83.51 on 4 and 16 DF, p-value: 1.645e-10
2) So I analyzed how good the model is by taking a look at plot(model) graphs.
Looking at the "residual vs fitted value" plot, the model should not be linear!! Is it correct?
3) I tried to eliminate some factors (like "L") and to introduce some quadratic terms (d^2 ; D^2), but the "residual vs fitted value" plot has the same trend.
What can I do now? Should I use a non-linear model?
Thank you to everyone can help me =)
UPDATE:
Thank you again. I attached graph of plot(model) and DATA. The responses are m, Fz and the predictors d,L,D,p. The model is a linear model of response m.
[Residual vs Fitted][1]
[Normal Q-Q][2]
[Scale Location][3]
[Residual vs Leverage][4]
[DATA][5]
enter code here
Looking the "residual vs fitted value" plot the model should not be linear!! Is it correct?
Yes and no. If absolute value of the residuals have strong correlation with the fitted values, that could mean heteroscedasticity (heterogeneity of variance).
Then the residuals would not be equally spread along the fitted values. And heteroscedasticity is one of the thing you could look at on fitted vs residual graph, because it can invalidate statistical tests such as *t*-test or lm. You could also confirm it with scale-location plot (which is quite similar to this but slightly better).
On the other hand nonlinear distribution indicate nonlinearity and would probably want to change the structure of your model. Though you don´t wont neither linear, nor nonlinear relationship between residuals and fitted values: in ideal case scenario values should be more or less randomly and symmetrically scattered around 0 between two parallel lines with 0 slope.
You can find more discussion on the issue here: 1 2 3
What can I do now? Should I use a non-linear model?
If your diagnostic plots indicate nonlinearity, you may want to change/restructure/readjust your model (or transform the data) - there is some discussion on the options here

Back-transforming gamma GLM to natural units to be able to predict values in unsampled locations

I'm working with ecological data, where I have used cameras to sample animal detections (converted to biomass) and run various models to identify the best fitting model, chosen through looking at diagnostic plots, AIC and parameter effect size. The model is a gamma GLM (due to biomass having a continuous response), with a log link. The chosen model has the predictor variables of distance to water ("dist_water") and distance to forest patch ("dist_F3"). This is the model summary:
glm(formula = RAI_biomass ~ Dist_water.std + Dist_F3.std, family = Gamma(link = "log"),
data = biomass_RAI)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.3835 -1.0611 -0.3937 0.4355 1.5923
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.3577 0.2049 26.143 2.33e-16 ***
Dist_water.std -0.7531 0.2168 -3.474 0.00254 **
Dist_F3.std 0.5831 0.2168 2.689 0.01452 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for Gamma family taken to be 0.9239696)
Null deviance: 41.231 on 21 degrees of freedom
Residual deviance: 24.232 on 19 degrees of freedom
AIC: 287.98
Number of Fisher Scoring iterations: 7
The covariates were standardised prior to running the model. What I need to do now is to back-transform this model into natural units in order to predict biomass values at unsampled locations (in this case, farms). I made a table of each farm and their respective distance to water, and distance to forest patch. I thought the way to do this would be to use the exp(coef(biomass_glm)), but when I did this the dist_water.std coefficient changed direction and became positive.
exp(coef(biomass_glm8))
## Intercept Dist_water.std Dist_F3.std
## 212.2369519 0.4709015 1.7915026
To me this seems problematic, as in the original GLM, an increasing distance to water meant a decrease in biomass (this makes sense) - but now we are seeing the opposite? The calculated biomass response had a very narrow range, from 210.97-218.9331 (for comparison, in the original data, biomass ranged from 3.04-2227.99.
I then tried to take the exponent of the entire model, without taking the exponent of each coefficient individually:
farms$biomass_est2 <- exp(5.3577 + (-0.7531*farms$Farm_dist_water_std) + (0.5831*farms$Farm_dist_F3_std))
and this gave me a new biomass response that makes a bit more sense, i.e. more variation given the variation in the two covariates (2.93-1088.84).
I then tried converting the coefficient estimates by doing e^B - 1, which gave again different results, although most similar to the ones obtained through exp(coef(biomass_glm)):
(e^(-0.7531))-1 #dist_water = -0.5290955
(e^(0.5831))-1 #dist_F3 = 0.7915837
(e^(5.3577))-1 #intercept = 211.2362
My question is, why are these estimates different, and what is the best way to take this gamma GLM with a log link and convert it into a format that can be used to calculate predicted values? Any help would be greatly appreciated!

What is the difference between Multiple R-squared and Adjusted R-squared in a single-variate least squares regression?

Could someone explain to the statistically naive what the difference between Multiple R-squared and Adjusted R-squared is? I am doing a single-variate regression analysis as follows:
v.lm <- lm(epm ~ n_days, data=v)
print(summary(v.lm))
Results:
Call:
lm(formula = epm ~ n_days, data = v)
Residuals:
Min 1Q Median 3Q Max
-693.59 -325.79 53.34 302.46 964.95
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2550.39 92.15 27.677 <2e-16 ***
n_days -13.12 5.39 -2.433 0.0216 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 410.1 on 28 degrees of freedom
Multiple R-squared: 0.1746, Adjusted R-squared: 0.1451
F-statistic: 5.921 on 1 and 28 DF, p-value: 0.0216
The "adjustment" in adjusted R-squared is related to the number of variables and the number of observations.
If you keep adding variables (predictors) to your model, R-squared will improve - that is, the predictors will appear to explain the variance - but some of that improvement may be due to chance alone. So adjusted R-squared tries to correct for this, by taking into account the ratio (N-1)/(N-k-1) where N = number of observations and k = number of variables (predictors).
It's probably not a concern in your case, since you have a single variate.
Some references:
How high, R-squared?
Goodness of fit statistics
Multiple regression
Re: What is "Adjusted R^2" in Multiple Regression
The Adjusted R-squared is close to, but different from, the value of R2. Instead of being based on the explained sum of squares SSR and the total sum of squares SSY, it is based on the overall variance (a quantity we do not typically calculate), s2T = SSY/(n - 1) and the error variance MSE (from the ANOVA table) and is worked out like this: adjusted R-squared = (s2T - MSE) / s2T.
This approach provides a better basis for judging the improvement in a fit due to adding an explanatory variable, but it does not have the simple summarizing interpretation that R2 has.
If I haven't made a mistake, you should verify the values of adjusted R-squared and R-squared as follows:
s2T <- sum(anova(v.lm)[[2]]) / sum(anova(v.lm)[[1]])
MSE <- anova(v.lm)[[3]][2]
adj.R2 <- (s2T - MSE) / s2T
On the other side, R2 is: SSR/SSY, where SSR = SSY - SSE
attach(v)
SSE <- deviance(v.lm) # or SSE <- sum((epm - predict(v.lm,list(n_days)))^2)
SSY <- deviance(lm(epm ~ 1)) # or SSY <- sum((epm-mean(epm))^2)
SSR <- (SSY - SSE) # or SSR <- sum((predict(v.lm,list(n_days)) - mean(epm))^2)
R2 <- SSR / SSY
The R-squared is not dependent on the number of variables in the model. The adjusted R-squared is.
The adjusted R-squared adds a penalty for adding variables to the model that are uncorrelated with the variable your trying to explain. You can use it to test if a variable is relevant to the thing your trying to explain.
Adjusted R-squared is R-squared with some divisions added to make it dependent on the number of variables in the model.
Note that, in addition to number of predictive variables, the Adjusted R-squared formula above also adjusts for sample size. A small sample will give a deceptively large R-squared.
Ping Yin & Xitao Fan, J. of Experimental Education 69(2): 203-224, "Estimating R-squared shrinkage in multiple regression", compares different methods for adjusting r-squared and concludes that the commonly-used ones quoted above are not good. They recommend the Olkin & Pratt formula.
However, I've seen some indication that population size has a much larger effect than any of these formulas indicate. I am not convinced that any of these formulas are good enough to allow you to compare regressions done with very different sample sizes (e.g., 2,000 vs. 200,000 samples; the standard formulas would make almost no sample-size-based adjustment). I would do some cross-validation to check the r-squared on each sample.

Resources