I am quite new to both R and Statistics and really need your help. I should analyze some data to find an analytical model that describes it.
I have 2 response (y1,y2) and (4 predictors).
I thought of performing the analysis using R and followed these steps:
1) For each response, I tested a linear model (lm command) and I found:
Call:
lm(formula = data_mass$m ~ ., data = data_mass)
Residuals:
Min 1Q Median 3Q Max
-7.805e-06 -1.849e-06 -1.810e-07 2.453e-06 7.327e-06
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.367e-04 1.845e-05 -7.413 1.47e-06 ***
d 1.632e-04 1.134e-05 14.394 1.42e-10 ***
L 2.630e-08 1.276e-07 0.206 0.83927
D 1.584e-05 5.103e-06 3.104 0.00682 **
p 1.101e-06 1.195e-07 9.215 8.46e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.472e-06 on 16 degrees of freedom
Multiple R-squared: 0.9543, Adjusted R-squared: 0.9429
F-statistic: 83.51 on 4 and 16 DF, p-value: 1.645e-10
2) So I analyzed how good the model is by taking a look at plot(model) graphs.
Looking at the "residual vs fitted value" plot, the model should not be linear!! Is it correct?
3) I tried to eliminate some factors (like "L") and to introduce some quadratic terms (d^2 ; D^2), but the "residual vs fitted value" plot has the same trend.
What can I do now? Should I use a non-linear model?
Thank you to everyone can help me =)
UPDATE:
Thank you again. I attached graph of plot(model) and DATA. The responses are m, Fz and the predictors d,L,D,p. The model is a linear model of response m.
[Residual vs Fitted][1]
[Normal Q-Q][2]
[Scale Location][3]
[Residual vs Leverage][4]
[DATA][5]
enter code here
Looking the "residual vs fitted value" plot the model should not be linear!! Is it correct?
Yes and no. If absolute value of the residuals have strong correlation with the fitted values, that could mean heteroscedasticity (heterogeneity of variance).
Then the residuals would not be equally spread along the fitted values. And heteroscedasticity is one of the thing you could look at on fitted vs residual graph, because it can invalidate statistical tests such as *t*-test or lm. You could also confirm it with scale-location plot (which is quite similar to this but slightly better).
On the other hand nonlinear distribution indicate nonlinearity and would probably want to change the structure of your model. Though you don´t wont neither linear, nor nonlinear relationship between residuals and fitted values: in ideal case scenario values should be more or less randomly and symmetrically scattered around 0 between two parallel lines with 0 slope.
You can find more discussion on the issue here: 1 2 3
What can I do now? Should I use a non-linear model?
If your diagnostic plots indicate nonlinearity, you may want to change/restructure/readjust your model (or transform the data) - there is some discussion on the options here
Related
I'm working on question C9 in Wooldridge's Intro to Econometrics Textbook. It asks you to obtain the unweighted fitted values and residuals from a weighted least squares regression. Does the following code give me the weighted or unweighted fitted values and residuals?
fitted(wlsmodel)
resid(wlsmodel)
I'm getting different answers from those in the textbook and I'm thinking it must be because the code I'm entering is giving me weighted fitted values and residuals. If this is the case, is there a way to get unweighted fitted values and residuals from a wls regression?
Okay, I've figured it out.
Chapter 8, question C9
(i) Obtain the OLS estimates in equation (8.35)
library(wooldridge)
reg<-lm(cigs~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,data=smoke)
(ii) Obtain the hhat used in the WLS estimation of equation (8.36) and reproduce equation (8.36). From this equation, obtain the unweighted residuals and fitted values; call these uhat and yhat, respectively.
uhat<-resid(reg)
uhat2<-uhat^2
ghat<-fitted(lm(log(uhat^2)~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,data=smoke))
hhat<-exp(ghat)
wls<-lm(cigs~log(income)+log(cigpric)+educ+age+I(age^2)+restaurn,weight=1/hhat,data=smoke)
uhatwls<-resid(wls)
yhatwls<-fitted(wls)
(iii) Let utilde=uhat/sqrt(hhat) and ytilde=yhat/sqrt(hhat) be the weighted quantities. Carry out the special case of the white test for heteroskedasticity by regressing utilde^2 on ytilde and ytilde^2, being sure to include an intercept, as always. Do you find heteroskedasticity in the weighted residuals?
utilde<-uhatwls/sqrt(hhat)
ytilde<-yhatwls/sqrt(hhat)
utilde2<-utilde^2
ytilde2<-ytilde^2
whitetest<-lm(utilde2~ytilde+ytilde2)
summary(whitetest)
Call:
lm(formula = utilde2 ~ ytilde + ytilde2)
Residuals:
Min 1Q Median 3Q Max
-5.579 -1.801 -1.306 -0.855 90.871
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.06146 0.96043 -0.064 0.94899
ytilde 0.28667 1.41212 0.203 0.83918
ytilde2 2.40597 0.78615 3.060 0.00228 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 6.414 on 804 degrees of freedom
Multiple R-squared: 0.027, Adjusted R-squared: 0.02458
F-statistic: 11.15 on 2 and 804 DF, p-value: 1.667e-05
The process above gives the correct answers as given in the solutions manual, so I know it has been done correctly. The thing that was confusing me was the request to obtain the 'unweighted' residuals from the WLS. It turns out that these are just the residuals obtained by default from that regression, which are then weighted in part (iii) of the question, as per above. The goal being to then test the WLS regression for heteroskedasticity, which is indeed present in the WLS regression.
I have a time series of two variables representing two currencies: SYP (Syrian pound) and LBP (Lebanese pound). The data represent the daily values of both currencies over the previous six months. I previously run a standard regression model using SYP as dependent variable and LBP as independent variable. This are the results:
SYPts <- ts(SYP_LBP)
modelSYPLBP <- tslm(SYP ~ LBP, data = SYPts)
summary(modelSYPLBP)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -835.77100 228.93013 -3.651 0.000319 ***
LBP 0.41801 0.02744 15.235 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 324.9 on 248 degrees of freedom
Multiple R-squared: 0.4834, Adjusted R-squared: 0.4814
F-statistic: 232.1 on 1 and 248 DF, p-value: < 2.2e-16
However, when I try to run a dynamic regression model using the auto.arima call from the forecast package I obtain:
model <- auto.arima(SYPts[, "SYP"], xreg = SYPts[, "LBP"], stationary = FALSE)
summary(model)
Series: SYPts[, "SYP"]
Regression with ARIMA(0,1,1) errors
Coefficients:
ma1 drift xreg
-0.4654 5.6911 -0.0163
s.e. 0.0591 3.3376 0.0315
sigma^2 estimated as 9744: log likelihood=-1495.39
AIC=2998.78 AICc=2998.94 BIC=3012.85
Training set error measures:
ME RMSE MAE MPE MAPE MASE
Training set -0.06985184 97.92004 46.45812 -0.1826244 1.792092 1.18404
ACF1
Training set 0.001792873
The coefficient of the regression is now negative and a lot smaller than the one obtained using the standard regression model.
Furthermore, when I try the forecast with the model assuming 0 change in the LBP over the next 15 days I obtained a quite flat forecast curve and, since the coefficient estimate is negative, the curve becomes flatter and then negative as I increase the forecast values of LBP.
I would like to ask:
1- Did I do some specific error in treating the data and preparing the model? For example, I did not put any frequency of the time serie. Should I?
2 - Should I intervene further on the data by differentiang them before creating the model? In the formula I used STATIONARY = FALSE as if I use STATIONARY = TRUE I obtain very low p-value in the checkresiduals test.
3 - I don't understand if I do something wrong in generating the forecast values for the independent variable (LBP). In the formula forecast <- forecast(model, xreg = rep(10000,15)) I assume that the two arguments of rep represent respectively the additional value of LBP and for how many days I want it to be repeated in time. As 10000 was the latest value of LBP in the time series, by using it I assume no change intervene over the next 15 days. Is it correct?
Thank you
(This is a very brief answer, just to complete the thread, and in case someone chances upon this question again).
Check your residuals from the tslm() regression for autocorrelation "checkresiduals(modelSYPLBP))". (Please look up "?checkresiduals" for the interpretation). If the tests indicate (as is quite likely with time series regressions) that the regression residuals are autocorrelated, then inference based on the coefficient estimates are not valid.
I am trying to run a logistic regression model to predict the default probabilities of individual loans. I have a large sample size of 1.85 million observations, about 81% of which were fully paid off and the rest defaulted. I had run the logistic regression with 20+ other predictors that were statistically significant and got warning "fitted probabilities 0 or 1 occurred", and by adding predictors step by step I found that only 1 predictor was causing this problem, the "annual income" (annual_inc). I ran a logistic regression with only this predictor and found that it predicts only 0's (fully paid off loans), although there is a significant proportion of defaulted loans. I tried different proportions of training and testing data. If I give split the model in the way that gives 80% of the original sample to the Testing set and 20% to the Training set, R doesn't show the fitted probabilities warning, but the model still predicts 0's only on the testing set. Below I attach the little code concerned just in case. I doubt that adding a small sample of my data would be of any use given the circumstance, but if I am mistaken, let me know please and I will add it.
>set.seed(42)
>indexes <- sample(1:nrow(df), 0.8*nrow(df))
>df_test = df[indexes,]
>df_train = df[-indexes,]
>mymodel_2 <- glm(loan_status ~ annual_inc, data = df_train, family = 'binomial')
>summary(mymodel_2)
Call:
glm(formula = loan_status ~ annual_inc, family = "binomial",
data = df_train)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6902 -0.6530 -0.6340 -0.5900 5.4533
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.308e+00 8.290e-03 -157.83 <2e-16 ***
annual_inc -2.426e-06 9.382e-08 -25.86 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 352917 on 370976 degrees of freedom
Residual deviance: 352151 on 370975 degrees of freedom
AIC: 352155
Number of Fisher Scoring iterations: 4
>res <- predict(mymodel_2, df_test, type = "response")
>confmatrix <- table(Actual_value = df_test$loan_status, Predicted_value = res >0.5)
>confmatrix
Predicted_value
Actual_value FALSE
0 1212481
1 271426
Moreover, when I searched for the solution of the issue on the Internet, I seen it is often often attributed to perfect separation, but my case predicts 0's only, and the analogue-cases I have seen had small sample size. So far I am hesitant about implementing penalised logistic regression, because I think my issue is not perfect separation. Also, it is worth pointing out, I want to use logistic regression specifically due to specifics of research. How can I overcome the issue at hand?
As #deschen suggested I used resampling ROSE technique from ROSE package for R and it solved my issue, although over-, under-sampling methods, and a combination of both worked as well.
I made an Random-Forest model with R to predict continuous target variable, and just encountered a problem with my customers. I was trying to tell them that RMSE is a good way to understand how good my model is, but they are asking me to give some "accuracy" like values with percentages to ease their understanding.
So what I have are vector of continuous predicted values and continuous actual value from my training set. what I tried are as following:
I got MAE values of two values instead, but they refused to take MAE (since it was not the answer they want like **% accuracy).
I generated correlation with two, got 0.80, and they didn't like it either.
I tried to factorize the predicted values but there was no way to rank my predicted value, and my accuracy with generated factor was just over 45%. No way they will take that.
So,
summary(lm(prediction~MAINDATA5_TEST_18FW$WEEK-1))
Call:
lm(formula = prediction ~ as.numeric(as.character(MAINDATA5_TEST_18FW$PLCWEEK)) -
1)
Residuals:
Min 1Q Median 3Q Max
-6.8077 -1.1036 0.3121 1.6788 7.0615
Coefficients:
Estimate Std. Error t value Pr(>|t|)
as.numeric(as.character(MAINDATA5_TEST_18FW$PLCWEEK)) 1.066786 0.006516 163.7 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2.168 on 460 degrees of freedom
Multiple R-squared: 0.9831, Adjusted R-squared: 0.9831
F-statistic: 2.68e+04 on 1 and 460 DF, p-value: < 2.2e-16
I tried the lm() formula to make more numbers and also told them R-squared between predicted and actual value is well above 0.98, which is pretty high. It was not enough to make them happy (again, they are obsessed with %s).
what caught my eye is the gradient, 1.066786. I was wondering that maybe I can say that model's predicted value is about 6.7% higher than the actual value on average. Is this interpretation reasonable enough? If these are all nonsense, then I will be very happy to get some suggestions too. Thanks..
Could someone explain to the statistically naive what the difference between Multiple R-squared and Adjusted R-squared is? I am doing a single-variate regression analysis as follows:
v.lm <- lm(epm ~ n_days, data=v)
print(summary(v.lm))
Results:
Call:
lm(formula = epm ~ n_days, data = v)
Residuals:
Min 1Q Median 3Q Max
-693.59 -325.79 53.34 302.46 964.95
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2550.39 92.15 27.677 <2e-16 ***
n_days -13.12 5.39 -2.433 0.0216 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 410.1 on 28 degrees of freedom
Multiple R-squared: 0.1746, Adjusted R-squared: 0.1451
F-statistic: 5.921 on 1 and 28 DF, p-value: 0.0216
The "adjustment" in adjusted R-squared is related to the number of variables and the number of observations.
If you keep adding variables (predictors) to your model, R-squared will improve - that is, the predictors will appear to explain the variance - but some of that improvement may be due to chance alone. So adjusted R-squared tries to correct for this, by taking into account the ratio (N-1)/(N-k-1) where N = number of observations and k = number of variables (predictors).
It's probably not a concern in your case, since you have a single variate.
Some references:
How high, R-squared?
Goodness of fit statistics
Multiple regression
Re: What is "Adjusted R^2" in Multiple Regression
The Adjusted R-squared is close to, but different from, the value of R2. Instead of being based on the explained sum of squares SSR and the total sum of squares SSY, it is based on the overall variance (a quantity we do not typically calculate), s2T = SSY/(n - 1) and the error variance MSE (from the ANOVA table) and is worked out like this: adjusted R-squared = (s2T - MSE) / s2T.
This approach provides a better basis for judging the improvement in a fit due to adding an explanatory variable, but it does not have the simple summarizing interpretation that R2 has.
If I haven't made a mistake, you should verify the values of adjusted R-squared and R-squared as follows:
s2T <- sum(anova(v.lm)[[2]]) / sum(anova(v.lm)[[1]])
MSE <- anova(v.lm)[[3]][2]
adj.R2 <- (s2T - MSE) / s2T
On the other side, R2 is: SSR/SSY, where SSR = SSY - SSE
attach(v)
SSE <- deviance(v.lm) # or SSE <- sum((epm - predict(v.lm,list(n_days)))^2)
SSY <- deviance(lm(epm ~ 1)) # or SSY <- sum((epm-mean(epm))^2)
SSR <- (SSY - SSE) # or SSR <- sum((predict(v.lm,list(n_days)) - mean(epm))^2)
R2 <- SSR / SSY
The R-squared is not dependent on the number of variables in the model. The adjusted R-squared is.
The adjusted R-squared adds a penalty for adding variables to the model that are uncorrelated with the variable your trying to explain. You can use it to test if a variable is relevant to the thing your trying to explain.
Adjusted R-squared is R-squared with some divisions added to make it dependent on the number of variables in the model.
Note that, in addition to number of predictive variables, the Adjusted R-squared formula above also adjusts for sample size. A small sample will give a deceptively large R-squared.
Ping Yin & Xitao Fan, J. of Experimental Education 69(2): 203-224, "Estimating R-squared shrinkage in multiple regression", compares different methods for adjusting r-squared and concludes that the commonly-used ones quoted above are not good. They recommend the Olkin & Pratt formula.
However, I've seen some indication that population size has a much larger effect than any of these formulas indicate. I am not convinced that any of these formulas are good enough to allow you to compare regressions done with very different sample sizes (e.g., 2,000 vs. 200,000 samples; the standard formulas would make almost no sample-size-based adjustment). I would do some cross-validation to check the r-squared on each sample.