Re-transform a linear model. Case study with R - r

Let's say I have a response variable which is not normally distributed and an explanatory variable. Let's create these two variables first (coded in R):
set.seed(12)
resp = (rnorm(120)+20)^3.79
expl = rep(c(1,2,3,4),30)
I run a linear model and I realize that the residuals are not normally distributed. (I know running a Shapiro might not be enough to justify that the residuals are not normally distributed but it is not the point of my question)
m1=lm(resp~expl)
shapiro.test(residuals(m1))
0.01794
Therefore I want to transform my explanatory variable (looking for a transformation with a Box-Cox for example).
m2=lm(resp^(1/3.79)~expl)
shapiro.test(residuals(m2))
0.4945
Ok, now my residuals are normally distributed it is fine! I now want to make a graphical representation of my data and my model. But I do not want to plot my explanatory variable in the transformed form because I would lose lots of its intuitive meaning. Therefore I do:
plot(x=expl,y=resp)
What if I now want to add the model? I could do this
abline(m2) # m2 is the model with transformed variable
but of course the line does not fit the data represented. I could do this:
abline(m1) # m1 is the model with the original variable.
but it is not the model I ran for the statistics! How can I re-transform the line predicted by m2 so that it fits the data?

plotexpl <- seq(1,4,length.out=10)
predresp <- predict(m2,newdata=list(expl=plotexpl))
lines(plotexpl, predresp^(3.79))
I won't discuss the statistical issues here (e.g. a non-significant test does not mean that H0 is true and your model is not better than the mean).

Since you've mentioned that the transformation might base on Box-Cox formula,
I would like to point out a issue you might want to consider.
According to the Box-cox transformation formula in the paper Box,George E. P.; Cox,D.R.(1964). "An analysis of transformations", your transformation implementation (in case it is a Box-Cox one) might need to be slightly edited.The transformed y should be (y^(lambda)-1)/lambda instead of y^(lambda). (Actually, y^(lambda) is called Tukey transformation, which is another distinct transformation formula.)
So, the code should be:
lambda=3.79
m2=lm(resp^((lambda-1)/lambda)~expl)
shapiro.test(residuals(m2))
More information
Correct implementation of Box-Cox transformation formula by boxcox() in R:
https://www.r-bloggers.com/on-box-cox-transform-in-regression-models/
A great comparison between Box-Cox transformation and Tukey transformation. http://onlinestatbook.com/2/transformations/box-cox.html
One could also find the Box-Cox transformation formula on Wikipedia:
en.wikipedia.org/wiki/Power_transform#Box.E2.80.93Cox_transformation
Please correct me if I misunderstood your implementation.

Related

How does fixest handle negative values of the demeaned dependent variable in poisson estimations?

I need to perform glm (poisson) estimations with fixed-effects (say merely unit FE) and several regressors (RHS variables). I have an unbalanced panel dataset where most (~90%) observations have missing values (NA) for some but not all regressors.
fixest::feglm() can handle this and returns my fitted model.
However, to do so, it (and fixest::demean too) removes observations that have at least one regressor missing, before constructing the fixed-effect means.
In my case, I am afraid this implies not using a significant share of available information in the data.
Therefore, I would like to demean my variables by hand, to be able to include as much information as possible in each fixed-effect dimension's mean, and then run feglm on the demeaned data. However, this implies getting negative dependent variable values, which is not compatible with Poisson. If I run feglm with "poisson" family and my manually demeaned data, I (coherently) get: "Negative values of the dependent variable are not allowed for the "poisson" family.". The same error is returned with data demeaned with the fixest::demean function.
Question:
How does feglm handle negative values of the demeaned dependent variable? Is there a way (like some data transformation) to reproduce fepois on a fixed-effect in the formula with fepois on demeaned data and a no fixed-effect formula?
To use the example from fixest::demean documentation (with two-way fixed-effects):
data(trade)
base = trade
base$ln_dist = log(base$dist_km)
base$ln_euros = log(base$Euros)
# We center the two variables ln_dist and ln_euros
# on the factors Origin and Destination
X_demean = demean(X = base[, c("ln_dist", "ln_euros")],
fe = base[, c("Origin", "Destination")])
base[, c("ln_dist_dm", "ln_euros_dm")] = X_demean
and I would like to reproduce
est_fe = fepois(ln_euros ~ ln_dist | Origin + Destination, base)
with
est = fepois(ln_euros_dm ~ ln_dist_dm, base)
I think there are two main problems.
Modelling strategy
In general, it is important to be able to formally describe the estimated model.
In this case it wouldn't be possible to write down the model with a single equation, where the fixed-effects are estimated using all the data and other variables only on the non-missing observations. And if the model is not clear, then... maybe it's not a good model.
On the other hand, if your model is well defined, then removing random observations should not change the expectation of the coefficients, only their variance. So again, if your model is well specified, you shouldn't worry too much.
By suggesting that observations with missing values are relevant to estimate the fixed-effects coefficients (or stated differently, that they are used to demean some variables) you are implying that these observations are not randomly distributed. And now you should worry.
Just using these observations to demean the variables wouldn't remove the bias on the estimated coefficients due to the selection to non-missingness. That's a deeper problem that cannot be removed by technical tricks but rather by a profound understanding of the data.
GLM
There is a misunderstanding with GLM. GLM is a super smart trick to estimate maximum likelihood models with OLS (there's a nice description here). It was developed and used at a time when regular optimization techniques were very expensive in terms of computational time, and it was a way to instead employ well developed and fast OLS techniques to perform equivalent estimations.
GLM is an iterative process where typical OLS estimations are performed at each step, the only changes at each iteration concern the weights associated to each observation. Therefore, since it's a regular OLS process, techniques to perform fast OLS estimations with multiple fixed-effects can be leveraged (as is in the fixest package).
So actually, you could do what you want... but only within the OLS step of the GLM algorithm. By no means you should demean the data before running GLM because, well, it makes no sense (the FWL theorem has absolutely no hold here).

Estimation of a state-space model with lags in the measurement equation in R

I'm trying to estimate an SS model from this paper that has the following form:
Setting the order of the first lag polynomial to zero and the second one to one, we can reformulate it using terms from the MARSS package guide when applicable (x is the state, y is the observed variable, d is exogenous):
MARSS package allows for estimation of a simpler model that dooesn't include lagged variables in the measurement equation. Is there a way to estimate this one using MARSS or any other package without rewriting the estimation routine for this special case? Maybe there is a way to reformulate it so it could be "fed" to MARSS or some other package?
Take a look at how say the BSM Structural time series model or ARMA model is formulated as a MARSS model, aka a multivariate state-space model. That'll give you an idea of how to reform your model in multivariate state-space form.
Basically, your x will look like
See how the x_2 is just a dummy that is forced to be x(t-1)?
Now the y equation
The d and a are your D and A. I wrote in small case to spec that they are scalars. But they can be matrices in general (if y is multivariate say). Your inputs are the d_t and y_{t-1}. You prepare that 2x1xT matrix as an input.
Be careful with your initial condition specification. Probably best/easiest to set it at t=1 and estimate or use diffuse prior.
You can fit this model with MARSS. You can fit with any Kalman filter function that will allow you to pass in inputs in the y equation (some do, some don't). KFAS::KFS() allows that using the SScustom() function.
In MARSS the model list will look like so
mod.list=list(
B=matrix(list("b",1,0,0),2,2),
U=matrix(0,2,1),
Q=matrix(list("q",0,0,0),2,2),
Z=matrix(c("z", "c"),1,2),
A=matrix(0),
R=matrix("r"),
D=matrix(c("d", "a"),1,2),
x0=matrix(c("x1","x2"),2,1),
tinitx=1,
d=rbind(dt[2:TT],y[1:(TT-1)])
)
dat <- y[2:TT] # since you need y_{t-1} in the d (inputs)
fit <- MARSS(dat, model=mod.list)
It'll probably complain that it wants initial conditions for x0. Anything will work. The EM algorithm isn't sensitive to that like a BFGS or Newton algorithm. But method="BFGS" is actually often better for this type of structural ts model and in that case pick a reasonable initial condition for x (reasonable = close to your data in this case I think).

evaluate forecast by the terms of p-value and pearson correlation

I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!
Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.

R: Limit/Set values of predicted results from linear model

New to R.
Looking to limit the range of values that can be predicted.
df.Train <- data.frame(S=c(1,2,2,2,1),L=c(1,2,3,3,1),M=c(400,450,400,700,795),V=c(423,400,555,600,800),G=c(4,3.2,2,2.7,3.4), stringsAsFactors=FALSE)
m.Train <- lm(G~S+L+M+V,data=df.Train)
df.Test <- data.frame(S=c(1,2,1,2,1),L=c(1,2,3,1,1),M=c(400,450,500,800,795),V=c(423,475,555,600,555), stringsAsFactors=FALSE)
round(predict(m.Train, df.Test, type="response"),digits=1)
#seq(0,4,.1) #Predicted values should fall in this range
I've experimented with the predict() options but no luck.
Is there an option in predict? Should I be limiting it in the model?
Thank you
There are ways to transform your response variable, G in this occasion but there needs to be a good reason to do this. For example, if you want the output to be probabilities between 0 and 1 and your response variable is binary (0,1) then you need a logistic regression.
It all comes down to what data you have and whether a model / transformation of the response variable would be appropriate. In your example you do not specify what the data is and therefore we cannot say anything about which model or which transformation to use.
Setting the above on the side, if you really care about the prediction and do not care about the model or the transformation (but why wouldn't you care?) it looks like your data could use a quasipossion generalised linear model which might provide the output you need:
df.Train <- data.frame(S=c(1,2,2,2,1),L=c(1,2,3,3,1),M=c(400,450,400,700,795),V=c(423,400,555,600,800),G=c(4,3.2,2,2.7,3.4), stringsAsFactors=FALSE)
m.Train <- glm(G~S+L+M+V,data=df.Train, family=quasipoisson)
df.Test <- data.frame(S=c(1,2,1,2,1),L=c(1,2,3,1,1),M=c(400,450,500,800,795),V=c(423,475,555,600,555), stringsAsFactors=FALSE)
> predict(m.Train, df.Test, type="response")
1 2 3 4 5
4.000000 2.840834 3.062754 3.615447 4.573276
#probably not as good as you want
The model is using a log link by default which ensures the values will be positive. There is no guarantee that the model will not predict values greater than 4 but since you fed it values of less than 4 (your G variable) then chances are that most of the predictions will follow that distribution (like in this example). You might then need to consider how to treat predictions that go above 4.
In general you should consider carefully which model to choose and which response transformation. The poison model above for example is usually used for count data. However, you should never manipulate predictions on your own so if you choose the lm model in the end make sure you use the predictions it gives.
EDIT
It looks like in your case a non-linear regression might be what you need. The problem using a linear model like lm is that predictions can be greater than the max of the observed cases and less than the min of the observed cases. In which case doing a linear regression might not be appropriate. There are algorithms that will never predict a value greater than the max or less than the min. Such a case might be better suited in your case. One of these algorithms is the k-nearest neighbor for example:
library(FNN)
> knn.reg(df.Train[1:4], test=df.Test[1:4], y=df.Train[5], k=3)
Prediction:
[1] 3.066667 3.066667 3.066667 2.700000 3.100000
As you can see the predictions will never go above 4. That said knn is a local solution algorithm so again you need to research whether this is a good approach or not for your problem and your data. In terms of predictions though it definitely confirms your conditions. Knn is a very easy to understand algorithm that relies on distances between points to calculate predictions.
Hope it helps :)

How does predict deal with models that include the AsIs function?

I have the model lm(y~x+I(log(x)) and I would like to use predict to get predictions of a new data frame containing new values of x, based on my model. How does predict deal with the AsIs function I in the model? Does the I(log(x)) need to be extra specified in the newdata argument of predict or does predict understand that it should construct and use I(log(x)) from x?
UPDATE
#DWin: The way the variables enter in the model affect the coefficients especially for interactions. My example is simplistic but try this out
x<-rep(seq(0,100,by=1),10)
y<-15+2*rnorm(1010,10,4)*x+2*rnorm(1010,10,4)*x^(1/2)+rnorm(1010,20,100)
z<-x^2
plot(x,y)
lm1<-lm(y~x*I(x^2))
lm2<-lm(y~x*x^2)
lm3<-lm(y~x*z)
summary(lm1)
summary(lm2)
summary(lm3)
You see that lm1=lm3, but lm2 is something different (only 1 coefficient). Assuming you don't want to create the dummy variable z (computationally inefficient for large datasets), the only way to build an interaction model like lm3 is with I. Again this is a very simplistic example (that may make no statistical sense) however it makes sense in complicated models.
#Ben Bolker: I would like to avoid guessing and try to ask for an authoritative answer (I can't direct check this with my models since they are much more complicated than the example). My guess is that predict correctly assumes and constructs the I(log(x))
You do not need to make your variable names look like the term I(x). Just use "x" in the newdata argument.
The reason lm(y~x*I(x^2)) and lm(y~x*x^2) are different is that "^" and "*" are reserved symbols for formula in R. That's not the case with the log function. It is also incorrect that interactions can only be constructed with I(). If you wanted a second degree polynomial in R you should use poly(x, 2). If you build with I(log(x)) or with just log(x) you should get the same model. Both of them will get transformed to the predictor value properly with predict if you use:
newdata=dataframe( x=seq( min(x), max(x), length=10) )
Using poly will protect you from incorrect inferences that are so commonly caused by the use of I(x^2).

Resources