I have a panel dataset of firms with data from 2010 to 2019 and I need to estimate this model:
Labour demand i,t = f (labour demand i,t-1 , automation i,t-1 , automation i,t-2 , automation i,t-3 , automation i,t-4 , labour cost per employee i,t , value added i,t , gross investments i,t , control variables i,t)
To solve the endogeneity problem of the model, I need to use the generalized method of moments. In the model, endogeneity is due to:
the presence of the lagged dependent variable in the model
other covariates in the model
All the explanatory variables must be instrumented in the model, using up to thrice lagged instruments: as instruments for the level equations I need to use the differenced values of the independent variables, i.e. thrice lagged differences in labour demand, automation, labour costs etc.
The level equations must also include a set of industry and year dummies as controls.
Moreover, I have to run some tests: Wald test, autocorrelation tests, and Hansen test.
I don't know how to start.
Can you please tell me the R commands I need to apply?
Thank you very much for your help!
Related
I’m forecasting many commodity price series. One, for example, is a commodity priced in Norwegian krone and so I am adding the NKR/USD exchange rate as an exogenous variables. However, there is likely to a delay between the exchange rate changing and this being reflected in the commodity price.
Linear models, I would first find the best correlation between the predictor series and the series of variable, and use that in the model, e.g. lm(commodity ~ lag/lead(demand) + lag/lead(trade flows). I would expect the same is true for ARIMA models, to ensure the exogenous series are contemporaneous, with series being modelled.
1) When adding exogenous variables to ARIMA models, e.g. auto.arima(commodity, xreg = "NKR/USD"), is it advantageous to lag/lead the exogenous variables?
2) If the answer to 1 is, ‘Yes, it is better to lag/lead’, then will the optimum time-shift be when they are most strongly correlated? Or is another test better for selecting the shift?
A test would be better than producing many time-shifted models and comparing them.
There is not (yet) any coding to show, but it will follow based on the answer.
Thank you for any insights,
TC
I have behavioral data for many groups of birds over 10 days of observation. I wanted to investigate whether there is a temporal pattern in some behaviors (e.g. does mate competition increase over time?) And I was told that I had to account for the autocorrelation of the data, since behavior is unlikely to be independent in each day.
However I was wondering about two things:
Since I'm not interested in the differences in y among days but the trend of y over days, do I still need to correct for autocorrelation?
If yes, how do I control for the autocorrelation so that I'm left out only with the signal (and noise of course)?
For the second question, keep in mind I will be analyzing the effect of time on behavior using mixed models in R (since there are random effects such as pseudo-replication), but I have not found any straightforward method of correcting for autocorrelation in the data when modeling the responses.
(1) Yes, you should check for/account for autocorrelation.
The first example here shows an example of estimating trends in a mixed model while accounting for autocorrelation.
You can fit these models with lme from the nlme package. Here's a mixed model without autocorrelation included:
cmod_lme <- lme(GS.NEE ~ cYear,
data=mc2, method="REML",
random = ~ 1 + cYear | Site)
and you can explore the autocorrelation by using plot(ACF(cmod_lme)).
(2) Add correlation to the model something like this:
cmod_lme_acor <- update(cmod_lme,
correlation=corAR1(form=~cYear|Site)
#JeffreyGirard notes that
to check the ACF after updating the model to include the correlation argument, you will need to use plot(ACF(cmod_lme_acor, resType = "normalized"))
(I am using R and the lqmm package)
I was wondering how to consider autocorrelation in a Linear Quantile mixed models (LQMM).
I have a data frame that looks like this:
df1<-data.frame( Time=seq(as.POSIXct("2017-11-13 00:00:00",tz="UTC"),
as.POSIXct("2017-11-13 00:1:59",tz="UTC"),"sec"),
HeartRate=rnorm(120, mean=60, sd=10),
Treatment=rep("TreatmentA",120),
AnimalID=rep("ID01",120),
Experiment=rep("Exp01",120))
df2<-data.frame( Time=seq(as.POSIXct("2017-08-11 00:00:00",tz="UTC"),
as.POSIXct("2017-08-11 00:1:59",tz="UTC"),"sec"),
HeartRate=rnorm(120, mean=62, sd=14),
Treatment=rep("TreatmentB",120),
AnimalID=rep("ID02",120),
Experiment=rep("Exp02",120))
df<-rbind(df1,df2)
head(df)
With:
The heart rates (HeartRate) that are measured every second on some animals (AnimalID). These measures are carried during an experiment (Experiment) with different treatment possible (Treatment). Each animal (AnimalID) was observed for multiple experiments with different treatments. I wish to look at the effect of the variable Treatment on the 90th percentile of the Heart Rates but including Experiment as a random effect and consider the autocorrelation (as heart rates are taken every second). (If there is a way to include AnimalID as random effect as well it would be even better)
Model for now:
library(lqmm)
model<-lqmm(fixed= HeartRate ~ Treatment, random= ~1| Exp01, data=df, tau=0.9)
Thank you very much in advance for your help.
Let me know if you need more information.
For resources on thinking about this type of problem you might look at chapters 17 and 19 of Koenker et al. 2018 Handbook of Quantile Regression from CRC Press. Neither chapter has nice R code to go from, but they discuss different approaches to the kind of data you're working with. lqmm does use nlme machinery, so there may be a way to customize the covariance matrices for the random effects, but I suspect it would be easiest to either ask for help from the package author or to do a deep dive into the package code to figure out how to do that.
Another resource is the quantile regression model for mixed effects models accounting for autocorrelation in 'Quantile regression for mixed models with an application to examine blood pressure trends in China' by Smith et al. (2015). They model a bivariate response with a copula, but you could do the simplified version with univariate response. I think their model only at this points incorporates lag-1 correlation structure within subjects/clusters. The code for that model does not seem to be available online either though.
I am using R to do some evaluations for two different forecasting models. The basic idea of the evaluation is do the comparison of Pearson correlation and it corresponding p-value using the function of cor.() . The graph below shows the final result of the correlation coefficient and its p-value.
we suggestion that model which has lower correlation coefficient with corresponding lower p-value(less 0,05) is better(or, higher correlation coefficient but with pretty high corresponding p-value).
so , in this case, overall, we would say that the model1 is better than model2.
but the question here is, is there any other specific statistic method to quantify the comparison?
Thanks a lot !!!
Assuming you're working with time series data since you called out a "forecast". I think what you're really looking for is backtesting of your forecast model. From Ruey S. Tsay's "An Introduction to Analysis of Financial Data with R", you might want to take a look at his backtest.R function.
backtest(m1,rt,orig,h,xre=NULL,fixed=NULL,inc.mean=TRUE)
# m1: is a time-series model object
# orig: is the starting forecast origin
# rt: the time series
# xre: the independent variables
# h: forecast horizon
# fixed: parameter constriant
# inc.mean: flag for constant term of the model.
Backtesting allows you to see how well your models perform on past data and Tsay's backtest.R provides RMSE and Mean-Absolute-Error which will give you another perspective outside of correlation. Caution depending on the size of your data and complexity of your model, this can be a very slow running test.
To compare models you'll normally look at RMSE which is essentially the standard deviation of the error of your model. Those two are directly comparable and smaller is better.
An even better alternative is to set up training, testing, and validation sets before you build your models. If you train two models on the same training / test data you can compare them against your validation set (which has never been seen by your models) to get a more accurate measurement of your model's performance measures.
One final alternative, if you have a "cost" associated with an inaccurate forecast, apply those costs to your predictions and add them up. If one model performs poorly on a more expensive segment of data, you may want to avoid using it.
As a side-note, your interpretation of a p value as less is better leaves a little to be [desired] quite right.
P values address only one question: how likely are your data, assuming a true null hypothesis? It does not measure support for the alternative hypothesis.
I am having trouble understanding what the variables in knn() mean in context of the R function as I don't come from a background of statistics.
Lets say that I am trying to predict a pool race results for each pool A, B, and C.
I know the height and weight of each racing candidate competing in the race. Assuming that the candidates competing are the same every year, I also know who won for the past 30 years.
How would I predict who is going to win at pool A, B, and C this year?
MY guess:
The train argument is a data frame with the columns of weight, height, and pool that he is competing in for each competitor. This is for the last 29 years.
The test argument is a data frame with the columns of weight, height, and pool that he is competing in for each competitor. This is for the last year.
The cl argument is a vector of which competitor won the race each year.
Is this how knn() was intended to be used?
Reference:
http://stat.ethz.ch/R-manual/R-patched/library/class/html/knn.html
Not exactly. Train data is used for training, but test for testing. You can't just train and apply it straight away - you need to cross-validate your model. The aim of model training is not to minimize the error, but to minimize the difference between in-sample and out-of-sample errors. Otherwise you will overfit it: the fact is if you do it good enough your in-sample error will be 0. Which will not give any good results for real prediction. Training set in that function is your in-sample and testing is out-of-sample.
The actual model is then built and you can make a prediction (i.e., for current year) using mymodel.predict().