Clustered robust standard errors on country-year pairs - r

I want to replicate a Stata do.file (panel model) in R, but unfortunately I'm ending up with the wrong standard error estimates. The data is proprietary, so I can't post it here. The Stata code used looks like:
xtreg Y X, vce(cluster countrycodeid) fe nonest dfadj
With fe for fixed effects, nonest indicating that the panels are not nested within the clusters, and dfadj for the fact that some sort of DF-adjustment takes place - not possible to find out which sort as of now.
My R-Code looks like this and makes me end up with the right coefficient values:
model <- plm(Y~X+as.factor(year),data=panel,model="within",index=c("codeid","year"))
Now comes the difficult part, which I haven't found a solution for so far, even after trying out numerous sorts of standard error robust estimation methods, for example making extensive use of lmtest and various degrees of freedom transformation methods. The standard errors are supposed to follow a country-year pair pattern (captured by the variable countrycodeid in the Stata code, which takes the form codeid-year, as there appears to be missing data for some variables which are not available on a monthly basis.
Does anyone know if there are special tricks to keep in mind when working with unbalanced panels and the plm() package, which sort of DF-adjustment can be used, and if there is a possibility to group data in the coeftest() function on a country-year basis?

This is not a complete answer.
Stata uses a finite sample correction described in this post. I think that may get your standard errors a tad closer.
Moreover, you can learn more about the nonest/dfadj by issuing the help whatsnew9. Stata used to adjust the VCE for the within transformation when the cluster() option was specified. The cluster-robust VCE no longer adjusts unless the dfadj is specified. You may need to use the version control to replicate old estimates.

Related

Using permanova in r to analyse the effect of 3 independent variables on reef systems

I am trying to understand how to run PERMANOVA using Adonis2 in R to analyse some data that I have collected. I have been looking online, but as it often happens, explanations are a bit convoluted, so I am asking for your help, if you can help me. I have got some fish and coral groups as columns, as well as 3 independent variables (reef age, depth, and material). Snapshot of my dataset structure I think I have understood that p-values are not the only important bit of the output, and that the R2 values indicate how much each variable contributes to the model. Is there something wrong or that I am missing here? Also, I think I understood that I should check for homogeneity of variance, but I have not understood, again, if I should check for it on each variable independently, or if I should include them all in the same bit of code (which does not seem to work). Here are the bit of code that I am using to run the PERMANOVA (1), and the one that I am trying to use to assess homogeneity of variance - which does not work (2).
(1) adonis2(species ~ Age+Material+Depth,data=data.var,by="margin)
'Species' is the subset of the dataset including all the species'count, while 'data.var'is the subset including the 3 independent variables. Also what is the difference in using '+' or '' in the code? When I use '' it gives me - 'Error in qr.X(object$CCA$QR) :
need larger value of 'ncol' as pivoting occurred'. What does this mean?
(2) variance.check<-betadisper(species.distance,data.var, type=c("centroid"), bias.adjust= FALSE)
'species.distance' is a matrix calculated through 'vegdist' using Bray-Curtis method. I used 'data.var'to check variance on all the 3 independent variables, but it does not work, while it works if I check them independently (3). Why is that?
(3) variance.check<-betadisper(species.distance, data$Depth, type=c("centroid"), bias.adjust= FALSE)
Thank you in advance for your responses, and for your help. It will really help me to get my head around it (and sorry for the many questions).

R: Evaluate Gradient Boosting Machines (GBM) for Regression

Which are the best metrics to evaluate the fit of a GBM algorithm in R (metrics, graphs, ratios)? And how interpret them?
I think maybe you are overthinking this one! Take a step back and think about what matters... the error. You have forecasted values and you have observed values. the difference tells you most of what you need to know when comparing across models. Basic measures like MSE, MPE, etc. should do fine. If you are looking to refine within a given model, I would recommend taking a look at the gbm documentation. For example, you can pass your gbm model object to summary(), to get the relative influence of each of your variables. Additionally, you can find a lot of information in the documentation, so if you haven't taken a look, I would recommend doing so! I have posted the link at the bottom.
-Carmine
gbm_documentation

How to use the `vcconv` command in lme4 for serial correlation?

I'm working with a large longitudinal dataset of firm-year observations. For some time now I have been using lme4to implement crossed (non-nested) effects for year and firm-ID groups.
My goal is now to correct for the serial correlation in the firm-group dimension. Based on chl's and fabians' answers to this question (as well as Ben Bolker's comment on the latter), I've assumed this is impossible with lmer(), but is feasible with nlme::lme().
I have been able to implement crossed effects in nlme based on the discussion in Pinheiro & Bates (2000, sec. 4.2.2, pp. 163-6). In principle then, I believe I can use the correlation = AR1() speficiation in lme() to control for autocorrelation.
My strong preference, however, would be to implement such a correlation specification in lmer() because:
lme4 is much, much (much) faster
nlme requires crossed effects to be nested in some higher group -- without such a higher level grouping I'm forced to create an arbitrary dummy for groupedData to which all observations belong (e.g., here). This creates issues interpreting the relative levels of variation between the two crossed groups and the residual variance because some of the variation appears to be captured by the higher-level dummy group.
I got excited when I found the feature request #224 on GitHub, but alas it doesn't seem like there's much movement on the flexLambda front (please let me know if I'm wrong!).
lme4 v1.1-10
I've just noticed that the latest (Oct. 2015) version of lme4 contains a vcconv command that can
Convert between representation of (co-)variance structures (EXPERIMENTAL.)
Based on the source code, it seems that maybe the sdcor2cov option could allow one to specify a correlation structure such as AR(1).
So my questions are:
Is this interpretation of the vcconv function correct?
If so, does the user supply the correlation (e.g., AR(1)) parameters or are they determined internally in lmer()?
How does one implement this function properly?

How to deal with heteroscedasticity in OLS with R

I am fitting a standard multiple regression with OLS method. I have 5 predictors (2 continuous and 3 categorical) plus 2 two-way interaction terms. I did regression diagnostics using residuals vs. fitted plot. Heteroscedasticity is quite evident, which is also confirmed by bptest().
I don't know what to do next. First, my dependent variable is reasonably symmetric (I don't think I need to try transformations of my DV). My continuous predictors are also not highly skewed. I want to use weights in lm(); however, how do I know what weights to use?
Is there a way to automatically generate weights for performing weighted least squares? or Are you other ways to go about it?
One obvious way to deal with heteroscedasticity is the estimation of heteroscedasticity consistent standard errors. Most often they are referred to as robust or white standard errors.
You can obtain robust standard errors in R in several ways. The following page describes one possible and simple way to obtain robust standard errors in R:
https://economictheoryblog.com/2016/08/08/robust-standard-errors-in-r
However, sometimes there are more subtle and often more precise ways to deal with heteroscedasticity. For instance, you might encounter grouped data and find yourself in a situation where standard errors are heterogeneous in your dataset, but homogenous within groups (clusters). In this case you might want to apply clustered standard errors. See the following link to calculate clustered standard errors in R:
https://economictheoryblog.com/2016/12/13/clustered-standard-errors-in-r
What is your sample size? I would suggest that you make your standard errors robust to heteroskedasticity, but that you do not worry about heteroskedasticity otherwise. The reason is that with or without heteroskedasticity, your parameter estimates are unbiased (i.e. they are fine as they are). The only thing that is affected (in linear models!) is the variance-covariance matrix, i.e. the standard errors of your parameter estimates will be affected. Unless you only care about prediction, adjusting the standard errors to be robust to heteroskedasticity should be enough.
See e.g. here how to do this in R.
Btw, for your solution with weights (which is not what I would recommend), you may want to look into ?gls from the nlme package.

Standard error of the ARIMA constant

I am trying to manually calculate the standard error of the constant in an ARIMA model, if it is included. I have referred to Box and Jenkins (1994) text, specially Section 7.2, but my understanding is that the methods mentioned here calculates the variance-covariance matrix for the ARIMA parameters only, not the constant. Tried searching on the Internet, but couldn't find any theory. Software like Minitab, R etc. calculate this, so I was wondering what is the way? Can someone provide any pointer(s) on this topic?
Thanks.
arima() will fit a regression model with ARMA errors. The constant is treated as the coefficient of a regression variable consisting only of 1s. So you need the covariance matrix of the regression coefficients which is usually calculated separately from the covariance matrix of the ARMA coefficients. Look at Section 8.3 of Hamilton's "Time series analysis"
One of the nicest things about R is that you can access a lot of the source code to R itself from within the environment. If you simply type arima at the command prompt, you get the high-level source code for the arima() function. I got several pages of code here, when I tried it.
You do miss out on anything implemented internally within the R executable in native code, but often the high-level code tells you everything you want to know.
Perhaps a shift of perspective can solve this problem.
Rather than seeing the constant as something special, just consider the problem without constant and with a variable that is a vector of ones.

Resources