What is the between variance formula in panel data? - standards

I want to calculate panel descriptive statistics for my variables analogously to how Stata provides them using the "xtsum" function. I am able to compute almost everything (overall/within sd, mean, min, max) but I cannot seem to find a reliable source with the formula to compute the between sd. Anybody that knows the formula/has a reliable source?
So far I have used the formula from this thread Between/within standard deviations in R. But I'm unsure whether this formula is correct.

In this post: xtsum command for R? you find an R implementation of the entire xtsum command. You can pick the line for between variation. It's hidden a little bit.
I used some easy example data, and it perfectly replicates the results from Stata:
paneldata = data.frame(id=c(1,1,1,2,2,2,3,3,3), time=seq(1:3), variable=c(9,10,11,20,20,20,25,30,35))
XTSUM(paneldata, varname = variable, unit=id)
R output with "XTSUM":
Stata output:
Be aware of some differences in the within-formula which is adjusted in Stata. You also find valuable information in this post:
http://stephenporter.org/files/xtsum_handout.pdf
Example data comes from here:
http://rizaudinsahlan.blogspot.com/2016/06/within-and-between-variation-in-panel.html

Related

custom code for compact letter display from pairwise table output

I would like to create a custom code that creates a compact letter display from a pairwise test I have performed.
I have done this with pairwise t-tests with success (packages for this exist), and I am also familiar with the package library(multcomp) when I run linear models and the function cld() to get the compact letter displays, but they will not work for my specific case here.
I work with kaplan meier survival data often, and after I run the pairwise_survdiff() function to see if any statistical differences exist between groups (found in the packages library(survival) and library(survminer), I am easily able to extract a table to display all pairwise comparisons and their corresponding p-values. I have included an example for you here today. (see df below)
When their are many comparisons to do by hand, this becomes a mess to found out which groups are different / similar, and it's prone to human error when many levels exist, and up to now, I've always done it by hand. I would like to change this.
Could someone help me with a code that helps do this automatically?
Here is a mock dataframe df with 10 treatments (named treatment-1....treatment-10), and the rows are filled with p-values. Let's assume anything below p<0.05 as significant. However, it would be very cool to have a code that would allow a more conservative approach, and say set the desired cut off for statistical significance (say anything below p<0.01 as significant for example).
Thanks for your help, and again, here is a play datatframe
df <- read.table("https://pastebin.com/raw/ZAKDBjVs", header = T)
While reflecting on this, I believe I found an answer on my own, with the library(mulcompView) and library(rcompanion)
Nonetheless, I think it's important, since I have seen / heard this question multiple times. Here is how I solved my problem
library(rcompanion)
library(multcompView)
df <- read.table("https://pastebin.com/raw/ZAKDBjVs", header=T)
PT1 = fullPTable(df)
multcompLetters(PT1,
compare="<",
threshold=0.05,
Letters=letters,
reversed = FALSE)
This gives me the desired output with the compact letter displays between groups. Additionally, one could edit the statistical threshold to be either more/less conservative by changing the threshold=
Very happy with the result. This has bothered me for a while. I hope it is useful to other members

1 sample t-test from summarized data in R

I can perform a 1 sample t-test in R with the t.test command. This requires actual sets of data. I can't use summary statistics (sample size, sample mean, standard deviation). I can work around this utilizing the BSDA package. But are there any other ways to accomplish this 1-sample-T in R without the BSDA pacakage?
Many ways. I'll list a few:
directly calculate the p-value by computing the statistic and calling pt with that and the df as arguments, as commenters suggest above (it can be done with a single short line in R - ekstroem shows the two-tailed test case; for the one tailed case you wouldn't double it)
alternatively, if it's something you need a lot, you could convert that into a nice robust function, even adding in tests against non-zero mu and confidence intervals if you like. Presumably if you go this route you'' want to take advantage of the functionality built around the htest class
(code and even a reasonably complete function can be found in the answers to this stats.SE question.)
If samples are not huge (smaller than a few million, say), you can simulate data with the exact same mean and standard deviation and call the ordinary t.test function. If m and s and n are the mean, sd and sample size, t.test(scale(rnorm(n))*s+m) should do (it doesn't matter what distribution you use, so runif would suffice). Note the importance of calling scale there. This makes it easy to change your alternative or get a CI without writing more code, but it wouldn't be suitable if you had millions of observations and needed to do it more than a couple of times.
call a function in a different package that will calculate it -- there's at least one or two other such packages (you don't make it clear whether using BSDA was a problem or whether you wanted to avoid packages altogether)

Clustered robust standard errors on country-year pairs

I want to replicate a Stata do.file (panel model) in R, but unfortunately I'm ending up with the wrong standard error estimates. The data is proprietary, so I can't post it here. The Stata code used looks like:
xtreg Y X, vce(cluster countrycodeid) fe nonest dfadj
With fe for fixed effects, nonest indicating that the panels are not nested within the clusters, and dfadj for the fact that some sort of DF-adjustment takes place - not possible to find out which sort as of now.
My R-Code looks like this and makes me end up with the right coefficient values:
model <- plm(Y~X+as.factor(year),data=panel,model="within",index=c("codeid","year"))
Now comes the difficult part, which I haven't found a solution for so far, even after trying out numerous sorts of standard error robust estimation methods, for example making extensive use of lmtest and various degrees of freedom transformation methods. The standard errors are supposed to follow a country-year pair pattern (captured by the variable countrycodeid in the Stata code, which takes the form codeid-year, as there appears to be missing data for some variables which are not available on a monthly basis.
Does anyone know if there are special tricks to keep in mind when working with unbalanced panels and the plm() package, which sort of DF-adjustment can be used, and if there is a possibility to group data in the coeftest() function on a country-year basis?
This is not a complete answer.
Stata uses a finite sample correction described in this post. I think that may get your standard errors a tad closer.
Moreover, you can learn more about the nonest/dfadj by issuing the help whatsnew9. Stata used to adjust the VCE for the within transformation when the cluster() option was specified. The cluster-robust VCE no longer adjusts unless the dfadj is specified. You may need to use the version control to replicate old estimates.

A case of GMM estimation in R

I want to estimate the forward looking version of the Taylor rule equation using the iterative nonlinear GMM:
I have the data for all the variables in the model, namely (inflation rate), (unemployment gap) and (effective federal funds rate) and what I am trying to estimate is the set of parameters , and .
Where I need help is in the usage of the gmm() function in the {gmm} R package. I 'think' that the parameters of the function that I need are the parameters:
gmm(g, x, type = "iterative",...)
where g is the formula (so, the model stated above), x is the data vector (or matrix) and type is the type of GMM to use.
My problem is with the data matrix parameter. I do not know the way in which to construct it (not that I don't know of matrices in R and all the examples I have seen on the internet are not similar to what I am attempting to do here. Also, this is my first time using the gmm() function in R. Is there anything else I need to know?
Your help will be much appreciated. Thank you :)

function for removing nonsignificant variables at one step in R

I am trying to automate logistic regression in R.
Basically, my source code will generate a new equation everyday as the input data is updated,
(Variables, data format etc are same) and print out te significant variables with corresponding coefficients.
When I use step function, sometimes the resulting coefficients are not significant. Therefore, I want to update my set of coefficients and get rid of all the ones that are not significant enough.
Is there a function or automated way of doing it?
If not, the only way I can think of is writing a script on another language that takes the coefficients and corresponding P value and checking significance, and rerunning R accordingly. But even for that, do you know how can I get only P values and coefficients of variables. I can either print whole summary of regression result with "summary" function. I can't reach only P values.
Thank you very much
It's a bit hard for me without sample code and data, but you can subset based on variable values like this,
newdata <- data[ which(data$p.value < 0.5), ]
You can inspect your R object using str, see ?str to figure out how to select whatever you want to use in your subset $p.value or $residuals.
If this doesn't answer your question try submitting some sample code and data.
Best,
Eric

Resources