Obtaining covariates' estimates in rdrobust package - r

I am using rdrobust to estimate RDDs and for a submission in a journal the journal demands I report tables with covariates and their estimates. I don't think these should be reported in designs like these and don't really know how informative they are, but anyways: I can't find them anywhere in the output of the rdrobust call, so I was wondering whether there is anyway of actually obtaining them.
Here's my code:
library(rdrobust)
rd <- rdrobust(y = full_data$share_female,
x = full_data$running,
c = 0,
cluster = full_data$constituency.name,
covs=cbind(full_data$income, full_data$year_fct,
full_data$population, as.factor(full_data$constituency.name)))
I then call the object
rd
And get:
Call: rdrobust
Number of Obs. 1812
BW type mserd
Kernel Triangular
VCE method NN
Number of Obs. 1452 360
Eff. Number of Obs. 566 170
Order est. (p) 1 1
Order bias (q) 2 2
BW est. (h) 0.145 0.145
BW bias (b) 0.221 0.221
rho (h/b) 0.655 0.655
Unique Obs. 1452 360
So as you see there seems to be no information on this on the output nor the object the function calls. I don't really know what to do.
Thanks!

Unfortunately, I do not believe rdrobust() allows you to recover the coefficients introduced through the covs option.
In your case, running the code as you provided and then running:
rd$coef
will only give you the point estimate for the rd estimator.
Josh McCrain has written-up a nice vignette here to replicate rdrobust using lfe that also allows you to recover the coefficients on covariates.
It involves some modification on your part and is of course not as user friendly, but does allow recovery of covariates.
This might be beside the point by now, but the journal requirement in an RD design is odd.

Use summary(rd). This will return the coefficient estimate

Related

Use superior predictive ability (SPA) test in R

Does anynoe know hot to use the SPA test in R/Matlab or other software; it is a statistical method to evaluate models. I knew that there is a R package called "ttrTests" has a relevant SPA function, but it looks like suitable for comparing portfolio strategies, rather than comparing general models in terms of some loss function. Can someone tell me other source or how to prepare the data suitable for the "ttrTests" package.
The Model Confidence Set package tests multiple models for superior predictive ability.
install.packages("MCS")
library(MCS)
data(Loss)
MCS <- MCSprocedure(Loss=Loss[,1:5],alpha=0.2,B=5000,statistic='Tmax',cl=NULL)
...and the output:
> MCS <- MCSprocedure(Loss=Loss[,1:5],alpha=0.2,B=5000,statistic='Tmax',cl=NULL)
###########################################################################################################################
Superior Set Model created :
Rank_M v_M MCS_M Rank_R v_R MCS_R Loss
sGARCH-norm 4 0.8201805 0.6034 4 1.43408052 0.3576 0.0004042581
sGARCH-std 5 0.9649670 0.5008 5 3.22834167 0.0058 0.0004010655
sGARCH-ged 1 -1.3942903 1.0000 3 0.21893448 0.9940 0.0003986329
sGARCH-snorm 2 -1.3101987 1.0000 2 0.08452883 0.9998 0.0003982803
sGARCH-sstd 3 -0.4739630 1.0000 1 -0.08452883 1.0000 0.0003977886
p-value :
[1] 0.5008
###########################################################################################################################
>
You could take a look at the MFE Toolbox for Matlab. It includes two distinct multiple hypotesis tests that may be helpful for you:
Model Confidence Set (mcs function)
Reality Check and Test for Superior Predictive Accuracy (bsds function)
The framework is pretty well documented and you can see these implementations from this page.

warning messages in lme4 for survival analysis that did not arise 3 years ago

I am trying to fit a generalized linear mixed-effects model to my data, using the lme4 package.
The data can be described as follows (see example below): Survival data of fish over 28 days. Explanatory variables in the example data set are:
Region This is the geographical region from which the larvae originated.
treatment The temperatures at which sub-samples of fish from each region were raised.
replicate One of three replications of the entire experiment
tub Random variable. 15 tubs (used to maintain experimental temperatures in aquaria) in total (3 replicates for each of 5 temperature treatments). Each tub contained 1 aquaria for each Region (4 aquaria in total) and was located randomly in the lab.
Day Self explanatory, number of days from the start of the experiment.
stage is not being used in the analysis. Can be ignored.
Response variable
csns cumulative survival. i.e remaining fish/initial fish at day 0.
start weights used to tell the model that the probability of survival is relative to this number of fish at start of experiment.
aquarium Second random variable. This is the unique ID for each individual aquaria containing the value of each factor that it belongs to. e.g. N-14-1 means Region N, Treatment 14, replicate 1.
My problem is unusual, in that I have fitted the following model before:
dat.asr3<-glmer(csns~treatment+Day+Region+
treatment*Region+Day*Region+Day*treatment*Region+
(1|tub)+(1|aquarium),weights=start,
family=binomial, data=data2)
However, now that I am attempting to re-run the model, to generate analyses for publication, I am getting the following errors with the same model structure and package. Output is listed below:
> Warning messages:
1: In eval(expr, envir, enclos) : non-integer #successes in a binomial glm!
2: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model failed to converge with max|grad| = 1.59882 (tol = 0.001, component >1)
3: In checkConv(attr(opt, "derivs"), opt$par, ctrl = control$checkConv, :
Model is nearly unidentifiable: very large eigenvalue
- Rescale variables?;Model is nearly unidentifiable: large eigenvalue ratio
- Rescale variables?
My understanding is the following:
Warning message 1.
non-integer #success in a binomial glm refers to the proportion format of the csns variable. I have consulted several sources, here included, github, r-help, etc, and all suggested this. The research fellow that assisted me in this analysis 3 years ago, is unreachable. Can it have to do with changes in lme4 package over the last 3 years?
Warning message 2.
I understand this is a problem because there are insufficient data points to fit the model to, particularly at
L-30-1, L-30-2 and L-30-3,
where only two observations are made:
Day 0 csns=1.00 and Day 1 csns=0.00
for all three aquaria. Therefore there is no variability or sufficient data to fit the model to.
Nevertheless, this model in lme4 has worked before, but doesn't run without these warnings now.
Warning message 3
This one is entirely unfamiliar to me. Never seen it before.
Sample data:
Region treatment replicate tub Day stage csns start aquarium
N 14 1 13 0 1 1.00 107 N-14-1
N 14 1 13 1 1 1.00 107 N-14-1
N 14 1 13 2 1 0.99 107 N-14-1
N 14 1 13 3 1 0.99 107 N-14-1
N 14 1 13 4 1 0.99 107 N-14-1
N 14 1 13 5 1 0.99 107 N-14-1
The data in question 1005cs.csv is available here via we transfer: http://we.tl/ObRKH0owZb
Any help with deciphering this problem, would be greatly appreciated. Also any alternative suggestions for suitable packages or methods to analyse this data would be great too.
tl;dr the "non-integer successes" warning is accurate; it's up to you to decide whether fitting a binomial model to these data really makes sense. The other warnings suggest that the fit is a bit unstable, but scaling and centering some of the input variables can make the warnings go away. It's up to you, again, to decide whether the results from these different formulations are different enough for you to worry about ...
data2 <- read.csv("1005cs.csv")
library(lme4)
Fit model (slightly more compact model formulation)
dat.asr3<-glmer(
csns~Day*Region*treatment+
(1|tub)+(1|aquarium),
weights=start, family=binomial, data=data2)
I do get the warnings you report.
Let's take a look at the data:
library(ggplot2); theme_set(theme_bw())
ggplot(data2,aes(Day,csns,colour=factor(treatment)))+
geom_point(aes(size=start),alpha=0.5)+facet_wrap(~Region)
Nothing obviously problematic here, although it does clearly show that the data are very close to 1 for some treatment combinations, and that the treatment values are far from zero. Let's try scaling & centering some of the input variables:
data2sc <- transform(data2,
Day=scale(Day),
treatment=scale(treatment))
dat.asr3sc <- update(dat.asr3,data=data2sc)
Now the "very large eigenvalue" warning is gone, but we still have the "non-integer # successes" warning, and a max|grad|=0.082. Let's try another optimizer:
dat.asr3scbobyqa <- update(dat.asr3sc,
control=glmerControl(optimizer="bobyqa"))
Now only the "non-integer #successes" warning remains.
d1 <- deviance(dat.asr3)
d2 <- deviance(dat.asr3sc)
d3 <- deviance(dat.asr3scbobyqa)
c(d1,d2,d3)
## [1] 12597.12 12597.31 12597.56
These deviances don't differ by very much (0.44 on the deviance scale is more than could be accounted for by round-off error, but not much difference in goodness of fit); actually, the first model gives the best (lowest) deviance, suggesting that the warnings are false positives ...
resp <- with(data2,csns*start)
plot(table(resp-floor(resp)))
This makes it clear that there really are non-integer responses, so the warning is correct.

Convert mixed model with repeated measures from SAS to R

I have been trying to convert a repeated measures model from SAS to R, since a collaborator will do the analysis but does not have SAS. We are dealing with 4 groups, 8 to 10 animals per group, and then 5 time points for each animal. The mock data file is available here https://drive.google.com/file/d/0B-WfycVUQyhaVGU2MUpuQkg4Mk0/edit?usp=sharing as a Rdata file and here https://drive.google.com/file/d/0B-WfycVUQyhaR0JtZ0V4VjRkTk0/edit?usp=sharing as an excel file:
The original SAS code (1) is :
proc mixed data=essai.data_test method=reml;
class group time mice;
model param = group time group*time / ddfm=kr;
repeated time / type=un subject=mice group=group;
run;
Which gives :
Type 3 Tests des effets fixes
DDL DDL Valeur
Effet Num. Res. F Pr > F
group 3 15.8 1.58 0.2344
time 4 25.2 10.11 <.0001
group*time 12 13.6 1.66 0.1852
I know that R does not handle degrees of freedom in the same way as SAS does, so I am first trying to obtain results similar to (2) :
proc mixed data=essai.data_test method=reml;
class group time mice;
model param = group time group*time;
repeated time / type=un subject=mice group=group;
run;
I have found some hints here Converting Repeated Measures mixed model formula from SAS to R and when specifying a compound symmetry correlation matrix this works perfectly. However, I am not able to obtain the same thing for a general correlation matrix.
With (2) in SAS, I obtain the following results :
Type 3 Tests des effets fixes
DDL DDL Valeur
Effet Num. Res. F Pr > F
group 3 32 1.71 0.1852
time 4 128 11.21 <.0001
group*time 12 128 2.73 0.0026
Using the following R code :
options(contrasts=c('contr.sum','contr.poly'))
mod <- lme(param~group*time, random=list(mice=pdDiag(form=~group-1)),
correlation = corSymm(form=~1|mice),
weights = varIdent(form=~1|group),
na.action = na.exclude, data = data, method = "REML")
anova(mod,type="marginal")
I obtain:
numDF denDF F-value p-value
(Intercept) 1 128 1373.8471 <.0001
group 3 32 1.5571 0.2189
time 4 128 10.0628 <.0001
group:time 12 128 1.6416 0.0880
The degrees of freedom are similar, but not the tests on fixed effects and I don’t know where this comes from. Would anyone have any idea of what I am doing wrong here?
Your R code differs from the SAS code in multiple ways. Some of them are fixable, but I was not able to fix all the aspects to reproduce the SAS analysis.
The R code fits a mixed effects model with a random mice effect, while the SAS code fits a generalized linear model that allows correlation between the residuals, but there are no random effects (because there is no RANDOM statement). In R you would have to use the gls function from the same nlme package.
In the R code all observations within the same group have the same variance, while in the SAS code you have an unstructured covariance matrix, that is each time-point within each group has its own variance. You can achieve the same effect by using weights=varIdent(form=~1|group*time).
In the R code the correlation matrix is the same for every mouse regardless of group. In the SAS code each group has its own correlation matrix. This is the part that I don't know how to reproduce in R.
I have to note that the R model seems to be more meaningful - SAS estimates way too many variances and correlations (which, by the way, you can see meaningfully arranged using the R and RCORR options to the repeated statement).
"In the R code the correlation matrix is the same for every mouse regardless of group. In the SAS code each group has its own correlation matrix. This is the part that I don't know how to reproduce in R." - Try: correlation=corSymm(~1|group*time)

How to perform random forest/cross validation in R

I'm unable to find a way of performing cross validation on a regression random forest model that I'm trying to produce.
So I have a dataset containing 1664 explanatory variables (different chemical properties), with one response variable (retention time). I'm trying to produce a regression random forest model in order to be able to predict the chemical properties of something given its retention time.
ID RT (seconds) 1_MW 2_AMW 3_Sv 4_Se
4281 38 145.29 5.01 14.76 28.37
4952 40 132.19 6.29 11 21.28
4823 41 176.21 7.34 12.9 24.92
3840 41 174.24 6.7 13.99 26.48
3665 42 240.34 9.24 15.2 27.08
3591 42 161.23 6.2 13.71 26.27
3659 42 146.22 6.09 12.6 24.16
This is an example of the table that I have. I want to basically plot RT against 1_MW, etc (up to 1664 variables), so I can find which of these variables are of importance and which aren't.
I do:-
r = randomForest(RT..seconds.~., data = cadets, importance =TRUE, do.trace = 100)
varImpPlot(r)
which tells me which variables are of importance and what not, which is great. However, I want to be able to partition my dataset so that I can perform cross validation on it. I found an online tutorial that explained how to do it, but for a classification model rather than regression.
I understand you do:-
k = 10
n = floor(nrow(cadets)/k)
i = 1
s1 = ((i-1) * n+1)
s2 = (i * n)
subset = s1:s2
to define how many cross folds you want to do, and the size of each fold, and to set the starting and end value of the subset. However, I don't know what to do here on after. I was told to loop through but I honestly have no idea how to do this. Nor do I know how to then plot the validation set and the test set onto the same graph to depict the level of accuracy/error.
If you could please help me with this I'd be ever so grateful, thanks!
From the source:
The out-of-bag (oob) error estimate
In random forests, there is no need for cross-validation or a separate
test set to get an unbiased estimate of the test set error. It is
estimated internally , during the run...
In particular, predict.randomForest returns the out-of-bag prediction if newdata is not given.
As topchef pointed out, cross-validation isn't necessary as a guard against over-fitting. This is a nice feature of the random forest algorithm.
It sounds like your goal is feature selection, cross-validation is still useful for this purpose. Take a look at the rfcv() function within the randomForest package. Documentation specifies input of a data frame & vector, so I'll start by creating those with your data.
set.seed(42)
x <- cadets
x$RT..seconds. <- NULL
y <- cadets$RT..seconds.
rf.cv <- rfcv(x, y, cv.fold=10)
with(rf.cv, plot(n.var, error.cv))

Error with gls function in nlme package in R

I keep getting an error like this:
Error in `coef<-.corARMA`(`*tmp*`, value = c(18.3113452983211, -1.56626248550284, :
Coefficient matrix not invertible
or like this:
Error in gls(archlogfl ~ co2, correlation = corARMA(p = 3)) : false convergence (8)
with the gls function in nlme.
The former example was with the model gls(archlogflfornma~nma,correlation=corARMA(p=3)) where archlogflfornma is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
and nma is
[1] 138 139 142 148 150 134 137 135
You can see the model in the latter, and archlogfl is
[1] 2.611840 2.618454 2.503317 2.305531 2.180464 2.185764 2.221760 2.211320
[9] 2.105556 2.176747
and co2 is
[1] 597.5778 917.9308 1101.0430 679.7803 886.5347 597.0668 873.4995
[8] 816.3483 1427.0190 423.8917
I have R 2.13.1.
Roland
#GavinSimpson's comment above, that trying to estimate a model with 5 parameters from 10 observations is very hopeful, is correct. The general rule of thumb is that you should have at least 10 times as many data points as parameters, and that's for standard fixed effect/regression parameters. (Generally variance structure parameters such as AR parameters are even a bit harder/require a bit more data than regression parameters to estimate.)
That said, in a perfect world one could hope to estimate parameters even from overfitted models. Let's just explore what happens though:
archlogfl <- c(2.611840,2.618454,2.503317,
2.305531,2.180464,2.185764,2.221760,2.211320,
2.105556,2.176747)
co2 <- c(597.5778,917.9308,1101.0430,679.7803,
886.5347,597.0668,873.4995,
816.3483,1427.0190,423.8917)
Take a look at the data,
plot(archlogfl~co2,type="b")
library(nlme)
g0 <- gls(archlogfl~co2)
plot(ACF(g0),alpha=0.05)
This is an autocorrelation function of the residuals, with 95% confidence intervals (note that these are curvewise confidence intervals, so we would expect about 1/20 points to fall outside these boundaries in any case).
So there is indeed some (graphical) evidence for some autocorrelation here. We'll fit an AR(1) model, with verbose output (to understand the scale on which these parameters are estimated, you'll probably have to dig around in Pinheiro and Bates 2000: what's presented in the printout are the unconstrained values of the parameters, what's printed in the summaries are the constrained values ...
g1 <- gls(archlogfl ~co2,correlation=corARMA(p=1),
control=glsControl(msVerbose=TRUE))
Let's see what's left after we fit AR1:
plot(ACF(g1,resType="normalized"),alpha=0.05)
Now fit AR(2):
g2 <- gls(archlogfl ~co2,correlation=corARMA(p=2),
control=glsControl(msVerbose=TRUE))
plot(ACF(g2,resType="normalized"),alpha=0.05)
As you correctly state, trying to go to AR(3) fails.
gls(archlogfl ~co2,correlation=corARMA(p=3))
You can play with tolerances, starting conditions, etc., but I don't think it's going to help much.
gls(archlogfl ~co2,correlation=corARMA(p=3,value=c(0.9,-0.5,0)),
control=glsControl(tolerance=1e-4,msVerbose=TRUE),verbose=TRUE)
If I were absolutely desperate to get these values I would code my own generalized least-squares function, constructing the AR(3) correlation matrix from scratch, and try to run it with some slightly more robust optimizer, but I would really have to have a good reason to work that hard ...
Another alternative would be to use arima to fit to the residuals from a gls or lm fit without autocorrelation: arima(residuals(g0),c(3,0,0)). (You can see that if you do this with arima(residuals(g0),c(2,0,0)) the answers are close to (but not quite equal to) the results from gls with corARMA(p=2).)

Resources