How to use covariates in rddtools rdd_reg_lm function? - r

I am trying to run a parametric RD regression using the rddtools R package. However, the package documentation is not very clear to me.
First: the function to define an RD object is:
rdd_data(y, x, covar, cutpoint, z, labels, data)
where covar, in the help file, means only "Exogeneous variables" . But what type? A data frame? A list?
Second: The function rdd_reg_lm again demands informing covariates in this way:
rdd_reg_lm(rdd_object, covariates = NULL, order = 1, bw = NULL,
slope = c("separate", "same"), covar.opt = list(strategy = c("include",
"residual"), slope = c("same", "separate"), bw = NULL),
covar.strat = c("include", "residual"), weights)
Where, according to the help file, the covariates argument means simply "Formula to include covariates". Again, it is not clear to me what is exactly the correct way of applying these covariates.
Moreover, is it possible to include multiple covariates in this function rdd_data() and rdd_reg_lm()?
I appreciate some help here. I have already read the help and vignette files again and again, searched in many blogs and still nothing.
I have already checked this topic below
How to include a linear trend in a regression discontinuity design using rddtools
which showed me the following example:
rd.medic <- rdd_data(y = er ,x = ageyrs, covar = ageyrs, cutpoint=65, data = medicare)
rd.reg <- rdd_reg_lm(rdd_object=rd.medic, covariates = 'ageyrs', slope =
("same"), covar.opt = list("include"))
Even so, the syntax is still not clear to me, as I am trying to add multiple covariates without success
Thanks!

You can create a data frame with your covariates and then include it in rdd_data.
covariates<-data.frame(z1=ageyrs, z2=ageyrs2)
rd.medic <- rdd_data(y = er ,x = ageyrs, covar = covariates, cutpoint=65, data = medicare)
rd.reg <- rdd_reg_lm(rdd_object=rd.medic, covariates =TRUE, slope =("same"))

Related

Writing a prediction equation from plsr model

Greeting to everyone.
I sucessfully computed pls-r model in R using the code below
pls_modB_Kexch_2 <- plsr(Av.K_exc~., data = trainKexch.sar.veg, scale=TRUE,method= "s",validation='CV')
The regression coeffiecents for ncomps =11 were
(
Intercept)= -4.692966e+05,
Easting = 6.068582e+03, Northings= 7.929767e+02,
sigma_vv = 8.024741e+05, sigma_vh = -6.375260e+05,
gamma_vv = -7.120684e+05, gamma_vh = 4.330279e+05,
beta_vv = -8.949598e+04, beta_vh = 2.045924e+05,
c11_db = 2.305016e+01, c22_db = -4.706773e+01,
c12_real = -1.877267e+00.)
It predicts well new data sets when applied with in R enviroment.
My challenge is presenting this model in form of y=sum(AX)+Bo equation where A are coeffiecents of respective variablesX
Or any other mathmetical form, that can be presented academically.
I tried a direct way by multiplying the coeff.to each variable and suming them up, aquick manual trial for predictions gave me strange results. Am missing something here, please help.

Error with svyglm function in survey package in R: "all variables must be in design=argument"

New to stackoverflow. I'm working on a project with NHIS data, but I cannot get the svyglm function to work even for a simple, unadjusted logistic regression with a binary predictor and binary outcome variable (ultimately I'd like to use multiple categorical predictors, but one step at a time).
El_under_glm<-svyglm(ElUnder~SO2, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in eval(extras, data, env) :
object '.survey.prob.weights' not found
I changed the variables to 0 and 1 instead:
Under_narm$SO2REG<-ifelse(Under_narm$SO2=="Heterosexual", 0, 1)
Under_narm$ElUnderREG<-ifelse(Under_narm$ElUnder=="No", 0, 1)
But then get a different issue:
El_under_glm<-svyglm(ElUnderREG~SO2REG, design=SAMPdesign, subset=NULL, family=binomial(link="logit"), rescale=FALSE, correlation=TRUE)
Error in svyglm.survey.design(ElUnderREG ~ SO2REG, design = SAMPdesign, :
all variables must be in design= argument
This is the design I'm using to account for the weights -- I'm pretty sure it's correct:
SAMPdesign=svydesign(data=Under_narm, id= ~NHISPID, weight= ~SAMPWEIGHT)
Any and all assistance appreciated! I've got a good grasp of stats but am a slow coder. Let me know if I can provide any other information.
Using some make-believe sample data I was able to get your model to run by setting rescale = TRUE. The documentation states
Rescaling of weights, to improve numerical stability. The default
rescales weights to sum to the sample size. Use FALSE to not rescale
weights.
So, one solution maybe is just to set rescale = TRUE.
library(survey)
# sample data
Under_narm <- data.frame(SO2 = factor(rep(1:2, 1000)),
ElUnder = sample(0:1, 1000, replace = TRUE),
NHISPID = paste0("id", 1:1000),
SAMPWEIGHT = sample(c(0.5, 2), 1000, replace = TRUE))
# with 'rescale' = TRUE
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(), # this family avoids warnings
rescale=TRUE) # Weights rescaled to the sum of the sample size.
summary(El_under_glm, correlation = TRUE) # use correlation with summary()
Otherwise, looking code for this function's method with 'survey:::svyglm.survey.design', it seems like there may be a bug. I could be wrong, but by my read when 'rescale' is FALSE, .survey.prob.weights does not appear to get assigned a value.
if (is.null(g$weights))
g$weights <- quote(.survey.prob.weights)
else g$weights <- bquote(.survey.prob.weights * .(g$weights)) # bug?
g$data <- quote(data)
g[[1]] <- quote(glm)
if (rescale)
data$.survey.prob.weights <- (1/design$prob)/mean(1/design$prob)
There may be a work around if you assign a vector of numeric values to .survey.prob.weights in the global environment. No idea what these values should be, but your error goes away if you do something like the following. (.survey.prob.weights needs to be double the length of the data.)
SAMPdesign=svydesign(ids = ~NHISPID,
data=Under_narm,
weights = ~SAMPWEIGHT)
.survey.prob.weights <- rep(1, 2000)
El_under_glm<-svyglm(formula = ElUnder~SO2,
design=SAMPdesign,
family=quasibinomial(),
rescale=FALSE)
summary(El_under_glm, correlation = TRUE)

Predict using multiple variables in R

I have a slight problem with my R coursework.
I have made a following dataset:
Now I'm going to plot the values based on this dataset using the following command:
plot(x ~ Group.1, data = jarelmaks_vaikelaen23mean,
xlab = "Vanus", ylab = "PD", main = "Järelmaks ja väikelaen")
After that, I'm creating a glm model using the following command. The difference is, that now I'm using an original dataset (the values of the dependent values are 1/0).
GLM command:
jarelmaks_vaikelaen23_mudel <- glm(Default ~ Vanus.aastates + Toode,
family = binomial(link = 'logit'), data = jarelmaks_vaikelaen_23)
Now, I'm trying to predict the values using my model.
predict(jarelmaks_vaikelaen23_mudel,data.frame(Vanus.aastates=x),type = "resp")
Unfortunately, I get a following error message:
Error in data.frame(Vanus.aastates = x) : object 'x' not found
Can you give me some ideas, how to solve this problem or explain, how this predict() command works or smth?
When you provide a data-frame to the predict function's newdata argument, the data-frame should have column names that match the variables used as independent variables in your model-fitting step. That is, your predict call should look like
predict(
jarelmaks_vaikelaen23_mudel,
newdata = data.frame(
Vanus.aastates = SOMETHING,
Toode = SOMETHING_ELSE
),
type = "response"
)

How to ensemble forecasts in R using weights

I have a couple forecasts and was trying to figure out how to merge the two according to some criterion that is widely used.
In part one, I split the data and compared the forecasts against the actual balues using Forest_comb.
library(forecast)
library(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
In part two, I rebuilt the whole model on the entire data and forecast out by ten values. How can one merge the two values together according to some criterion?
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted
Thanks for the help
The reason that I am using ForecastCombination is that it includes procedures for popular combination strategies. I thought that perhaps that function could be modified to perform the desired ensembling.
Based on a lot of Kaggle competitions where people share/discuss their scripts, I'd say that by far the most common and most effective way is simply to manually weight and add your predictions.
pacman::p_load(forecast)
pacman::p_load(ForecastCombinations)
y1 = rnorm(100)
train = y1[1:90]
test = y1[91:100]
fit1 = auto.arima(train)
fit2 = ets(train)
forc1 = forecast(fit1, n=10)$mean
forc2 = forecast(fit2, n=10)$mean
forc_all = cbind(forc1,forc2)
forc_all
?Forecast_comb
fitted_1 <- Forecast_comb(obs = test ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_1
fit3 = auto.arima(y1)
fit4 = ets(y1)
forc3 = forecast(fit3, n=20)$mean
forc4 = forecast(fit4, n=20)$mean
forc_all = cbind(forc3,forc4)
forc_all
fitted_2 <- Forecast_comb(obs = y1[91:100] ,fhat = as.matrix(forc_all), Averaging_scheme = "best")$fitted
fitted_2
# By far the most common way to combine/weight is simply:
fitted <- fitted_2*.5+fitted_1*.5
fitted
One might ask if you should use equal weights or how to know what to make the weights. This is usually determined by
(a) naive, equal weighting if that's all you have time for and it seems to work fine
(b) iterating with a holdout or cross-validation sample(s), being careful not to overfit
Some people try to take more fancy approaches. It's easy to mess that up, however if you get it right then it can lead you to a more optimal forecast.
The model-based and other more fancy approaches are things like creating another stage of the modeling process wherein your predictions on a holdout sample are the X matrix and the outcome variable is the actual y for that sample.
Also, check out Erin LeDell's approach in h2oEnsemble.

Set G in prior using MCMCglmm, with categorical response and phylogeny

I am new to the MCMCglmm package in R, and rather new to glm models in general. I have a dataset of species traits and whether or not they have been introduced outside of their native range.
I would like to test whether being introduced (as a binary 0/1 response variable) can be explained by any of the species traits. I would also like to correct for phylogeny between species.
I was told that for a binary response I could use family =“threshold” and I should fix the residual variance at 1. But I am having some trouble with the other parameters needed for the prior.
I've specified the R value for the random effects, but if I specify R I must also specify G and it is not clear to me how to decide the values for this parameter. I've tried putting default values but I get error messages:
Error in MCMCglmm(fixed, random = ~species, data = data2, family = "threshold", :
prior$G has the wrong number of structures
I have read the help vignettes and course but have not found an example with a binary response, and it is not clear to me how to decide the values for the priors. This is what I have so far:
fixed=Intro_binary ~ Trait1+ Trait2 + Trait3
Ainv=inverseA(redTree1)$Ainv
binary_model = MCMCglmm(fixed, random=~species, data = data, family = "threshold", ginverse=list(species=Ainv),
 prior = list(
    G = list(),    #not sure about the parameters for random effects.
    R = list(V = 1, fix = 1)),  #to fix the residual variance at one
  nitt = 60000, burnin = 10000)
Any help or feedback would be greatly appreciated!
This one is a bit tricky with the information you provide. I'd say you can define G as a "weak" prior using:
priors <- list(R = list(V = 1, nu = 0.002),
G = list(V = 1, fix = 1)))
binary_model <- MCMCglmm(fixed, random = ~species, data = data,
family = "threshold",
ginverse = list(species = Ainv),
prior = priors,
nitt = 60000, burnin = 10000)
However, without more information on your analysis, I strongly suggest you plot your posteriors to have a look at the results and see if anything looks wrong. Have a look at the MCMCglmm package Course Notes for more info on how to set these priors (especially on what not to do in section 1.5 - you can also find more specific info on how to tune it to your model if it fits in the categories of the tutorial).

Resources