IV for logistic regression with clustered standard errors in R - r

I have individual-level data to analyze the effect of state-level educational expenditures on individual-level students' performances. Students' performance is a binary variable (0 when they do not pass, 1 when they pass the test). I run the following glm model with state-level clustering of standard errors:
library(miceadds)
df_logit <- data.frame(performance = c(0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0),
state = c("MA", "MA", "MB", "MC", "MB", "MD", "MA", "MC", "MB", "MD", "MB", "MC", "MA", "MA", "MA", "MA", "MD", "MA","MB","MA","MA","MD","MC","MA","MA","MC","MB","MB","MD", "MB"),
expenditure = c(123000, 123000,654000, 785000, 654000, 468000, 123000, 785000, 654000, 468000, 654000, 785000,123000,123000,123000,123000, 468000,123000, 654000, 123000, 123000, 468000,785000,123000, 123000, 785000, 654000, 654000, 468000,654000),
population = c(0.25, 0.25, 0.12, 0.45, 0.12, 0.31, 0.25, 0.45, 0.12, 0.31, 0.12, 0.45, 0.25, 0.25, 0.25, 0.25, 0.31, 0.25, 0.12, 0.25, 0.25, 0.31, 0.45, 0.25, 0.25, 0.45, 0.12, 0.12, 0.31, 0.1),
left_wing = c(0.10, 0.10, 0.12, 0.18, 0.12, 0.36, 0.10, 0.18, 0.12, 0.36, 0.12, 0.18, 0.10, 0.10, 0.10, 0.10, 0.36, 0.10, 0.12, 0.10, 0.10, 0.36, 0.18, 0.10, 0.10,0.18, 0.12, 0.12, 0.36, 0.12))
df_logit$performance <- as.factor(df_logit$performance)
glm_clust_1 <- miceadds::glm.cluster(data=df_logit, formula=performance ~ expenditure + population,
cluster="state", family=binomial(link = "logit"))
summary(glm_clust_1)
Since I cannot rule out the possibility that expenditures are endogenous, I would like to use the share of left-wing parties at the state level as an instrument for education expenditures.
However, I have not found a command in ivtools or other packages to run two-stage least squares with control variables in a logistic regression with state-level clustered standard errors.
Which commands can I use to extend my logit model with the instrument "left_wing" (also included in the example dataset) and at the same time output the common tests like the Wu-Hausman test or the weak instrument test (like ivreg does for ols)?
ideally, I could adapt the following command to binary dependent variables and cluster the standard errors at state level
iv_1 <- ivreg(performance ~ population + expenditure | left_wing + population, data=df_logit)
summary(iv_1, cluster="state", diagnostics = TRUE)

Try this?
require(mlogit)
require(ivprobit)
test <- ivprobit(performance ~ population | expenditure | left_wing + population, data = df_logit)
summary(test)
I wasn't completely sure about the clustering part, but according to this thread on CrossValidated, it might not be necessary. Please take a read and let me know what you think.
Essentially, what I understood was since the likelihood of binary data is already specified there is no need to include the clusters. This is only true when your model is "correct", however, if you believe that there is something in the joint distribution that is not accounted for then you should cluster, though from my reading it doesn't seem like it's possible to implement clustering on a IV logit model in R.
In terms of the model itself there is a really good explanation in this SO question. How can I use the "ivprobit" function in "ivprobit" package in R?.
From my reading as well there should be almost no difference between the end results of a logit v probit model.
The basic breakdown is as follows:
y= d2 = dichotomous l.h.s.
x= ltass+roe+div = the r.h.s. exogenous variables
y1= eqrat+bonus = the r.h.s. endogenous variables
x2= tass+roe+div+gap+cfa = the complete set of instruments
Feel free to comment/edit/give feedback to this answer as I'm definitely not expert in applications of causal analysis and it's been a long time since I've implemented one. I also have not explored the potential of post-hoc tests from this final model, so that is still left for completion.

Related

Why do the results of Dunn's test in GraphPad Prism and R differ?

I have three sets of data, to which I want to apply Dunn's test. However, the test shows different results when performed in GraphPad Prism and R. I've been reading a little bit about the test here, but I couldn't understand why there is a difference in the p-values. I even tested in R all the methods to adjust the p-value, but none of them matched the GrapPad Prism result.
Below I present screenshots of the step-by-step in GraphPad Prism and the code I used in R.
library(rstatix)
Day <- rep(1:10, 3)
FLs <- c(rep("FL1", 10), rep("FL2", 10), rep("FL3", 10))
Value <- c(0.2, 0.4, 0.3, 0.2, 0.3, 0.4, 0.2, 0.25, 0.32, 0.21,
0.9, 0.6, 0.7, 0.78, 0.74, 0.81, 0.76, 0.77, 0.79, 0.79,
0.6, 0.58, 0.54, 0.52, 0.39, 0.6, 0.52, 0.67, 0.65, 0.56)
DF <- data.frame(FLs, Day, Value)
Dunn <- DF %>%
dunn_test(Value ~ FLs,
p.adjust.method = "bonferroni",
detailed = TRUE) %>%
add_significance()

R survival survreg not producing a good fit

I am new to using R, and I am trying to use survival analysis in order to find correlation in censored data.
The x data is the envelope mass of protostars. The y data is the intensity of an observed molecular line, and some values are upper limits. The data is:
x <- c(17.299, 4.309, 7.368, 29.382, 1.407, 3.404, 0.450, 0.815, 1.027, 0.549, 0.018)
y <- c(2.37, 0.91, 1.70, 1.97, 0.60, 1.45, 0.25, 0.16, 0.36, 0.88, 0.42)
censor <- c(0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1)
I am using the function survreg from the R Survival library
modeldata<-survreg(formula=Surv(y,censor)~x, dist="exponential", control = list(maxiter=90))
Which gives the following result:
summary(modeldata)
Call:
survreg(formula = Surv(y, censor) ~ x, dist = "exponential",
control = list(maxiter = 90))
Value Std. Error z p
(Intercept) -0.114 0.568 -0.20 0.841
x 0.153 0.110 1.39 0.163
Scale fixed at 1
Exponential distribution
Loglik(model)= -6.9 Loglik(intercept only)= -9
Chisq= 4.21 on 1 degrees of freedom, p= 0.04
Number of Newton-Raphson Iterations: 5
n= 11
However, when I plot the data and the model using the following method:
plot(x,y,pch=(censor+1))
xnew<-seq(0,30)
model<-predict(modeldata,list(x=xnew))
lines(xnew,model,col="red")
I get this plot of x and y data; triangles are censored data
I am not sure where I am going wrong. I have tried different distributions, but all produce similar results. The same is true when I use other data, for example:
x <- c(1.14, 1.14, 1.19, 0.78, 0.43, 0.24, 0.19, 0.16, 0.17, 0.66, 0.40)
I am also not sure if I am interpreting the results correctly.
I have tried other examples using the same method (e.g. https://stats.idre.ucla.edu/r/examples/asa/r-applied-survival-analysis-ch-1/), and it works well, as far as I can tell.
So my questions are:
Am I using the correct function for fitting the data? If not, which would be more suitable?
If it is the correct function, why is the model not fitting the data even closely? Does it have to do with the plotting?
Thank you for your help.
The "shape" of the relationship looks concave downward, so I would have guessed a ~ log(x) fit might be be more appropriate:
dfrm <- data.frame( x = c(17.299, 4.309, 7.368, 29.382, 1.407, 3.404, 0.450, 0.815, 1.027, 0.549, 0.018),
y = c(2.37, 0.91, 1.70, 1.97, 0.60, 1.45, 0.25, 0.16, 0.36, 0.88, 0.42),
censor= c(0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 1))
modeldata<-survreg(formula=Surv(y,censor)~log(x), data=dfrm, dist="loggaussian", control = list(maxiter=90))
You code seemed appropriate:
png(); plot(y~x,pch=(censor+1),data=dfrm)
xnew<-seq(0,30)
model<-predict(modeldata,list(x=xnew))
lines(xnew,model,col="red"); dev.off()
modeldata
Call:
survreg(formula = Surv(y, censor) ~ log(x), data = dfrm, dist = "loggaussian",
control = list(maxiter = 90))
Coefficients:
(Intercept) log(x)
0.02092589 0.32536509
Scale= 0.7861798
Loglik(model)= -6.6 Loglik(intercept only)= -8.8
Chisq= 4.31 on 1 degrees of freedom, p= 0.038
n= 11

Efficient way to calculate average MAPE and MSE in R

I have a real data and predicted data and I want to calculate overall MAPE and MSE. The data are time series, with each column representing data for different weeks. I predict value for each of the 52 weeks for each of the items as shown below. What would be the best possible calculate overall Error in R.
real = matrix(
c("item1", "item2", "item3", "item4", .5, .7, 0.40, 0.6, 0.3, 0.29, 0.7, 0.09, 0.42, 0.032, 0.3, 0.37),
nrow=4,
ncol=4)
colnames(real) <- c("item", "week1", "week2", "week3")
predicted = matrix(
c("item1", "item2", "item3", "item4", .55, .67, 0.40, 0.69, 0.13, 0.9, 0.47, 0.19, 0.22, 0.033, 0.4, 0.37),
nrow=4,
ncol=4)
colnames(predicted) <- c("item", "week1", "week2", "week3")
How do you get the predicted values in the first place? The model you use to get the predicted values is probably based on minimising some function of prediction errors (usually MSE). Therefore, if you calculate your predicted values, the residuals and some metrics on MSE and MAPE have been calculated somewhere along the line in fitting the model. You can probably retrieve them directly.
If the predicted values happened to be thrown into your lap and you have nothing to do with fitting the model, then you calculate MSE and MAPE as per below:
You have only one record per week for every item. So for every item, you can only calculate one prediction error per week. Depending on your application, you can choose to calculate the MSE and MAPE per item or per week.
This is what your data looks like:
real <- matrix(
c(.5, .7, 0.40, 0.6, 0.3, 0.29, 0.7, 0.09, 0.42, 0.032, 0.3, 0.37),
nrow = 4, ncol = 3)
colnames(real) <- c("week1", "week2", "week3")
predicted <- matrix(
c(.55, .67, 0.40, 0.69, 0.13, 0.9, 0.47, 0.19, 0.22, 0.033, 0.4, 0.37),
nrow = 4, ncol = 3)
colnames(predicted) <- c("week1", "week2", "week3")
Calculate the (percentage/squared) errors for every entry:
pred_error <- real - predicted
pct_error <- pred_error/real
squared_error <- pred_error^2
Calculate MSE, MAPE:
# For per-item prediction errors
apply(squared_error, MARGIN = 1, mean) # MSE
apply(abs(pct_error), MARGIN = 1, mean) # MAPE
# For per-week prediction errors
apply(squared_error, MARGIN = 0, mean) # MSE
apply(abs(pct_error), MARGIN = 0, mean) # MAPE

Bayes Factor values in the R package BayesFactor

I've followed the instructions on how to run a Bayesian 't-test' using default priors in the BayesFactor package in R.
Some of the returned values are astronomical.
Here is an example comparison with a huge Bayes factor:
#install.packages('BayesFactor')
library(BayesFactor)
condition1 <- c(0.94, 0.9, 0.96, 0.74, 1, 0.98, 0.86, 0.92, 0.918367346938776,
0.96, 0.4, 0.816326530612245, 0.8, 0.836734693877551, 0.56, 0.66,
0.605263157894737, 0.836734693877551, 0.84, 0.9, 0.92, 0.714285714285714,
0.82, 0.5, 0.565217391304348, 0.8, 0.62)
condition2 <- c(0.34, 0.16, 0.23, 0.19, 0.71, 0.36, 0.02, 0.83, 0.11, 0.06,
0.27, 0.347368421052632, 0.21, 0.13953488372093, 0.11340206185567,
0.14, 0.142857142857143, 0.257731958762887, 0.15, 0.29, 0.67,
0.0515463917525773, 0.272727272727273, 0.0895522388059701, 0.0204081632653061,
0.13, 0.0612244897959184)
bf = ttestBF(x = condition1, condition2, paired = TRUE)
bf
This returns:
Bayes factor analysis
--------------
[1] Alt., r=0.707 : 144035108289 ±0%
Against denominator:
Null, mu = 0
---
Bayes factor type: BFoneSample, JZS
For the most part the comparisons range from below 1 up to a few hundred. But I'm concerned that this value (144035108289!) is indicative of something erroneous on my part.
FYI: the p-value in the null-hypothesis test on the same data as above = 4.649279e-14.
Any assurances or insights into this returned BF would be much appreciated.
I calculated the BF using manual input of t-value and sample size like this using the same package:
exp(ttest.tstat(t=14.63, n1=27, rscale = 0.707)[['bf']])
It gives the same BF. It seems this is largely due to a relatively big sample size (27). The returned BF appears to be on the up-and-up.

Trying to fit f distribution to a vector

Would anyone know why the following code fails to execute fitdist with error "the function mle failed to estimate the parameters, with the error code 100".
I have encountered this error in the past when working with the normal distribution; the solution in that case was increasing the variance of the vector (by multiplying it by say 100), but that does not help on this case. Please note all elements in the vector are positive. Thank you.
library(fitdistrplus)
VH <- c(0.36, 0.3, 0.36, 0.47, 0, 0.05, 0.4, 0, 0, 0.15, 0.89, 0.03, 0.45, 0.21, 0, 0.18, 0.04, 0.53, 0, 0.68, 0.06, 0.09, 0.58, 0.03, 0.23, 0.27, 0, 0.12, 0.12, 0, 0.32, 0.07, 0.04, 0.07, 0.39, 0, 0.25, 0.28, 0.42, 0.55, 0.04, 0.07, 0.18, 0.17, 0.06, 0.39, 0.65, 0.15, 0.1, 0.32, 0.52, 0.55, 0.71, 0.93, 0, 0.36)
f <- fitdist(na.exclude(VH),"f", start =list(df1=1, df2=2))
The error you get here is actually somewhat informative:
simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data, ddistnam = ddistname, hessian = TRUE, method = meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters
Error in fitdist(na.exclude(VH), "f", start = list(df1 = 1, df2 = 2)) :
the function mle failed to estimate the parameters,
with the error code 100
That means something went wrong right away, not in the middle of the optimization process.
Taking a guess, I looked and saw that there was a zero value in your data (so your statement that all the elements are positive is not technically correct -- they're all non-negative ...). The F distribution has an infinite value at 0: df(0,1,2) is Inf.
If I exclude the zero value, I get an answer ...
f <- fitdist(na.exclude(VH[VH>0]),"f", start =list(df1=1, df2=2))
... the estimated value for the second shape parameter is very large (approx. 6e6, with a big uncertainty), but seems to fit OK ...
par(las=1); hist(VH,freq=FALSE,col="gray")
curve(df(x,1.37,6.45e6),add=TRUE)

Resources