I'm trying to use robumeta and I keep running into an error.
I'm using 113 observations on 8 variables:
EM <- read.csv(file="SchoolMotivationRisk.csv", header=TRUE,sep=",")
The eight variables are: studynum yi var.effect.size sei aget1 aget2 permale sexmix.
Doing str(EM) yields 'data.frame': 113 obs. of 8 variables.
The problem is when I go to fit:
res<-robu(formula = yi ~ 1, var.effect.size = var.effect.size, studynum = studynum, modelweights = "CORR", rho= 0.8, small=TRUE, data=EM)
I get the following error:
Error in data.frame(effect.size = mf[, 1], stats::model.matrix(formula, : arguments imply differing number of rows: 113, 0
Is there a way around this?
Also, the output from dput(EM) is at https://pastebin.com/vmMwy1u4
The parameter is var.eff.size and not var.effect.size
robu(formula = yi ~ 1, var.eff.size = var.effect.size,
studynum = studynum, modelweights = "CORR", rho= 0.8, small=TRUE, data=EM)
#RVE: Correlated Effects Model with Small-Sample Corrections
#Model: yi ~ 1
#Number of studies = 17
#Number of outcomes = 113 (min = 2 , mean = 6.65 , median = 7 , max = 12 )
#Rho = 0.8
#I.sq = 57.54005
#Tau.sq = 0.004609755
# Estimate StdErr t-value dfs P(|t|>) 95% CI.L 95% CI.U Sig
#1 X.Intercept. 0.113 0.0184 6.12 14 0.0000263 0.0733 0.152 ***
#Signif. codes: < .01 *** < .05 ** < .10 *
#Note: If df < 4, do not trust the results
I would like to use nls to fit a global parameter and group-specific parameters. The closest I have found to a minimum reproducible example is below (found here: https://stat.ethz.ch/pipermail/r-help/2015-September/432020.html)
#Generate some data
d <- transform(data.frame(x=seq(0,1,len=17),
group=rep(c("A","B","B","C"),len=17)), y =
round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2))
#Fit to model using nls
nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=rep(3,length(levels(d$group)))))
This gives me an error:
Error in numericDeriv(form[[3L]], names(ind), env, central = nDcentral) :
Missing value or an infinity produced when evaluating the model
I have not been able to figure out if the error is coming from bad guesses for the starting values, or the way this code is dealing with group-specific parameters. It seems the line with p=rep(3,length(levels(d$group))) is for generating c(3,3,3), but switching this part of the code does not remove the problem (same error obtained as above):
#Fit to model using nls
nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3, 3, 3)))
Switching to nlsLM gives a different error which leads be to believe I am having an issue with the group-specific parameters:
#Generate some data
d <- transform(data.frame(x=seq(0,1,len=17),
group=rep(c("A","B","B","C"),len=17)), y =
round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2))
#Fit to model using nlsLM
nlsLM(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3,3,3)))
Error in dimnames(x) <- dn :
length of 'dimnames' [2] not equal to array extent
Any ideas?
I think you can do this much more easily with nlme::gnls:
fit2 <- nlme::gnls(y~1/(b+x^p),
params = list(p~group-1, b~1),
start = list(b=1, p = rep(3,3)))
Generalized nonlinear least squares fit
Model: y ~ 1/(b + x^p)
Data: d
Log-likelihood: 62.05887
p.groupA p.groupB p.groupC b
2.262383 2.895903 3.475324 1.407561
Degrees of freedom: 17 total; 13 residual
Residual standard error: 0.007188101
The params argument allows you to specify fixed-effect submodels for each nonlinear parameter. Using p ~ b-1 parameterizes the model with a separate estimate for each group, rather than fitting a baseline (intercept) value for the first group and the differences between successive groups. (In R's formula language, -1 or +0 signify "fit a model without intercept/set the intercept to 0", which in this case corresponds to fitting all three groups separately.)
I'm quite surprised that gnls and nls don't give identical results (although both give reasonable results); would like to dig in further ...
Parameter estimates (code below):
term nls gnls
1 b 1.41 1.40
2 pA 2.28 2.28
3 pB 3.19 3.14
4 pC 3.60 3.51
par(las = 1, bty = "l")
plot(y~x, data = d, col = d$group, pch = 16)
xvec <- seq(0, 1, length = 21)
f <- function(x) factor(x, levels = c("A","B","C"))
## fit1 is nls() fit
ll <- function(g, c = 1) {
lines(xvec, predict(fit1, newdata = data.frame(group=f(g), x = xvec)), col = c)
Map(ll, LETTERS[1:3], 1:3)
d2 <- expand.grid(x = xvec, group = f(c("A","B","C")))
pp <- predict(fit2, newdata = d2)
ll2 <- function(g, c = 1) {
lines(xvec, pp[d2$group == g], lty = 2, col = c)
Map(ll2, LETTERS[1:3], 1:3)
legend("bottomleft", lty = 1:2, col = 1, legend = c("nls", "gnls"))
(purrr::map_dfr(list(nls=fit1, gnls=fit2), tidy, .id = "pkg")
%>% select(pkg, term, estimate)
%>% group_by(pkg)
## force common parameter names
%>% mutate(across(term, ~ c("b", paste0("p", LETTERS[1:3]))))
%>% pivot_wider(names_from = pkg, values_from = estimate)
I was able to get this by switching the class of the group from chr to factor. Note the addition of factor() when generating the dataset.
> d <- transform(data.frame(
+ x=seq(0,1,len=17),
+ group=rep(factor(c("A","B","B","C")),len=17)),
+ y=round(1/(1.4+x^ifelse(group=="A", 2.3, ifelse(group=="B",3.1, 3.5))),2)
+ )
> str(d)
'data.frame': 17 obs. of 3 variables:
$ x : num 0 0.0625 0.125 0.1875 0.25 ...
$ group: Factor w/ 3 levels "A","B","C": 1 2 2 3 1 2 2 3 1 2 ...
$ y : num 0.71 0.71 0.71 0.71 0.69 0.7 0.69 0.69 0.62 0.64 ...
> nls(y~1/(b+x^p[group]), data=d, start=list(b=1, p=c(3,3,3)))
Nonlinear regression model
model: y ~ 1/(b + x^p[group])
data: d
b p1 p2 p3
1.406 2.276 3.186 3.601
residual sum-of-squares: 9.537e-05
Number of iterations to convergence: 5
Achieved convergence tolerance: 4.536e-06
I would like to perform a Cox regression for the following questions: A group of patients receives a treatment "drug" or not (0 / 1). My time variable "time" tells me, how many days the patient is observed and "status" if the patient survived or died (died = 1, survived = 0).
df <- data.frame(time = round(runif(100, min = 1, max = 70)),
status = round(runif(100, min = 0, max = 1)),
drug = round(runif(100, min = 0, max = 1)),
age40 = round(runif(100, min = 0, max = 1)),
stringsAsFactors = FALSE)
object <- Surv(df$time, df$status)
model <- coxph(object ~ drug, data = df)
This works fine for me and tells me, that the HR is 0.89, so the drug prevents patients from dying.
Now I want to do some subgroup analysis, f.e. how does the HR change, if the patient is <= 40 years or > 40 years old (age40: 0 vs 1).
Is all I have to do to include the variable "age40" into the coxph?
object2 <- Surv(df$time, df$status)
model2 <- coxph(object2 ~ drug + age40, data = df)
If I do that my HR in the summary for drug1 slightly changes to 0.86 and I get another one for age40 (1.12).
Now my question is: How are the Hazard Ratios for dying under treatment (drug = 1) if the patient is <= 40 or > 40 years old.
EDIT: Another question would be to graphically show the different HRs of the effect of drug on status in a forest plot, f.e. like this: https://rpkgs.datanovia.com/survminer/reference/ggforest-2.png.
Instead of "sex", "rx", "adhere" etc. I would like to show the HRs for Age40 = 0 vs. 1 and other variables as well, like hypertension = 0 vs. 1, smoker = 0 vs. 1.
Thank you!
The function you need to use is predict on your model2, and it needs to be supplied with a newdata argument that includes all the cases that you want to consider:
exp( predict(model2, newdata=expand.grid(drug=c(0,1), age40=c(0,1))) )
# 1 2 3 4
#1.0000000 0.8564951 1.1268713 0.9651598
You now have all 4 cases of possible combinations of drug and age40. The base case has a value of unity because you are estimating risk ratios form a baseline case of {drug=0, age40=0} You can see what the other risk ratios are associated with
expand.grid(drug=c(0,1), age40=c(0,1))
drug age40
1 0 0
2 1 0
3 0 1
4 1 1
Notice that the ration of drug=0 to drug=1 is the same for each age category considered separately. If you had wanted to see if the effects of drug was different in the two age categories you would have used an interaction model:
model3 <- coxph(object2 ~ drug * age40, data = df)
coxph(formula = object2 ~ drug * age40, data = df)
n= 100, number of events= 50
coef exp(coef) se(coef) z Pr(>|z|)
drug -0.18524 0.83091 0.45415 -0.408 0.683
age40 0.09611 1.10089 0.39560 0.243 0.808
drug:age40 0.05679 1.05843 0.63094 0.090 0.928
exp(coef) exp(-coef) lower .95 upper .95
drug 0.8309 1.2035 0.3412 2.024
age40 1.1009 0.9084 0.5070 2.390
drug:age40 1.0584 0.9448 0.3073 3.645
Concordance= 0.528 (se = 0.042 )
Likelihood ratio test= 0.34 on 3 df, p=1
Wald test = 0.33 on 3 df, p=1
Score (logrank) test = 0.33 on 3 df, p=1
And the effect estimates are now a bit different:
exp( predict(model3, newdata=expand.grid(drug=c(0,1), age40=c(0,1))) )
# 1 2 3 4
#1.0000000 0.8309089 1.1008850 0.9681861
Use argument strata.
coxph(object ~ drug + strata(age40), data = df)
I using an lm() like function called robu() from library robumeta within my own function foo.
However, I'm manipulating the formula argument such that when it is missing the default formula would be: formula(dint~1) or else any formula that user defines.
It works fine, however, in the output of foo the printed formula call always is: Model: missing(f) if formula(dint ~ 1) regardless of what formula is inputted in the foo.
Can I correct this part of output so that it only shows the exact formula used? (see below examples)
dat <- data.frame(dint = 1:9, SD = 1:9*.1,
time = c(1,1,2,3,4,3,2,4,1),
study.name = rep(c("bob", "jim", "jon"), 3))
foo <- function(f, data){
robu(formula = if(missing(f)) formula(dint~1) else formula(f), data = data, studynum = study.name, var = SD^2)
foo(data = dat) ## HERE I expect: `Model: dint ~ 1`
foo(dint~as.factor(time), data = dat) ## HERE I expect: `Model: dint ~ time`
One option is to update the 'ml' object
foo <- function(f, data){
fmla <- if(missing(f)) {
formula(dint ~ 1)
} else {
model <- robu(formula = fmla, data = data, studynum = study.name, var = SD^2)
model$ml <- fmla
foo(data = dat)
RVE: Correlated Effects Model with Small-Sample Corrections
Model: dint ~ 1
Number of studies = 3
Number of outcomes = 9 (min = 3 , mean = 3 , median = 3 , max = 3 )
Rho = 0.8
I.sq = 96.83379
Tau.sq = 9.985899
Estimate StdErr t-value dfs P(|t|>) 95% CI.L 95% CI.U Sig
1 X.Intercept. 4.99 0.577 8.65 2 0.0131 2.51 7.48 **
Signif. codes: < .01 *** < .05 ** < .10 *
Note: If df < 4, do not trust the results
foo(dint~ as.factor(time), data = dat)
RVE: Correlated Effects Model with Small-Sample Corrections
Model: dint ~ as.factor(time)
Number of studies = 3
Number of outcomes = 9 (min = 3 , mean = 3 , median = 3 , max = 3 )
Rho = 0.8
I.sq = 97.24601
Tau.sq = 11.60119
Estimate StdErr t-value dfs P(|t|>) 95% CI.L 95% CI.U Sig
1 X.Intercept. 3.98 2.50 1.588 2.00 0.253 -6.80 14.8
2 as.factor.time.2 1.04 4.41 0.236 1.47 0.842 -26.27 28.3
3 as.factor.time.3 1.01 1.64 0.620 1.47 0.617 -9.10 11.1
4 as.factor.time.4 2.52 2.50 1.007 2.00 0.420 -8.26 13.3
Signif. codes: < .01 *** < .05 ** < .10 *
I am writing a paper about the validity of a billing code in hospitalized children. I am a very novice R studio user. I need the confidence intervals for the sensitive and specificity and positive and negative predictive values but I can't figure out how to do it.
My data has 3 columns : ID, true value, billing value
Here is my code:
confusionMatrix(table(finalcodedataset$billing_value, finalcodedataset$true_value),
positive="1", boot=TRUE, boot_samples=4669, alpha=0.05)
here is the output:
Confusion Matrix and Statistics
0 1
0 4477 162
1 10 20
Accuracy : 0.9632
95% CI : (0.9574, 0.9684)
No Information Rate : 0.961
P-Value [Acc > NIR] : 0.238
Kappa : 0.1796
Mcnemar's Test P-Value : <2e-16
Sensitivity : 0.109890
Specificity : 0.997771
Pos Pred Value : 0.666667
Neg Pred Value : 0.965079
Prevalence : 0.038981
Detection Rate : 0.004284
Detection Prevalence : 0.006425
Balanced Accuracy : 0.553831
'Positive' Class : 1
You can use epiR package for this purpouse.
data <- as.table(matrix(c(670,202,74,640), nrow = 2, byrow = TRUE))
rval <- epi.tests(data, conf.level = 0.95)
Outcome + Outcome - Total
Test + 670 202 872
Test - 74 640 714
Total 744 842 1586
Point estimates and 95 % CIs:
Apparent prevalence 0.55 (0.52, 0.57)
True prevalence 0.47 (0.44, 0.49)
Sensitivity 0.90 (0.88, 0.92)
Specificity 0.76 (0.73, 0.79)
Positive predictive value 0.77 (0.74, 0.80)
Negative predictive value 0.90 (0.87, 0.92)
Positive likelihood ratio 3.75 (3.32, 4.24)
Negative likelihood ratio 0.13 (0.11, 0.16)
Caret and other packages use the Clopper-Pearson Interval method to calculate the confidence interval.
I consider your 2x2 reversed since the TP (True Positive) is on the bottom right. If the TP is at the top left then variables (A,B,C,D) would be switched.
D = 4477
C = 162
B = 10
A = 20
Acc = (A+D)/(A+B+C+D)
Sensitivity = A / (A + C)
Specificity = D / (D + B)
P = (A+C)/(A+B+C+D)
PPV = (Sensitivity*P)/((Sensitivity*P)+((1-Specificity)*(1-P)))
NPV = (Specificity*(1-P))/(((1 - Sensitivity)*P)+((Specificity)*(1-P)))
n = A+B+C+D
x = n - (A+D)
alpha = 0.05
ub = 1 - ((1 + (n - x + 1)/ (x * qf(alpha *.5, 2*x, 2*(n - x + 1))))^-1)
lb = 1 - ((1 + (n - x) / ((x + 1)* qf(1-(alpha*.5), 2*(x+1), 2*(n-x))))^-1)
CI = c(lb,ub)
> Acc
[1] 0.9631613
> CI
[1] 0.9573536 0.9683800
> Sensitivity
[1] 0.1098901
> Specificity
[1] 0.9977713
[1] 0.6666667
[1] 0.9650787
Here is also a good resource for where these formulas come from.
The following reproducible example is partially inspired from ROC curve from training data in caret.
ctrl <- trainControl(method = "cv", summaryFunction = twoClassSummary, classProbs = TRUE, savePredictions = TRUE)
fit1 <- train(Class ~ ., data = Sonar,method = "rf",trControl = ctrl)
bestmodel <- merge(fit1$bestTune, fit1$pred)
mtx <- confusionMatrix(table(bestmodel$pred, bestmodel$obs))$table
# M R
# M 104 23
# R 7 74
# 95% Confident Interval
## Sensitivity
sens_errors <- sqrt(sensitivity(mtx) * (1 - sensitivity(mtx)) / sum(mtx[,1]))
sensLower <- sensitivity(mtx) - 1.96 * sens_errors
sensUpper <- sensitivity(mtx) + 1.96 * sens_errors
## Specificity
spec_errors <- sqrt(specificity(mtx) * (1 - specificity(mtx)) / sum(mtx[,2]))
specLower <- specificity(mtx) - 1.96 * spec_errors
specUpper <- specificity(mtx) + 1.96 * spec_errors
## Positive Predictive Values
ppv_errors <- sqrt(posPredValue(mtx) * (1 - posPredValue(mtx)) / sum(mtx[1,]))
ppvLower <- posPredValue(mtx) - 1.96 * ppv_errors
ppvUpper <- posPredValue(mtx) + 1.96 * ppv_errors
## Negative Predictive Values
npv_errors <- sqrt(negPredValue(mtx) * (1 - negPredValue(mtx)) / sum(mtx[2,]))
npvLower <- negPredValue(mtx) - 1.96 * npv_errors
npvUpper <- negPredValue(mtx) + 1.96 * npv_errors
I'm trying to reproduce the 95% CI that Stata produces when you run a model with clustered standard errors. For example:
regress api00 acs_k3 acs_46 full enroll, cluster(dnum)
Regression with robust standard errors Number of obs = 395
F( 4, 36) = 31.18
Prob > F = 0.0000
R-squared = 0.3849
Number of clusters (dnum) = 37 Root MSE = 112.20
| Robust
api00 | Coef. Std. Err. t P>|t| [95% Conf. Interval]
acs_k3 | 6.954381 6.901117 1.008 0.320 -7.041734 20.9505
acs_46 | 5.966015 2.531075 2.357 0.024 .8327565 11.09927
full | 4.668221 .7034641 6.636 0.000 3.24153 6.094913
enroll | -.1059909 .0429478 -2.468 0.018 -.1930931 -.0188888
_cons | -5.200407 121.7856 -0.043 0.966 -252.193 241.7922
I am able to reproduce the coefficients and the standard errors:
clustered.se <- function(model_result, data, cluster) {
model_variables <-
intersect(colnames(data), c(colnames(model_result$model), cluster))
model_rows <- rownames(model_result$model)
data <- data[model_rows, model_variables]
cl <- data[[cluster]]
M <- length(unique(cl))
N <- nrow(data)
K <- model_result$rank
dfc <- (M / (M - 1)) * ((N - 1) / (N - K))
uj <-
apply(estfun(model_result), 2, function(x)
tapply(x, cl, sum))
vcovCL <- dfc * sandwich(model_result, meat = crossprod(uj) / N)
standard.errors <- coeftest(model_result, vcov. = vcovCL)[, 2]
p.values <- coeftest(model_result, vcov. = vcovCL)[, 4]
clustered.se <-
list(vcovCL = vcovCL,
standard.errors = standard.errors,
p.values = p.values)
elemapi2 <- read.dta13(file = 'elemapi2.dta')
lm1 <-
lm(formula = api00 ~ acs_k3 + acs_46 + full + enroll,
data = elemapi2)
clustered_se <-
clustered.se(model_result = lm1,
data = elemapi2,
cluster = "dnum")
override.se = clustered_se$standard.errors,
override.p = clustered_se$p.value,
star.symbol = "\\*",
digits = 7
Model 1
(Intercept) -5.2004067
acs_k3 6.9543811
acs_46 5.9660147 *
full 4.6682211 ***
enroll -0.1059909 *
R^2 0.3848830
Adj. R^2 0.3785741
Num. obs. 395
RMSE 112.1983218
*** p < 0.001, ** p < 0.01, * p < 0.05
Alas, I cannot reproduce the 95% confidence Interval:
override.se = clustered_se$standard.errors,
override.p = clustered_se$p.value,
digits = 7,
ci.force = TRUE
Model 1
(Intercept) -5.2004067
[-243.8957845; 233.4949710]
acs_k3 6.9543811
[ -6.5715605; 20.4803228]
acs_46 5.9660147 *
[ 1.0051987; 10.9268307]
full 4.6682211 *
[ 3.2894567; 6.0469855]
enroll -0.1059909 *
[ -0.1901670; -0.0218148]
R^2 0.3848830
Adj. R^2 0.3785741
Num. obs. 395
RMSE 112.1983218
* 0 outside the confidence interval
If I do it 'by hand', I get the same thing than with texreg:
level <- 0.95
a <- 1-(1 - level)/2
coeff <- lm1$coefficients
se <- clustered_se$standard.errors
lb <- coeff - qnorm(a)*se
ub <- coeff + qnorm(a)*se
> lb
(Intercept) acs_k3 acs_46 full enroll
-243.895784 -6.571560 1.005199 3.289457 -0.190167
> ub
(Intercept) acs_k3 acs_46 full enroll
233.49497100 20.48032276 10.92683074 6.04698550 -0.02181481
What is Stata doing and how can I reproduce it in R?
PS: This is a follow up question.
PS2: The Stata data is available here.
It looks like Stata is using confidence intervals based on t(36) rather than Z (i.e. Normal errors).
Taking the values from the Stata output
coef=6.954381; rse= 6.901117 ; lwr= -7.041734; upr= 20.9505
## [1] 2.028095
## [1] -2.028094
Computing/cross-checking the tail values for t(36):
## [1] 0.975
## [1] 2.028094
I don't know how you pass confidence intervals to texreg. Since you haven't given a reproducible example (I don't have elemapi2.dta) I can't say exactly how you would get the df, but it looks like you would want tdf <- length(unique(elemapi2$dnum))-1
level <- 0.95
a <- 1- (1 - level)/2
bounds <- coef(lm1) + c(-1,1)*clustered_se*qt(a,tdf)
Indeed Stata is using the t distribution rather than the normal distribution. There is now a really easy solution to getting confidence intervals that match Stata into texreg using lm_robust from the estimatr package, which you can install from CRAN install.packages(estimatr).
> library(estimatr)
> lmro <- lm_robust(mpg ~ hp, data = mtcars, clusters = cyl, se_type = "stata")
> screenreg(lmro)
Model 1
(Intercept) 30.10 *
[13.48; 46.72]
hp -0.07
[-0.15; 0.01]
R^2 0.60
Adj. R^2 0.59
Num. obs. 32
RMSE 3.86
* 0 outside the confidence interval