These are three different ways to run an individual fixed effect method which gives more or less the same results (see below). My main question is how to get predictive probabilities or average marginal effects using the second model (model_plm) or the third model(model_felm). I know how to do it using the first model (model_lm) and show an example below using ggeffects, but that only works when i have a small sample.
As i have over a million individual, my model only works using model_plm and model_felm. If i use model_lm, it takes a lot of time to run with one million individuals since they are controlled for in the model. I also get the following error: Error: vector memory exhausted (limit reached?). I checked many threads on StackOverflow to work around that error but nothing seems to solve it.
I was wondering whether there is an efficient way to work around this issue. My main interest is to extract the predicted probabilities of the interaction residence*union. I usually extract predictive probabilities or average marginal effects using one of these packages: ggeffects,emmeans or margins.
library(lfe)
library(plm)
library(ggeffects)
data("Males")
model_lm = lm(wage ~ exper + residence+health + residence*union +factor(nr)-1, data=Males)
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr", "year"), data=Males)
model_felm = felm(wage ~ exper + residence + health + residence*union | nr, data= Males)
pred_ggeffects <- ggpredict(model_lm, c("residence","union"),
vcov.fun = "vcovCL",
vcov.type = "HC1",
vcov.args = list(cluster = Males$nr))
I tried adjusting formula/datasets to get emmeans and plm to play nice. Let me know if there's something here. I realized the biglm answer wasn't going to cut it for a million individuals after some testing.
library(emmeans)
library(plm)
data("Males")
## this runs but we need to get an equivalent result with expanded formula
## and expanded dataset
model_plm = plm(wage ~ exper + residence + health + residence*union,model = "within", index=c("nr"), data=Males)
## expanded dataset
Males2 <- data.frame(wage=Males[complete.cases(Males),"wage"],
model.matrix(wage ~ exper + residence + health + residence*union, Males),
nr=Males[complete.cases(Males),"nr"])
(fmla2 <- as.formula(paste("wage ~ ", paste(names(coef(model_plm)), collapse= "+"))))
## expanded formula
model_plm2 <- plm(fmla2,
model = "within",
index=c("nr"),
data=Males2)
(fmla2_rg <- as.formula(paste("wage ~ -1 +", paste(names(coef(model_plm)), collapse= "+"))))
plm2_rg <- qdrg(fmla2_rg,
data = Males2,
coef = coef(model_plm2),
vcov = vcov(model_plm2),
df = model_plm2$df.residual)
plm2_rg
### when all 3 residences are 0, that's `rural area`
### then just pick the rows when one of the residences are 1
emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
Which gives, after some row-deletion:
> ### when all 3 residences are 0, that's `rural area`
> ### then just pick the rows when one of the residences are 1
> emmeans(plm2_rg, c("residencenorth_east","residencenothern_central","residencesouth", "unionyes"))
residencenorth_east residencenothern_central residencesouth unionyes emmean SE df lower.CL upper.CL
0 0 0 0 0.3777 0.0335 2677 0.31201 0.443
1 0 0 0 0.3301 0.1636 2677 0.00929 0.651
0 1 0 0 0.1924 0.1483 2677 -0.09834 0.483
0 0 1 0 0.2596 0.1514 2677 -0.03732 0.557
0 0 0 1 0.2875 0.1473 2677 -0.00144 0.576
1 0 0 1 0.3845 0.1647 2677 0.06155 0.708
0 1 0 1 0.3326 0.1539 2677 0.03091 0.634
0 0 1 1 0.3411 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: healthyes
Confidence level used: 0.95
The problem seems to be that when we add -1 to the formula, that creates an extra column in the model matrix that is not included in the regression coefficients. (This is a byproduct of the way that R creates factor codings.)
So I can work around this by adding a strategically placed coefficient of zero. We also have to fix up the covariance matrix the same way:
library(emmeans)
library(plm)
data("Males")
mod <- plm(wage ~ exper + residence + health + residence*union,
model = "within",
index = "nr",
data = Males)
BB <- c(coef(mod)[1], 0, coef(mod)[-1])
k <- length(BB)
VV <- matrix(0, nrow = k, ncol = k)
VV[c(1, 3:k), c(1, 3:k)] <- vcov(mod)
RG <- qdrg(~ -1 + exper + residence + health + residence*union,
data = Males, coef = BB, vcov = VV, df = df.residual(mod))
Verify that things line up:
> names(RG#bhat)
[1] "exper" ""
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
> colnames(RG#linfct)
[1] "exper" "residencerural_area"
[3] "residencenorth_east" "residencenothern_central"
[5] "residencesouth" "healthyes"
[7] "unionyes" "residencenorth_east:unionyes"
[9] "residencenothern_central:unionyes" "residencesouth:unionyes"
They do line up, so we can get the results we need:
(EMM <- emmeans(RG, ~ residence * union))
residence union emmean SE df lower.CL upper.CL
rural_area no 0.378 0.0335 2677 0.31201 0.443
north_east no 0.330 0.1636 2677 0.00929 0.651
nothern_central no 0.192 0.1483 2677 -0.09834 0.483
south no 0.260 0.1514 2677 -0.03732 0.557
rural_area yes 0.287 0.1473 2677 -0.00144 0.576
north_east yes 0.385 0.1647 2677 0.06155 0.708
nothern_central yes 0.333 0.1539 2677 0.03091 0.634
south yes 0.341 0.1534 2677 0.04024 0.642
Results are averaged over the levels of: health
Confidence level used: 0.95
In general, the key is to identify where the added column occurs. It's going to be the position of the first level of the first factor in the model formula. You can check it by looking at names(coef(mod)) and colnames(model.matrix(formula), data = data) where formula is the model formula with intercept removed.
Update: a general function
Here's a function that may be used to create a reference grid for any plm object. It turns out that sometimes these objects do have an intercept (e.g., random-effects models) so we have to check. For models lacking an intercept, you really should use this only for contrasts.
plmrg = function(object, ...) {
form = formula(formula(object))
if (!("(Intercept)" %in% names(coef(object))))
form = update(form, ~ . - 1)
data = eval(object$call$data, environment(form))
mmat = model.matrix(form, data)
sel = which(colnames(mmat) %in% names(coef(object)))
k = ncol(mmat)
b = rep(0, k)
b[sel] = coef(object)
v = matrix(0, nrow = k, ncol = k)
v[sel, sel] = vcov(object)
emmeans::qdrg(formula = form, data = data,
coef = b, vcov = v, df = df.residual(object), ...)
}
Test run:
> (rg = plmrg(mod, at = list(exper = c(3,6,9))))
'emmGrid' object with variables:
exper = 3, 6, 9
residence = rural_area, north_east, nothern_central, south
health = no, yes
union = no, yes
> emmeans(rg, "residence")
NOTE: Results may be misleading due to involvement in interactions
residence emmean SE df lower.CL upper.CL
rural_area 0.313 0.0791 2677 0.1579 0.468
north_east 0.338 0.1625 2677 0.0190 0.656
nothern_central 0.243 0.1494 2677 -0.0501 0.536
south 0.281 0.1514 2677 -0.0161 0.578
Results are averaged over the levels of: exper, health, union
Confidence level used: 0.95
This potential solution uses biglm::biglm() to fit the lm model and then uses emmeans::qdrg() with a nuisance specified. Does this approach help in your situation?
library(biglm)
library(emmeans)
## the biglm coefficients using factor() with all the `nr` levels has NAs.
## so restrict data to complete cases in the `biglm()` call
model_biglm <- biglm(wage ~ -1 +exper + residence+health + residence*union + factor(nr), data=Males[!is.na(Males$residence),])
summary(model_biglm)
## double check that biglm and lm give same/similar model
## summary(model_biglm)
## summary(model_lm)
summary(model_biglm)$rsq
summary(model_lm)$r.squared
identical(coef(model_biglm), coef(model_lm)) ## not identical! but plot the coefficients...
head(cbind(coef(model_biglm), coef(model_lm)))
tail(cbind(coef(model_biglm), coef(model_lm)))
plot(cbind(coef(model_biglm), coef(model_lm))); abline(0,1,col="blue")
## do a "[q]uick and [d]irty [r]eference [g]rid and follow examples
### from ?qdrg and https://cran.r-project.org/web/packages/emmeans/vignettes/FAQs.html
rg1 <- qdrg(wage ~ -1 + exper + residence+health + residence*union + factor(nr),
data = Males,
coef = coef(model_biglm),
vcov = vcov(model_biglm),
df = model_biglm$df.resid,
nuisance="nr")
## Since we already specified nuisance in qdrg() we don't in emmeans():
emmeans(rg1, c("residence","union"))
Which gives:
> emmeans(rg1, c("residence","union"))
residence union emmean SE df lower.CL upper.CL
rural_area no 1.72 0.1417 2677 1.44 2.00
north_east no 1.67 0.0616 2677 1.55 1.79
nothern_central no 1.53 0.0397 2677 1.45 1.61
south no 1.60 0.0386 2677 1.52 1.68
rural_area yes 1.63 0.2011 2677 1.23 2.02
north_east yes 1.72 0.0651 2677 1.60 1.85
nothern_central yes 1.67 0.0503 2677 1.57 1.77
south yes 1.68 0.0460 2677 1.59 1.77
Results are averaged over the levels of: 1 nuisance factors, health
Confidence level used: 0.95
I using an lm() like function called robu() from library robumeta within my own function foo.
However, I'm manipulating the formula argument such that when it is missing the default formula would be: formula(dint~1) or else any formula that user defines.
It works fine, however, in the output of foo the printed formula call always is: Model: missing(f) if formula(dint ~ 1) regardless of what formula is inputted in the foo.
Can I correct this part of output so that it only shows the exact formula used? (see below examples)
dat <- data.frame(dint = 1:9, SD = 1:9*.1,
time = c(1,1,2,3,4,3,2,4,1),
study.name = rep(c("bob", "jim", "jon"), 3))
library(robumeta)
# MY FUNCTION:
foo <- function(f, data){
robu(formula = if(missing(f)) formula(dint~1) else formula(f), data = data, studynum = study.name, var = SD^2)
}
# EXAMPLES OF USE:
foo(data = dat) ## HERE I expect: `Model: dint ~ 1`
foo(dint~as.factor(time), data = dat) ## HERE I expect: `Model: dint ~ time`
One option is to update the 'ml' object
foo <- function(f, data){
fmla <- if(missing(f)) {
formula(dint ~ 1)
} else {
formula(f)
}
model <- robu(formula = fmla, data = data, studynum = study.name, var = SD^2)
model$ml <- fmla
model
}
-checking
foo(data = dat)
RVE: Correlated Effects Model with Small-Sample Corrections
Model: dint ~ 1
Number of studies = 3
Number of outcomes = 9 (min = 3 , mean = 3 , median = 3 , max = 3 )
Rho = 0.8
I.sq = 96.83379
Tau.sq = 9.985899
Estimate StdErr t-value dfs P(|t|>) 95% CI.L 95% CI.U Sig
1 X.Intercept. 4.99 0.577 8.65 2 0.0131 2.51 7.48 **
---
Signif. codes: < .01 *** < .05 ** < .10 *
---
Note: If df < 4, do not trust the results
foo(dint~ as.factor(time), data = dat)
RVE: Correlated Effects Model with Small-Sample Corrections
Model: dint ~ as.factor(time)
Number of studies = 3
Number of outcomes = 9 (min = 3 , mean = 3 , median = 3 , max = 3 )
Rho = 0.8
I.sq = 97.24601
Tau.sq = 11.60119
Estimate StdErr t-value dfs P(|t|>) 95% CI.L 95% CI.U Sig
1 X.Intercept. 3.98 2.50 1.588 2.00 0.253 -6.80 14.8
2 as.factor.time.2 1.04 4.41 0.236 1.47 0.842 -26.27 28.3
3 as.factor.time.3 1.01 1.64 0.620 1.47 0.617 -9.10 11.1
4 as.factor.time.4 2.52 2.50 1.007 2.00 0.420 -8.26 13.3
---
Signif. codes: < .01 *** < .05 ** < .10 *
In the R package {Epi} the ROC() function can generate a plot out of the dataset aSAH in in the {pROC} package like this:
with the following commands:
require(Epi)
require(pROC)
data(aSAH)
rock = ROC(form = outcome ~ s100b, data=aSAH, plot = "ROC", MX = T)
The sensitivity and specificity were calculated for 51 points included in the object nrow(rock$res). In this regard, note that nrow(aSAH) is instead 113.
Which points were used to generate rock$res?
If we were using the function roc() in the package {pROC} instead, we could get this via: roc(aSAH$outcome, aSAH$s100b)$threshold. But being different packages, they are probably different.
The answer is... of course... in the package documentation:
res dataframe with variables sens, spec, pvp, pvn and name of the test
variable. The latter is the unique values of test or linear predictor
from the logistic regression in ascending order with -Inf prepended.
So what are the unique values:
points = unique(aSAH$s100b); length(points) [1] 50 plus the pre-ended -Inf!
Nice inkling, but can we prove it... I think so:
require(Epi)
require(pROC)
data(aSAH)
rock = ROC(form = outcome ~ s100b, data=aSAH, plot = "ROC", MX = T)
d = aSAH
> head(d)
gos6 outcome gender age wfns s100b ndka
29 5 Good Female 42 1 0.13 3.01
30 5 Good Female 37 1 0.14 8.54
31 5 Good Female 42 1 0.10 8.09
points = sort(unique(d$s100b))
> head(points)
[1] 0.03 0.04 0.05 0.06 0.07 0.08
> length(points)
[1] 50
## Logistic regression coefficients:
beta.0 = as.numeric(rock$lr$coefficients[1])
beta.1 = as.numeric(rock$lr$coefficients[2])
## Sigmoid function:
sigmoid = 1 / (1 + exp(-(beta.0 + beta.1 * points)))
sigmoid = as.numeric(c("-Inf", sigmoid))
lr.eta = rock$res$lr.eta
length(lr.eta)
head(lr.eta)
head(sigmoid)
> head(lr.eta)
[1] -Inf 0.1663429 0.1732556 0.1803934 0.1877585 0.1953526
> head(sigmoid)
[1] -Inf 0.1663429 0.1732556 0.1803934 0.1877585 0.1953526
## Trying to get the lr.eta number 0.304 on the plot:
> which.max(rowSums(rock$res[, c("sens", "spec")]))
# 0.30426295405785 18
## What do we find in row 18 or res?
> rock$res[18,]
sens spec pvp pvn lr.eta
0.30426295405785 0.6341463 0.8055556 0.2054795 0.35 0.304263
## Yet, lr.eta is not the Youden's J statistic or index:
> rock$res[18,"sens"] + rock$res[18,"spec"] - 1
[1] 0.4397019
## Instead, it is the Probability of the outcome at the input with max Youden's index:
## Excluding the "-Inf" introduced by the ROC function (position 17 as opposed to 18):
max.sens.sp.cut = points[17]
1 / (1 + exp(-(beta.0 + beta.1 * max.sens.sp.cut))) [1] 0.304263 !!!
Done!
The lt.eta is, therefore, the probability of the outcome at the threshold corresponding to the maximum Youden's index.
I fitted a trilinear model
library(nlstools)
library(nlsMicrobio)
library(investr) # for plotFit function
trilinear
LOG10N ~ LOG10N0 - (t >= Sl) * (t <= (Sl + (LOG10N0 - LOG10Nres) *
log(10)/kmax)) * kmax * (t - Sl)/log(10) + (t >= Sl) * (t >
(Sl + (LOG10N0 - LOG10Nres) * log(10)/kmax)) * (LOG10Nres -
LOG10N0)
to bacterial survival data
data(survivalcurve1)
survivalcurve1
t LOG10N
1 0.00 7.56
2 0.33 7.41
3 1.00 7.26
4 2.00 7.30
5 3.00 7.26
6 4.00 7.15
7 5.00 7.30
8 6.00 6.48
9 7.00 6.15
10 8.00 5.30
11 9.00 4.78
12 10.00 5.11
13 11.00 2.30
14 13.00 3.15
15 14.00 2.00
16 16.00 1.00
17 18.00 1.00
18 20.00 1.00
19 23.00 1.00
using an OLS fit with nls :
nls = nls(trilinear, survivalcurve1,
list(Sl = 5, kmax = 1.5, LOG10N0 = 7, LOG10Nres = 1))
overview(nls)
Parameters:
Estimate Std. Error t value Pr(>|t|)
Sl 4.7064 0.5946 7.915 9.82e-07 ***
kmax 1.3223 0.1222 10.818 1.76e-08 ***
LOG10N0 7.3233 0.1884 38.875 < 2e-16 ***
LOG10Nres 1.0000 0.2307 4.334 0.00059 ***
t-based confidence interval:
2.5% 97.5%
Sl 3.4389618 5.973874
kmax 1.0617863 1.582868
LOG10N0 6.9218035 7.724863
LOG10Nres 0.5082284 1.491772
plotFit(nls, interval="confidence")
I was wondering though if I could also fit that model using maximum likelihood on the original (non-log transformed) cell nrs (which would be in this case be survivalcurve1$N = (10^survivalcurve1$LOG10N) ), taking into account that the error structure would be approx Poisson? Can this perhaps be done using bbmle's mle2, and if so, what would be the correct syntax?
EDIT: I tried with
survivalcurve1$N = as.integer(10^survivalcurve1$LOG10N)
trilinearN=formula(N ~ dpois( 10^(LOG10N0 - (t >= Sl) * (t <= (Sl + (LOG10N0 - LOG10Nres) *
log(10)/kmax)) * kmax * (t - Sl)/log(10) + (t >= Sl) * (t > (Sl + (LOG10N0 - LOG10Nres) * log(10)/kmax)) * (LOG10Nres - LOG10N0))))
m1 = mle2(trilinearN, start=list(Sl = 5, kmax = 1.5, LOG10N0 = 7, LOG10Nres = 1), data=survivalcurve1)
and
coef(summary(m1))
gives me
Estimate Std. Error z value Pr(z)
Sl 4.902048 1.669354e-04 2.936495e+04 0
kmax 1.475309 3.210865e-04 4.594739e+03 0
LOG10N0 7.344014 3.785883e-05 1.939842e+05 0
LOG10Nres -1.830498 1.343019e-10 -1.362972e+10 0
Couldn't get plotting the predictions to work though :
df=data.frame(t=seq(0,max(survivalcurve1$t),length=100))
df$pred=predict(m1,newdata=df)
with(df,lines(t,pred,col=2))
as this gave me the error
Error : object of type 'symbol' is not subsettable
Error in gfun(object, newdata = newdata, location = location, op = "predict") :
can only use predict() if formula specified
Any thoughts? Also, how would I make out if the Poisson mle2 fit was any better than the nls one? (As AIC cannot be compared due to the difference in scale)
PS The geeraerd model would be OK too, in case that would be any easier:
geeraerd
LOG10N ~ LOG10Nres + log10((10^(LOG10N0 - LOG10Nres) - 1) * exp(kmax *
Sl)/(exp(kmax * t) + (exp(kmax * Sl) - 1)) + 1)
I need to create a fancy table and export it as png. I'm trying tables package in R. I need to group "variacion" by groups of agents ("agentes") who had a positive variation vs the rest. I want the mean, sd and the number of agents who fulfill these conditions
My table is:
agente mes1 mes2 variacion
1 a1 0.50 0.60 0.20000000
2 a2 0.70 0.65 -0.07142857
3 a3 0.60 0.75 0.25000000
4 a4 0.80 0.60 -0.25000000
5 a5 0.78 0.90 0.15384615
My output should be (including format):
You can arrive to those numbers by doing for example:
sd(t_agentes1$variacion[t_agentes1$variacion<=0])
And the result is the last number in the table for the column sd: 0.126
So in tables library:
library(tables)
X<-t_agentes1$variacion
latex( tabular( (X > 0) + (X < 0) + 1
+ ~ ((n = 1) + X*(mean + sd + length)) ) )
But I get the error:
non-numeric argument to binary operator
Also when I try the first example of the package I get the same error
tabular( (Species + 1) ~ (n=1) + Format(digits=2)*
+ + (Sepal.Length + Sepal.Width)*(mean + sd), data=iris )
Error in e[[3]] : subscript out of bounds
I really don't understand the parameters of this package. Is there a way to do the grouping? I'm really lost with this so any help would be really appreciated. Thanks.
X <- read.table(header = TRUE, text="agente mes1 mes2 variacion
1 a1 0.50 0.60 0.20000000
2 a2 0.70 0.65 -0.07142857
3 a3 0.60 0.75 0.25000000
4 a4 0.80 0.60 -0.25000000
5 a5 0.78 0.90 0.15384615")
X <- within(X, variation <- factor(variacion > 0, levels = c(TRUE, FALSE),
labels = c('variation > 0',
'variation <= 0')))
library(tables)
# latex(
# tabular(Heading() * variation ~
# Justify(l) * (Heading() * Format(digits = 2) * variacion * (mean + sd) + (number = (n = 1))),
# data = X))
latex(
tabular(Heading() * variation ~
Justify(l) * (Heading() * variacion * (Format(digits = 2) * mean + Format(digits = 2) *sd) + (number = (n = 1))),
data = X))
# mean sd number
# variation $>$ 0 0.20 0.048 3
# variation $\\leq$ 0 -0.16 0.126 2
Gives me
Without prettifying the results:
tabular((X > 0) + (X < 0) ~ mean*X + sd*X + length*X)