I'm trying to use the R package "ipw" for inverse probability weighting. I have some columns which are named "covar.1", "covar.2", "covar.3"... so I want to have a formula for them. From a previous question I got it to work with glm, matchit and other functions. But with ipw, it doesn't work. It works if I copy and paste manually what the print(f1) outputs at the denominator, so I tried without as.formula but it still doesn't work. To reproduce, run
library(ipw)
betaz <- c(0.75, -0.5, 0.25)
betay <- c(0.5, 1.0, -1.5)
X <- matrix(rnorm(3 * 250), 250)
ps <- pnorm(X %*% betaz)
Z <- rbinom(250, 1, ps)
epsilon <- rnorm(250, 0.0, 0.5)
Y0 <- X %*% betay + epsilon
Y1 <- X %*% betay + 0.5 + epsilon
Y <- Y0 * (1 - Z) + Y1 * Z
df <- data.frame(id = seq(250), covar = X, group = Z, metric = Y)
print(df[1:10,])
cols <- colnames(df)
covars <- cols[grep("covar", colnames(df))]
f <- as.formula(paste('group','~', paste(covars, collapse="+")))
psmodel <- glm(f, family = binomial(), data=df)
pscore <- psmodel$fitted.values
f1 <- as.formula(paste('~', paste(covars, collapse="+")))
print(f1)
weightmodel <- ipwpoint(
exposure = group, family = "binomial", link = "logit",
denominator = f1,
data = df, trunc = .01
)
With as.formula, it complains about object 'groupf1' not found. Not sure why it's doing such concatenation. Basically I need a way to set f1 dynamically using a variable.
From traceback I see the source code
glm(formula = eval(
parse(
text = paste(
deparse(tempcall$exposure, width.cutoff = 500),
deparse(tempcall$denominator, width.cutoff = 500), sep = ""))),
family = lf, data = data, na.action = na.fail, ...)
R master help needed please. What form does this denominator want?
ipw is written in such a way that it's very hard to enter a formula dynamically. This was one of the motivations I had to write the WeightIt package, which has the same functionality (in all but a few rare cases). In addition, in my cobalt package, there is the function f.build() which creates a formula from its inputs.
You can replace the last several lines of your code with the following:
f1 <- f.build("group", covars)
w.out <- weightit(f1, data = df, estimand = "ATE")
w.out2 <- trim(w.out, .01, lower = TRUE)
Here, f1 is your formula, created by f.build. This way you can cycle through multiple treatment variables in the first argument. The second argument can be either a vector of names of covariates or the data.frame of the covariates themselves. w.out is the weightit object containing the weights estimated by weightit(). The default is logistic regression, but this can be changed. (I noticed the true treatment propensities were generated using a probit model, which can be requested in weightit with link = "probit".)
It seems like you wanted to truncate the weights at the first and 99th percentiles, which is what trim does. By default, it only trims the highest weights, so I set lower = TRUE to also trim the lower weights. In general, you should check covariate balance and the variability of your weights before trimming in case the untrimmed weights are sufficient. cobalt is designed to assess balance and is compatible with WeightIt. Below is how you could assess balance on a weightit object:
bal.tab(w.out, un = TRUE)
You can compare the trimmed and untrimmed weights too:
bal.tab(f1, data = df, un = TRUE,
weights = list(untrimmed = w.out$weights,
trimmed = w.out2$weights))
When you're ready to estimate your treatment effect, you can just extract the weights from the weightit object. I use the jtools package to get robust standard errors, which are a must with PS weighting:
w1 <- w.out$weights
jtools::summ(lm(metric ~ group, data = df, weights = w1),
robust = TRUE, confint = TRUE)
There is a good deal of documentation on WeightIt and cobalt. I hope you find them useful!
Related
hope all is well.
I have been exploring a few options for constraining a spline function so that it not only stays positive, but, so that it stays above the lowest value of y in the dataframe. I am assuming there is a penalized spline function out there where one can readily adapt the shape, though I have not found easily or tried yet. I have also tried nls with an exponential decay function which works, however, the last estimated point is much higher than desired (would like it to pass through, or be closer to the final value of y). see code below with the options i have tried. The ultimate goal however is to fit a spline that passes through all points and never decreases below the lowest value of y at any point while also acknowledging that yes there are only 5 data points. thanks in advance for the help.
library(tidyverse)
library(broom)
library(gnm)
library(cobs)
library(zoo)
DF <- data.frame(x = seq(1,5,1),y=c(26419753,9511111,3566667,57993,52194))
t=1:5
# option 1a and 1b: preferred method which is fitting a spline function
mod1a <- splinefun(DF$x,DF$y)
curve(mod1a, 1,5)
pred_interval_mod1a <- seq(1,5,length = 40)
interp(pred_interval_mod1a) # has that dip to negative near the end which should remain larger than y= 52,194
mod1b <- cobs(x= DF$x,y = DF$y,pointwise=rbind(c(0,52194,-1),c(0,26419753,1)))
pred_interval_mod1b <- seq(1,5,length = 40)
interp(pred_interval_mod1b)
# option 2: NLS for exponential decay with starting values
mod2 <- nls(y ~ SSasymp(t, yf, y0, log_alpha), data = DF)
qplot(t, y, data = augment(mod2)) + geom_line(aes(y = .fitted))
# option 3: similar NLS premise but with lower defined
mod3 <- nls(y ~ yf + (y0 - yf) * exp(-alpha * t), data = DF,
start = list(y0 = 26419753, yf = 52194, alpha = 1),
lower= c(-Inf,52194,-Inf),algorithm="port")
# option 4: similar to 2 and 3
a=log(52194)
mod4 <- gnm(y ~ Exp(1 + t) -1, verbose = FALSE, constrain="Exp(.+x).Intercept",
constrainTo=a, start=c(a,-0.05), data=DF)
mod4_df <- data.frame(t = seq(1,5,by=1))
mod4_pred <- predict(mod4,newdata=mod4_df)
mod4_pred
I am working on predicting gam model with random effect to produce 3D surface plot by plot_ly.
Here is my code;
x <- runif(100)
y <- runif(100)
z <- x^2 + y + rnorm(100)
r <- rep(1,times=100) # random effect
r[51:100] <- 2 # replace 1 into 2, making two groups
df <- data.frame(x, y, z, r)
gam_fit <- gam(z ~ s(x) + s(y) + s(r,bs="re"), data = df) # fit
#create matrix data for `add_surface` function in `plot_ly`
newx <- seq(0, 1, len=20)
newy <- seq(0, 1, len=30)
newxy <- expand.grid(x = newx, y = newy)
z <- matrix(predict(gam_fit, newdata = newxy), 20, 30) # predict data as matrix
However, the last line results in error;
Error in model.frame.default(ff, data = newdata, na.action = na.act) :
variable lengths differ (found for 'r')
In addition: Warning message:
In predict.gam(gam_fit, newdata = newxy) :
not all required variables have been supplied in newdata!
Thanks to the previous answer, I am sure that above codes work without random effect, as in here.
How can I predict gam models with random effect?
Assuming you want the surface conditional upon the random effects (but not for a specific level of the random effect), there are two ways.
The first is to provide a level for the random effect but exclude that term from the predicted values using the exclude argument to predict.gam(). The second is to again use exclude but this time to not provide any data for the random effect and instead stop predict.gam() from checking the newdata using the argument newdata.guaranteed = TRUE.
Option 1:
newxy1 <- with(df, expand.grid(x = newx, y = newy, r = 2))
z1 <- predict(gam_fit, newdata = newxy1, exclude = 's(r)')
z1 <- matrix(z1, 20, 30)
Option 2:
z2 <- predict(gam_fit, newdata = newxy, exclude = 's(r)',
newdata.guaranteed=TRUE)
z2 <- matrix(z2, 20, 30)
These produce the same result:
> all.equal(z1, z2)
[1] TRUE
A couple of notes:
Which you use will depend on how complex the rest of you model is. I would generally use the first option as it provides an extra check against me doing something stupid when creating the data. But in this instance, with a simple model and set of covariates it seems safe enough to trust that newdata is OK.
Your example uses a random slope (was that intended?), not a random intercept as r is not a factor. If your real example uses a factor random effect then you'll need to be a little more careful when creating the newdata as you need to get the levels of the factor right. For example:
expand.grid(x = newx, y = newy,
r = with(df, factor(2, levels = levels(r))))
should get the right set-up for a factor r
I would like to estimate a smooth effect of some covariate N in a marginal model of type "exchangeable" in R, where the clustering variable is S. From what I could find, this should be possible with:
geeglm(..., id = S, corstr = "exchangeable")
as well as:
gamm(..., correlation = corCompSymm(form = ~1|S))
Below you can find an example where the results look good in a sense that the two estimates are quite close. However, if I use the real data our project is about, the estimated smooth effects tend to be very different. I cannot publish that here, but maybe someone can still spot some problem in the code. For instance (see below), the gamm-object says Number of Groups: 1 which worries me as there clearly is more than one cluster...
(Yes, this is the realisation of a random-effects-model by construction, but this should lead to the desired model given the answer here.)
########
## Packages
########
library(ggplot2)
library(mgcv)
library(dplyr)
library(geepack)
library(splines)
########
## Data Simulation
########
f <- function(N) {return((-200+(N-25)^2)/100)}
N <- sort(sample(1:50, 10, replace = T))
S <- as.character(1:10)
S_Effect <- rnorm(length(S),0,1)
S_Effect <- rep(S_Effect,N)
S <- rep(S,N)
N <- rep(N,N)
E <- runif(length(N))
data <- data.frame(O = rep(0,length(N)),
E = E,
N = N,
S = as.factor(S),
S_Effect = S_Effect)
for (i in 1:length(N)) {
data$O[i] <- rbinom(1, 1, plogis(f(N[i]) + qlogis(E[i]) + S_Effect[i]))}
data <- data %>% mutate(E = qlogis(E))
########
## Fitting
########
formula_gamm <- as.formula("O ~ 1 + offset(E) + s(N, bs = 'bs')")
model_gamm <- gamm(formula_gamm, family = binomial(), correlation = corCompSymm(form=~1|S), data = data)
model_gamm
formula_geeglm <- as.formula("O ~ 1 + offset(E) + bs(N)")
model_geeglm <- geeglm(formula_geeglm, family = binomial(), corstr = "exchangeable", id = S, data = data)
########
## Plot
########
pred_gamm <- plot.gam(model_gamm$gam, select = 1)
x <- pred_gamm[[1]]$x
pred_geeglm <- predict(model_geeglm, type = "terms", newdata = data.frame(E = rep(0,length(x)), N = x))
z <- qnorm(0.9)
tmp <- data.frame(x = x,
y = pred_gamm[[1]]$fit,
group = rep("estimate gamm",length(x)))
tmp2 <- data.frame(x = x,
y = as.numeric(pred_geeglm),
group = rep("estimate geeglm",length(x)))
tmp3 <- data.frame(x = x,
y = f(x),
group = rep("actual function",length(x)))
data_pred = bind_rows(tmp,tmp2,tmp3) %>% mutate(group = as.factor(group))
p <- ggplot(data = data_pred, aes(x = x, y = y, color = group)) +
geom_line(size = 2) +
xlab("N") +
ylab("f(N)")
p
An additional question: The gamm-object contains enough information to plot a confidence-band around the estimated function, but how can I do this for the geeglm-estimate? You get something that looks reasonable if you simulate(model_geeglm, ...) and take the pointwise mean and so on, but that doesn't really satisfy me as (1) the documentation on simulate doesn't mention marginal models and (2) it is very primitive...
The GAMM is using penalised splines, such that the degrees of freedom used by the resulting spline (smoother) is likely to be somewhat less than the requested basis dimension, which is 10. The GEE is fitting an unpenalized model. All else equal, the unpenalised model will be more wiggly than the penalised one.
To compare these approaches on a common footing, you need to make sure that bs() and s(x, bs = 'bs') both produce the same number of basis functions (The s() version can produce one fewer as it will remove the lack identifiability with the intercept term, whereas you are omitting the intercept in the bs() version).
Having assured yourself that you get the same basis dimension, then you can make GAMM fit an unpenalized spline by adding fx = TRUE to the s(...) term in the formula.
Having done that, both models should be estimating similar smooth effects.
However, I would suggest that you use penalisation; For the GAMM model, use fx = FALSE, and then after estimating the model run gam.check(model$gam) (replacing model with your fitted model object) and see if the basis size check passes for the smoother.
This is my first question, please let me know if I'm doing anything wrong. We have a df with two variables, and want to model EPR (egg production rate) as a function of temperature.
The relevant packages as per the nls page:
install.packages("tidyverse")
install.packages("nls.multstart")
install.packages("nlstools")
library(tidyverse)
library(nls.multstart)
library(nlstools)
The relevant variables from a larger df:
temp=c(9.2,9.9,12.7,12.8,14.3,14.5,16.3,16.5,18,18,19.6,19.6,19.9,19.9,22,22.4,23.2,23.4,25.3,25.6,27,27.3,28.5,30.3,20.9)
EPR=c(1.5,0,0,0,1.27,0.56,3.08,0.575,2.7,3.09,2,6.3,2,3.76,3.7,1.65,7.1,18.9,7.07,3.77,13.79,0,0,0.47,0)
df<-data.frame(temp,EPR)
Here I write the formula with the five parameters to be estimated (k1,a,b,k2,c), temp will be the x values. So far so good.
formula<-function(k1,a,b,k2,c,temp) {
modelEPR<-k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
return(modelEPR)
}
This is where I'm stuck; I'm already using quite narrow start_lower and upper, since I now know the parameters by using the excel solver somewhat successfully. The values I get with this method will get me a model, albeit quite an inaccurate one. Yes, I gave the start lower and upper a much greater range in the beginning, but that didn't yield any better results.
fit <- nls_multstart(EPR ~ formula(k1,a,b,k2,c,temp),
data = df,
iter = 100,
start_lower = c(k1 = 14, a = 0.3, b = 20, k2 = 0.02, c = 0.15),
start_upper = c(k1 = 15, a = 0.5, b = 21, k2 = 0.08, c = 0.24),
supp_errors = 'Y',
na.action = na.omit)
fit
As aforementioned, I used the excel solver to successfully make the model and I got the parameter estimates, then tried to just manually insert them here in R, which makes for a much better model.
model<-df %>%
mutate(pred=(14.69/(1+exp(-0.41*(temp-20.52)))-0.05*exp(0.19 *temp))) %>%
ggplot()+
xlab("Temperature (°C)")+
ylab("EPR (Eggs per female per day")+
geom_point(aes(temp,EPR))+
geom_line(aes(temp,pred),col="red")
model
Ultimately, I have two questions;
a) What am I doing wrong? Or is it simply the data being weird? Seems to work better with excel?!
b) How do I code the bridge between fit and model? fit will yield the 5 parameters, but how do I insert them directly into the model function? Can I utilize mutate somehow here?
Would appreciate any help!
A. Starting values and fitting model
To get starting values:
If k1 = 0 then we can rearrange the formula as follows and then use the result of fitting that linear model as a starting value for c.
log(EPR) ~ log(k2) + c * temp
b is a shift in temp and a is a scaling so choose b = mean(temp) and a = 1/sd(temp)
We can use algorithm = "plinear" to avoid having to specify starting values for the linear parameters, i.e. for k1 and k2. When using plinear the right hand side of the formula should be a matrix such that k1 times the first column plus k2 times the second column gives the predicted EPR.
This gives the following. Note that k1 and k2 will be represented by .lin1 and .lin2 in the nls output.
fm1 <- lm(log(EPR) ~ temp, df, subset = EPR > 0)
st2 <- list(c = coef(fm1)[[2]], a = 1/sd(df$temp), b = mean(df$temp))
fo2 <- EPR ~ cbind(1/(1+exp(-a*(temp-b))), -exp(c*temp))
fm2 <- nls(fo2, df, start = st2, algorithm = "plinear",
control = list(maxiter = 200))
deviance(fm2) # residual sum of squares
## [1] 333.6
Note that this represents a lower (better) residual sum of squares than the fit shown in the question:
sum((df$EPR - pred)^2) # residual sum of squares for fit shown in question
## [1] 339.7
No packages were used.
We can plot the two fits where the fit from the question is in blue and the fit done here is in red. From the plot there is some question whether the two large EFR values are outliers and whether they should be excluded.
plot(EPR ~ temp, df)
lines(fitted(fm2) ~ temp, df, subset = order(temp), col = "red")
lines(pred ~ temp, df, subset = order(temp), col = "blue")
[continued after screenshot]
B. Evaluating model at given parameters
For a given model expressed in formula notation we can evaluate it at given parameters using the nls2 package. nls2 takes similar arguments as nls but if the starting value is a data frame with one row and the algorithm is "brute" then it simply returns the value of the right hand side evaluated at the starting values. See ?nls for more information.
library(nls2)
fo <- EPR ~ k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
st <- list(k1 = 14.69, a = 0.41, b = 20.52, k2 = 0.05, c = 0.19)
fm <- nls2(fo, df, start = data.frame(st), algorithm = "brute")
deviance(fm)
## [1] 339.7
fitted(fm) # predictions at parameter values given in st
or in terms of a function:
rhs <- function(a, b, c, k1, k2, temp) k1*1/(1+exp(-a*(temp-b)))-k2*exp(c*temp)
p <- do.call("rhs", c(st, list(temp = df$temp)))
all.equal(p, pred)
## [1] TRUE
I am using glmnet and for the best lambda I want to check the VIF between variables. Can anyone suggest how can I accomplish this?
Below is the code I am following and fielddfm is the data frame containing the independent variables:
x<- model.matrix(depvar ~ ., fielddfm) [,-1]
y <- depvar
lambda <- 10^seq(10, -2, length = 100)
ridge.mod <- glmnet(x, y, alpha = 0, lambda = lambda)
predict(ridge.mod, s = 0, exact = T, type = 'coefficients')
cv.out <- cv.glmnet(x, y, alpha = 0, nfolds = 3)
bestlam <- cv.out$lambda.min
ridge.pred <- predict(ridge.mod, s = bestlam, newx = x)
predict(ridge.mod, type = "coefficients", s = bestlam)'
Here, I get the coefficients for different promotion vehicles but I want to know, VIF values for the best lambda for different independent variables
Could yo please suggest how can I achieve this?
Since a) VIF is a function of your predictors rather than your model and b) a ridge regression keeps all variables irrespective of lambda, you could get the VIFs from an arbitrarily-fitted linear model. For example:
vifs = car::vif(lm(y ~ ., data = X))
where y is your response and X is your dataframe of predictors. Note that the results are independent of the values contained in y.
Given the above however, It's a little dubious whether this question makes sense in the first place...