In R, I would like to test the specification of a partial least square (PLS) model m1 against a non-nested alternative m2, applying the Davidson-MacKinnon J test. For a simple linear outcome Y it works quite well using the plsr estimator followed by the jtest command:
# Libraries and data
library(plsr)
library(plsRglm)
library(lmtest)
Z <- Cornell # illustration dataset coming with the plsrglm package
# Simple linear model
m1 <- plsr(Z$Y ~ Z$X1 + Z$X2 + Z$X3 + Z$X4 + Z$X5 ,2) # including X1
m2 <- plsr(Z$Y ~ Z$X6 + Z$X2 + Z$X3 + Z$X4 + Z$X5 ,2) # including X6 as alternative
jtest(m1,m2)
However, if Iuse the generalized linear model (plsRglm) estimator to account for a possible nonlinear distibution of an outcome, e.g.:
# Generalized Model
m1 <- plsRglm(Z$Y ~ Z$X1 + Z$X2 + Z$X3 + Z$X4 + Z$X5 ,2, modele = "pls-glm-family", family=Gamma(link = "log"), pvals.expli=TRUE)
m2 <- plsRglm(Z$Y ~ Z$X6 + Z$X2 + Z$X3 + Z$X4 + Z$X5 ,2, modele = "pls-glm-family", family=Gamma(link = "log"), pvals.expli=TRUE)
I am running into an error when using jtest:
> jtest(m1,m2)
Error in terms.default(formula1) : no terms component nor attribute
>
It seems that plsRglm does not save objects of class "formula", that jtest can handle. Has anybody a suggestion of how to edit my code to get this to work?
Thanks!
Related
In a Cox regression framework, I'd like to implement piecewise continuous time-dependent coefficients. This is for variable that do not satisify the PH assumption.
In this vignette https://cran.r-project.org/web/packages/survival/vignettes/timedep.pdf, there are examples for step functions (p.17), and coefficient with some functional form (p.20).
What I'd like is to somehow have a piecewise relationship. Using the example provided in the vignette:
library(survival)
vfit <- coxph(Surv(time, status) ~ trt + prior + karno, veteran)
zp <- cox.zph(vfit, transform= function(time) log(time +20))
## Step functions
vet2 <- survSplit(Surv(time, status) ~ ., data= veteran, cut=c(90, 180),
episode= "tgroup", id="id")
vfit2 <- coxph(Surv(tstart, time, status) ~ trt + prior +
karno:strata(tgroup), data=vet2)
## Functional form
vfit3 <- coxph(Surv(time, status) ~ trt + prior + karno + tt(karno),
data=veteran,
tt = function(x, t, ...) x * log(t+20))
plot(zp[3])
abline(coef(vfit3)[3:4], col=2)
From the plot (also on p.21 in the vignette), we might argue that we could have a similar but inverted trend from approx. Time=200. I've tried but without success.
First tried directly with a piecewise function with the tt argument but it does not give two sets of coefficents, only one coef for karno and one for tt(karno). I mean we should have something like ax+b for t<200 and cx+d for t>=200
vfit3 <- coxph(Surv(time, status) ~ trt + prior + karno + tt(karno),
data=veteran,
tt = function(x, t, t1, t2, ...) x * log(t1+20) * (t<200) +
x * t2 * (t>=200))
So in a second step, I tried to mix both step functions with some functional form for each. Meaning to split the data in two time periods as for step functions and then fit a function in each. But gives error.
vfit3 <- coxph(Surv(tstart, time, status) ~ trt + prior +
(karno + tt(karno)):strata(tgroup),
data=vet2,
tt = function(x, t, ...) x * log(t+20) * (t<200) -
x * t * (t>=200))
Does someone knows how to implement this?
EDIT:
This is what I've come up with
library(survival)
## Original model
m1 <- coxph(formula = Surv(time, status) ~ trt + prior + karno,
data = veteran)
## Transform to long format as in the link
vet1 <- survSplit(Surv(time, status)~., data = veteran, id = "id",
cut = unique(veteran$time))
## Add a grouping variable (strata) for time before 200 days and after.
vet1$tgroup <- ifelse(vet1$time < 200, 1, 2)
## Add a time-transform function
## Here it is the same function for both strata, but they could be different
## e.g. ifelse(vet1$time < 200, f1(time), f2(time))
## Actually not sure, as we need to be careful with the time scale... Anyway
vet1$time1 <- log(vet1$time + 20)
## Same model as in the link, but then add an interaction with the strata
m2 <- coxph(formula = Surv(tstart, time, status)~
trt + prior + (karno + karno:time1):strata(tgroup), data = vet1)
## Some plots as in the vignette
zp <- cox.zph(m1, transform = function(time) log(time +20))
plot(zp[3])
abline(coef(m2)[c(3,5)], col="tomato")
abline(coef(m2)[c(4,6)], col="tomato")
In R stepwise forward regression, I would like to specify several minimal models. I am looking for the best model whith choices between 12 variables (6 flow variables Q_ and 6 precipitation variables LE_).
Biggest model takes into account all the variables :
formule <- "Q ~ 0 + Q_minus_1h + Q_minus_2h + Q_minus_3h + Q_minus_4h + Q_minus_5h + Q_minus_6h + LE_6h + LE_12h + LE_18h + LE_24h + LE_30h + LE_36h"
biggest <- formula(lm(formule, Sub_fit))
With Sub_fit my set of data (data frame with Q and my 12 variables).
I would like to have at least one variable "LE_XX" in my model. So my minimal model could be :
formule <- "Q ~ 0 + LE_6h"
smallest <- formula(lm(formule, Sub_fit))
OR
formule <- "Q ~ 0 + LE_12h"
smallest <- formula(lm(formule, Sub_fit))
OR...
formule <- "Q ~ 0 + LE_36h"
smallest <- formula(lm(formule, Sub_fit))
With finally :
modele.res <- step(lm(as.formula("Q ~ 0"),data=Sub_fit), direction='forward', scope=list(lower=smallest, upper=biggest))
"lower", into "scope", does not allow a list but should be one unique formula. Is it possible to do what I need ?
To specify several minimal models in stepwise forward regression, create the smallest formulas with, for instance, lapply and then loop through them.
In the example below, built-in data set mtcars is used to fit several models having mpg as response, one per each of the 3 last variables in the data set.
data(mtcars)
biggest <- mpg ~ .
sml <- names(mtcars)[9:11]
small_list <- lapply(sml, function(x) {
fmla <- paste("mpg", x, sep = "~")
as.formula(fmla)
})
names(small_list) <- sml
fit <- lm(mpg ~ ., mtcars)
fit_list <- lapply(small_list, function(smallest){
step(fit, scope = list(lower = smallest, upper = biggest))
})
Now select with AIC as criterion
min_aic <- sapply(fit_list, AIC)
min_aic
# am gear carb
#154.1194 155.9852 154.5631
fit_list[[which.min(min_aic)]]
stepwise function in StepReg R package can include some variables you want in all models during the stepwise regression.
library(StepReg)
f1 <- Q ~ 0 + Q_minus_1h + Q_minus_2h + Q_minus_3h + Q_minus_4h + Q_minus_5h + Q_minus_6h + LE_6h + LE_12h + LE_18h + LE_24h + LE_30h + LE_36h
## include LE_6h in the model
stepwise(formula=f1,
data=yourdata,
include="LE_6h",
selection="forward",
select="AIC")
## include LE_6h and LE_12h in the model
stepwise(formula=f1,
data=yourdata,
include=c("LE_6h","LE_12h"),
selection="forward",
select="AIC")
I'm trying to fit a logistic regression model using all predictors as a polynomial model. I've tried doing this but didn't work:
poly_model = glm(type~ poly(., 2), data=train_data, family=binomial)
I'm using the built in dataset:
train_data = MASS::Pima.tr
What's the correct way to do this?
There's not really a way to do that with the . syntax. You'll need to explictly build the formula yourself. You can do this with a helper function
get_formula <- function(resp) {
reformulate(
sapply(setdiff(names(train_data), resp), function(x) paste0("poly(", x, ", 2)")),
response = resp
)
}
model <- get_formula("type")
model
# type ~ poly(npreg, 2) + poly(glu, 2) + poly(bp, 2) + poly(skin,
# 2) + poly(bmi, 2) + poly(ped, 2) + poly(age, 2)
glm(model, data=train_data, family=binomial)
From Stata:
margins, at(age=40)
To understand why that yields the desired result, let us tell you that if you were to type
. margins
margins would report the overall margin—the margin that holds nothing constant. Because our model
is logistic, the average value of the predicted probabilities would be reported. The at() option fixes
one or more covariates to the value(s) specified and can be used with both factor and continuous
variables. Thus, if you typed
margins, at(age=40)
then margins would average over the data
the responses for everybody, setting age=40.
Could someone help me which package could be useful? I tried already to find a mean of predicted values for the subset data, but it doesnt work for sequences, for example margins, at(age=40 (1)50).
There are many ways to get marginal effects in R.
You should understand that Stata's margins, at are simply marginal effects evaluated at means or representative points (see this and the documentation).
I think that you'll like this solution best as it's most similar to what you're used to:
library(devtools)
install_github("leeper/margins")
Source: https://github.com/leeper/margins
margins is an effort to port Stata's (closed source) margins command
to R as an S3 generic method for calculating the marginal effects (or
"partial effects") of covariates included in model objects (like those
of classes "lm" and "glm"). A plot method for the new "margins" class
additionally ports the marginsplot command.
library(margins)
x <- lm(mpg ~ cyl * hp + wt, data = mtcars)
(m <- margins(x))
cyl hp wt
0.03814 -0.04632 -3.11981
See also the prediction command (?prediction) in this package.
Asides from that, here are some other solutions I've compiled:
I. erer (package)
maBina() command
http://cran.r-project.org/web/packages/erer/erer.pdf
II. mfxboot
mfxboot <- function(modform,dist,data,boot=1000,digits=3){
x <- glm(modform, family=binomial(link=dist),data)
# get marginal effects
pdf <- ifelse(dist=="probit",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
marginal.effects <- pdf*coef(x)
# start bootstrap
bootvals <- matrix(rep(NA,boot*length(coef(x))), nrow=boot)
set.seed(1111)
for(i in 1:boot){
samp1 <- data[sample(1:dim(data)[1],replace=T,dim(data)[1]),]
x1 <- glm(modform, family=binomial(link=dist),samp1)
pdf1 <- ifelse(dist=="probit",
mean(dnorm(predict(x, type = "link"))),
mean(dlogis(predict(x, type = "link"))))
bootvals[i,] <- pdf1*coef(x1)
}
res <- cbind(marginal.effects,apply(bootvals,2,sd),marginal.effects/apply(bootvals,2,sd))
if(names(x$coefficients[1])=="(Intercept)"){
res1 <- res[2:nrow(res),]
res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep=""),res1)),nrow=dim(res1)[1])
rownames(res2) <- rownames(res1)
} else {
res2 <- matrix(as.numeric(sprintf(paste("%.",paste(digits,"f",sep=""),sep="")),nrow=dim(res)[1]))
rownames(res2) <- rownames(res)
}
colnames(res2) <- c("marginal.effect","standard.error","z.ratio")
return(res2)}
Source: http://www.r-bloggers.com/probitlogit-marginal-effects-in-r/
III. Source: R probit regression marginal effects
x1 = rbinom(100,1,.5)
x2 = rbinom(100,1,.3)
x3 = rbinom(100,1,.9)
ystar = -.5 + x1 + x2 - x3 + rnorm(100)
y = ifelse(ystar>0,1,0)
probit = glm(y~x1 + x2 + x3, family=binomial(link='probit'))
xbar <- as.matrix(mean(cbind(1,ttt[1:3])))
Now the graphic, i.e., the marginal effect of x1, x2 and x3
library(arm)
curve(invlogit(1.6*(probit$coef[1] + probit$coef[2]*x + probit$coef[3]*xbar[3] + probit$coef[4]*xbar[4]))) #x1
curve(invlogit(1.6*(probit$coef[1] + probit$coef[2]*xbar[2] + probit$coef[3]*x + probit$coef[4]*xbar[4]))) #x2
curve(invlogit(1.6*(probit$coef[1] + probit$coef[2]*xbar[2] + probit$coef[3]*xbar[3] + probit$coef[4]*x))) #x3
library(AER)
data(SwissLabor)
mfx1 <- mfxboot(participation ~ . + I(age^2),"probit",SwissLabor)
mfx2 <- mfxboot(participation ~ . + I(age^2),"logit",SwissLabor)
mfx3 <- mfxboot(participation ~ . + I(age^2),"probit",SwissLabor,boot=100,digits=4)
mfxdat <- data.frame(cbind(rownames(mfx1),mfx1))
mfxdat$me <- as.numeric(as.character(mfxdat$marginal.effect))
mfxdat$se <- as.numeric(as.character(mfxdat$standard.error))
# coefplot
library(ggplot2)
ggplot(mfxdat, aes(V1, marginal.effect,ymin = me - 2*se,ymax= me + 2*se)) +
scale_x_discrete('Variable') +
scale_y_continuous('Marginal Effect',limits=c(-0.5,1)) +
theme_bw() +
geom_errorbar(aes(x = V1, y = me),size=.3,width=.2) +
geom_point(aes(x = V1, y = me)) +
geom_hline(yintercept=0) +
coord_flip() +
opts(title="Marginal Effects with 95% Confidence Intervals")
Using different sources, I wrote a little function that creates a table with standard errors, t statistics and standard errors that are clustered according to a group variable "cluster" after a linear regression model. The code is as follows
cl1 <- function(modl,clust) {
# model is the regression model
# clust is the clustervariable
# id is a unique identifier in ids
library(plm)
library(lmtest)
# Get Formula
form <- formula(modl$call)
# Get Data frame
dat <- eval(modl$call$data)
dat$row <- rownames(dat)
dat$id <- ave(dat$row, dat[[deparse(substitute(clust))]], FUN =seq_along)
pdat <- pdata.frame(dat,
index=c("id", deparse(substitute(clust)))
, drop.index= F, row.names= T)
# # Regression
reg <- plm(form, data=pdat, model="pooling")
# # Adjustments
G <- length(unique(dat[, deparse(substitute(clust))]))
N <- length(dat[,deparse(substitute(clust))])
# # Resid degrees of freedom, adjusted
dfa <- (G/(G-1))*(N-1)/reg$df.residual
d.vcov <- dfa* vcovHC(reg, type="HC0", cluster="group", adjust=T)
table <- coeftest(reg, vcov=d.vcov)
# # Output: se, t-stat and p-val
cl1out <- data.frame(table[, 2:4])
names(cl1out) <- c("se", "tstat", "pval")
# # Cluster VCE
return(cl1out)
}
For a regression like reg1 <- lm (y ~ x1 + x2 , data= df), calling the function cl1(reg1, cluster) will work just fine.
However, if I use a model like reg2 <- lm(y ~ . , data=df), I will get the error message:
Error in terms.formula(object) : '.' in formula and no 'data' argument
After some tests, I am guessing that I can't use "." to signal "use all variables in the data frame" for {plm}. Is there a way I can do this with {plm}? Otherwise, any ideas on how I could improve my function in a way that does not use {plm} and that accepts all possible specifications of a linear model?
Indeed you can't use . notation for formula within plm pacakge.
data("Produc", package = "plm")
plm(gsp ~ .,data=Produc)
Error in terms.formula(object) : '.' in formula and no 'data' argument
One idea is to expand the formula when you have a .. Here is a custom function that does the job (surely is done within other packages):
expand_formula <-
function(form="A ~.",varNames=c("A","B","C")){
has_dot <- any(grepl('.',form,fixed=TRUE))
if(has_dot){
ii <- intersect(as.character(as.formula(form)),
varNames)
varNames <- varNames[!grepl(paste0(ii,collapse='|'),varNames)]
exp <- paste0(varNames,collapse='+')
as.formula(gsub('.',exp,form,fixed=TRUE))
}
else as.formula(form)
}
Now test it :
(eform = expand_formula("gsp ~ .",names(Produc)))
# gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
plm(eform,data=Produc)
# Model Formula: gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
# <environment: 0x0000000014c3f3c0>