I'm trying to fit a logistic regression model using all predictors as a polynomial model. I've tried doing this but didn't work:
poly_model = glm(type~ poly(., 2), data=train_data, family=binomial)
I'm using the built in dataset:
train_data = MASS::Pima.tr
What's the correct way to do this?
There's not really a way to do that with the . syntax. You'll need to explictly build the formula yourself. You can do this with a helper function
get_formula <- function(resp) {
reformulate(
sapply(setdiff(names(train_data), resp), function(x) paste0("poly(", x, ", 2)")),
response = resp
)
}
model <- get_formula("type")
model
# type ~ poly(npreg, 2) + poly(glu, 2) + poly(bp, 2) + poly(skin,
# 2) + poly(bmi, 2) + poly(ped, 2) + poly(age, 2)
glm(model, data=train_data, family=binomial)
Related
Im trying to fit a mixed effect model to asses for effects upon the rate of germinated polen grains. I started with a binomial distribution with a model structure like this:
glmer(cbind(NGG,NGNG) ~ RH3*Altitude + AbH + Date3 + (1 | Receptor/Code/Plant) +
(1 | Mountain/Community), data=database, family="binomial",
control = glmerControl(optimizer="bobyqa"))
Where NGG is the number of successes (germinated grains per stigma, can vary from 0 to e.g. 55), NGNG the number of failures (non-germinated grains 0 to e.g. 80). The issue is, after seeing the results, data seems to be over-dispersed, as indicated by the function (found in http://rstudio-pubs-static.s3.amazonaws.com/263877_d811720e434d47fb8430b8f0bb7f7da4.html):
overdisp_fun <- function(model) {
vpars <- function(m) {
nrow(m)*(nrow(m)+1)/2
}
model.df <- sum(sapply(VarCorr(model), vpars)) + length(fixef(model))
rdf <- nrow(model.frame(model))-model.df
rp <- residuals(model, type = "pearson") # computes pearson residuals
Pearson.chisq <- sum(rp^2)
prat <- Pearson.chisq/rdf
pval <- pchisq(Pearson.chisq, df = rdf, lower.tail = FALSE)
c(chisq = Pearson.chisq, ratio = prat, rdf = rdf, p = pval)
}
The output was:
chisq = 1.334567e+04, ratio = 1.656201e+00, rdf = 8.058000e+03, p = 3.845911e-268
So I decided to try a beta-binomial in glmmTMB as follows (its important to keep this hierarchical structure):
glmmTMB(cbind(NGG,NGNG) ~ RH3*Altitude + AbH + Date3 + (1 | Receptor/Code/Plant) +
(1 | Mountain/Community), data=database,
family=betabinomial(link = "logit"), na.action = na.omit, weights=NGT)
When I run it.. says:
Error in nlminb(start = par, objective = fn, gradient = gr, control = control$optCtrl) : (converted from warning) NA/NaN function evaluation
Is there something wrong in the model writing? I already checked for posible issues in (http://rstudio-pubs-static.s3.amazonaws.com/263877_d811720e434d47fb8430b8f0bb7f7da4.html) but did not find any solution yet.
thanks
In R, I would like to test the specification of a partial least square (PLS) model m1 against a non-nested alternative m2, applying the Davidson-MacKinnon J test. For a simple linear outcome Y it works quite well using the plsr estimator followed by the jtest command:
# Libraries and data
library(plsr)
library(plsRglm)
library(lmtest)
Z <- Cornell # illustration dataset coming with the plsrglm package
# Simple linear model
m1 <- plsr(Z$Y ~ Z$X1 + Z$X2 + Z$X3 + Z$X4 + Z$X5 ,2) # including X1
m2 <- plsr(Z$Y ~ Z$X6 + Z$X2 + Z$X3 + Z$X4 + Z$X5 ,2) # including X6 as alternative
jtest(m1,m2)
However, if Iuse the generalized linear model (plsRglm) estimator to account for a possible nonlinear distibution of an outcome, e.g.:
# Generalized Model
m1 <- plsRglm(Z$Y ~ Z$X1 + Z$X2 + Z$X3 + Z$X4 + Z$X5 ,2, modele = "pls-glm-family", family=Gamma(link = "log"), pvals.expli=TRUE)
m2 <- plsRglm(Z$Y ~ Z$X6 + Z$X2 + Z$X3 + Z$X4 + Z$X5 ,2, modele = "pls-glm-family", family=Gamma(link = "log"), pvals.expli=TRUE)
I am running into an error when using jtest:
> jtest(m1,m2)
Error in terms.default(formula1) : no terms component nor attribute
>
It seems that plsRglm does not save objects of class "formula", that jtest can handle. Has anybody a suggestion of how to edit my code to get this to work?
Thanks!
In R, one can build an lm() or glm() object with fixed coefficients, using the offset parameter in a formula.
x=seq(1,100)
y=x^2+3*x+7
# Forcing to fit the polynomial: 2x^2 + 4x + 8
fixed_model = lm(y ~ 0 + offset(8 + 4*x + 2*I(x^2) ))
Is it possible to do the same thing using poly()? I tried the code below but it doesn't seem to work.
fixed_model_w_poly <- lm(y ~ 0 + offset(poly(x, order=2, raw = TRUE, coefs= c(8, 4, 2))))
Error : number of offsets is 200, should equal 100 (number of observations)
I want to use poly() as a convenient interface to run iterations with a high number of fixed coefficients or order values, rather than having to manually code: offset(8 + 4*x + 2*I(x^2) ) for each order/coefficient combination.
P.S: Further but not essential information: This is to go inside an MCMC routine. So an example usage would be to generate (and then compare) model_current to model_next in the below code:
library(MASS)
coeffs_current <- c(8, 4, 2)
model_current <- lm(y ~ 0 + offset(poly(x, order=2, raw = TRUE, coefs= coeffs_current )))
cov <- diag(rep(1,3))
coeffs_next <- mvrnorm(1, mu = as.numeric(coeffs_current ),
Sigma = cov )
model_next <- lm(y ~ 0 + offset(poly(x, order=2, raw = TRUE, coeffs_next ))
This demonstrates what I suggested. (Not to use poly.)
library(MASS)
# coeffs_current <- c(8, 4, 2) Name change for compactness.
cc <- c(8, 4, 2)
form <- as.formula(bquote(y~x+offset(.(cc[1])+x*.(cc[2])+.(cc[3])*I(x^2) )))
model_current <- lm(form, data=dat))
I really have no idea what you intend to do with this next code. Looks like you want something based on the inputs to the prior function, but doesn't look like you want it based on the results.
cov <- diag(rep(1,3))
coeffs_next <- mvrnorm(1, mu = as.numeric(cc ),
Sigma = cov )
The code works (at least as I intended) with a simple test case. The bquote function substitutes values into expressions (well actually calls) and the as.formula function evaluates its argument and then dresses the result up as a proper formula-object.
dat <- data.frame(x=rnorm(20), y=rnorm(20) )
cc <- c(8, 4, 2)
form <- as.formula( bquote(y~x+offset(.(cc[1])+x*.(cc[2])+.(cc[3])*I(x^2) )))
model_current <- lm(form, data=dat)
#--------
> model_current
Call:
lm(formula = form, data = dat)
Coefficients:
(Intercept) x
-9.372 -5.326 # Bizarre results due to the offset.
#--------
form
#y ~ x + offset(8 + x * 4 + 2 * I(x^2))
i am trying to familiarize myself with the caret package. I would previously use rpart directly - e.g. with the following syntax
fit_rpart=rpart(y~.,data=dt1,method="anova").
i have specified anova as i am aiming for regression (rather than classification)
with caret - i would the following syntax:
rpart_fit <- train(y ~ ., data = dt1, method = "rpart",trControl=fitControl)
my question is, as the method slot is already used, where/how can i still specify method="anova"?
Many thanks in advance!
You can make a custom method using the current rpart code. First, get the current code:
library(caret)
rpart_code <- getModelInfo("rpart", regex = FALSE)[[1]]
You then just add the extra option to the code. This method is somewhat convoluted since it handles a bunch of different cases, but here is the edit:
rpart_code$fit <- function(x, y, wts, param, lev, last, classProbs, ...) {
cpValue <- if(!last) param$cp else 0
theDots <- list(...)
if(any(names(theDots) == "control")) {
theDots$control$cp <- cpValue
theDots$control$xval <- 0
ctl <- theDots$control
theDots$control <- NULL
} else ctl <- rpart.control(cp = cpValue, xval = 0)
## check to see if weights were passed in (and availible)
if(!is.null(wts)) theDots$weights <- wts
modelArgs <- c(list(formula = as.formula(".outcome ~ ."),
data = if(is.data.frame(x)) x else as.data.frame(x),
control = ctl,
method = "anova"),
theDots)
modelArgs$data$.outcome <- y
out <- do.call("rpart", modelArgs)
if(last) out <- prune.rpart(out, cp = param$cp)
out
}
then test:
library(rpart)
set.seed(445)
mod <- train(pgstat ~ age + eet + g2 + grade + gleason + ploidy,
data = stagec,
method = rpart_code,
tuneLength = 8)
Max
In caret 'method' refers to the type of model you would like to use, so for example rpart or lm (linear regression) or rf (random forest).
What you're referring to is defined as 'metric' in caret.
If your y-variable is a continuous variable, the metric will be default set to maximizing RMSE. So you don't have to do anything.
You could also explicitly specify this by:
rpart_fit <- train(y ~ ., data = dt1, method = "rpart",trControl=fitControl, metric="RMSE")
Using different sources, I wrote a little function that creates a table with standard errors, t statistics and standard errors that are clustered according to a group variable "cluster" after a linear regression model. The code is as follows
cl1 <- function(modl,clust) {
# model is the regression model
# clust is the clustervariable
# id is a unique identifier in ids
library(plm)
library(lmtest)
# Get Formula
form <- formula(modl$call)
# Get Data frame
dat <- eval(modl$call$data)
dat$row <- rownames(dat)
dat$id <- ave(dat$row, dat[[deparse(substitute(clust))]], FUN =seq_along)
pdat <- pdata.frame(dat,
index=c("id", deparse(substitute(clust)))
, drop.index= F, row.names= T)
# # Regression
reg <- plm(form, data=pdat, model="pooling")
# # Adjustments
G <- length(unique(dat[, deparse(substitute(clust))]))
N <- length(dat[,deparse(substitute(clust))])
# # Resid degrees of freedom, adjusted
dfa <- (G/(G-1))*(N-1)/reg$df.residual
d.vcov <- dfa* vcovHC(reg, type="HC0", cluster="group", adjust=T)
table <- coeftest(reg, vcov=d.vcov)
# # Output: se, t-stat and p-val
cl1out <- data.frame(table[, 2:4])
names(cl1out) <- c("se", "tstat", "pval")
# # Cluster VCE
return(cl1out)
}
For a regression like reg1 <- lm (y ~ x1 + x2 , data= df), calling the function cl1(reg1, cluster) will work just fine.
However, if I use a model like reg2 <- lm(y ~ . , data=df), I will get the error message:
Error in terms.formula(object) : '.' in formula and no 'data' argument
After some tests, I am guessing that I can't use "." to signal "use all variables in the data frame" for {plm}. Is there a way I can do this with {plm}? Otherwise, any ideas on how I could improve my function in a way that does not use {plm} and that accepts all possible specifications of a linear model?
Indeed you can't use . notation for formula within plm pacakge.
data("Produc", package = "plm")
plm(gsp ~ .,data=Produc)
Error in terms.formula(object) : '.' in formula and no 'data' argument
One idea is to expand the formula when you have a .. Here is a custom function that does the job (surely is done within other packages):
expand_formula <-
function(form="A ~.",varNames=c("A","B","C")){
has_dot <- any(grepl('.',form,fixed=TRUE))
if(has_dot){
ii <- intersect(as.character(as.formula(form)),
varNames)
varNames <- varNames[!grepl(paste0(ii,collapse='|'),varNames)]
exp <- paste0(varNames,collapse='+')
as.formula(gsub('.',exp,form,fixed=TRUE))
}
else as.formula(form)
}
Now test it :
(eform = expand_formula("gsp ~ .",names(Produc)))
# gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
plm(eform,data=Produc)
# Model Formula: gsp ~ state + year + pcap + hwy + water + util + pc + emp + unemp
# <environment: 0x0000000014c3f3c0>