How to extract info from package in R and use in function? - r

I apologize for the vague question title. What I want to do is run a regression in R using geeglm from the geepack R package, then use information from that to calculate a quasilikelihood information criteria (QIC; Pan 2001). I can do this fairly easily for single models but I would like to write a general function that can do this for a variety of different types of models. I guess my real question is whether there is a better alternative than having a long series of nested ifelse statements?
Here's my current code:
library(geepack)
data(dietox) #data from the geepack package
# Run gee regression
dietox$Cu <- as.factor(dietox$Cu)
mf <- formula(Weight ~ Cu * (Time + I(Time^2) + I(Time^3)))
gee1 <- geeglm(mf, data = dietox, id = Pig, family = gaussian, corstr = "ar1")
Then I can run a function to calculate the quasilikelihood:
QlogLik.normal <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
# Quasi Likelihood for Normal
quasi.R <- sum(((y - mu.R)^2)/-2)
quasi.R
}
However, I would like to write a function that is more general because the quasilikelihood function is different for every distribution. The above function would work for gee1 because it had a gaussian (normal) distribution. If I wanted to generalize it for a variety of distributions I could use a series of nested ifelse statements (below), but I don't know if this is the best way to do this. Does anyone have other options or a better solution? This just doesn't seem very elegant to say the least (clearly I don't have much programming or R experience).
QlogLik <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
ifelse(model.R$modelInfo$variance == "poisson",
# Quasi Likelihood for Poisson
quasi.R <- sum((y*log(mu.R)) - mu.R),
ifelse(model.R$modelInfo$variance == "gaussian",
# Quasi Likelihood for Normal
quasi.R <- sum(((y - mu.R)^2)/-2),
ifelse(model.R$modelInfo$variance == "binomial",
# Quasilikelihood for Binomial
quasi.R <- sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
quasi.R <- "Error: distribution not recognized")))
quasi.R
}
In this example, I used the model output from geeglm to extract the type of distribution used to model the variance
model.R$modelInfo$variance
but there may be other ways to determine what distribution was used in the geeglm model. Any help would be appreciated.

You should be able to rewrite your function like this:
QlogLik <- function(model.R) {
library(MASS)
mu.R <- model.R$fitted.values
y <- model.R$y
type <- family(model.R)$family
switch(type,
poisson = sum((y*log(mu.R)) - mu.R),
gaussian = sum(((y - mu.R)^2)/-2),
binomial = sum(y*log(mu.R/(1 - mu.R)) + log(1 - mu.R)),
stop("Error: distribution not recognized"))
}
As #baptise points out, switch useful in these cases. You use family(model.R)$family to automatically detect what family type should be used with switch.
Also, if your commands for what to do in different cases run beyond one line, you can wrap the lines with curly brackets ({ do something here }) instead.
switch(type,
type1 = { something <- do(this)
thisis(something) },
type2 = do(that))
I hope this helps!

You may also use model.R$family$family which gives the type of distribution used to model the variance, but so far I didn't know if you could eliminate those ifelse statements. The quasi.R in your code differs among different distributions, so you have to define each of them separately.
BTW, it is a good question and thanks for posting it: I had similar situations in the past, and hope to get some advice on how to write the codes more efficiently.

Related

How to normalize a Lmer model?

lmer:
mixed.lmer6 <- lmer(Size ~ (Time+I(Time^2))*Country*STemperature +
(1|Country:Locality)+ (1|Locality:Individual)+(1|Batch)+
(1|Egg_masses), REML = FALSE, data = data_NoNA)
residuals:
plot_model(mixed.lmer6, type = "diag")
Tried manual log,power, sqrt transformations in my formula but no improvement and I also can not find a suitable automatic transformation R function such as BoxCox (which does not work for LMER's)
Any help or tips would be appreciated
This might be better suited for CrossValidated ("what should I do?" is appropriate for CV; "how should I do it?" is best for Stack Overflow), but I'll take a crack.
The Q-Q plot is generally the last/least important diagnostic you should look at (the order should be approximately (1) check for significant bias/missed patterns in the mean [fitted vs. residual, residual vs. covariates]; (2) check for outliers/influential points [leverage, Cook's distance]; (3) check for heteroscedasticity [scale-location plot]; (4) check distributional assumptions [Q-Q plot]). The reason is that any of the "upstream" failures (e.g. missed patterns) will show up in the Q-Q plot as well; resolving them will often resolve the apparent non-Normality.
If you can fix the distributional assumptions by fixing something else about the model (adding covariates/adding interactions/adding polynomial or spline terms/removing outliers), then do that.
you could code your own brute-force Box-Cox, something like
fitted_model <- lmer(..., data = mydata)
bcfun <- function(lambda, resp = "y") {
y <- mydata[[resp]]
mydata$newy <- if (lambda==0) log(y) else (y^lambda -1)/lambda
## https://stats.stackexchange.com/questions/261380/how-do-i-get-the-box-cox-log-likelihood-using-the-jacobian
log_jac <- sum((lambda-1)*log(y))
newfit <- update(fitted_model, newy ~ ., data = mydata)
return(-2*(c(logLik(newfit))+ log_jac))
}
lambdavec <- seq(-2, 2, by = 0.2)
boxcox <- vapply(lambdavec, bcfun, FUN.VALUE = numeric(1))
plot(lambdavec, boxcox - min(boxcox))
(lightly tested! but feel free to let me know if it doesn't work)
if you do need to fit a mixed model with a heavy-tailed residual distribution (e.g. Student t), the options are fairly limited. The brms package can fit such models (but takes you down the Bayesian/MCMC rabbit hole), and the heavy package (currently archived on CRAN) will work, but doesn't appear to handle crossed random effects.

Why does R and PROCESS render different result of a mediation model (one is significant, the other one is not)?

As a newcomer who just gets started in R, I am confused about the result of the mediation analysis.
My model is simple: IV 'T1Incivi', Mediator 'T1Envied', DV 'T2PSRB'. I ran the same model in SPSS using PROCESS, but the result was insignificant in PROCESS; however, the indirect effect is significant in R. Since I am not that familiar with R, could you please help me to see if there is anything wrong with my code? And tell me why the result is significant in R but not in SPSS?Thanks a bunch!!!
My code in R:
X predict M
apath <- lm(T1Envied~T1Incivi, data=dat)
summary(apath)
X and M predict Y
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
summary(bpath)
Bootstrapping for indirect effect
getindirect <- function(dataset,random){
d=dataset[random,]
apath <- lm(T1Envied~T1Incivi, data=d)
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
indirect <- apath$coefficients["T1Incivi"]*bpath$coefficients["T1Envied"]
return(indirect)
}
library(boot)
set.seed(6452234)
Ind1 <- boot(data=dat,
statistic=getindirect,
R=5000)
boot.ci(Ind1,
conf = .95,
type = "norm")`*PSRB as outcome*
In your function getindirect all linear regressions should be based on the freshly shuffled data in d.
However there is the line
bpath <- lm(T2PSRB~T1Envied+T1Incivi, data=dat)
that makes the wrong reference to the variable dat which should really not be used within this function. That alone can explain incoherent results.

Time-dependent covariates- is there something wrong with this code? (R program)

I am checking a few of my Cox multivariate regression analyses' proportional hazard assumptions using time-dependent co-variates, using the survival package. The question is looking at survival in groups with different ADAMTS13 levels (a type of enzyme).
Could I check if something is wrong with my code itself? It keeps saying Error in tt(TMAdata$ADAMTS13level.f) : could not find function "tt" . Why?
Notably, ADAMTS13level.f is a factor variable.
cox_multivariate_survival_ADAMTS13 <- coxph(Surv(TMAdata$Daysalive, TMAdata$'Dead=1')
~TMAdata$ADAMTS13level.f
+TMAdata$`Age at diagnosis`
+TMAdata$CCIwithoutage
+TMAdata$Gender.f
+TMAdata$`Peak Creatinine`
+TMAdata$DICorcrit.f,
tt(TMAdata$ADAMTS13level.f),
tt = function(x, t, ...)
{mtrx <- model.matrix(~x)[,-1]
mtrx * log(t)})
Thanks- starting with the fundamentals of my actual code or typos- I have tried different permutations to no avail yet.
#Limey was on the right track!
The time-transformed version of ADAMTS13level.f needs to be added to the model, instead of being separated into a separate argument of coxph(...).
The form of coxph call when testing the time-dependent categorical variables is described in How to use the timeSplitter by Max Gordon.
Other helpful documentation:
coxph - fit proportional hazards regression model
cox_multivariate_survival_ADAMTS13 <-
coxph(
Surv(
Daysalive,
'Dead=1'
) ~
ADAMTS13level.f
+ `Age at diagnosis`
+ CCIwithoutage
+ Gender.f
+ `Peak Creatinine`
+ DICorcrit.f
+ tt(ADAMTS13level.f),
tt = function(x, t, ...) {
mtrx <- model.matrix(~x)[,-1]
mtrx * log(t)
},
data = TMAdata
)
p.s. with the original data, there was also a problem because Daysalive included a zero (0) value, which eventually resulted in an 'infinite predictor' error from coxph, probably because tt transformed the data using a log(t). (https://rdrr.io/github/therneau/survival/src/R/coxph.R)

Is it possible to adapt standard prediction interval code for dlm in R with other distribution?

Using the dlm package in R I fit a dynamic linear model to a time series data set, consisting of 20 observations. I then use the dlmForecast function to predict future values (which I can validate against the genuine data for said period).
I use the following code to create a prediction interval;
ciTheory <- (outer(sapply(fut1$Q, FUN=function(x) sqrt(diag(x))), qnorm(c(0.05,0.95))) +
as.vector(t(fut1$f)))
However my data does not follow a normal distribution and I wondered whether it would be possible to
adapt the qnorm function for other distributions. I have tried qt, but am unable to apply qgamma.......
Just wondered if anyone knew how you would go about sorting this.....
Below is a reproduced version of my code...
library(dlm)
data <- c(20.68502, 17.28549, 12.18363, 13.53479, 15.38779, 16.14770, 20.17536, 43.39321, 42.91027, 49.41402, 59.22262, 55.42043)
mod.build <- function(par) {
dlmModPoly(1, dV = exp(par[1]), dW = exp(par[2]))
}
# Returns most likely estimate of relevant values for parameters
mle <- dlmMLE(a2, rep(0,2), mod.build); #nileMLE$conv
if(mle$convergence==0) print("converged") else print("did not converge")
mod1 <- dlmModPoly(dV = v, dW = c(0, w))
mod1Filt <- dlmFilter(a1, mod1)
fut1 <- dlmForecast(mod1Filt, n = 7)
Cheers

Fitting Step functions

AIM: The aim here was to find a suitable fit, using step functions, which uses age to describe wage, in the Wage dataset in the library ISLR.
PLAN:
To find a suitable fit, I'll try multiple fits, which will have different cut points. I'll use the glm() function (of the boot library) for the fitting purpose. In order to check which fit is the best, I'll use the cv.glm() function to perform cross-validation over the fitted model.
PROBLEM:
In order to do so, I did the following:
all.cvs = rep(NA, 10)
for (i in 2:10) {
lm.fit = glm(wage~cut(Wage$age,i), data=Wage)
all.cvs[i] = cv.glm(Wage, lm.fit, K=10)$delta[2]
}
But this gives an error:
Error in model.frame.default(formula = wage ~ cut(Wage$age, i), data =
list( : variable lengths differ (found for 'cut(Wage$age, i)')
Whereas, when I run the code given below, it runs.(It can be found here)
all.cvs = rep(NA, 10)
for (i in 2:10) {
Wage$age.cut = cut(Wage$age, i)
lm.fit = glm(wage~age.cut, data=Wage)
all.cvs[i] = cv.glm(Wage, lm.fit, K=10)$delta[2]
}
Hypotheses and Results:
Well, it might be possible that cut() and glm() might not work together. But this works:
glm(wage~cut(age,4),data=Wage)
Question:
So, basically we're using the cut() function, saving it's results in a variable, then using that variable in the glm() function. But we can't put the cut function inside the glm() function. And that too, only if the code is in a loop.
So, why is the first version of the code not working?
This is confusing. Any help appreciated.

Resources