How to find the best fitting regression equations - r

I have a very large dataset consisting of car insurance policyholders (C) and those who died in a car accident (D). The dataset includes different rate types (what type of insurance was in place). I want to do a logistic regression as a function of age. Is there a way to find an optimal regression equation?
for example now i have something like this in R
glm( cbind(D, C-D)~d_regr+1, data=data, family=binomial)
where d_regr is something like age, (age^2), (age^3)/3 and so on.
is there a nice way to find an optimal function, only depending on the variable age - for example with maximizing the pseudo R^2 or so?

Related

Is it appropriate to use a factor variable that was created from a predictor variable in a Cox PH regression?

This is for R Studio - Survival Analysis w/ "survival" package data and running Cox PH analysis.
I have a basic survival dataset (nafld1) that includes age and bmi. I created a new variable from the existing data bmi. I created a category A,B, or C based on the quantiles of BMI. So when I include these in the coxph formula and get results, is it okay to include them, or are they somehow getting "double counted" in the regression since the formula looks something like
~sex+age+bmi+bmi_g1+bmi_g2+bmi_g3
Conceptually, it's messing with me because the groups are based on bmi and not technically a completely separate covariate (as opposed to like treatment A or B might be), but more of a refinement of detail from the given data.
Any help would be appreciated, and please let me know if you need further detail.

Survival Curves For Cox PH Models. Checking My Understanding About Plotting Them

Im using the book Applied Survival Analysis Using R by Moore to try and model some time-to-event data. The issue I'm running into is plotting the estimated survival curves from the cox model. Because of this I'm wondering if my understanding of the model is wrong or not. My data is simple: a time column t, an event indicator column (1 for event 0 for censor) i, and a predictor column with 6 factor levels p.
I believe I can plot estimated surival curves for a cox model as follows below. But I don't understand how to use survfit and baseplot, nor functions from survminer to achieve the same end. Here is some generic code for clarifying my question. I'll use the pharmcoSmoking data set to demonstrate my issue.
library(survival)
library(asaur)
t<-pharmacoSmoking$longestNoSmoke
i<-pharmacoSmoking$relapse
p<-pharmacoSmoking$levelSmoking
data<-as.data.frame(cbind(t,i,p))
model <- coxph(Surv(data$t, data$i) ~ p, data=data)
As I understand it, with the following code snippets, modeled after book examples, a baseline (cumulative) hazard at my reference factor level for p may be given from
base<-basehaz(model, centered=F)
An estimate of the survival curve is given by
s<-exp(-base$hazard)
t<-base$time
plot(s~t, typ = "l")
The survival curve associated with a different factor level may then be given by
beta_n<-model$coefficients #only one coef in this case
s_n <- s^(exp(beta_n))
lines(s_n~t)
where beta_n is the coefficient for the nth factor level from the cox model. The code above gives what I think are estimated survival curves for heavy vs light smokers in the pharmcoSmokers dataset.
Since thats a bit of code I was looking to packages for a one-liner solution, I had a hard time with the documentation for Survival ( there weren't many examples in the docs) and also tried survminer. For the latter I've tried:
library(survminer)
ggadjustedcurves(model, variable ="p" , data=data)
This gives me something different than my prior code, although it is similar. Is the method I used earlier incorrect? Or is there a different methodology that accounts for the difference? The survminer code doesn't work from my data (I get a 'can't allocated vector of size yada yada error, and my data is ~1m rows) which seems weird considering I can make plots using what I did before no problem. This is the primary reason I am wondering if I am understanding how to plot survival curves for my model.

Consider autocorrelation in a Linear Quantile mixed models (LQMM)

(I am using R and the lqmm package)
I was wondering how to consider autocorrelation in a Linear Quantile mixed models (LQMM).
I have a data frame that looks like this:
df1<-data.frame( Time=seq(as.POSIXct("2017-11-13 00:00:00",tz="UTC"),
as.POSIXct("2017-11-13 00:1:59",tz="UTC"),"sec"),
HeartRate=rnorm(120, mean=60, sd=10),
Treatment=rep("TreatmentA",120),
AnimalID=rep("ID01",120),
Experiment=rep("Exp01",120))
df2<-data.frame( Time=seq(as.POSIXct("2017-08-11 00:00:00",tz="UTC"),
as.POSIXct("2017-08-11 00:1:59",tz="UTC"),"sec"),
HeartRate=rnorm(120, mean=62, sd=14),
Treatment=rep("TreatmentB",120),
AnimalID=rep("ID02",120),
Experiment=rep("Exp02",120))
df<-rbind(df1,df2)
head(df)
With:
The heart rates (HeartRate) that are measured every second on some animals (AnimalID). These measures are carried during an experiment (Experiment) with different treatment possible (Treatment). Each animal (AnimalID) was observed for multiple experiments with different treatments. I wish to look at the effect of the variable Treatment on the 90th percentile of the Heart Rates but including Experiment as a random effect and consider the autocorrelation (as heart rates are taken every second). (If there is a way to include AnimalID as random effect as well it would be even better)
Model for now:
library(lqmm)
model<-lqmm(fixed= HeartRate ~ Treatment, random= ~1| Exp01, data=df, tau=0.9)
Thank you very much in advance for your help.
Let me know if you need more information.
For resources on thinking about this type of problem you might look at chapters 17 and 19 of Koenker et al. 2018 Handbook of Quantile Regression from CRC Press. Neither chapter has nice R code to go from, but they discuss different approaches to the kind of data you're working with. lqmm does use nlme machinery, so there may be a way to customize the covariance matrices for the random effects, but I suspect it would be easiest to either ask for help from the package author or to do a deep dive into the package code to figure out how to do that.
Another resource is the quantile regression model for mixed effects models accounting for autocorrelation in 'Quantile regression for mixed models with an application to examine blood pressure trends in China' by Smith et al. (2015). They model a bivariate response with a copula, but you could do the simplified version with univariate response. I think their model only at this points incorporates lag-1 correlation structure within subjects/clusters. The code for that model does not seem to be available online either though.

Running diagnostics on a multivariate multiple regression in r

I have a data set that gives the rates of incidence of some phenomena in all the zip codes of a state, and some demographic data. The rates are given for each year in the data set (year 1 - year 6). A snippet of the data is available here.
I've run a multivariate linear regression to examine the impact of the demographic variables on the rates, per Fox & Weisberg (2011), weighted by the average zip code population across all years (var = POPmean):
Y <- cbind(data$rateY1, data$rateY2, data$rateY3, data$rateY4, data$rateY5, data$rateY6)
model <- lm(Y ~ someVAR1+someVAR2+someVAR3+someVAR4+someVAR5, data=data, weights= POPmean)
summary(model)
coef(model)
summary(manova(model))
I'd like to plot the regression diagnostics for this model for each year, but have no idea how to do so. I'd like to use influencePlot() from the car package, but when I try to do so:
influencePlot(model, id.method="noteworthy", main="Robustness Check")
I receive an error stating that the lengths of x,y differ (which, of course, they do). Can anyone help figure out how to plot the regression diagnostics for the model given above? Or suggest an alternative method?

Plotting a fixed effect against model estimated values?

So I've looked at a number of similarly themed posts but none of them seem to be exactly what I need, or I simply don't really understand the solutions they offered... So here it goes...
I ran a mixed-effects model with lme4 to look at some chimpanzee data. I have two factors (aggression rate; copulation rate) which affect my dependent (feeding time).
I would like to produce two scatter plots which show the relationship between each of the predictors and the outcome variable but I would like to draw a line, which is derived from the model estimates (and not an abline of the (lm(y ~ x)) type, which only gives a simple regression line, not one based on the full LMM).
I have a sense that this is only possible with ggplot2 but I have not been able to actually figure out how to do this. Having spent most of the day looking through books and forums, I was hoping this is something that may have a fairly straight-forward answer, if one knows what they are doing.
Thanks for any tips in advance!
Alex
To begin with I had the following model:
M3reml
Linear mixed model fit by REML ['lmerMod']
Formula: z.feeding_time ~ z.copul_rate + z.agro_given + z.agro_recd + (1 | Male) + ac_term
Data: N85
where the variables are the z-transformed values of: male chimpanzee feeding time (z.feeding_time); daily copulation rates with females (acts/hr; z.copul_rate); daily rate of aggression given (z.agro_given); and daily rate of aggression received (z.agro_recd). Random effect – male ID for the 12 males of my study; and a temporal autocorellation term (ac_term).
I wanted produce a regression line based on the model estimates for male feeding time.
Getting the estimates:
p1<-predict(M3reml)
Plotting the estimates against male rates of aggression (z-transformed values):
plot(p1~z.agro_given, data=N85)
adding a regression line:
abline(lm(p1~z.agro_given, data=N85))
I would post an image of the plot here but apparently I am not allowed to yet.

Resources