How do I code a spline model with degree of 1 with different knots? - r

I'm trying to code a multivariate spline model with some independent variables having multiple knots and some having none. The variables with splines will always have a degree of one. I have some code but I don't know if I trust it because I haven't done a spline regression in r (only in some proprietary "black box" software). Below is the code.
I have checked a lot of the 6,000 posts on splines. I see so many different codes that I'm confused.
Will anyone either
a)tell me if this code is doing what I want it to do (degree = 1/different knots)
b)is there a better way to do this?
fit1 <- glm(freq ~ channel + term2 + pay_plan_bucket_2 +
state + eff_year + marital_status +
vehicle_type + insured_age_bucket + I(pmax(0, insured_age_bucket-
26)) + I(pmax(0, insured_age_bucket - 70)) +
vehicle_length_bucket + I(pmax(0, vehicle_length_bucket - 45)) +
veh_age + I(pmax(veh_age -7)) + I(pmax(veh_age - 18)) +
rba_bucket + I(pmax(0, rba_bucket - 3500)) + I(pmax(0, rba_bucket - 27000)) +
credit_tier_bucket + I(pmax(0, credit_tier_bucket - 3)),
family=quasipoisson(link="log"),
data=comp_training_set_newpayplan)
Thank you.

Your brute force approach to splines is probably correct. Verify your output using bs from the splines package, for instance: bs(credit_tier_bucket, knots=3, degree=1) as a single term in the formula. Since the basis is formed differently than you have done here, gather the predicted values from both models and verify they are equal to ensure the two approaches to coding splines provides equivalent estimation and inference.

Related

lme4: Handling lmer "convergence code: 0"

I am currently calculating a multi-level analysis with 32 countries (country variable "CNTRY3"). The dependent variable is the willingness to pay for environmental protection "WTP" (scale 1 - 5, centred). I have included four random slopes in the third step (Random Intercept Random Slope) of the multi-level analysis (I would like to vary these based on my theory):
RINC_ALL_z -> Income per person (z-standardized)
social_trust (scale 1 - 5, centered) -> trust in other people
political_trust (scale 1 - 5, centred) -> trust in politics
EC_cen (scale 1 - 5, centered) -> Environmental awareness
.
modell.3a <- lmer(WTP ~ RINC_ALL_z + social_trust_cen + political_trust_cen +
Men + lowest_degree + middle_degree + requirement_university +
uncompleted_university + university_degree + AGE_cen + urban +
EC_cen + (RINC_ALL_z + social_trust_cen + political_trust_cen +
EC_cen|CNTRY3), data=ISSP2010_1)
Then the error message appears:
convergence code: 0 Model failed to converge with max|grad| =
0.00527884 (tol = 0.002, component 1)
Now I was able to research that if you calculate the model with the optimizer "BOBYQA", for example, these "convergence warnings" can be bypassed. And indeed, if I calculate the model like this, then no convergence warning appears any more:
modell.3b <- lmer(WTP ~ RINC_ALL_z + social_trust_cen + political_trust_cen +
Men + lowest_degree + middle_degree + requirement_university +
uncompleted_university + university_degree + AGE_cen + urban +
EC_cen + (RINC_ALL_z + social_trust_cen + political_trust_cen +
EC_cen|CNTRY3), data=ISSP2010_1, control = lmerControl(optimizer = "bobyqa",
optCtrl=list(maxfun=1e5)))
So I read in this very interesting article that if you use one optimizer, you should compare it with all other available optimizers to find out if certain optimizers influence the parameters of the regression. No sooner said than done (with the variable environmental awareness), I replicated the graphs of the article, but unfortunately the optimizers are not in one column (loglikelihood / t-value) as in the article. I have attached the two pictures here. My interpretation would be, regarding the log-likelihood comparison and the t-value comparison, that the majority (5) of the optimizers are in one column (including my used "BOBYQA" ) and only 2 optimizers differ from the majority of optimizers, so my used optimizer should not influence the parameters, right?
Loglikelihood comparison
T-Value Comparison
The first question would be what such an optimizer does (you just read that you should use optimizers to avoid convergence issues)?
The second question I have is, would you agree with this interpretation of the two diagrams?
I would be very happy about an answer, I have been thinking about this for several days... :-(
Many greetings
Joern

Visualising a three way interaction between two continuous variables and one categorical variable in R

I have a model in R that includes a significant three-way interaction between two continuous independent variables IVContinuousA, IVContinuousB, IVCategorical and one categorical variable (with two levels: Control and Treatment). The dependent variable is continuous (DV).
model <- lm(DV ~ IVContinuousA * IVContinuousB * IVCategorical)
You can find the data here
I am trying to find out a way to visualise this in R to ease my interpretation of it (perhaps in ggplot2?).
Somewhat inspired by this blog post I thought that I could dichotomise IVContinuousB into high and low values (so it would be a two-level factor itself:
IVContinuousBHigh <- mean(IVContinuousB) + sd (IVContinuousB)
IVContinuousBLow <- mean(IVContinuousB) - sd (IVContinuousB)
I then planned to plot the relationship between DV and IV ContinuousA and fit lines representing the slopes of this relationship for different combinations of IVCategorical and my new dichotomised IVContinuousB:
IVCategoricalControl and IVContinuousBHigh
IVCategoricalControl and IVContinuousBLow
IVCategoricalTreatment and IVContinuousBHigh
IVCategoricalTreatment and IVContinuousBLow
My first question is does this sound like a viable solution to producing an interpretable plot of this three-way-interaction? I want to avoid 3D plots if possible as I don't find them intuitive... Or is there another way to go about it? Maybe facet plots for the different combinations above?
If it is an ok solution, my second question is how to I generate the data to predict the fit lines to represent the different combinations above?
Third question - does anyone have any advice as to how to code this up in ggplot2?
I posted a very similar question on Cross Validated but because it is more code related I thought I would try here instead (I will remove the CV post if this one is more relevant to the community :) )
Thanks so much in advance,
Sarah
Note that there are NAs (left as blanks) in the DV column and the design is unbalanced - with slightly different numbers of datapoints in the Control vs Treatment groups of the variable IVCategorical.
FYI I have the code for visaualising a two-way interaction between IVContinuousA and IVCategorical:
A<-ggplot(data=data,aes(x=AOTAverage,y=SciconC,group=MisinfoCondition,shape=MisinfoCondition,col = MisinfoCondition,))+geom_point(size = 2)+geom_smooth(method='lm',formula=y~x)
But what I want is to plot this relationship conditional on IVContinuousB....
Here are a couple of options for visualizing the model output in two dimensions. I'm assuming here that the goal here is to compare Treatment to Control
library(tidyverse)
theme_set(theme_classic() +
theme(panel.background=element_rect(colour="grey40", fill=NA))
dat = read_excel("Some Data.xlsx") # I downloaded your data file
mod <- lm(DV ~ IVContinuousA * IVContinuousB * IVCategorical, data=dat)
# Function to create prediction grid data frame
make_pred_dat = function(data=dat, nA=20, nB=5) {
nCat = length(unique(data$IVCategorical))
d = with(data,
data.frame(IVContinuousA=rep(seq(min(IVContinuousA), max(IVContinuousA), length=nA), nB*2),
IVContinuousB=rep(rep(seq(min(IVContinuousB), max(IVContinuousB), length=nB), each=nA), nCat),
IVCategorical=rep(unique(IVCategorical), each=nA*nB)))
d$DV = predict(mod, newdata=d)
return(d)
}
IVContinuousA vs. DV by levels of IVContinuousB
The roles of IVContinuousA and IVContinuousB can of course be switched here.
ggplot(make_pred_dat(), aes(x=IVContinuousA, y=DV, colour=IVCategorical)) +
geom_line() +
facet_grid(. ~ round(IVContinuousB,2)) +
ggtitle("IVContinuousA vs. DV, by Level of IVContinousB") +
labs(colour="")
You can make a similar plot without faceting, but it gets difficult to interpret as the number of IVContinuousB levels increases:
ggplot(make_pred_dat(nB=3),
aes(x=IVContinuousA, y=DV, colour=IVCategorical, linetype=factor(round(IVContinuousB,2)))) +
geom_line() +
#facet_grid(. ~ round(IVContinuousB,2)) +
ggtitle("IVContinuousA vs. DV, by Level of IVContinousB") +
labs(colour="", linetype="IVContinuousB") +
scale_linetype_manual(values=c("1434","11","62")) +
guides(linetype=guide_legend(reverse=TRUE))
Heat map of the model-predicted difference, DV treatment - DV control on a grid of IVContinuousA and IVContinuousB values
Below, we look at the difference between treatment and control at each pair of IVContinuousA and IVContinuousB.
ggplot(make_pred_dat(nA=100, nB=100) %>%
group_by(IVContinuousA, IVContinuousB) %>%
arrange(IVCategorical) %>%
summarise(DV = diff(DV)),
aes(x=IVContinuousA, y=IVContinuousB)) +
geom_tile(aes(fill=DV)) +
scale_fill_gradient2(low="red", mid="white", high="blue") +
labs(fill=expression(Delta*DV~(Treatment - Control)))
If you really want to avoid 3-d plotting, you could indeed turn one of the continuous variables into a categorical one for visualization purposes.
For the purpose of the answer, I used the Duncan data set from the package car, as it is of the same form as the one you described.
library(car)
# the data
data("Duncan")
# the fitted model; education and income are continuous, type is categorical
lm0 <- lm(prestige ~ education * income * type, data = Duncan)
# turning education into high and low values (you can extend this to more
# levels)
edu_high <- mean(Duncan$education) + sd(Duncan$education)
edu_low <- mean(Duncan$education) - sd(Duncan$education)
# the values below should be used for predictions, each combination of the
# categories must be represented:
prediction_mat <- data.frame(income = Duncan$income,
education = rep(c(edu_high, edu_low),each =
nrow(Duncan)),
type = rep(levels(Duncan$type), each =
nrow(Duncan)*2))
predicted <- predict(lm0, newdata = prediction_mat)
# rearranging the fitted values and the values used for predictions
df <- data.frame(predicted,
income = Duncan$income,
edu_group =rep(c("edu_high", "edu_low"),each = nrow(Duncan)),
type = rep(levels(Duncan$type), each = nrow(Duncan)*2))
# plotting the fitted regression lines
ggplot(df, aes(x = income, y = predicted, group = type, col = type)) +
geom_line() +
facet_grid(. ~ edu_group)

Solving a single, long linear equation in R - many unknown variables

I have a single twelve-term equation for different classes of disease prevalence (nx) that looks like this:
y = f*b0*n0 + f*b1*n1 + f*b2*n2 + ... + f*b12*n12
I also have an equality constraint, such that
n0 + n1 + n2 + ... + n12 = 1 ## prevalence must add up to one
The unknowns are nx (where x=[1,12] ).
I would prefer to solve this analytically, but can also make do with a numerical solution.
I would be grateful for any pointers on R packages or approaches (or is this unsolvable?). The project is in support of the current humanitarian response in Iraq - so hopefully a worthwhile cause.

LMEM: Chi-square = 0 , prob = 1 - what's wrong with my code?

I'm running a LMEM (linear mixed effects model) on some data, and compare the models (in pairs) with the anova function. However, on a particular subset of data, I'm getting nonsense results.
This is my full model:
m3_full <- lmer(totfix ~ psource + cond + psource:cond +
1 + cond | subj) + (1 + psource + cond | object), data, REML=FALSE)
And this is the model I'm comparing it to: (basically dropping out one of the main effects)
m3_psource <- lmer (totfix ~ psource + cond + psource:cond -
psource + (1 + cond | subj) + (1 + psource + cond | object),
data, REML=FALSE)
Running the anova() function (anova(m3_full, m3_psource) returns Chisq = 0, pr>(Chisq) = 1
I'm doing the same for a few other LMEMs and everything seems fine, it's just this particular response value that gives me the weird chi-square and probability values. Anyone has an idea why and how I can fix it? Any help will be much appreciated!
This is not really a mixed-model-specific question: rather, it has to do with the way that R constructs model matrices from formulas (and, possibly, with the logic of your model comparison).
Let's narrow it down to the comparison between
form1 <- ~ psource + cond + psource:cond
and
form2 <- ~ psource + cond + psource:cond - psource
(which is equivalent to ~cond + psource:cond). These two formulas give equivalent model matrices, i.e. model matrices with the same number of columns, spanning the same design space, and giving the same overall goodness of fit.
Making up a minimal data set to explore:
dd <- expand.grid(psource=c("A","B"),cond=c("a","b"))
What constructed variables do we get with each formula?
colnames(model.matrix(form1,data=dd))
## [1] "(Intercept)" "psourceB" "condb" "psourceB:condb"
colnames(model.matrix(form2,data=dd))
## [1] "(Intercept)" "condb" "psourceB:conda" "psourceB:condb"
We get the same number of contrasts.
There are two possible responses to this problem.
There is one school of thought (typified by Nelder, Venables, etc.: e.g. see Venables' famous (?) but unpublished exegeses on linear models, section 5, or Wikipedia on the principle of marginality) that says that it doesn't make sense to try to test main effects in the presence of interaction terms, which is what you're trying to do.
There are occasional situations (e.g in a before-after-control-impact design where the 'before' difference between control and impact is known to be zero due to experimental protocol) where you really do want to do this comparison. In this case, you have to make up your own dummy variables and add them to your data, e.g.
## set up model matrix and drop intercept and "psourceB" column
dummies <- model.matrix(form1,data=dd)[,-(1:2)]
## d='dummy': avoid colons in column names
colnames(dummies) <- c("d_cond","d_source_by_cond")
colnames(model.matrix(~d_cond+d_source_by_cond,data.frame(dd,dummies)))
## [1] "(Intercept)" "d_cond" "d_source_by_cond"
This is a nuisance. My guess at the reason for this being difficult is that the original authors of R and S before it were from school of thought #1, and figured that generally when people were trying to do this it was a mistake; they didn't make it impossible, but they didn't go out of their way to make it easy.

producing a 2 class label prediction with neuralnet package in R

I am working with the neuralnet package in R (I am more familiar with nnet).
My target variable is a 2 label class. (Phone_Sales 1/0). I have a train and test set. Also, all variable were normalized to [0,1] scale.
My nn model is:
wireless_model <- neuralnet(formula = Phone_sale ~ Topflight + Balance +
Qual_miles + cc1_miles. + cc2_miles. +
cc3_miles. + Bonus_miles + Bonus_trans +
Flight_miles_12mo + Flight_trans_12 +
Online_12 + Email + Club_member + Any_cc_miles_12mo,
data = wireless_train, hidden=1, linear.output=FALSE)
the predicted results from wireless_model$net.result are produced as floats between 0 and 1 (in fact almost all hover very close to zero). ie .07 and .21, etc instead of 1 or 0.
So obviously when I compare my train to my test- my prediction is bad b/c of the two different types of DV.
I want the predicted results to be in the form of either 1 or 0. I am sure I did not use specify a correct setting somewhere in the neuralnet package.
A guess is that I may need to set the "family" in the formula for logistic so I get on 1 or 0 output. But not sure how that works in this package.
Any help?

Resources