Using predict() function to fit line of predicted values onto visualized model - r

I've created a linear mixed effects model where the explanatory variable is mother age and response variable is partner age along with several random effects.
I've had no problem creating the model but now that im using it to generate predicted values to fit to the graph I am having some difficulties!
As far as i can tell the code is correct and I cannot tell why this isnt working.
First i tried the following
pred <- expand.grid(Mother.age=EPPs$Mother.age)
head(pred)
tail(pred)
pred$fit <- predict(m6, newdata=EPPs, type='response', re.form=~0)
head(pred) #here i have predicted and fitted values but they are not in order so I tried the following
pred[order(pred$fit),] ##this said that it worked but changed nothing
plot(jitter(Genetic.father.age) ~ jitter(Mother.age), data=EPPs)
lines(fit ~ Mother.age, data=pred, col='red')
This essentially ends up in a massive back and forth red scribble instead of a line so i decided to manually order the data points in excel and try again
pred1 <- read.csv("predictions1.csv", header=TRUE) ##These are the ordered point
head(pred1)
tail(pred1)
pred$fit <- predict(m6, newdata=EPPs, type='response', re.form=~0)
plot(jitter(Genetic.father.age) ~ jitter(Mother.age), data=EPPs)
lines(fit ~ Mother.age, data=pred1, col='red')
Now this gets me very close to a straight line however it still has little 'step-ups' at each year on the x-axis, what I really want is a smooth flat line!
Any help would be appreciated - Im not sure what else I can do.
Oh whilst i'm here - any recommendations for a post hoc test for comparing two lme models to see if there is a significant difference? In this case it would be the exact same model except the response variable on the first is genetic father age and the second would be social father age (I want to show that extra pair males are consistently older than the cuckolded social males on average)
Thanks!!
Whole example
EPPs <- read.csv("EPPs.csv", header=TRUE) #data i am using
Turn these bad boys to factors to be used as random effects
EPPs$Mother <- as.factor(EPPs$Mother)
EPPs$Mother.Cohort <- as.factor(EPPs$Mother.Cohort)
EPPs$Brood.year <- as.factor(EPPs$Brood.year)
EPPs$Social.Father <- as.factor(EPPs$Social.Father)
Within subject centering
AveByInd <- function(x) mean(x)
d2 <- do.call("rbind", as.list(
by(EPPs, EPPs["Mother"], transform, AveMAge=AveByInd(Mother.age))))
par(mfrow=c(1,1))
hist(d2$AveMAge, xlab="Average mother age", ylab="Frequency", main="")
WithinIndCentr <- function(x) x-mean(x)
d2 <- do.call("rbind", as.list(
by(d2, d2["Mother"], transform, WithinMAge=WithinIndCentr(Mother.age))))
par(mfrow=c(1,1))
hist(d2$WithinMAge, xlab="Within-female centered age", ylab="Frequency",
main
="")
m6<-lmer(d2$Genetic.father.age~d2$WithinMAge+d2$AveMAge+(1|Mother)+
(1|Mother.Cohort)+(1|Brood.year)+(1|Social.Father), data=d2)
summary(m6)
The model I have selected
Above this point everything is working fine
Now I want to use the predictions generated by my model to add a line of best fit to my graph
I should mention I decided to try it with the Mother.age variable rather than the within-subject centered variables as I thought it would be more straight forward
Originally I was trying this
pred <- expand.grid(Mother.age=EPPs$Mother.age)
pred$fit <- predict(m6, newdata=EPPs, type='response', re.form=~0)
head(pred)
pred[order(pred$fit),]
I was told I had to order the matrix in order for it to work however this doesnt seem to work
plot(jitter(Genetic.father.age) ~ jitter(Mother.age), data=EPPs)
lines(fit ~ Mother.age, data=pred, col='red')
This results in a horrible back and forth mess of red lines
I was told by a friend that I needed to order the matrix to get it to work so I tried doing it manually in excel and reuploading
pred1 <- read.csv("predictions1.csv", header=TRUE) ##almost works
head(pred1)##predictions are in order
dev.off()
plot(jitter(Genetic.father.age) ~ jitter(Mother.age), data=EPPs)
Using jitter because discrete data points and I want to give an idea of areas with lots of overlapping points
lines(fit ~ Mother.age, data=pred1, col='red')
Now I get a very almost straight line but it still has little 'steps' and I want it to be totally smooth

Related

How to correctly take out zero observations in panel data in R

I'm running into some problems while running plm regressions in my panel database. Basically, I have to take out a year from my base and also all observations from some variable that are zero. I tried to make a reproducible example using a dataset from AER package.
require (AER)
library (AER)
require(plm)
library("plm")
data("Grunfeld", package = "AER")
View(Grunfeld)
#Here I randomize some observations of the third variable (capital) as zero, to reproduce my dataset
for (i in 1:220) {
x <- rnorm(10,0,1)
if (mean(x) >=0) {
Grunfeld[i,3] <- 0
}
}
View(Grunfeld)
panel <- Grunfeld
#First Method
#This is how I was originally manipulating my data and running my regression
panel <- Grunfeld
dd <-pdata.frame(panel, index = c('firm', 'year'))
dd <- dd[dd$year!=1935, ]
dd <- dd[dd$capital !=0, ]
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
summary(ols_model_2)
#However, I couuldn't plot the variables of the datasets in graphs, because they weren't vectors. So I tried another way:
#Second Method
panel <- panel[panel$year!= 1935, ]
panel <- panel[panel$capital != 0,]
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
summary(ols_model)
#But this gave extremely different results for the ols regression!
In my understanding, both approaches sould have yielded the same outputs in the OLS regression. Now I'm afraid my entire analysis is wrong, because I was doing it like the first way. Could anyone explain me what is happening?
Thanks in advance!
You are a running two different models. I am not sure why you would expect results to be the same.
Your first model is:
ols_model_2 <- plm(log(value) ~ (capital), data=dd)
While the second is:
ols_model <- plm(log(value) ~ log(capital), data=panel, index = c('firm','year'))
As you see from the summary of the models, both are "Oneway (individual) effect Within Model". In the first one you dont specify the index, since dd is a pdata.frame object. In the second you do specify the index, because panel is a simple data.frame. However this makes no difference at all.
The difference is using the log of capital or capital without log.
As a side note, leaving out 0 observations is often very problematic. If you do that, make sure you also try alternative ways of dealing with zero, and see how much your results change. You can get started here https://stats.stackexchange.com/questions/1444/how-should-i-transform-non-negative-data-including-zeros

Fixed coefficient/Offset in Fine&Gray competing-risk adjusted model (FGR)

I want to fit a Fine&Gray competing risk adjusted model including an offset. In other types of models, I am used to being able to simply put in >offset(x), which will add an offset with coefficient 1.
I tried to do the same using the FGR function from the package riskRegression. I didn't get a warning message, but I then noticed that the coefficients for the model with and without offset(x) were exactly the same for the other variables
Example:
#install.packages(riskRegression")
library(riskRegression)
matrix <- matrix(c(3,6,3,2,5,4,7,2,8,2,
0.8,0.6,0.4,0.25,0.16,0.67,0.48,0.7,0.8,0.78,
60,55,61,62,70,49,59,63,62,64,
15,16,18,12,16,13,19,12,15,14,
0,2,1,0,1,1,0,1,2,0,
345,118,225,90,250,894,128,81,530,268),
nrow=10,ncol=6)
df <- data.frame(matrix)
colnames(df) <- c("x","y","z", "a","event","time")
fit <- FGR(Hist(time,event)~ offset(x)+a+y+z, data=df, cause=1)
fit
fit2 <- FGR(Hist(time,event)~ a+y+z, data=df, cause=1)
fit2
If you run this script, you can see that the coefficients of a, y and z do not change, while you are not getting a warning that offset cannot be used (so apparantly it just simply ignored offset(x)).
Does anybody know of a way to include x as an offset (i.e. with coefficient fixed at 1) in FGR? (Edit: Or another way to calculate the correct coefficents for a, y and z with fixed x?)
You can use the survival package for Fine-Gray models with offsets. Just wrap the variable you would like to have the offset with offset(var). I set the model below to model event 1. See code below:
library(survival)
matrix <- matrix(c(3,6,3,2,5,4,7,2,8,2,
0.8,0.6,0.4,0.25,0.16,0.67,0.48,0.7,0.8,0.78,
60,55,61,62,70,49,59,63,62,64,
15,16,18,12,16,13,19,12,15,14,
0,2,1,0,1,1,0,1,2,0,
345,118,225,90,250,894,128,81,530,268),
nrow=10,ncol=6)
df <- data.frame(matrix)
colnames(df) <- c("x","y","z", "a","event","time")
coxph(Surv(time,event==1)~ offset(x)+a+y+z, data=df)

Partial residual plots for linear model including an interaction term

My model includes one response variable, five predictors and one interaction term for predictor_1 and predictor_2. I would like to plot partial residual plots for every predictor variable which I would normally realize using the crPlots function from the package car. Unfortunately the function complains that it doesn't work with models that include interaction terms.
Is there another way of doing what I want?
EDIT: I created a small example illustrating the problem
require(car)
R <- c(0.53,0.60,0.64,0.52,0.75,0.66,0.71,0.49,0.52,0.59)
P1 <- c(3.1,1.8,1.8,1.8,1.8,3.2,3.2,2.8,3.1,3.3)
P2 <- c(2.1,0.8,0.3,0.5,0.4,1.3,0.5,1.2,1.6,2.1)
lm.fit1 <- lm(R ~ P1 + P2)
summary(lm.fit1)
crPlots(lm.fit1) # works fine
lm.fit2 <- lm(R ~ P1*P2)
summary(lm.fit2)
crPlots(lm.fit2) # not available
Another way to do this is to put the interaction term in as a separate variable (which avoids hacking the code for crPlot(...)).
df <- data.frame(R,P1,P2,P1.P2=P1*P2)
lm.fit1 <- lm(R ~ ., df)
summary(lm.fit1)
crPlots(lm.fit1)
Note that summary(lm.fit1) yeilds exactly the same result as summary(lm(R~P1*P2,df)).
I must admit i'm not that familiar with partial residual plots so i'm not entirely sure what the proper interpretation of them should be given an interaction term. But basically, the equivalent of
crPlot(lm.fit1, "P1")
is
x <- predict(lm.fit1, type="term", term="P1")
y <- residuals(lm.fit1, type="partial")[,"P1"]
plot(x, y)
abline(lm(y~x), col="red", lty=2)
loessLine(x,y,col="green3",log.x = FALSE, log.y = FALSE, smoother.args=list())
so really, there's no real reason the same idea couldn't work with an interaction term as well. We just leave the partial contribution from a variable due to the interaction as a separate entity and just focus on the non-interaction contribution. So what i'm going to do is just take out the check for the interaction term and then we can use the function. Assuming that
body(car:::crPlot.lm)[[11]]
# if (any(attr(terms(model), "order") > 1)) {
# stop("C+R plots not available for models with interactions.")
# }
we can copy and modify to create a new function with out the check
crPlot2 <- car:::crPlot.lm
body(crPlot2) <- body(crPlot2)[-11]
environment(crPlot2) <- asNamespace("car")
And then we can run
layout(matrix(1:2, ncol=2))
crPlot2(lm.fit2, "P1")
crPlot2(lm.fit2, "P2")
to get
I'm sure the authors had a good reason for not incorporating models with interaction terms so use this hack at your own risk. It's just unclear to me what should happen to the residual from the interaction term when making the plot.

"Vectorizing" this for-loop in R? (suppressing interaction main effects in lm)

When interactions are specified in lm, R includes main effects by default, with no option to suppress them. This is usually appropriate and convenient, but there are certain instances (within estimators, ratio LHS variables, among others) where this isn't appropriate.
I've got this code that fits a log-transformed variable to a response variable, independently within subsets of the data.
Here is a silly yet reproducible example:
id = as.factor(c(1,2,2,3,3,3,4,4,4,4,5,5,5,5,6,7,7,8,8,8,9,9,9,9,10))
x = rexp(length(id))
y = rnorm(length(id))
logx = log(x)
data = data.frame(id,y,logx)
for (i in data$id){
sub = subset(data, id==i) #This splits the data by id
m = lm(y~logx-1,data=sub) #This gives me the linear (log) fit for one of my id's
sub$x.tilde = log(1+3)*m$coef #This linearizes it and gives me the expected value for x=3
data$x.tilde[data$id==i] = sub$x.tilde #This puts it back into the main dataset
data$tildecoeff[data$id==i] = m$coef #This saves the coefficient (I use it elsewhere for plotting)
}
I want to fit a model like the following:
Y = B(X*id) +e
with no intercept and no main effect of id. As you can see from the loop, I'm interested in the expectation of Y when X=3, constrained the fit through the origin (because Y is a (logged) ratio of Y[X=something]/Y[X=0].
But if I specify
m = lm(Y~X*as.factor(id)-1)
there is no means of suppressing the main effects of id. I need to run this loop several hundred times in an iterative algorithm, and as a loop it is far too slow.
The other upside of de-looping this code is that it'll be much more convenient to get prediction intervals.
(Please, I don't need pious comments about how leaving out main effects and intercepts is improper -- it usually is, but I can promise that it isn't in this instance).
Thanks in advance for any ideas!
I think you want
m <- lm(y ~ 0 + logx : as.factor(id))
see R-intro '11.1 Defining statistical models; formulae'

How get plot from nls in R?

In R I use nls to do a nonlinear least-squares fit. How then do I plot the model function using the values of the coefficients that the fit provided?
(Yes, this is a very naive question from an R relative newbie.)
Using the first example from ?nls and following the example I pointed you to line by line achieves the following:
#This is just our data frame
DNase1 <- subset(DNase, Run == 1)
DNase1$lconc <- log(DNase1$conc)
#Fit the model
fm1DNase1 <- nls(density ~ SSlogis(lconc, Asym, xmid, scal), DNase1)
#Plot the original points
# first argument is the x values, second is the y values
plot(DNase1$lconc,DNase1$density)
#This adds to the already created plot a line
# once again, first argument is x values, second is y values
lines(DNase1$lconc,predict(fm1DNase1))
The predict method for a nls argument is automatically returning the fitted y values. Alternatively, you add a step and do
yFitted <- predict(fm1DNase1)
and pass yFitted in the second argument to lines instead. The result looks like this:
Or if you want a "smooth" curve, what you do is to simply repeat this but evaluate the function at more points:
r <- range(DNase1$lconc)
xNew <- seq(r[1],r[2],length.out = 200)
yNew <- predict(fm1DNase1,list(lconc = xNew))
plot(DNase1$lconc,DNase1$density)
lines(xNew,yNew)
coef(x) returns the coefficients for regression results x.
model<-nls(y~a+b*x^k,my.data,list(a=0.,b=1.,k=1))
plot(y~x,my.data)
a<-coef(model)[1]
b<-coef(model)[2]
k<-coef(model)[3]
lines(x<-c(1:10),a+b*x^k,col='red')
For example.
I know what you want (I'm a Scientist). This isn't it, but at least shows how to use 'curve' to plot your fitting function over any range, and the curve will be smooth. Using the same data set as above:
nonlinFit <- nls(density ~ a - b*exp(-c*conc), data = DNase1, start = list(a=1, b=1, c=1) )
fitFnc <- function(x) predict(nonlinFit, list(conc=x))
curve(fitFnc, from=.5, to=10)
or,
curve(fitFnc, from=8.2, to=8.4)
or,
curve(fitFnc, from=.1, to=50) # well outside the data range
or whatever (without setting up a sequence of evaluation points first).
I'm a rudimentary R programmer, so I don't know how to implement (elegantly) something like ReplaceAll ( /. ) in Mathematica that one would use to replace occurrences of the symbolic parameters in the model, with the fitted parameters. This first step works although it looks horrible:
myModel <- "a - b*exp(-c*conc)"
nonlinFit <- nls(as.formula(paste("density ~", myModel)), data = DNase1, start = list(a=1, b=1, c=1) )
It leaves you with a separate 'model' (as a character string), that you might be able to make use of with the fitted parameters ... cleanly (NOT digging out a, b, c) would simply use nonlinFit ... not sure how though.
The function "curve" will plot functions for you.

Resources