Draw a logarithmic curve on graph in R - r

I have the following set of data and when plotted has a curvilinear relationship
Fish.species.richness Habitat.Complexity log.habitat
17 0.6376 -0.1954858
13 0.2335 -0.6317131
30 0.2866 -0.5427238
20 0.3231 -0.4906630
22 0.1073 -0.9694003
25 0.2818 -0.5500590
2 0.2182 -0.6612448
4 0.0189 -1.7246886
19 0.2960 -0.5287083
25 0.5507 -0.2590849
29 0.2689 -0.5704900
21 0.6286 -0.2016602
18 0.1557 -0.8078509
24 0.6851 -0.1642460
30 0.5059 -0.2959353
32 0.4434 -0.3532043
29 0.3585 -0.4455108
32 0.5920 -0.2276783
When I log the x axis and do a linear regression to find the intercept and slope I am able to add a line that fits the data:
summary(lm(Fish.species.richness~log.habitat,data=three))
plot(three$log.habitat,
three$Fish.species.richness,
xlab='Log Habitat Complexity',
ylab='Fish Species Richness')
abline(29.178,13.843)
However when I then do a curvilinear regression and try to plot the curve it doesn't fit the data, where am I going wrong?
mod.log<-lm(Fish.species.richness~log(Habitat.Complexity),data=three)
plot(three$Habitat.Complexity,
three$Fish.species.richness)
abline(mod.log)

Using ggplot2:
ggplot(three, aes(Habitat.Complexity, Fish.species.richness))+
geom_point(shape = 1) + stat_smooth(method = "lm", formula = y ~ log(x))

abline can only draw straight lines, on the form y = a + bx. Other curves can be added using the curve function.
plot(Fish.species.richness ~ Habitat.Complexity, three)
curve(coef(mod.log)[1] + coef(mod.log)[2]*log(x), add=TRUE)

For clarity and flexibility to other model types, you may want to use the predict function to calculate the predicted values along the range of your predictor variable:
mod.log<-lm(Fish.species.richness~log(Habitat.Complexity), data=three)
# predict along predictor variable range
newdat <- data.frame(Habitat.Complexity=seq(min(three$Habitat.Complexity), max(three$Habitat.Complexity),,100))
newdat$Fish.species.richness <- predict(mod.log, newdat, type="response")
# plot
plot(Fish.species.richness ~ Habitat.Complexity, data=three)
lines(Fish.species.richness ~ Habitat.Complexity, data=newdat)

Related

R - plotting a mixed-effect graph by group [duplicate]

I am new with mixed effect models and I need your help please.
I have plotted the below graph in ggplot:
ggplot(tempEf,aes(TRTYEAR,CO2effect,group=Myc,col=Myc)) +
facet_grid(~N) +
geom_smooth(method="lm",se=T,size=1) +
geom_point(alpha = 0.3) +
geom_hline(yintercept=0, linetype="dashed") +
theme_bw()
However, I would like to represent a mixed effects model instead of lmin geom_smooth, so I can include SITEas a random effect.
The model would be the following:
library(lme4)
tempEf$TRTYEAR <- as.numeric(tempEf$TRTYEAR)
mod <- lmer(r ~ Myc * N * TRTYEAR + (1|SITE), data=tempEf)
I have included TRTYEAR(year of treatment) because I am also interested in the patterns of the effect, that may increase or decrease over time for some groups.
Next is my best attempt so far to extract the plotting variables out of the model, but only extracted the values for TRTYEAR= 5, 10 and 15.
library(effects)
ef <- effect("Myc:N:TRTYEAR", mod)
x <- as.data.frame(ef)
> x
Myc N TRTYEAR fit se lower upper
1 AM Nlow 5 0.04100963 0.04049789 -0.03854476 0.1205640
2 ECM Nlow 5 0.41727928 0.07342289 0.27304676 0.5615118
3 AM Nhigh 5 0.20562700 0.04060572 0.12586080 0.2853932
4 ECM Nhigh 5 0.24754017 0.27647151 -0.29556267 0.7906430
5 AM Nlow 10 0.08913042 0.03751783 0.01543008 0.1628307
6 ECM Nlow 10 0.42211957 0.15631679 0.11504963 0.7291895
7 AM Nhigh 10 0.30411129 0.03615213 0.23309376 0.3751288
8 ECM Nhigh 10 0.29540744 0.76966410 -1.21652689 1.8073418
9 AM Nlow 15 0.13725120 0.06325159 0.01299927 0.2615031
10 ECM Nlow 15 0.42695986 0.27301163 -0.10934636 0.9632661
11 AM Nhigh 15 0.40259559 0.05990085 0.28492587 0.5202653
12 ECM Nhigh 15 0.34327471 1.29676632 -2.20410343 2.8906529
Suggestions to a completely different approach to represent this analysis are welcome. I thought this question is better suited for stackoverflow because it’s about the technicalities in R rather than the statistics behind. Thanks
You can represent your model a variety of different ways. The easiest is to plot data by the various parameters using different plotting tools (color, shape, line type, facet), which is what you did with your example except for the random effect site. Model residuals can also be plotted to communicate results. Like #MrFlick commented, it depends on what you want to communicate. If you want to add confidence/prediction bands around your estimates, you'll have to dig deeper and consider bigger statistical issues (example1, example2).
Here's an example taking yours just a bit further.
Also, in your comment you said you didn't provide a reproducible example because the data do not belong to you. That doesn't mean you can't provide an example out of made up data. Please consider that for future posts so you can get faster answers.
#Make up data:
tempEf <- data.frame(
N = rep(c("Nlow", "Nhigh"), each=300),
Myc = rep(c("AM", "ECM"), each=150, times=2),
TRTYEAR = runif(600, 1, 15),
site = rep(c("A","B","C","D","E"), each=10, times=12) #5 sites
)
# Make up some response data
tempEf$r <- 2*tempEf$TRTYEAR +
-8*as.numeric(tempEf$Myc=="ECM") +
4*as.numeric(tempEf$N=="Nlow") +
0.1*tempEf$TRTYEAR * as.numeric(tempEf$N=="Nlow") +
0.2*tempEf$TRTYEAR*as.numeric(tempEf$Myc=="ECM") +
-11*as.numeric(tempEf$Myc=="ECM")*as.numeric(tempEf$N=="Nlow")+
0.5*tempEf$TRTYEAR*as.numeric(tempEf$Myc=="ECM")*as.numeric(tempEf$N=="Nlow")+
as.numeric(tempEf$site) + #Random intercepts; intercepts will increase by 1
tempEf$TRTYEAR/10*rnorm(600, mean=0, sd=2) #Add some noise
library(lme4)
model <- lmer(r ~ Myc * N * TRTYEAR + (1|site), data=tempEf)
tempEf$fit <- predict(model) #Add model fits to dataframe
Incidentally, the model fit the data well compared to the coefficients above:
model
#Linear mixed model fit by REML ['lmerMod']
#Formula: r ~ Myc * N * TRTYEAR + (1 | site)
# Data: tempEf
#REML criterion at convergence: 2461.705
#Random effects:
# Groups Name Std.Dev.
# site (Intercept) 1.684
# Residual 1.825
#Number of obs: 600, groups: site, 5
#Fixed Effects:
# (Intercept) MycECM NNlow
# 3.03411 -7.92755 4.30380
# TRTYEAR MycECM:NNlow MycECM:TRTYEAR
# 1.98889 -11.64218 0.18589
# NNlow:TRTYEAR MycECM:NNlow:TRTYEAR
# 0.07781 0.60224
Adapting your example to show the model outputs overlaid on the data
library(ggplot2)
ggplot(tempEf,aes(TRTYEAR, r, group=interaction(site, Myc), col=site, shape=Myc )) +
facet_grid(~N) +
geom_line(aes(y=fit, lty=Myc), size=0.8) +
geom_point(alpha = 0.3) +
geom_hline(yintercept=0, linetype="dashed") +
theme_bw()
Notice all I did was change your colour from Myc to site, and linetype to Myc.
I hope this example gives some ideas how to visualize your mixed effects model.

Plot Kaplan-Meier for Cox regression

I have a Cox proportional hazards model set up using the following code in R that predicts mortality. Covariates A, B and C are added simply to avoid confounding (i.e. age, sex, race) but we are really interested in the predictor X. X is a continuous variable.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I'm having troubles plotting a Kaplan-Meier curve for this. I've been searching on how to create this figure but I haven't had much luck. I'm not sure if plotting a Kaplan-Meier for a Cox model is possible? Does the Kaplan-Meier adjust for my covariates or does it not need them?
What I did try is below, but I've been told this isn't right.
plot(survfit(cox.model), xlab = 'Time (years)', ylab = 'Survival Probabilities')
I also tried to plot a figure that shows cumulative hazard of mortality. I don't know if I'm doing it right since I've tried it a few different ways and get different results. Ideally, I would like to plot two lines, one that shows the risk of mortality for the 75th percentile of X and one that shows the 25th percentile of X. How can I do this?
I could list everything else I've tried, but I don't want to confuse anyone!
Many thanks.
Here is an example taken from this paper.
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi <- read.table(url, header=TRUE)
Rossi[1:5, 1:10]
# week arrest fin age race wexp mar paro prio educ
# 1 20 1 no 27 black no not married yes 3 3
# 2 17 1 no 18 black no not married yes 8 4
# 3 25 1 no 19 other yes not married yes 13 3
# 4 52 0 yes 23 black yes married yes 1 5
# 5 52 0 no 19 other yes not married yes 3 3
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + wexp + mar + paro + prio,
data=Rossi)
mod.allison
# Call:
# coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
# mar + paro + prio, data = Rossi)
#
#
# coef exp(coef) se(coef) z p
# finyes -0.3794 0.684 0.1914 -1.983 0.0470
# age -0.0574 0.944 0.0220 -2.611 0.0090
# raceother -0.3139 0.731 0.3080 -1.019 0.3100
# wexpyes -0.1498 0.861 0.2122 -0.706 0.4800
# marnot married 0.4337 1.543 0.3819 1.136 0.2600
# paroyes -0.0849 0.919 0.1958 -0.434 0.6600
# prio 0.0915 1.096 0.0286 3.194 0.0014
#
# Likelihood ratio test=33.3 on 7 df, p=2.36e-05 n= 432, number of events= 114
Note that the model uses fin, age, race, wexp, mar, paro, prio to predict arrest. As mentioned in this document the survfit() function uses the Kaplan-Meier estimate for the survival rate.
plot(survfit(mod.allison), ylim=c(0.7, 1), xlab="Weeks",
ylab="Proportion Not Rearrested")
We get a plot (with a 95% confidence interval) for the survival rate. For the cumulative hazard rate you can do
# plot(survfit(mod.allison)$cumhaz)
but this doesn't give confidence intervals. However, no worries! We know that H(t) = -ln(S(t)) and we have confidence intervals for S(t). All we need to do is
sfit <- survfit(mod.allison)
cumhaz.upper <- -log(sfit$upper)
cumhaz.lower <- -log(sfit$lower)
cumhaz <- sfit$cumhaz # same as -log(sfit$surv)
Then just plot these
plot(cumhaz, xlab="weeks ahead", ylab="cumulative hazard",
ylim=c(min(cumhaz.lower), max(cumhaz.upper)))
lines(cumhaz.lower)
lines(cumhaz.upper)
You'll want to use survfit(..., conf.int=0.50) to get bands for 75% and 25% instead of 97.5% and 2.5%.
The request for estimated survival curve at the 25th and 75th percentiles for X first requires determining those percentiles and specifying values for all the other covariates in a dataframe to be used as newdata argument to survfit.:
Can use the data suggested by other resondent from Fox's website, although on my machine it required building an url-object:
url <- url("http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt")
Rossi <- read.table(url, header=TRUE)
It's probably not the best example for this wquestion but it does have a numeric variable that we can calculate the quartiles:
> summary(Rossi$prio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.984 4.000 18.000
So this would be the model fit and survfit calls:
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + prio ,
data=Rossi)
prio.fit <- survfit(mod.allison,
newdata= data.frame(fin="yes", age=30, race="black", prio=c(1,4) ))
plot(prio.fit, col=c("red","blue"))
Setting the values of the confounders to a fixed value and plotting the predicted survival probabilities at multiple points in time for given values of X (as #IRTFM suggested in his answer), results in a conditional effect estimate. That is not what a standard Kaplan-Meier estimator is used for and I don't think that is what the original poster wanted. Usually we are interested in average causal effects. In other words: What would the survival probability be if X had been set to some specific value x in the entire sample?
We can obtain this probability using the cox-model that was fit plus g-computation. In g-computation, we set the value of X to x in the entire sample and then use the cox model to predict the survival probability at t for each individual, using their observed covariate values in the process. Then we simply take the average of those predictions to obtain our final estimate. By repeating this process for a range of points in time and a range of possible values for X, we obtain a three-dimensional survival surface. We can then visualize this surface using color scales.
This can be done using the contsurvplot R-package I developed, as discussed in this previous answer: Converting survival analysis by a continuous variable to categorical or in the documentation of the package. More information about this strategy in general can be found in the preprint version of my article on this topic: https://arxiv.org/pdf/2208.04644.pdf

plot mixed effects model in ggplot

I am new with mixed effect models and I need your help please.
I have plotted the below graph in ggplot:
ggplot(tempEf,aes(TRTYEAR,CO2effect,group=Myc,col=Myc)) +
facet_grid(~N) +
geom_smooth(method="lm",se=T,size=1) +
geom_point(alpha = 0.3) +
geom_hline(yintercept=0, linetype="dashed") +
theme_bw()
However, I would like to represent a mixed effects model instead of lmin geom_smooth, so I can include SITEas a random effect.
The model would be the following:
library(lme4)
tempEf$TRTYEAR <- as.numeric(tempEf$TRTYEAR)
mod <- lmer(r ~ Myc * N * TRTYEAR + (1|SITE), data=tempEf)
I have included TRTYEAR(year of treatment) because I am also interested in the patterns of the effect, that may increase or decrease over time for some groups.
Next is my best attempt so far to extract the plotting variables out of the model, but only extracted the values for TRTYEAR= 5, 10 and 15.
library(effects)
ef <- effect("Myc:N:TRTYEAR", mod)
x <- as.data.frame(ef)
> x
Myc N TRTYEAR fit se lower upper
1 AM Nlow 5 0.04100963 0.04049789 -0.03854476 0.1205640
2 ECM Nlow 5 0.41727928 0.07342289 0.27304676 0.5615118
3 AM Nhigh 5 0.20562700 0.04060572 0.12586080 0.2853932
4 ECM Nhigh 5 0.24754017 0.27647151 -0.29556267 0.7906430
5 AM Nlow 10 0.08913042 0.03751783 0.01543008 0.1628307
6 ECM Nlow 10 0.42211957 0.15631679 0.11504963 0.7291895
7 AM Nhigh 10 0.30411129 0.03615213 0.23309376 0.3751288
8 ECM Nhigh 10 0.29540744 0.76966410 -1.21652689 1.8073418
9 AM Nlow 15 0.13725120 0.06325159 0.01299927 0.2615031
10 ECM Nlow 15 0.42695986 0.27301163 -0.10934636 0.9632661
11 AM Nhigh 15 0.40259559 0.05990085 0.28492587 0.5202653
12 ECM Nhigh 15 0.34327471 1.29676632 -2.20410343 2.8906529
Suggestions to a completely different approach to represent this analysis are welcome. I thought this question is better suited for stackoverflow because it’s about the technicalities in R rather than the statistics behind. Thanks
You can represent your model a variety of different ways. The easiest is to plot data by the various parameters using different plotting tools (color, shape, line type, facet), which is what you did with your example except for the random effect site. Model residuals can also be plotted to communicate results. Like #MrFlick commented, it depends on what you want to communicate. If you want to add confidence/prediction bands around your estimates, you'll have to dig deeper and consider bigger statistical issues (example1, example2).
Here's an example taking yours just a bit further.
Also, in your comment you said you didn't provide a reproducible example because the data do not belong to you. That doesn't mean you can't provide an example out of made up data. Please consider that for future posts so you can get faster answers.
#Make up data:
tempEf <- data.frame(
N = rep(c("Nlow", "Nhigh"), each=300),
Myc = rep(c("AM", "ECM"), each=150, times=2),
TRTYEAR = runif(600, 1, 15),
site = rep(c("A","B","C","D","E"), each=10, times=12) #5 sites
)
# Make up some response data
tempEf$r <- 2*tempEf$TRTYEAR +
-8*as.numeric(tempEf$Myc=="ECM") +
4*as.numeric(tempEf$N=="Nlow") +
0.1*tempEf$TRTYEAR * as.numeric(tempEf$N=="Nlow") +
0.2*tempEf$TRTYEAR*as.numeric(tempEf$Myc=="ECM") +
-11*as.numeric(tempEf$Myc=="ECM")*as.numeric(tempEf$N=="Nlow")+
0.5*tempEf$TRTYEAR*as.numeric(tempEf$Myc=="ECM")*as.numeric(tempEf$N=="Nlow")+
as.numeric(tempEf$site) + #Random intercepts; intercepts will increase by 1
tempEf$TRTYEAR/10*rnorm(600, mean=0, sd=2) #Add some noise
library(lme4)
model <- lmer(r ~ Myc * N * TRTYEAR + (1|site), data=tempEf)
tempEf$fit <- predict(model) #Add model fits to dataframe
Incidentally, the model fit the data well compared to the coefficients above:
model
#Linear mixed model fit by REML ['lmerMod']
#Formula: r ~ Myc * N * TRTYEAR + (1 | site)
# Data: tempEf
#REML criterion at convergence: 2461.705
#Random effects:
# Groups Name Std.Dev.
# site (Intercept) 1.684
# Residual 1.825
#Number of obs: 600, groups: site, 5
#Fixed Effects:
# (Intercept) MycECM NNlow
# 3.03411 -7.92755 4.30380
# TRTYEAR MycECM:NNlow MycECM:TRTYEAR
# 1.98889 -11.64218 0.18589
# NNlow:TRTYEAR MycECM:NNlow:TRTYEAR
# 0.07781 0.60224
Adapting your example to show the model outputs overlaid on the data
library(ggplot2)
ggplot(tempEf,aes(TRTYEAR, r, group=interaction(site, Myc), col=site, shape=Myc )) +
facet_grid(~N) +
geom_line(aes(y=fit, lty=Myc), size=0.8) +
geom_point(alpha = 0.3) +
geom_hline(yintercept=0, linetype="dashed") +
theme_bw()
Notice all I did was change your colour from Myc to site, and linetype to Myc.
I hope this example gives some ideas how to visualize your mixed effects model.

How to calculate the predicted probability of negative binomial regression model?

I use glm.nb() function in R MASS package to estimate the parameters of a negative binomial regression model. How could I calculate the predicted probability (probability mass function) given new data, which R function can I use?
My dataset is as follows. y follows negative binomial distribution and x is covariate. And I use glm.nb(y ~ x, data=data) to estimate model parameters. Given new x and y, how can I calculate the predicted probability.
Is there a way to calculate it using Java?
y x
91 1.000000
79 1.000000
86 1.000000
32 1.000000
41 1.000000
29 0.890609
44 1.000000
42 1.000000
31 0.734058
35 1.000000
Let's say you set up your data like this:
set.seed(1)
x = seq(-2, 8, .01)
y = rnbinom(length(x), mu=exp(x), size=10)
fit = glm.nb(y ~ x)
and you have a new point: you want to find the probability of y=100 given x=5.
You can get the predicted value of y from x using predict (with type="response" to tell it you want it after the inverse of the link function has been applied):
predicted.y = predict(fit, newdata=data.frame(x=5), type="response")
Then you could find out the probability with:
dnbinom(100, mu=predicted.y, size=fit$theta)
(This is using fit$theta, the maximum likelihood estimate of the "size" parameter of the negative binomial).
So in one function:
prob = function(newx, newy, fit) {
dnbinom(newy, mu=predict(fit, newdata=data.frame(x=newx), type="response"), size=fit$theta)
}

Fitting a function in R

I have a few datapoints (x and y) that seem to have a logarithmic relationship.
> mydata
x y
1 0 123
2 2 116
3 4 113
4 15 100
5 48 87
6 75 84
7 122 77
> qplot(x, y, data=mydata, geom="line")
Now I would like to find an underlying function that fits the graph and allows me to infer other datapoints (i.e. 3 or 82). I read about lm and nls but I'm not getting anywhere really.
At first, I created a function of which I thought it resembled the plot the most:
f <- function(x, a, b) {
a * exp(b *-x)
}
x <- seq(0:100)
y <- f(seq(0:100), 1,1)
qplot(x,y, geom="line")
Afterwards, I tried to generate a fitting model using nls:
> fit <- nls(y ~ f(x, a, b), data=mydata, start=list(a=1, b=1))
Error in numericDeriv(form[[3]], names(ind), env) :
Missing value or an Infinity produced when evaluating the model
Can someone point me in the right direction on what to do from here?
Follow up
After reading your comments and googling around a bit further I adjusted the starting parameters for a, b and c and then suddenly the model converged.
fit <- nls(y~f(x,a,b,c), data=data.frame(mydata), start=list(a=1, b=30, c=-0.3))
x <- seq(0,120)
fitted.data <- data.frame(x=x, y=predict(fit, list(x=x))
ggplot(mydata, aes(x, y)) + geom_point(color="red", alpha=.5) + geom_line(alpha=.5) + geom_line(data=fitted.data)
Maybe using a cubic specification for your model and estimating via lm would give you a good fit.
# Importing your data
dataset <- read.table(text='
x y
1 0 123
2 2 116
3 4 113
4 15 100
5 48 87
6 75 84
7 122 77', header=T)
# I think one possible specification would be a cubic linear model
y.hat <- predict(lm(y~x+I(x^2)+I(x^3), data=dataset)) # estimating the model and obtaining the fitted values from the model
qplot(x, y, data=dataset, geom="line") # your plot black lines
last_plot() + geom_line(aes(x=x, y=y.hat), col=2) # the fitted values red lines
# It fits good.
Try taking the log of your response variable and then using lm to fit a linear model:
fit <- lm(log(y) ~ x, data=mydata)
The adjusted R-squared is 0.8486, which at face value isn't bad. You can look at the fit using plot, for example:
plot(fit, which=2)
But perhaps, it's not such a good fit after all:
last_plot() + geom_line(aes(x=x, y=exp(fit$fitted.values)))
Check this document out: http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
In brief, first you need to decide on the model to fit onto your data (e.g., exponential) and then estimate its parameters.
Here are some widely used distributions:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm

Resources