R - plotting a mixed-effect graph by group [duplicate] - r

I am new with mixed effect models and I need your help please.
I have plotted the below graph in ggplot:
ggplot(tempEf,aes(TRTYEAR,CO2effect,group=Myc,col=Myc)) +
facet_grid(~N) +
geom_smooth(method="lm",se=T,size=1) +
geom_point(alpha = 0.3) +
geom_hline(yintercept=0, linetype="dashed") +
theme_bw()
However, I would like to represent a mixed effects model instead of lmin geom_smooth, so I can include SITEas a random effect.
The model would be the following:
library(lme4)
tempEf$TRTYEAR <- as.numeric(tempEf$TRTYEAR)
mod <- lmer(r ~ Myc * N * TRTYEAR + (1|SITE), data=tempEf)
I have included TRTYEAR(year of treatment) because I am also interested in the patterns of the effect, that may increase or decrease over time for some groups.
Next is my best attempt so far to extract the plotting variables out of the model, but only extracted the values for TRTYEAR= 5, 10 and 15.
library(effects)
ef <- effect("Myc:N:TRTYEAR", mod)
x <- as.data.frame(ef)
> x
Myc N TRTYEAR fit se lower upper
1 AM Nlow 5 0.04100963 0.04049789 -0.03854476 0.1205640
2 ECM Nlow 5 0.41727928 0.07342289 0.27304676 0.5615118
3 AM Nhigh 5 0.20562700 0.04060572 0.12586080 0.2853932
4 ECM Nhigh 5 0.24754017 0.27647151 -0.29556267 0.7906430
5 AM Nlow 10 0.08913042 0.03751783 0.01543008 0.1628307
6 ECM Nlow 10 0.42211957 0.15631679 0.11504963 0.7291895
7 AM Nhigh 10 0.30411129 0.03615213 0.23309376 0.3751288
8 ECM Nhigh 10 0.29540744 0.76966410 -1.21652689 1.8073418
9 AM Nlow 15 0.13725120 0.06325159 0.01299927 0.2615031
10 ECM Nlow 15 0.42695986 0.27301163 -0.10934636 0.9632661
11 AM Nhigh 15 0.40259559 0.05990085 0.28492587 0.5202653
12 ECM Nhigh 15 0.34327471 1.29676632 -2.20410343 2.8906529
Suggestions to a completely different approach to represent this analysis are welcome. I thought this question is better suited for stackoverflow because it’s about the technicalities in R rather than the statistics behind. Thanks

You can represent your model a variety of different ways. The easiest is to plot data by the various parameters using different plotting tools (color, shape, line type, facet), which is what you did with your example except for the random effect site. Model residuals can also be plotted to communicate results. Like #MrFlick commented, it depends on what you want to communicate. If you want to add confidence/prediction bands around your estimates, you'll have to dig deeper and consider bigger statistical issues (example1, example2).
Here's an example taking yours just a bit further.
Also, in your comment you said you didn't provide a reproducible example because the data do not belong to you. That doesn't mean you can't provide an example out of made up data. Please consider that for future posts so you can get faster answers.
#Make up data:
tempEf <- data.frame(
N = rep(c("Nlow", "Nhigh"), each=300),
Myc = rep(c("AM", "ECM"), each=150, times=2),
TRTYEAR = runif(600, 1, 15),
site = rep(c("A","B","C","D","E"), each=10, times=12) #5 sites
)
# Make up some response data
tempEf$r <- 2*tempEf$TRTYEAR +
-8*as.numeric(tempEf$Myc=="ECM") +
4*as.numeric(tempEf$N=="Nlow") +
0.1*tempEf$TRTYEAR * as.numeric(tempEf$N=="Nlow") +
0.2*tempEf$TRTYEAR*as.numeric(tempEf$Myc=="ECM") +
-11*as.numeric(tempEf$Myc=="ECM")*as.numeric(tempEf$N=="Nlow")+
0.5*tempEf$TRTYEAR*as.numeric(tempEf$Myc=="ECM")*as.numeric(tempEf$N=="Nlow")+
as.numeric(tempEf$site) + #Random intercepts; intercepts will increase by 1
tempEf$TRTYEAR/10*rnorm(600, mean=0, sd=2) #Add some noise
library(lme4)
model <- lmer(r ~ Myc * N * TRTYEAR + (1|site), data=tempEf)
tempEf$fit <- predict(model) #Add model fits to dataframe
Incidentally, the model fit the data well compared to the coefficients above:
model
#Linear mixed model fit by REML ['lmerMod']
#Formula: r ~ Myc * N * TRTYEAR + (1 | site)
# Data: tempEf
#REML criterion at convergence: 2461.705
#Random effects:
# Groups Name Std.Dev.
# site (Intercept) 1.684
# Residual 1.825
#Number of obs: 600, groups: site, 5
#Fixed Effects:
# (Intercept) MycECM NNlow
# 3.03411 -7.92755 4.30380
# TRTYEAR MycECM:NNlow MycECM:TRTYEAR
# 1.98889 -11.64218 0.18589
# NNlow:TRTYEAR MycECM:NNlow:TRTYEAR
# 0.07781 0.60224
Adapting your example to show the model outputs overlaid on the data
library(ggplot2)
ggplot(tempEf,aes(TRTYEAR, r, group=interaction(site, Myc), col=site, shape=Myc )) +
facet_grid(~N) +
geom_line(aes(y=fit, lty=Myc), size=0.8) +
geom_point(alpha = 0.3) +
geom_hline(yintercept=0, linetype="dashed") +
theme_bw()
Notice all I did was change your colour from Myc to site, and linetype to Myc.
I hope this example gives some ideas how to visualize your mixed effects model.

Related

Compare treatment effects in three way interaction between two continuous variables and one categorical variable in R

I am trying to run a linear regression model which contains continuous variable A * continuous variables B * categorical variable (treatments with 4 levels). Data can be download here.
Model<-lm(H2O2~Treatment*(A*B), data=mydata)
Now I want to compare different treatment effects.
I know that lstrends can deal with continuous variable * categorical variable in linear model, but it could not work in my situation. I also tried to divide the data based on different treatment groups and created 4 different linear models to compare, that did not work either.
The equation you're estimating is:
There are six different treatment effects in which you could be interested - they comprise the pairwise differences among treatment categories given fixed values of A and B. Three of these are represented by comparisons of estimated categories versus the reference category. For example, to figure out the effect of HF versus HC (the reference), you would calculate:
Looking at the coefficients from your model:
b <- coef(Model)
b
(Intercept) TreatmentHF TreatmentLF TreatmentMF A B
-1.4318658015 1.5744952961 1.7649475644 -0.6971275663 0.0334782841 0.1528682774
A:B TreatmentHF:A TreatmentLF:A TreatmentMF:A TreatmentHF:B TreatmentLF:B
-0.0022753098 -0.0313728254 -0.0342105088 0.0173173280 -0.1430777577 -0.1214230927
TreatmentMF:B TreatmentHF:A:B TreatmentLF:A:B TreatmentMF:A:B
0.0212295284 0.0025811227 0.0023565223 -0.0007721532
You would want in R, something like
b[2] + b[8]*A + b[11]*B + b[14]*A*B
You would want to substitute in a wide range of combinations of A and B, which you could do by making a sequence of values of each going from the minimum to the maximum, and then crossing them.
a_seq <- seq(min(mydata$A), max(mydata$A), length=25)
b_seq <- seq(min(mydata$B), max(mydata$B), length=25)
eg <- expand.grid(A=a_seq, B=b_seq)
head(eg)
# A B
# 1 5.03000 4.34
# 2 10.01292 4.34
# 3 14.99583 4.34
# 4 19.97875 4.34
# 5 24.96167 4.34
# 6 29.94458 4.34
You could then make the treatment effect in this dataset.
library(dplyr)
eg <- eg %>% mutate(treat_HC_HF = b[2] + b[8]*A + b[11]*B + b[14]*A*B)
Then, you could plot it using a heatmap or similar.
ggplot(eg, aes(x=A, y=B, fill=treat_HC_HF)) +
geom_tile() +
scale_fill_viridis_c() +
theme_classic() +
labs(fill="Treatment\nEffect")
You could do this for the other comparisons as well. There are two things that you don't get from this directly that are. First, this doesn't tell you anything about where you actually observe A and B. Second it doesn't tell you whether any of these effects is statistically significant. The first problem you could solve more or less by only plotting those hypothetical values of A and B that are in the convex hull of A and B in the data.
library(geometry)
ch <- convhulln(mydata[,c("A", "B")])
eg <- eg %>%
mutate(inhull = inhulln(ch, cbind(A,B)))
eg %>%
filter(inhull) %>%
ggplot(aes(x=A, y=B, fill=treat_HC_HF)) +
geom_tile() +
scale_fill_viridis_c(limits = c(min(eg$treat_HC_HF), max(eg$treat_HC_HF))) +
theme_classic() +
labs(fill="Treatment\nEffect")
To calculate whether or not these are significant, you would have to do a bit more work. First, you'd have to get the standard error of each comparison. What you need is a matrix we'll call M that collects the values you multiply the coefficients by to get the treatment effect. So, in the above example, we would have the three pieces of information:
In R, we could get these with:
b_t <- b[c(2,8,11,14)]
V_t <- vcov(Model)[c(2,8,11,14), c(2,8,11,14)]
M <- cbind(1, eg$A, eg$B, eg$A*eg$B)
Then, we could calculate the standard error of the treatment effect as:
In R, we could do this and identify which treatment effects are significant (two-tailed 95% test) with:
eg <- eg %>%
mutate(se = sqrt(diag(M %*% V_t %*% t(M))),
sig = abs(treat_HC_HF/se) > pt(0.975, Model$df.residual))
Then we could plot only those effects that are in the convex hull and significant:
eg %>%
filter(inhull, sig) %>%
ggplot(aes(x=A, y=B, fill=treat_HC_HF)) +
geom_tile() +
scale_fill_viridis_c(limits = c(min(eg$treat_HC_HF), max(eg$treat_HC_HF))) +
theme_classic() +
labs(fill="Treatment\nEffect")
You would have to do this for each one of the six paired comparisons of levels of the treatment effects. This seems like a lot of work, but the model, despite the simplicity of estimating it, is quite complicated to interpret.

fitting non linear function to data : singular gradient issue

I am trying to fit data to a non linear model, but I am getting "singular gradient" message when I build the model.
here is the data:
> astrodata
temperature intensity
1 277.15 121
2 282.15 131
3 287.15 153
4 292.15 202
5 297.15 311
The function:
y= a * exp(-b * temperature) + c
What I did so far:
> temperature <- astrodata$temperature
temperature
[1] 277.15 282.15 287.15 292.15 297.15
> intensity <- astrodata$intensity
> c.0 <- min(temperature)*0.5
> c.0 <- min(intensity)*0.5
> model.0 <- lm(log(intensity - c.0) ~ temperature, data=astrodata)
> start <- list(a=exp(coef(model.0)[1]), b=coef(model.0)[2], c=c.0)
>
> model <- nls(intensity ~ a * exp(-b * temperature) + c, data = astrodata, start = start)
Error in nls(intensity ~ a * exp(b * temperature) + c, data = astrodata, :
singular gradient
Does anybody has an idea how to solve this ?
The model is linear in a and c and only nonlinear in b. That suggests we try the "plinear" algorithm. It has the advantage that only the non-linear parameters require starting values.
Note that the formula specification for that algorithm is different and has a RHS which is a matrix with one column per linear parameter.
model <- nls(intensity ~ cbind(exp(-b * temperature), 1), data = astrodata,
start = start["b"], algorithm = "plinear")
giving:
> model
Nonlinear regression model
model: intensity ~ cbind(exp(-b * temperature), 1)
data: astrodata
b .lin1 .lin2
-1.598e-01 4.728e-19 1.129e+02
residual sum-of-squares: 0.003853
Number of iterations to convergence: 5
Achieved convergence tolerance: 2.594e-07
Also:
plot(intensity ~ temperature, astrodata)
lines(fitted(model) ~ temperature, astrodata)
Note: Based on the comment below you don't really need an nls model and it may be good enough to just use geom_line
p <- ggplot(astrodata, aes(temperature, intensity)) + geom_point()
p + geom_line()
or splines:
p + geom_line(data = data.frame(spline(temperature, intensity)), aes(x, y))
Your data isn't varied enough.
nls uses least squares to work. This is a measurement of the distance between the model and the data points. If there is no distance, nls doesn't work. Your model fits the data exactly, this is called "zero-residual" data. Hence
singular gradient matrix at initial parameter estimates.
It's an overly complicated error message that simply means "There is no error to measure."
You only have 5 (x,y) combos, so this error is almost guaranteed using non-linear analysis with so little data. Use different data or more data.
One possibility is to double each data point, adding very tiny variations to the doubled data like so:
temperature intensity
1 277.15 121
2 282.15 131
3 287.15 153
4 292.15 202
5 297.15 311
11 277.15000001 121.000001
12 282.15000001 131.000001
13 287.15000001 153.000001
14 292.15000001 202.000001
15 297.15000001 311.000001
In the original data set, each point effectively has the same weight of 1.0, and in the "doubled" data set again each point effectively has the same weight of 2.0 so you get the same fitted parameter values but no error.

Plot Kaplan-Meier for Cox regression

I have a Cox proportional hazards model set up using the following code in R that predicts mortality. Covariates A, B and C are added simply to avoid confounding (i.e. age, sex, race) but we are really interested in the predictor X. X is a continuous variable.
cox.model <- coxph(Surv(time, dead) ~ A + B + C + X, data = df)
Now, I'm having troubles plotting a Kaplan-Meier curve for this. I've been searching on how to create this figure but I haven't had much luck. I'm not sure if plotting a Kaplan-Meier for a Cox model is possible? Does the Kaplan-Meier adjust for my covariates or does it not need them?
What I did try is below, but I've been told this isn't right.
plot(survfit(cox.model), xlab = 'Time (years)', ylab = 'Survival Probabilities')
I also tried to plot a figure that shows cumulative hazard of mortality. I don't know if I'm doing it right since I've tried it a few different ways and get different results. Ideally, I would like to plot two lines, one that shows the risk of mortality for the 75th percentile of X and one that shows the 25th percentile of X. How can I do this?
I could list everything else I've tried, but I don't want to confuse anyone!
Many thanks.
Here is an example taken from this paper.
url <- "http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt"
Rossi <- read.table(url, header=TRUE)
Rossi[1:5, 1:10]
# week arrest fin age race wexp mar paro prio educ
# 1 20 1 no 27 black no not married yes 3 3
# 2 17 1 no 18 black no not married yes 8 4
# 3 25 1 no 19 other yes not married yes 13 3
# 4 52 0 yes 23 black yes married yes 1 5
# 5 52 0 no 19 other yes not married yes 3 3
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + wexp + mar + paro + prio,
data=Rossi)
mod.allison
# Call:
# coxph(formula = Surv(week, arrest) ~ fin + age + race + wexp +
# mar + paro + prio, data = Rossi)
#
#
# coef exp(coef) se(coef) z p
# finyes -0.3794 0.684 0.1914 -1.983 0.0470
# age -0.0574 0.944 0.0220 -2.611 0.0090
# raceother -0.3139 0.731 0.3080 -1.019 0.3100
# wexpyes -0.1498 0.861 0.2122 -0.706 0.4800
# marnot married 0.4337 1.543 0.3819 1.136 0.2600
# paroyes -0.0849 0.919 0.1958 -0.434 0.6600
# prio 0.0915 1.096 0.0286 3.194 0.0014
#
# Likelihood ratio test=33.3 on 7 df, p=2.36e-05 n= 432, number of events= 114
Note that the model uses fin, age, race, wexp, mar, paro, prio to predict arrest. As mentioned in this document the survfit() function uses the Kaplan-Meier estimate for the survival rate.
plot(survfit(mod.allison), ylim=c(0.7, 1), xlab="Weeks",
ylab="Proportion Not Rearrested")
We get a plot (with a 95% confidence interval) for the survival rate. For the cumulative hazard rate you can do
# plot(survfit(mod.allison)$cumhaz)
but this doesn't give confidence intervals. However, no worries! We know that H(t) = -ln(S(t)) and we have confidence intervals for S(t). All we need to do is
sfit <- survfit(mod.allison)
cumhaz.upper <- -log(sfit$upper)
cumhaz.lower <- -log(sfit$lower)
cumhaz <- sfit$cumhaz # same as -log(sfit$surv)
Then just plot these
plot(cumhaz, xlab="weeks ahead", ylab="cumulative hazard",
ylim=c(min(cumhaz.lower), max(cumhaz.upper)))
lines(cumhaz.lower)
lines(cumhaz.upper)
You'll want to use survfit(..., conf.int=0.50) to get bands for 75% and 25% instead of 97.5% and 2.5%.
The request for estimated survival curve at the 25th and 75th percentiles for X first requires determining those percentiles and specifying values for all the other covariates in a dataframe to be used as newdata argument to survfit.:
Can use the data suggested by other resondent from Fox's website, although on my machine it required building an url-object:
url <- url("http://socserv.mcmaster.ca/jfox/Books/Companion/data/Rossi.txt")
Rossi <- read.table(url, header=TRUE)
It's probably not the best example for this wquestion but it does have a numeric variable that we can calculate the quartiles:
> summary(Rossi$prio)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.000 1.000 2.000 2.984 4.000 18.000
So this would be the model fit and survfit calls:
mod.allison <- coxph(Surv(week, arrest) ~
fin + age + race + prio ,
data=Rossi)
prio.fit <- survfit(mod.allison,
newdata= data.frame(fin="yes", age=30, race="black", prio=c(1,4) ))
plot(prio.fit, col=c("red","blue"))
Setting the values of the confounders to a fixed value and plotting the predicted survival probabilities at multiple points in time for given values of X (as #IRTFM suggested in his answer), results in a conditional effect estimate. That is not what a standard Kaplan-Meier estimator is used for and I don't think that is what the original poster wanted. Usually we are interested in average causal effects. In other words: What would the survival probability be if X had been set to some specific value x in the entire sample?
We can obtain this probability using the cox-model that was fit plus g-computation. In g-computation, we set the value of X to x in the entire sample and then use the cox model to predict the survival probability at t for each individual, using their observed covariate values in the process. Then we simply take the average of those predictions to obtain our final estimate. By repeating this process for a range of points in time and a range of possible values for X, we obtain a three-dimensional survival surface. We can then visualize this surface using color scales.
This can be done using the contsurvplot R-package I developed, as discussed in this previous answer: Converting survival analysis by a continuous variable to categorical or in the documentation of the package. More information about this strategy in general can be found in the preprint version of my article on this topic: https://arxiv.org/pdf/2208.04644.pdf

plot mixed effects model in ggplot

I am new with mixed effect models and I need your help please.
I have plotted the below graph in ggplot:
ggplot(tempEf,aes(TRTYEAR,CO2effect,group=Myc,col=Myc)) +
facet_grid(~N) +
geom_smooth(method="lm",se=T,size=1) +
geom_point(alpha = 0.3) +
geom_hline(yintercept=0, linetype="dashed") +
theme_bw()
However, I would like to represent a mixed effects model instead of lmin geom_smooth, so I can include SITEas a random effect.
The model would be the following:
library(lme4)
tempEf$TRTYEAR <- as.numeric(tempEf$TRTYEAR)
mod <- lmer(r ~ Myc * N * TRTYEAR + (1|SITE), data=tempEf)
I have included TRTYEAR(year of treatment) because I am also interested in the patterns of the effect, that may increase or decrease over time for some groups.
Next is my best attempt so far to extract the plotting variables out of the model, but only extracted the values for TRTYEAR= 5, 10 and 15.
library(effects)
ef <- effect("Myc:N:TRTYEAR", mod)
x <- as.data.frame(ef)
> x
Myc N TRTYEAR fit se lower upper
1 AM Nlow 5 0.04100963 0.04049789 -0.03854476 0.1205640
2 ECM Nlow 5 0.41727928 0.07342289 0.27304676 0.5615118
3 AM Nhigh 5 0.20562700 0.04060572 0.12586080 0.2853932
4 ECM Nhigh 5 0.24754017 0.27647151 -0.29556267 0.7906430
5 AM Nlow 10 0.08913042 0.03751783 0.01543008 0.1628307
6 ECM Nlow 10 0.42211957 0.15631679 0.11504963 0.7291895
7 AM Nhigh 10 0.30411129 0.03615213 0.23309376 0.3751288
8 ECM Nhigh 10 0.29540744 0.76966410 -1.21652689 1.8073418
9 AM Nlow 15 0.13725120 0.06325159 0.01299927 0.2615031
10 ECM Nlow 15 0.42695986 0.27301163 -0.10934636 0.9632661
11 AM Nhigh 15 0.40259559 0.05990085 0.28492587 0.5202653
12 ECM Nhigh 15 0.34327471 1.29676632 -2.20410343 2.8906529
Suggestions to a completely different approach to represent this analysis are welcome. I thought this question is better suited for stackoverflow because it’s about the technicalities in R rather than the statistics behind. Thanks
You can represent your model a variety of different ways. The easiest is to plot data by the various parameters using different plotting tools (color, shape, line type, facet), which is what you did with your example except for the random effect site. Model residuals can also be plotted to communicate results. Like #MrFlick commented, it depends on what you want to communicate. If you want to add confidence/prediction bands around your estimates, you'll have to dig deeper and consider bigger statistical issues (example1, example2).
Here's an example taking yours just a bit further.
Also, in your comment you said you didn't provide a reproducible example because the data do not belong to you. That doesn't mean you can't provide an example out of made up data. Please consider that for future posts so you can get faster answers.
#Make up data:
tempEf <- data.frame(
N = rep(c("Nlow", "Nhigh"), each=300),
Myc = rep(c("AM", "ECM"), each=150, times=2),
TRTYEAR = runif(600, 1, 15),
site = rep(c("A","B","C","D","E"), each=10, times=12) #5 sites
)
# Make up some response data
tempEf$r <- 2*tempEf$TRTYEAR +
-8*as.numeric(tempEf$Myc=="ECM") +
4*as.numeric(tempEf$N=="Nlow") +
0.1*tempEf$TRTYEAR * as.numeric(tempEf$N=="Nlow") +
0.2*tempEf$TRTYEAR*as.numeric(tempEf$Myc=="ECM") +
-11*as.numeric(tempEf$Myc=="ECM")*as.numeric(tempEf$N=="Nlow")+
0.5*tempEf$TRTYEAR*as.numeric(tempEf$Myc=="ECM")*as.numeric(tempEf$N=="Nlow")+
as.numeric(tempEf$site) + #Random intercepts; intercepts will increase by 1
tempEf$TRTYEAR/10*rnorm(600, mean=0, sd=2) #Add some noise
library(lme4)
model <- lmer(r ~ Myc * N * TRTYEAR + (1|site), data=tempEf)
tempEf$fit <- predict(model) #Add model fits to dataframe
Incidentally, the model fit the data well compared to the coefficients above:
model
#Linear mixed model fit by REML ['lmerMod']
#Formula: r ~ Myc * N * TRTYEAR + (1 | site)
# Data: tempEf
#REML criterion at convergence: 2461.705
#Random effects:
# Groups Name Std.Dev.
# site (Intercept) 1.684
# Residual 1.825
#Number of obs: 600, groups: site, 5
#Fixed Effects:
# (Intercept) MycECM NNlow
# 3.03411 -7.92755 4.30380
# TRTYEAR MycECM:NNlow MycECM:TRTYEAR
# 1.98889 -11.64218 0.18589
# NNlow:TRTYEAR MycECM:NNlow:TRTYEAR
# 0.07781 0.60224
Adapting your example to show the model outputs overlaid on the data
library(ggplot2)
ggplot(tempEf,aes(TRTYEAR, r, group=interaction(site, Myc), col=site, shape=Myc )) +
facet_grid(~N) +
geom_line(aes(y=fit, lty=Myc), size=0.8) +
geom_point(alpha = 0.3) +
geom_hline(yintercept=0, linetype="dashed") +
theme_bw()
Notice all I did was change your colour from Myc to site, and linetype to Myc.
I hope this example gives some ideas how to visualize your mixed effects model.

Fitting a function in R

I have a few datapoints (x and y) that seem to have a logarithmic relationship.
> mydata
x y
1 0 123
2 2 116
3 4 113
4 15 100
5 48 87
6 75 84
7 122 77
> qplot(x, y, data=mydata, geom="line")
Now I would like to find an underlying function that fits the graph and allows me to infer other datapoints (i.e. 3 or 82). I read about lm and nls but I'm not getting anywhere really.
At first, I created a function of which I thought it resembled the plot the most:
f <- function(x, a, b) {
a * exp(b *-x)
}
x <- seq(0:100)
y <- f(seq(0:100), 1,1)
qplot(x,y, geom="line")
Afterwards, I tried to generate a fitting model using nls:
> fit <- nls(y ~ f(x, a, b), data=mydata, start=list(a=1, b=1))
Error in numericDeriv(form[[3]], names(ind), env) :
Missing value or an Infinity produced when evaluating the model
Can someone point me in the right direction on what to do from here?
Follow up
After reading your comments and googling around a bit further I adjusted the starting parameters for a, b and c and then suddenly the model converged.
fit <- nls(y~f(x,a,b,c), data=data.frame(mydata), start=list(a=1, b=30, c=-0.3))
x <- seq(0,120)
fitted.data <- data.frame(x=x, y=predict(fit, list(x=x))
ggplot(mydata, aes(x, y)) + geom_point(color="red", alpha=.5) + geom_line(alpha=.5) + geom_line(data=fitted.data)
Maybe using a cubic specification for your model and estimating via lm would give you a good fit.
# Importing your data
dataset <- read.table(text='
x y
1 0 123
2 2 116
3 4 113
4 15 100
5 48 87
6 75 84
7 122 77', header=T)
# I think one possible specification would be a cubic linear model
y.hat <- predict(lm(y~x+I(x^2)+I(x^3), data=dataset)) # estimating the model and obtaining the fitted values from the model
qplot(x, y, data=dataset, geom="line") # your plot black lines
last_plot() + geom_line(aes(x=x, y=y.hat), col=2) # the fitted values red lines
# It fits good.
Try taking the log of your response variable and then using lm to fit a linear model:
fit <- lm(log(y) ~ x, data=mydata)
The adjusted R-squared is 0.8486, which at face value isn't bad. You can look at the fit using plot, for example:
plot(fit, which=2)
But perhaps, it's not such a good fit after all:
last_plot() + geom_line(aes(x=x, y=exp(fit$fitted.values)))
Check this document out: http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
In brief, first you need to decide on the model to fit onto your data (e.g., exponential) and then estimate its parameters.
Here are some widely used distributions:
http://www.itl.nist.gov/div898/handbook/eda/section3/eda366.htm

Resources