I have a set of data that I have collected which consists of a time series, where each y-value is found by taking the mean of 30 samples of grape cluster weight.
The growth follows a Gompertz curve with formula y = a*exp(-exp(-(x-x0)/b)), with
a = 88.8
b = 11.7
x0 = 15.1.
The data:
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112)
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165).
x refers to the days from fruit set (i.e. 0 is when the time series starts)
x values correspond to the days in which the measurements are taken (which depends on certain growth stages of grapes)
I want to simulate more data from this, with the same number of x and y values, so that I can carry out some Bayesian analysis to find the posterior distribution of the data.
Effectively what I need is:
to simulate data which follows the Gompertz curve to create the posterior distribution. This data would technically be for "previous years" time series data.
to construct and test the fit of the predictive time-series model based on the distribution
If there is some skeleton code where it is possible to change around the parameters, then this could potentially be very helpful for me too.
Thanks
Let's inspect your data
x <- c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112)
y <- c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165)
and fitted Gompertz curve
gFun <- function(x){
a <- 88.8
b <- 11.7
x0 <- 15.1
est <- a*exp(-exp(-(x-x0)/b))
return(est)
}
by visualisation:
library(ggplot2)
ggplot(ggData, aes(x=x, y=y) ) +
geom_point() +
stat_function(fun=gFun, colour="blue") +
theme_bw()
This doesn't look as a good fit. However, simulating data y|x at fixed x as in the vector above can be done by adding error term. I've used normal distribution with sd=4 for illustration.
nSim <- 10
simData <- data.frame(x=c(0, rep(x[-1], each=nSim)) ) # x[-1] removes 0 from simulation
simData$y <- gFun(simData$x) + rnorm(n=nrow(simData), sd=4)
ggplot(simData, aes(x=x, y=y) ) +
geom_point(alpha=0.4) +
stat_function(fun=gFun, colour="blue") +
scale_x_continuous(limits=c(0, max(x)) ) +
theme_bw()
Related
We are supposed to find the 90% confidence interval for a 74 year old man.
x <- c(58, 69, 43, 39, 63, 52, 47, 31, 74, 36)
y <- c(189, 235, 193, 177, 154, 191, 213, 165, 198, 181)
(where x is age and y is cholesterol level)
i used:
correlation <- cor.test(x, y, conf.level = 0.90)
and that gives me this:
data: x and y t = 1.2656, df = 8, p-value = 0.2413 alternative hypothesis: true correlation is not equal to 0 90 percent confidence interval: -0.1857867 0.7839057 sample estimates: cor 0.4084309
and when i asked people in my class what values they were getting all of them told me (203.2717, 205.5591) Where am I going wrong, the corr.test is telling me -0.1857867 0.7839057.
also the next portion of the assignment is asking us to calculate a 90% prediction interval for a 74 year olds, how would i do this in r studio?
thanks a lot!
df <- data.frame(
x = c(58, 69, 43, 39, 63, 52, 47, 31, 74, 36),
y = c(189, 235, 193, 177, 154, 191, 213, 165, 198, 181)
)
predict.lm(
lm(y~x, data = df),
newdata = data.frame(x = 74),
interval = "confidence",
level = 0.90
)
# fit lwr upr
# 1 204.42 178.99 229.85
I am trying to apply the expectation-maximization algorithm to estimate missing count data but all the packages in R, such as missMethods, assume a multivariate Gaussian distribution. How would I apply the expectation-maximization algorithm to estimate missing count data assuming a Poisson distribution?
Say we have data that look like this:
x <- c(100, 96, 79, 109, 111, NA, 93, 95, 119, 90, 121, 96, NA,
NA, 85, 95, 110, 97, 87, 104, 101, 87, 87, NA, 89, NA,
113, NA, 95, NA, 119, 115, NA, 105, NA, 80, 90, 108, 90,
99, 111, 93, 99, NA, 87, 89, 87, 126, 101, 106)
Applying impute_EM using missMethods (missMethods::impute_EM(x, stochastic = FALSE)) gives an answer but the data are not continuous but discrete.
I understand that questions like these require a minimum, reproducible example, but I honestly do not know where to start. Even suggested reading to point me in the right direction would be helpful.
Defining x0:
x0 <- x[!is.na(x)]
The Jeffreys/reference prior for a Poisson distribution with mean lambda is 1/sqrt(lambda). From the observed values, this results in lambda having a gamma reference posterior with a shape parameter sum(x0) + 0.5 and a rate parameter 1/length(x0). You could take n samples of lambda with:
lambda <- rgamma(n, sum(x0) + 0.5, length(x0))
Then sample n missing values (xm) with
xm <- rpois(n, lambda)
Alternatively, since a Gamma-Poisson compound distribution can be formulated as a negative binomial (after integrating out lambda):
xm <- rnbinom(n, sum(x0) + 0.5, length(x0)/(length(x0) + 1L))
As a function:
MI_poisson <- function(x, n) {
x0 <- x[!is.na(x)]
rbind(matrix(x0, ncol = n, nrow = length(x0)),
matrix(rnbinom(n*(length(x) - length(x0)), sum(x0) + 0.5, length(x0)/(length(x0) + 1L)), ncol = n))
}
This will return a matrix with n columns where each column contains the original vector x with all NA values imputed. Each column could be used separately in further analysis, then the results can be aggregated.
How does one fit a linear regression line to a scatter plot using base R? Assuming you already have the summary info from the linear model.
I already have a scatter plot that compares a and ix, and I am trying to add the regression lines lm.a and lm.b to the plot. Should I use an a b line or something else?
a <- c(21, 23, 25, 27, 29)
ix <- c(100, 300, 500, 600, 750)
ib <- c(0, 1, 0, 1, 1)
x <- data.frame(a, ix, ib)
lm.a <- with(x, lm(a ~ ix + ib + ix*ib))
summary(lm.a)
n1 <- lm.a$coefficients[1]
n2 <- lm.a$coefficients[2]
n3 <- lm.a$coefficients[3]
n4 <- lm.a$coefficients[4]
You almost got it, here is a working example you can adapt:
height <- c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)
bodymass <- c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)
plot(bodymass, height)+
abline(lm(height ~ bodymass)) # Missing lm here
I have a set of data that I have collected which consists of a time series, where each y-value is found by taking the mean of 30 samples of grape cluster weight.
I want to simulate more data from this, with the same number of x and y values, so that I can carry out some Bayesian analysis to find the posterior distribution of the data.
I have the data, and I know that the growth follows a Gompertz curve with formula:
[y = a*exp(-exp(-(x-x0)/b))], with a = 88.8, b = 11.7, and x0 = 15.1.
The data I have is
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112)
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165).
Any help would be appreciated thank you
*Will edit when more information is given**
I am a little confused by your question. I have compiled what you have written into R. Please elaborate for me so that I can help you:
gompertz <- function(x, x0, a, b){
a*exp(-exp(-(x-x0)/b))
}
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165) # means of 30 samples of grape cluster weights?
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112) # ?
#??
gompertz(x, x0 = 15.1, a = 88.8, b = 11.7)
gompertz(y, x0 = 15.1, a = 88.8, b = 11.7)
I am trying to generate a plot derived from my data and a multiple regression model run on these data. I am having trouble getting everything that I need plotted together within one graph however (i.e, raw data points + fitted lines + 95% CI's). In this model, there is a polynomial numeric predictor that interacts with a factor (three levels), so I will have three fitted lines in the plot.
#Here are the data
list<-c(60,75,90,120,180)
x1<-rep(list,each=3,times=3) #predictor 1
x2<-rep(seq(1:3),each=15) #predictor 2
y<-c(72, 63, 58, 56, 50, 52, 47, 48, 51, 41, 47, 44, 38, 34, 36, 92, 93, 88, 76, 76, 74, 72, 67, 78, 56, 71, 65, 53, 56, 60, 93, 73, 77, 96, 79, 81, 84, 79, 86, 80, 76, 75, 69, 61, 63)
df<-data.frame(cbind(x1,x2,y)) #combine vectors into data frame
df$x2<-factor(df$x2) #make x2 a factor
#here is the model
mod<-lm(y~poly(x1,2)*x2,data=df)
I prefer to start from an empty plot and build up from there. I can easily plot the raw data, but I'm not sure how to get the fitted lines and 95% confidence intervals from the model added in here.
#create empty plot
plot(y~x1,xlim=c(min(x1),max(x1)),ylim=c(min(y),max(y)),
type="n",xlab="x1",ylab="y",data=df)
#add data points
points(y[x2==1]~x1[x2==1], cex=1.7,pch=21,lwd=2,bg="gray10", data=df)
points(y[x2==2]~x1[x2==2], cex=1.7,pch=22,lwd=2,bg="gray40", data=df)
points(y[x2==3]~x1[x2==3], cex=1.7,pch=25,lwd=2,bg="gray80", data=df)
Through a lot of research and tinkering, I figured out how to get the fitted lines and 95% CI's using the 'effects' package, but I don't know how to add the raw data to this plot.
library(effects)
plot(allEffects(mod,xlevels=list(x1=min(x1):max(x1)),xlab="x1",ylab="y"),multiline=T,rug=F,ci.style="bands")
That was a lot of writing and code, but I hope what I'm trying to do is clear. Thanks you very much in advance for your help.
The predict function handles all the messy calculations with the orthogonal polynomials:
x.two <- df$x2
lines(x = sort(x.two),
y = predict(mod, newdata=data.frame(x1=factor("1"), x2=sort(x.two) ) ) ,
col="red")
lines(x = sort(x.two),
y = predict(mod, newdata=data.frame(x1=factor("2"), x2=sort(x.two) ) ) ,
col="green")
lines(sort(x.two),
predict(mod, newdata=data.frame(x1=factor("3"),x2=sort(x.two) ) ) , col="orange")