I am trying to apply the expectation-maximization algorithm to estimate missing count data but all the packages in R, such as missMethods, assume a multivariate Gaussian distribution. How would I apply the expectation-maximization algorithm to estimate missing count data assuming a Poisson distribution?
Say we have data that look like this:
x <- c(100, 96, 79, 109, 111, NA, 93, 95, 119, 90, 121, 96, NA,
NA, 85, 95, 110, 97, 87, 104, 101, 87, 87, NA, 89, NA,
113, NA, 95, NA, 119, 115, NA, 105, NA, 80, 90, 108, 90,
99, 111, 93, 99, NA, 87, 89, 87, 126, 101, 106)
Applying impute_EM using missMethods (missMethods::impute_EM(x, stochastic = FALSE)) gives an answer but the data are not continuous but discrete.
I understand that questions like these require a minimum, reproducible example, but I honestly do not know where to start. Even suggested reading to point me in the right direction would be helpful.
Defining x0:
x0 <- x[!is.na(x)]
The Jeffreys/reference prior for a Poisson distribution with mean lambda is 1/sqrt(lambda). From the observed values, this results in lambda having a gamma reference posterior with a shape parameter sum(x0) + 0.5 and a rate parameter 1/length(x0). You could take n samples of lambda with:
lambda <- rgamma(n, sum(x0) + 0.5, length(x0))
Then sample n missing values (xm) with
xm <- rpois(n, lambda)
Alternatively, since a Gamma-Poisson compound distribution can be formulated as a negative binomial (after integrating out lambda):
xm <- rnbinom(n, sum(x0) + 0.5, length(x0)/(length(x0) + 1L))
As a function:
MI_poisson <- function(x, n) {
x0 <- x[!is.na(x)]
rbind(matrix(x0, ncol = n, nrow = length(x0)),
matrix(rnbinom(n*(length(x) - length(x0)), sum(x0) + 0.5, length(x0)/(length(x0) + 1L)), ncol = n))
}
This will return a matrix with n columns where each column contains the original vector x with all NA values imputed. Each column could be used separately in further analysis, then the results can be aggregated.
Related
We are supposed to find the 90% confidence interval for a 74 year old man.
x <- c(58, 69, 43, 39, 63, 52, 47, 31, 74, 36)
y <- c(189, 235, 193, 177, 154, 191, 213, 165, 198, 181)
(where x is age and y is cholesterol level)
i used:
correlation <- cor.test(x, y, conf.level = 0.90)
and that gives me this:
data: x and y t = 1.2656, df = 8, p-value = 0.2413 alternative hypothesis: true correlation is not equal to 0 90 percent confidence interval: -0.1857867 0.7839057 sample estimates: cor 0.4084309
and when i asked people in my class what values they were getting all of them told me (203.2717, 205.5591) Where am I going wrong, the corr.test is telling me -0.1857867 0.7839057.
also the next portion of the assignment is asking us to calculate a 90% prediction interval for a 74 year olds, how would i do this in r studio?
thanks a lot!
df <- data.frame(
x = c(58, 69, 43, 39, 63, 52, 47, 31, 74, 36),
y = c(189, 235, 193, 177, 154, 191, 213, 165, 198, 181)
)
predict.lm(
lm(y~x, data = df),
newdata = data.frame(x = 74),
interval = "confidence",
level = 0.90
)
# fit lwr upr
# 1 204.42 178.99 229.85
How does one fit a linear regression line to a scatter plot using base R? Assuming you already have the summary info from the linear model.
I already have a scatter plot that compares a and ix, and I am trying to add the regression lines lm.a and lm.b to the plot. Should I use an a b line or something else?
a <- c(21, 23, 25, 27, 29)
ix <- c(100, 300, 500, 600, 750)
ib <- c(0, 1, 0, 1, 1)
x <- data.frame(a, ix, ib)
lm.a <- with(x, lm(a ~ ix + ib + ix*ib))
summary(lm.a)
n1 <- lm.a$coefficients[1]
n2 <- lm.a$coefficients[2]
n3 <- lm.a$coefficients[3]
n4 <- lm.a$coefficients[4]
You almost got it, here is a working example you can adapt:
height <- c(176, 154, 138, 196, 132, 176, 181, 169, 150, 175)
bodymass <- c(82, 49, 53, 112, 47, 69, 77, 71, 62, 78)
plot(bodymass, height)+
abline(lm(height ~ bodymass)) # Missing lm here
I have a set of data that I have collected which consists of a time series, where each y-value is found by taking the mean of 30 samples of grape cluster weight.
I want to simulate more data from this, with the same number of x and y values, so that I can carry out some Bayesian analysis to find the posterior distribution of the data.
I have the data, and I know that the growth follows a Gompertz curve with formula:
[y = a*exp(-exp(-(x-x0)/b))], with a = 88.8, b = 11.7, and x0 = 15.1.
The data I have is
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112)
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165).
Any help would be appreciated thank you
*Will edit when more information is given**
I am a little confused by your question. I have compiled what you have written into R. Please elaborate for me so that I can help you:
gompertz <- function(x, x0, a, b){
a*exp(-exp(-(x-x0)/b))
}
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165) # means of 30 samples of grape cluster weights?
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112) # ?
#??
gompertz(x, x0 = 15.1, a = 88.8, b = 11.7)
gompertz(y, x0 = 15.1, a = 88.8, b = 11.7)
I have a set of data that I have collected which consists of a time series, where each y-value is found by taking the mean of 30 samples of grape cluster weight.
The growth follows a Gompertz curve with formula y = a*exp(-exp(-(x-x0)/b)), with
a = 88.8
b = 11.7
x0 = 15.1.
The data:
x = c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112)
y = c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165).
x refers to the days from fruit set (i.e. 0 is when the time series starts)
x values correspond to the days in which the measurements are taken (which depends on certain growth stages of grapes)
I want to simulate more data from this, with the same number of x and y values, so that I can carry out some Bayesian analysis to find the posterior distribution of the data.
Effectively what I need is:
to simulate data which follows the Gompertz curve to create the posterior distribution. This data would technically be for "previous years" time series data.
to construct and test the fit of the predictive time-series model based on the distribution
If there is some skeleton code where it is possible to change around the parameters, then this could potentially be very helpful for me too.
Thanks
Let's inspect your data
x <- c(0, 28, 36, 42, 50, 58, 63, 71, 79, 85, 92, 99, 106, 112)
y <- c(0, 15, 35, 55, 62, 74, 80, 96, 127, 120, 146, 160, 177, 165)
and fitted Gompertz curve
gFun <- function(x){
a <- 88.8
b <- 11.7
x0 <- 15.1
est <- a*exp(-exp(-(x-x0)/b))
return(est)
}
by visualisation:
library(ggplot2)
ggplot(ggData, aes(x=x, y=y) ) +
geom_point() +
stat_function(fun=gFun, colour="blue") +
theme_bw()
This doesn't look as a good fit. However, simulating data y|x at fixed x as in the vector above can be done by adding error term. I've used normal distribution with sd=4 for illustration.
nSim <- 10
simData <- data.frame(x=c(0, rep(x[-1], each=nSim)) ) # x[-1] removes 0 from simulation
simData$y <- gFun(simData$x) + rnorm(n=nrow(simData), sd=4)
ggplot(simData, aes(x=x, y=y) ) +
geom_point(alpha=0.4) +
stat_function(fun=gFun, colour="blue") +
scale_x_continuous(limits=c(0, max(x)) ) +
theme_bw()
I am struggling with some strange behaviour in R, with the quantile function.
I have two sets of numeric data, and a custom boxplot stats function (which someone helped me write, so I am actually not too sure about every detail):
sample_lang = c(91, 122, 65, 90, 90, 102,
98, 94, 84, 86, 108, 104,
94, 110, 100, 86, 92, 92,
124, 108, 82, 65, 102, 90, 114,
88, 68, 112, 96, 84, 92,
80, 104, 114, 112, 108, 68,
92, 68, 63, 112, 116)
sample_vocab = c(96, 136, 81, 92, 95,
112, 101, 95, 97, 94,
117, 95, 111, 115, 88,
92, 108, 81, 130, 106,
91, 95, 119, 103, 132, 103,
65, 114, 107, 108, 86,
100, 98, 111, 123, 123, 117,
82, 100, 97, 89, 132, 114)
my.boxplot.stats <- function (x, coef = 1.5, do.conf = TRUE, do.out = TRUE) {
if (coef < 0)
stop("'coef' must not be negative")
nna <- !is.na(x)
n <- sum(nna)
#stats <- stats::fivenum(x, na.rm = TRUE)
stats <- quantile(x, probs = c(0.15, 0.25, 0.5, 0.75, 0.85), na.rm = TRUE)
iqr <- diff(stats[c(2, 4)])
if (coef == 0)
do.out <- FALSE
else {
out <- if (!is.na(iqr)) {
x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef *
iqr)
}
else !is.finite(x)
if (any(out[nna], na.rm = TRUE))
stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
}
conf <- if (do.conf)
stats[3L] + c(-1.58, 1.58) * iqr/sqrt(n)
list(stats = stats, n = n, conf = conf, out = if (do.out) x[out &
nna] else numeric())
}
However, when I call quantile and my.boxplot.stats on the same set of data, I am getting different quantile results for the sample_vocab data (but it appears consistent with the sample_lang data), and I am not sure why:
> quantile(sample_vocab, probs = c(0.15, 0.25, 0.5, 0.75, 0.85), na.rm=TRUE)
15% 25% 50% 75% 85%
89.6 94.5 101.0 114.0 118.4
>
> my.boxplot.stats(sample_vocab)
$stats
15% 25% 50% 75% 85%
81.0 94.5 101.0 114.0 136.0
Could someone help me understand what is happening? Please note, I am reasonably experienced with programming, but have no formal training in R, I am learning on my own.
Thanks so much in advance!
The relevant bit of code is right here:
if (coef == 0)
do.out <- FALSE
else {
out <- if (!is.na(iqr)) {
x < (stats[2L] - coef * iqr) | x > (stats[4L] + coef *
iqr)
}
else !is.finite(x)
if (any(out[nna], na.rm = TRUE))
stats[c(1, 5)] <- range(x[!out], na.rm = TRUE)
}
Basically, if coef != 0 (in your case coef is 1.5, the default function parameter), then the first and last elements of the reported quantiles are replaced with the minimum and maximum data value within coef * iqr of the 25% and 75% quantiles, where iqr is the distance between those quantiles.