How to calculate percentile in R? - r

In R helphelp(quantile),you can see
Type 7 m = 1-p. p[k] = (k - 1) / (n - 1). In this case, p[k] =mode[F(x[k])].
This is used by S.
now ,i have a example:
w<-c(75,64,47.4,66.9,62.2,62.2,58.7,63.5,66.6,64,57,69,56.9,50,72.0)
sort(w)
[1] 47.4 50.0 56.9 57.0 58.7 62.2 62.2 63.5 64.0 64.0 66.6 66.9 69.0 72.0 75.0
quantile(w)
0% 25% 50% 75% 100%
47.40 57.85 63.50 66.75 75.00
How can you use the type 7 formula to get the result?

I'm having some trouble deciding if the answer is just:
> quantile(w, type=7)
0% 25% 50% 75% 100%
47.40 57.85 63.50 66.75 75.00
My problem is that the default for quantile is type=7 and you already have that result. If you look at the code for quantile.default there is a section for type=7:
index <- 1 + (n - 1) * probs
lo <- floor(index)
hi <- ceiling(index)
x <- sort(x, partial = unique(c(lo, hi)))
qs <- x[lo]
i <- which(index > lo)
h <- (index - lo)[i]
qs[i] <- (1 - h) * qs[i] + h * x[hi[i]]

Related

Estimating the standard deviation from mean and confidence intervals with a gamma distribution in R

I have the following problem I'd like to solve in R and apply to a larger workflow. I need to estimate the standard deviation from a gamma distribution where the mean and 95% confidence intervals are known.
state = c("group1", "group2", "group3")
mean = c(0.970, 0.694, 0.988)
lowers = c(0.527, 0.381, 0.536)
uppers = c(1.87, 1.37, 1.90)
df = data.frame(state=state, mean=mean, lower=lower, upper=upper)
Using excel and the "solver" tool I can adjust the standard deviation to minimize the sum of squared differences between the target 2.5 (lowers) and 97.5 (uppers) percentiles of the distribution with the actuals. Challenge is this needs to be scaled up to a rather large set of data and operationalized in my R dataframe workflow. Any ideas how to solve this?
I think this problem is ultimately an optimization problem, dealing one row of data at a time. Since you want to scale it, though, here's an approximation for finding the distribution core parameters.
This process is not an optimization: it expands a defined range of possible k (shape) parameters and finds the shape/scale combination (given your mean) that most closely resembles your upper and lower quantiles. You control the granularity of k, which is as good as you're going to get to having a tolerance (which would be appropriate for optimizations).
As such, this process will be imperfect. I hope it gets you a fast-enough process for good-enough accuracy.
I'm going to first demonstrate a process that operates one row at a time, as guesser1. I'll then expand it to do the same operation on an arbitrary number of mean, lower, and upper.
Data with Known Answers
But first, I want to generate my own samples so that we have known "truth" for this guesser.
library(dplyr)
set.seed(42)
n <- 4
randks <- tibble(
k = sample(1:10, size = n, replace = TRUE),
scale = sample(seq(0.5, 2.5, by = 0.5), size = n, replace = TRUE)
) %>%
mutate(
samp = map2(k, scale, ~ rgamma(1000, shape = .x, scale = .y)),
theor_mu = k*scale,
mu = map_dbl(samp, ~ mean(.x)),
lwr = map_dbl(samp, ~ quantile(.x, 0.025)),
upr = map_dbl(samp, ~ quantile(.x, 0.975))
) %>%
select(-samp)
randks
# # A tibble: 4 x 6
# k scale theor_mu mu lwr upr
# <int> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 10 2 20 19.9 9.47 33.7
# 2 10 1.5 15 15.1 7.36 25.0
# 3 3 2 6 5.85 1.08 14.5
# 4 9 0.5 4.5 4.51 1.99 7.72
Guesser1
Single row at a time:
guesser1 <- function(mu, lwr, upr, k.max = 10, k.by = 0.01) {
stopifnot(length(mu) == 1, length(lwr) == 1, length(upr) == 1)
ks <- seq(0, k.max, by = k.by)[-1]
L <- sapply(ks, function(k) qgamma(0.025, shape = k, scale = mu / k))
U <- sapply(ks, function(k) qgamma(0.975, shape = k, scale = mu / k))
dists <- sqrt((L-lwr)^2 + (U-upr)^2)
ind <- which.min(dists)
data.frame(
k = ks[ind],
scale = mu / ks[ind],
dist = min(dists),
lwr = L[ind],
upr = U[ind]
)
}
In action:
out1 <- do.call(rbind, Map(guesser1, randks$mu, randks$lwr, randks$upr))
cbind(subset(randks, select = -theor_mu), out1)
# k scale mu lwr upr k scale dist lwr upr
# 1 10 2.0 19.88 9.47 33.67 10.00 1.988 0.304 9.53 33.97
# 2 10 1.5 15.06 7.36 25.02 10.00 1.506 0.727 7.22 25.73
# 3 3 2.0 5.85 1.08 14.50 2.76 2.120 0.020 1.10 14.50
# 4 9 0.5 4.51 1.99 7.72 9.55 0.472 0.142 2.12 7.79
### \____ randks __________/ \____ guessed ____________/
There are certainly some differences, underscoring my original assertion that this is imperfect.
Guessers
All rows at once. This is a little more work in the function, since it deals with matrices instead of just vectors. Not a problem, I just wanted to prove it one-at-a-time before going for the gusto.
guessers <- function(mu, lwr, upr, k.max = 10, k.by = 0.01, include.size = FALSE) {
stopifnot(length(mu) == length(lwr), length(mu) == length(upr))
# count <- length(mu)
ks <- seq(0, k.max, by = k.by)[-1]
# 'ks' dims: [mu]
L <- outer(mu, ks, function(m, k) qgamma(0.025, shape = k, scale = m / k))
U <- outer(mu, ks, function(m, k) qgamma(0.975, shape = k, scale = m / k))
# 'L' & 'U' dims: [mu, ks]
dists <- sqrt((L - lwr)^2 + (U - upr)^2)
inds <- apply(dists, 1, which.min)
mindists <- apply(dists, 1, min)
i <- seq_along(mu)
out <- data.frame(
k = ks[inds],
scale = mu / ks[inds],
dist = mindists,
lwr = L[cbind(i, inds)],
upr = U[cbind(i, inds)]
)
size <- if (include.size) {
message("guessers memory: ",
object.size(list(ks, L, U, dists, inds, mindists, i, out)))
}
out
}
In action:
outs <- guessers(randks$mu, randks$lwr, randks$upr, include.size = TRUE)
# guessers memory: 106400
cbind(subset(randks, select = -theor_mu), outs)
# k scale mu lwr upr k scale dist lwr upr
# 1 10 2.0 19.88 9.47 33.67 10.00 1.988 0.304 9.53 33.97
# 2 10 1.5 15.06 7.36 25.02 10.00 1.506 0.727 7.22 25.73
# 3 3 2.0 5.85 1.08 14.50 2.76 2.120 0.020 1.10 14.50
# 4 9 0.5 4.51 1.99 7.72 9.55 0.472 0.142 2.12 7.79
### \____ randks __________/ \____ guessed (same) _____/
(I included a memory message in there just to track how much this can scale. It's not bad now, and that argument should definitely not be used in production. FYI.)
Comparison
Using microbenchmark, we repeat each operation a few times and compare their run times.
microbenchmark::microbenchmark(
g1 = Map(guesser1, randks$mu, randks$lwr, randks$upr),
gs = guessers(randks$mu, randks$lwr, randks$upr)
)
# Unit: milliseconds
# expr min lq mean median uq max neval
# g1 27.3 28.8 33.9 29.7 33.0 131.1 100
# gs 13.3 13.6 14.4 13.8 14.6 20.3 100
Not too surprisingly, the all-at-once guessers is a bit faster. When will this not be the case? When the number of rows gets so big that memory consumption is a problem. I don't know what that is.
Let's try the same thing, but increasing randks from 4 rows to 1000 and repeating the benchmark.
n <- 1000
# randks <- ...
nrow(randks)
# [1] 1000
microbenchmark::microbenchmark(
g1 = Map(guesser1, randks$mu, randks$lwr, randks$upr),
gs = guessers(randks$mu, randks$lwr, randks$upr),
times = 10
)
# Unit: seconds
# expr min lq mean median uq max neval
# g1 8.50 8.99 9.59 9.31 9.64 11.72 10
# gs 3.35 3.44 3.61 3.63 3.77 3.93 10
So it's definitely faster. The median run-time for 1000 estimations is 3.63 seconds, so it appears to finish about 300/sec.
guessers memory: 24066176
(24 MiB) Actually, that doesn't seem bad at all. Decrease k.by to increase your accuracy, or increase k.by to speed this up.

Making a matrix from lsmeans contrasts return

To create the data frame:
num <- sample(1:25, 20)
x <- data.frame("Day_eclosion" = num, "Developmental" = c("AP", "MA",
"JU", "L"), "Replicate" = 1:5)
model <- glmer(Day_eclosion ~ Developmental + (1 | Replicate), family =
"poisson", data= x)
I get this return from:
a <- lsmeans(model, pairwise~Developmental, adjust = "tukey")
a$contrasts
contrast estimate SE df z.ratio p.value
AP - JU 0.2051 0.0168 Inf 12.172 <.0001
AP - L 0.3009 0.0212 Inf 14.164 <.0001
AP - MA 0.3889 0.0209 Inf 18.631 <.0001
JU - L 0.0958 0.0182 Inf 5.265 <.0001
JU - MA 0.1839 0.0177 Inf 10.387 <.0001
L - MA 0.0881 0.0222 Inf 3.964 0.0004
I am looking for a simple way to turn this output (just p values) into:
AP MA JU L
AP - <.0001 <.0001 <.0001
MA - - <.0001 0.0004
JU - - - <.0001
L - - -
I have about 20 sets of these that I need to turn into tables, so the simpler and more general the better.
Bonus points if the output is tab-deliminated, etc, so that I can easily paste into word/excel.
Thanks!
Here's a function that works...
pvmat = function(emm, ...) {
emm = update(emm, by = NULL) # need to work harder otherwise
pv = test(pairs(emm, reverse = TRUE, ...)) $ p.value
fmtpv = sprintf("%6.4f", pv)
fmtpv[pv < 0.0001] = "<.0001"
lbls = do.call(paste, emm#grid[emm#misc$pri.vars])
n = length(lbls)
mat = matrix("", nrow = n, ncol = n, dimnames = list(lbls, lbls))
mat[upper.tri(mat)] = fmtpv
idx = seq_len(n - 1)
mat[idx, 1 + idx] # trim off last row and 1st col
}
Illustration:
require(emmeans)
> warp.lm = lm(breaks ~ wool * tension, data = warpbreaks)
> warp.emm = emmeans(warp.lm, ~ wool * tension)
> warp.emm
wool tension emmean SE df lower.CL upper.CL
A L 44.6 3.65 48 37.2 51.9
B L 28.2 3.65 48 20.9 35.6
A M 24.0 3.65 48 16.7 31.3
B M 28.8 3.65 48 21.4 36.1
A H 24.6 3.65 48 17.2 31.9
B H 18.8 3.65 48 11.4 26.1
Confidence level used: 0.95
> pm = pvmat(warp.emm, adjust = "none")
> print(pm, quote=FALSE)
B L A M B M A H B H
A L 0.0027 0.0002 0.0036 0.0003 <.0001
B L 0.4170 0.9147 0.4805 0.0733
A M 0.3589 0.9147 0.3163
B M 0.4170 0.0584
A H 0.2682
Notes
As provided, this does not support by variables. Accordingly, the first line of the function disables them.
Using pairs(..., reverse = TRUE) generates the P values in the correct order needed later for upper.tri()
you can pass arguments to test() via ...
To create a tab-delimited version, use the clipr package:
clipr::write_clip(pm)
What you need is now in the clipboard and ready to paste into a spreadsheet.
Addendum
Answering this question inspired me to add a new function pwpm() to the emmeans package. It will appear in the next CRAN release, and is available now from the github site. It displays means and differences as well as P values; but the user may select which to include.
> pwpm(warp.emm)
wool = A
L M H
L [44.6] 0.0007 0.0009
M 20.556 [24.0] 0.9936
H 20.000 -0.556 [24.6]
wool = B
L M H
L [28.2] 0.9936 0.1704
M -0.556 [28.8] 0.1389
H 9.444 10.000 [18.8]
Row and column labels: tension
Upper triangle: P values adjust = “tukey”
Diagonal: [Estimates] (emmean)
Upper triangle: Comparisons (estimate) earlier vs. later

Developing hierarchical version of nonlinear growth curve model in Stan

The following model is model 1 of Preece and Baines (1978, Annals of Human Biology), and is used to describe human growth.
My Stan code for this model is as follows:
```{stan output.var="test"}
data {
int<lower=1> n;
ordered[n] t; // age
ordered[n] y; // height of human
}
parameters {
positive_ordered[2] h;
real<lower=0, upper=t[n-1]>theta;
positive_ordered[2] s;
real<lower=0>sigma;
}
model {
h[1] ~ uniform(0, y[n]);
h[2] ~ normal(180, 20);
sigma ~ student_t(2, 0, 1);
s[1] ~ normal(5, 5);
s[2] ~ normal(5, 5);
theta ~ normal(10, 5);
y ~ normal(h[2] - (2*(h[2] - h[1]) * inv(exp(s[1]*(t - theta)) + exp(s[2]*(t - theta)))), sigma);
}
```
I now want to create a hierarchical version of this model, with parameters modelled either identical across boys (such as the measurement variability sigma) or in a hierarchical version (such as h_1, the adult height).
As for the parameters sigma and theta, I want to maintain these priors as identical across boys as sigma ~ student_t(2, 0, 1); and theta ~ normal(10, 5);.
Unfortunately, I have almost no experience in implementing hierarchical modelling, and I have struggled in my attempts at doing any hierarchical examples beyond the simple binomial hierarchical models in Bayesian textbooks (see chapter 12, pages 358-359 of Statistical Rethinking by Richard McElreath). I do, however, understand the theory behind hierarchical modelling, as written in chapter 5 of Bayesian Data Analysis by Andrew Gelman and chapter 12 of Statistical Rethinking by Richard McElreath.
I am keen to discover how this type of hierarchical model would be implemented in Stan. Ideally, I am seeking explanations alongside the code, so that I may learn how to implement these types of hierarchical model examples independently in the future.
A sample of my data is as follows:
age boy01 boy02 boy03 boy04 boy05 boy06 boy07 boy08 boy09 boy10 boy11 boy12 boy13 boy14 boy15 boy16 boy17 boy18
1 1 81.3 76.2 76.8 74.1 74.2 76.8 72.4 73.8 75.4 78.8 76.9 81.6 78 76.4 76.4 76.2 75 79.7
2 1.25 84.2 80.4 79.8 78.4 76.3 79.1 76 78.7 81 83.3 79.9 83.7 81.8 79.4 81.2 79.2 78.4 81.3
3 1.5 86.4 83.2 82.6 82.6 78.3 81.1 79.4 83 84.9 87 84.1 86.3 85 83.4 86 82.3 82 83.3
4 1.75 88.9 85.4 84.7 85.4 80.3 84.4 82 85.8 87.9 89.6 88.5 88.8 86.4 87.6 89.2 85.4 84 86.5
5 2 91.4 87.6 86.7 88.1 82.2 87.4 84.2 88.4 90 91.4 90.6 92.2 87.1 91.4 92.2 88.4 85.9 88.9
6 3 101. 97 94.2 98.6 89.4 94 93.2 97.3 97.3 100. 96.6 99.3 96.2 101. 101. 101 95.6 99.4
7 4 110. 105. 100. 104. 96.9 102. 102. 107. 103. 111 105. 106. 104 106. 110. 107. 102. 104.
8 5 116. 112. 107. 111 104. 109. 109 113. 108. 118. 112 113. 111 113. 117. 115. 109. 112.
9 6 122. 119. 112. 116. 111. 116. 117. 119. 114. 126. 119. 120. 117. 120. 122. 121. 118. 119
10 7 130 125 119. 123. 116. 122. 123. 126. 120. 131. 125. 127. 124. 129. 130. 128 125. 128
I acknowledge the lack of precision in the decimal places. The data is in the form of a tibble table, which doesn't seem to respond to R's usual commands for greater precision. For the sake of consistency, it might be better to simply ignore all of the rows after row 5, since rows 1 - 5 display the full precision that is present in the original data.
In the full data, the ages are
> Children$age
[1] 1.00 1.25 1.50 1.75 2.00 3.00 4.00 5.00 6.00 7.00 8.00 8.50 9.00 9.50 10.00 10.50 11.00 11.50 12.00 12.50
[21] 13.00 13.50 14.00 14.50 15.00 15.50 16.00 16.50 17.00 17.50 18.00
And there are 39 boys, listed in the same wide-data format as the above sample.
Disclaimer: As a start let's fit a (non-hierarchical) non-linear growth model using Stan.
We read in the sample data.
library(tidyverse);
df <- read.table(text = "
age boy01 boy02 boy03 boy04 boy05 boy06 boy07 boy08 boy09 boy10 boy11 boy12 boy13 boy14 boy15 boy16 boy17 boy18
1 1 81.3 76.2 76.8 74.1 74.2 76.8 72.4 73.8 75.4 78.8 76.9 81.6 78 76.4 76.4 76.2 75 79.7
2 1.25 84.2 80.4 79.8 78.4 76.3 79.1 76 78.7 81 83.3 79.9 83.7 81.8 79.4 81.2 79.2 78.4 81.3
3 1.5 86.4 83.2 82.6 82.6 78.3 81.1 79.4 83 84.9 87 84.1 86.3 85 83.4 86 82.3 82 83.3
4 1.75 88.9 85.4 84.7 85.4 80.3 84.4 82 85.8 87.9 89.6 88.5 88.8 86.4 87.6 89.2 85.4 84 86.5
5 2 91.4 87.6 86.7 88.1 82.2 87.4 84.2 88.4 90 91.4 90.6 92.2 87.1 91.4 92.2 88.4 85.9 88.9
6 3 101. 97 94.2 98.6 89.4 94 93.2 97.3 97.3 100. 96.6 99.3 96.2 101. 101. 101 95.6 99.4
7 4 110. 105. 100. 104. 96.9 102. 102. 107. 103. 111 105. 106. 104 106. 110. 107. 102. 104.
8 5 116. 112. 107. 111 104. 109. 109 113. 108. 118. 112 113. 111 113. 117. 115. 109. 112.
9 6 122. 119. 112. 116. 111. 116. 117. 119. 114. 126. 119. 120. 117. 120. 122. 121. 118. 119
10 7 130 125 119. 123. 116. 122. 123. 126. 120. 131. 125. 127. 124. 129. 130. 128 125. 128", header = T, row.names = 1);
df <- df %>%
gather(boy, height, -age);
We define the Stan model.
model <- "
data {
int N; // Number of observations
real y[N]; // Height
real t[N]; // Time
}
parameters {
real<lower=0,upper=1> s[2];
real h_theta;
real theta;
real sigma;
}
transformed parameters {
vector[N] mu;
real h1;
h1 = max(y);
for (i in 1:N)
mu[i] = h1 - 2 * (h1 - h_theta) / (exp(s[1] * (t[i] - theta)) + (exp(s[2] * (t[i] - theta))));
}
model {
// Priors
s ~ cauchy(0, 2.5);
y ~ normal(mu, sigma);
}
"
Here we consider weakly informative (regularising) priors on s[1] and s[2].
Fit the model to the data.
library(rstan);
options(mc.cores = parallel::detectCores());
rstan_options(auto_write = TRUE);
model <- stan(
model_code = model,
data = list(
N = nrow(df),
y = df$height,
t = df$age),
iter = 4000);
Summary of the 6 model parameters:
summary(model, pars = c("h1", "h_theta", "theta", "s", "sigma"))$summary
# mean se_mean sd 2.5% 25% 50%
#h1 131.0000000 0.000000000 0.0000000 131.0000000 131.0000000 131.0000000
#h_theta 121.6874553 0.118527828 2.7554944 115.4316738 121.1654809 122.2134014
#theta 6.5895553 0.019738319 0.5143429 5.4232740 6.4053479 6.6469534
#s[1] 0.7170836 0.214402086 0.3124318 0.1748077 0.3843143 0.8765256
#s[2] 0.3691174 0.212062373 0.3035039 0.1519308 0.1930381 0.2066811
#sigma 3.1524819 0.003510676 0.1739904 2.8400096 3.0331962 3.1411533
# 75% 97.5% n_eff Rhat
#h1 131.0000000 131.0000000 8000.000000 NaN
#h_theta 123.0556379 124.3928800 540.453594 1.002660
#theta 6.8790801 7.3376348 679.024115 1.002296
#s[1] 0.9516115 0.9955989 2.123501 3.866466
#s[2] 0.2954409 0.9852540 2.048336 6.072550
#sigma 3.2635849 3.5204101 2456.231113 1.001078
So what does this mean? From the Rhat values for s[1] and s[2] you can see that there are issues with convergence for these two parameters. This is due to the fact that s[1] and s[2] are indistinguishable: they cannot be estimated both at the same time. A stronger regularising prior on s[1] and s[2] would probably drive one of the s parameters to zero.
I'm not sure I understand the point of s[1] and s[2]. From a statistical modelling point of view, you cannot obtain estimates for both parameters in the simple non-linear growth model that we're considering.
Update
As promised here is an update. This is turning into quite a long post, I've tried to make things as clear as possible by adding additional explanations.
Preliminary comments
Using positive_ordered as data type for s makes a significant difference in terms of convergence of solutions. It's not clear to me why that is the case, nor do I know how Stan implements positive_ordered, but it works.
In this hierarchical approach, we partially pool height data across all boys by considering h1 ~ normal(mu_h1, sigma_h1), with priors on the hyperparameters mu_h1 ~ normal(max(y), 10) and a half-Cauchy prior on sigma_h1 ~ cauchy(0, 10) (it's half-Cauchy because sigma is declared as real<lower=0>).
To be honest, I am unsure about the interpretation (and interpretability) of some of the parameters. Estimates for h_1 and h_theta are very similar, and in some way cancel each other out. I would imagine that this creates some convergence issues when fitting the model, but as you can see further down, Rhat values seem ok. Still, as I don't know enough about the model, data and its context, I remain somewhat skeptical as to the interpretability of those parameters. Extending the model by turning some of the other parameters into group-level parameters is straightforward from a statistical modelling point of view; however I imagine that difficulties will arise from the indistinguishability and lack of interpretability.
All these points aside, this should give you a practical example of how to implement a hierarchical model.
The Stan model
model_code <- "
data {
int N; // Number of observations
int J; // Number of boys
int<lower=1,upper=J> boy[N]; // Boy of observation
real y[N]; // Height
real t[N]; // Time
}
parameters {
real<lower=0> h1[J];
real<lower=0> h_theta;
real<lower=0> theta;
positive_ordered[2] s;
real<lower=0> sigma;
// Hyperparameters
real<lower=0> mu_h1;
real<lower=0> sigma_h1;
}
transformed parameters {
vector[N] mu;
for (i in 1:N)
mu[i] = h1[boy[i]] - 2 * (h1[boy[i]] - h_theta) / (exp(s[1] * (t[i] - theta)) + (exp(s[2] * (t[i] - theta))));
}
model {
h1 ~ normal(mu_h1, sigma_h1); // Partially pool h1 parameters across boys
mu_h1 ~ normal(max(y), 10); // Prior on h1 hyperparameter mu
sigma_h1 ~ cauchy(0, 10); // Half-Cauchy prior on h1 hyperparameter sigma
h_theta ~ normal(max(y), 2); // Prior on h_theta
theta ~ normal(max(t), 2); // Prior on theta
s ~ cauchy(0, 1); // Half-Cauchy priors on s[1] and s[2]
y ~ normal(mu, sigma);
}
"
To summarise: We pool height data from all boys to improve estimates at the group (i.e. boy) level, by modelling the adult height parameter as h1 ~ normal(mu_h1, sigma_h1), where the hyperparameters mu_h1 and sigma_h1 characterise the normal distribution of adult height values across all boys. We choose weakly informative priors on the hyperparameters, and choose weakly informative priors on all remaining parameters similar to the first complete-pooling example.
Fit the model
# Fit model
fit <- stan(
model_code = model_code,
data = list(
N = nrow(df),
J = length(unique(df$boy)),
boy = df$boy,
y = df$height,
t = df$age),
iter = 4000)
Extract summary
We extract parameter estimates for all parameters; note that we now have as many h1 parameters as there are groups (i.e. boys).
# Get summary
summary(fit, pars = c("h1", "h_theta", "theta", "s", "sigma"))$summary
# mean se_mean sd 2.5% 25% 50%
#h1[1] 142.9406153 0.1046670943 2.41580757 138.4272280 141.2858391 142.909765
#h1[2] 143.7054020 0.1070466445 2.46570025 139.1301456 142.0233342 143.652657
#h1[3] 144.0352331 0.1086953809 2.50145442 139.3982034 142.3131167 143.971473
#h1[4] 143.8589955 0.1075753575 2.48015745 139.2689731 142.1666685 143.830347
#h1[5] 144.7359976 0.1109871908 2.55284812 140.0529359 142.9917503 144.660586
#h1[6] 143.9844938 0.1082691127 2.49497990 139.3378948 142.2919990 143.926931
#h1[7] 144.3857221 0.1092604239 2.51645359 139.7349112 142.6665955 144.314645
#h1[8] 143.7469630 0.1070594855 2.46860328 139.1748700 142.0660983 143.697302
#h1[9] 143.6841113 0.1072208284 2.47391295 139.0885987 141.9839040 143.644357
#h1[10] 142.9518072 0.1041206784 2.40729732 138.4289207 141.3114204 142.918407
#h1[11] 143.5352502 0.1064173663 2.45712021 138.9607665 141.8547610 143.483157
#h1[12] 143.0941582 0.1050061258 2.42894673 138.5579378 141.4295430 143.055576
#h1[13] 143.6194965 0.1068494690 2.46574352 138.9426195 141.9412820 143.577920
#h1[14] 143.4477182 0.1060254849 2.44776536 138.9142081 141.7708660 143.392231
#h1[15] 143.1415683 0.1049131998 2.42575487 138.6246642 141.5014391 143.102219
#h1[16] 143.5686919 0.1063594201 2.45328456 139.0064573 141.8962853 143.510276
#h1[17] 144.0170715 0.1080567189 2.49269747 139.4162885 142.3138300 143.965127
#h1[18] 143.4740997 0.1064867748 2.45545200 138.8768051 141.7989566 143.426211
#h_theta 134.3394366 0.0718785944 1.72084291 130.9919889 133.2348411 134.367152
#theta 8.2214374 0.0132434918 0.45236221 7.4609612 7.9127800 8.164685
#s[1] 0.1772044 0.0004923951 0.01165119 0.1547003 0.1705841 0.177522
#s[2] 1.6933846 0.0322953612 1.18334358 0.6516669 1.1630900 1.463148
#sigma 2.2601677 0.0034146522 0.13271459 2.0138514 2.1657260 2.256678
# 75% 97.5% n_eff Rhat
#h1[1] 144.4795105 147.8847202 532.7265 1.008214
#h1[2] 145.2395543 148.8419618 530.5599 1.008187
#h1[3] 145.6064981 149.2080965 529.6183 1.008087
#h1[4] 145.4202919 149.0105666 531.5363 1.008046
#h1[5] 146.3200407 150.0701757 529.0592 1.008189
#h1[6] 145.5551778 149.1365279 531.0372 1.008109
#h1[7] 145.9594956 149.5996605 530.4593 1.008271
#h1[8] 145.3032680 148.8695637 531.6824 1.008226
#h1[9] 145.2401743 148.7674840 532.3662 1.008023
#h1[10] 144.4811712 147.9218834 534.5465 1.007937
#h1[11] 145.1153635 148.5968945 533.1235 1.007988
#h1[12] 144.6479561 148.0546831 535.0652 1.008115
#h1[13] 145.1660639 148.6562172 532.5386 1.008138
#h1[14] 144.9975197 148.5273804 532.9900 1.008067
#h1[15] 144.6733010 148.1130207 534.6057 1.008128
#h1[16] 145.1163764 148.6027096 532.0396 1.008036
#h1[17] 145.5578107 149.2014363 532.1519 1.008052
#h1[18] 145.0249329 148.4886949 531.7060 1.008055
#h_theta 135.4870338 137.6753254 573.1698 1.006818
#theta 8.4812339 9.2516700 1166.7226 1.002306
#s[1] 0.1841457 0.1988365 559.9036 1.005333
#s[2] 1.8673249 4.1143099 1342.5839 1.001562
#sigma 2.3470429 2.5374239 1510.5824 1.001219
Visualise adult height estimates
Finally we plot adult height estimates h_1 for all boys, including 50% and 97% confidence intervals.
# Plot h1 values
summary(fit, pars = c("h1"))$summary %>%
as.data.frame() %>%
rownames_to_column("Variable") %>%
mutate(
Variable = gsub("(h1\\[|\\])", "", Variable),
Variable = df$key[match(Variable, df$boy)]) %>%
ggplot(aes(x = `50%`, y = Variable)) +
geom_point(size = 3) +
geom_segment(aes(x = `2.5%`, xend = `97.5%`, yend = Variable), size = 1) +
geom_segment(aes(x = `25%`, xend = `75%`, yend = Variable), size = 2) +
labs(x = "Median (plus/minus 95% and 50% CIs)", y = "h1")

Estimating missing values in time-series data frame based on a "rate of change"

I am trying to use a loop in R to estimate values that will replace the NAs in my data frame based on a rate of change ("rate") that multiplies my last value (ok, this is confusing, but please refer to the example below). This is something similar to my data:
l1 <- c(NA,NA,NA,27,31,0.5)
l2 <- c(NA,8,12,28,39,0.5)
l3 <- c(NA,NA,NA,NA,39,0.3)
l4 <- c(NA,NA,11,15,31,0.2)
l5 <- c(NA,NA,NA,NA,51,0.9)
data <- as.data.frame(rbind(l1,l2,l3,l4,l5))
colnames(data) <- c("dbh1","dbh2","dbh3","dbh4","dbh5","rate")
So I created a loop to identify my first no-NA value in each line, then use that value to estimate its previous values based on the "rate". So for instance, in row 1, the first NA value would be replace by "27-(0.5*3)", then the second one would be "27-(0.5*2)" and the third one by "27-(0.5*1)". This is the loop I came up with. I know the first part (the outside loop) works but the the inside one doesn't:
for (i in 1: nrow(data)) {
dbh.cols <- data3[i,c("dbh1","dbh2","dbh3","dbh4","dbh5")]
sample.year <- which(dbh.cols != "NA")
data$first.dbh[i] <- min(dbh.cols, na.rm = T)
data$first.index[i] <- min(sample.year)
for (j on 1: (min(sample.year)-1)) {
ifelse(is.na(data[i,j]), min(dbh.cols, na.rm = T) - (min(sample.year)-j)*rate[i,j], data[i,j])
}
}
I am not good at programming so probably my internal loop strategy with "ifelse" is too weird (and wrong) but I just couldn't think of anything else that would work here... Any suggestions?
1) This uses no explicit loops, just an apply. It assumes that the NAs are all leading as in the example given.
fillIn <- function(x) {
rate <- tail(x, 1)
n <- sum(is.na(x)) # no of NAs
c(x[n+1] - rate * seq(n, 1), na.omit(x))
}
replace(data, TRUE, t(apply(data, 1, fillIn)))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2) Here is a second approach that uses na.approx from the zoo package. It does not require apply. Here data1 has the same content as data except that the first column is filled in. The other NAs remain. The last line uses na.approx to fill in the remaining NAs linearly.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind( data[cbind(1:nrow(data), NAs + 1)] - data$rate * NAs, data[-1] )
replace(data, TRUE, t(na.approx(t(data1))))
giving:
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9
2a) A variation on (2) uses na.locf in the middle line to bring forward the first non-NA in each row. The first and last lines are the same.
library(zoo)
NAs <- rowSums(is.na(data))
data1 <- cbind(na.locf(t(data), fromLast = TRUE)[1, ] - data$rate * NAs, data[-1])
replace(data, TRUE, t(na.approx(t(data1))))
You do not need to use multiple for loops for this. Here is some simplified code to do what you want just for the for loop. Working explicitly with your data we need to get the first non-NA value from each row.
for_estimate <- apply(data, 1, function(x) x[min(which(is.na(x) == FALSE))])
Secondly, we need to determine what integer to multiply the rate by for each row depending on how many NA values there are.
# total number of NA values per row
n_na <- apply(data,1, function(x) sum(is.na(x)) )
# make it a matrix with a 0's appended on
n_na <- matrix(c(n_na, rep(0, nrow(data) * (ncol(data)-1))),
nrow = nrow(data), ncol = ncol(data)-1)
# fill in the rest of the matrix
for(i in 2:ncol(n_na)){
n_na[,i] <- n_na[,i-1] -1
}
Once we have that we can use this code to back fill the NA values in that way you are interested in.
for(i in (ncol(data)-1):1){
if(sum(is.na(data[,i]))>0){
to_fill <- which(is.na(data[,i])==TRUE)
data[to_fill,i] <- for_estimate[to_fill] - (data$rate[to_fill]*(n_na[to_fill,i])
}
}
output
dbh1 dbh2 dbh3 dbh4 dbh5 rate
l1 25.5 26.0 26.5 27.0 31 0.5
l2 7.5 8.0 12.0 28.0 39 0.5
l3 37.8 38.1 38.4 38.7 39 0.3
l4 10.6 10.8 11.0 15.0 31 0.2
l5 47.4 48.3 49.2 50.1 51 0.9

Assign value to dataframe inside parent environment

I have a parent function (fun1) that takes data and a helper function (fun2) that operates on columns from that data. I want to assign the value from that helper function to it's matching column in data in fun1. In reality there are lots of little helper functions operating on columns and I want the value changed by fun2 to be what the other helper functions deal with.
How can I take the outout from fun2 and assign it to the matching column of data in fun1? I do not want to make that alteration to data in fun1 but want to change data from within fun2.
fun1 <- function(data){
Map(fun2, data, colnames(data))
data[[1]]
}
fun2 <- function(x, colname_x){
x <- x * 2
# my attempt to assign to `data[["mpg"]]`
assign(sprintf("data[[\"%s\"]]", colname_x), value=x, pos = 1)
}
fun1(mtcars[1:3])
Desired output of fun1:
## [1] 42.0 42.0 45.6 42.8 37.4 36.2 28.6 48.8 45.6 38.4 35.6 32.8 34.6 30.4 20.8
## [16] 20.8 29.4 64.8 60.8 67.8 43.0 31.0 30.4 26.6 38.4 54.6 52.0 60.8 31.6 39.4
## [31] 30.0 42.8
EDIT: Also tried:
fun2 <- function(x, colname_x = "mpg"){
x <- x * 2
assign(sprintf("data[[\"%s\"]]", colname_x), value=x, env = parent.frame(3))
}
Edit 2 This works but seems not so clean:
'
fun2 <- function(x, colname_x = "mpg"){
x <- x * 2
data <- get('data', env=parent.frame(3))
data[[colname_x]] <- x
assign('data', value=data, env = parent.frame(3))
}

Resources