R boxplot with already computed mean, confidence intervals and min max - r

I am trying to generate a boxplot in R using already computed confidence intervals and min and max. For time 1,2,3,4,5 (x-axis), I have MN which represents array of 5 elements, each describing the mean at time point. I also have CI1, CI2, MINIM, and MAXM, each as an array of 5 elements, one for each time step, representing upper CI, lower CI , minimum and maximum.
I want to generate 5 box plots bars at each time step.
I have tried the usual box plot function, but I could get it to work with already computed CIs and min max.
It would be great if the method work for normal plot function, though ggplot woll be fine too.

Since you have not posted data, I will use the builtin iris dataset, keeping the first 4 columns.
data(iris)
iris2 <- iris[-5]
The function boxplot computes the statistics it uses and then calls bxp to do the printing, passing it those computed values.
If you want a different set of statistics you will have to compute them and pass them to bxp manually.
I am assuming that by CI you mean normal 95% confidence intervals. For that you need to compute the standard errors and the mean values first.
s <- apply(iris2, 2, sd)
mn <- colMeans(iris2)
ci1 <- mn - qnorm(0.95)*s
ci2 <- mn + qnorm(0.95)*s
minm <- apply(iris2, 2, min)
maxm <- apply(iris2, 2, max)
Now have boxplot create the data structure used by bxp, a matrix.
bp <- boxplot(iris2, plot = FALSE)
And fill the matrix with the values computed earlier.
bp$stats <- matrix(c(
minm,
ci1,
mn,
ci2,
maxm
), nrow = 5, byrow = TRUE)
Finally, plot it.
bxp(bp)

Related

Calculating 95% confidence intervals for a weighted median over grouped data in dplyr

I have a dataset with several groups, where I want to calculate a median value for each group using dplyr. The data are weighted, and the weights need to be taken into account in calculating the median. I found the weighted.median function from spatstat which seems to work fine. Consider the following simplified example:
require(spatstat, dplyr)
tst <- data.frame(group = rep(c(1:5), each = 100))
tst$val = runif(500) * tst$group
tst$wt = runif(500) * tst$val
tst %>%
group_by(group) %>%
summarise(weighted.median(val, wt))
# A tibble: 5 × 2
group `weighted.median(val, wt)`
<int> <dbl>
1 1 0.752
2 2 1.36
3 3 1.99
4 4 2.86
5 5 3.45
However, I would also like to add 95% confidence intervals to these values, and this has me stumped. Things I've considered:
Spatstat also has a weighted.var function but there's no documentation, and it's not even clear to me whether this is variance around the median or mean.
This rcompanion post suggests various methods for calculating CIs around medians, but as far as I can tell none of them handle weights.
This blog post suggests a function for calculating CIs and a median for weighted data, and is the closest I can find to what I need. However, it doesn't work with my dplyr groupings. I suppose I could write a loop to do this one group at a time and build the output data frame, but that seems cumbersome. I'm also not totally sure I understand the function in the post and slightly suspicious of its results- for instance, testing this out I get wider estimates for alpha=0.1 than for alpha=0.05, which seems backwards to me. Edit to add: upon further investigation, I think this function works as intended if I use alpha=0.95 for 95% CIs, rather than alpha = 0.05 (at least, this returns values that feel intuitively about right). I can also make it work with dplyr by editing to return just a single moe value rather than a pair of high/low estimates. So this may be a good option- but I'm also considering others.
Is there an existing function in some library somewhere that can do what I want, or an otherwise straightforward way to implement this?
There are several approaches.
You could use the asymptotic formula for standard error of the sample median. The sample median is asymptotically normal with standard error 1/sqrt(4 n f(m)) where n is the number of observations, m is the true median, and f(x) is the probability density of the (weighted) random variable. You could estimate the probability density using the base R function density.default with the weights argument. If x is the vector of observed values and w the corresponding vector of weights, then
med <- weighted.median(x, w)
f <- density(x, weights=w)
fmed <- approx(f$x, f$y, xout=med)$y
samplesize <- length(x)
se <- 1/sqrt(4 * samplesize * fmed)
ci <- med + c(-1,1) * 1.96 * se
This relies on several asymptotic approximations so it may be inaccurate. Also the sample size depends on the interpretation of the weights. In some cases the sample size could be equal to sum(w).
If there is very little data in each group, you could use the even simpler normal reference approximation,
med <- weighted.median(x, w)
v <- weighted.var(x, w)
sdm <- sqrt(pi/2) * sqrt(v)
samplesize <- length(x)
se <- sdm/sqrt(samplesize)
ci <- med + c(-1,1) * 1.96 * se
Alternatively you could use bootstrapping - generate random resamples of the input data (by choosing random resamples of the indices 1, 2, ..., n), extract the corresponding weighted observations (x_i, w_i), compute the weighted median of each resampled dataset, and construct the 95% confidence interval.
(This approach implicitly assumes the sample size is equal to n)

Increase precision when standardizing test dataset

I am dealing with a dataset in R divided into train and test. I preproces the data centering and dividing by the standard deviation and so, I want to store the mean and sd values of the training set to scale the test set using the same values. However, the precision obtained if I use the scale function is much better than when I use the colmeans and apply(x, 2, sd) functions.
set.seed(5)
a = matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale = scale(a) # scale using the scale function
a_scale_custom = (a - colMeans(a)) / apply(a, 2, sd) # Using custom function
Now If I compare the mean of both matrices:
colMeans(a_scale)
[1] -9.270260e-17 -1.492891e-16 1.331857e-16
colMeans(a_scale_custom)
[1] 0.007461065 -0.004395052 -0.003046839
The matrix obtained using scale has a column mean of value 0, while the matrix obtained substracting the mean using colMeans has error in the order of 10^-2. The same happens when comparing the standard deviations.
Is there any way I can obtain a better precision when scaling the data without using the scalefunction?
The custom function has a bug in the matrix layout. You need to transpose the matrix before subtracting the vector with t(), then transpose it back. Try the following:
set.seed(5)
a <- matrix(rnorm(30000, mean=10, sd=5), 10000, 3) # Generate data
a_scale <- scale(a) # scale using the scale function
a_scale_custom <- t((t(a) - colMeans(a)) / apply(a, 2, sd))
colMeans(a_scale)
colMeans(a_scale_custom)
see also: How to divide each row of a matrix by elements of a vector in R

Convert uniform draws to normal distributions with known mean and std in R

I apply the sensitivity package in R. In particular, I want to use sobolroalhs as it uses a sampling procedure for inputs that allow for evaluations of models with a large number of parameters. The function samples uniformly [0,1] for all inputs. It is stated that desired distributions need to be obtained as follows
####################
# Test case: dealing with non-uniform distributions
x <- sobolroalhs(model = NULL, factors = 3, N = 1000, order =1, nboot=0)
# X1 follows a log-normal distribution:
x$X[,1] <- qlnorm(x$X[,1])
# X2 follows a standard normal distribution:
x$X[,2] <- qnorm(x$X[,2])
# X3 follows a gamma distribution:
x$X[,3] <- qgamma(x$X[,3],shape=0.5)
# toy example
toy <- function(x){rowSums(x)}
y <- toy(x$X)
tell(x, y)
print(x)
plot(x)
I have non-zero mean and standard deviations for some input parameter that I want to sample out of a normal distribution. For others, I want to uniformly sample between a defined range (e.g. [0.03,0.07] instead [0,1]). I tried using built in R functions such as
SA$X[,1] <- rnorm(1000, mean = 579, sd = 21)
but I am afraid this procedure messes up the sampling design of the package and resulted in odd results for the sensitivity indices. Hence, I think I need to adhere for the uniform draw of the sobolroalhs function in which and use the sampled value between [0, 1] when drawing out of the desired distribution (I think as density draw?). Does this make sense to anyone and/or does anyone know how I could sample out of the right distributions following the syntax from the package description?
You can specify mean and sd in qnorm. So modify lines like this:
x$X[,2] <- qnorm(x$X[,2])
to something like this:
x$X[,2] <- qnorm(x$X[,2], mean = 579, sd = 21)
Similarly, you could use the min and max parameters of qunif to get values in a given range.
Of course, it's also possible to transform standard normals or uniforms to the ones you want using things like X <- 579 + 21*Z or Y <- 0.03 + 0.04*U, where Z is a standard normal and U is standard uniform, but for some distributions those transformations aren't so simple and using the q* functions can be easier.

Generating random numbers in a specific interval

I want to generate some Weibull random numbers in a given interval. For example 20 random numbers from the Weibull distribution with shape 2 and scale 30 in the interval (0, 10).
rweibull function in R produce random numbers from a Weibull distribution with given shape and scale values. Can someone please suggest a method? Thank you in advance.
Use the distr package. It allows to do this kind of stuff very easily.
require(distr)
#we create the distribution
d<-Truncate(Weibull(shape=2,scale=30),lower=0,upper=10)
#The d object has four slots: d,r,p,q that correspond to the [drpq] prefix of standard R distributions
#This extracts 10 random numbers
d#r(10)
#get an histogram
hist(d#r(10000))
Using base R you can generate random numbers, filter which drop into target interval and generate some more if their quantity appears to be less than you need.
rweibull_interval <- function(n, shape, scale = 1, min = 0, max = 10) {
weib_rnd <- rweibull(10*n, shape, scale)
weib_rnd <- weib_rnd[weib_rnd > min & weib_rnd < max]
if (length(weib_rnd) < n)
return(c(weib_rnd, rweibull_interval(n - length(weib_rnd), shape, scale, min, max))) else
return(weib_rnd[1:n])
}
set.seed(1)
rweibull_interval(20, 2, 30, 0, 10)
[1] 9.308806 9.820195 7.156999 2.704469 7.795618 9.057581 6.013369 2.570710 8.430086 4.658973
[11] 2.715765 8.164236 3.676312 9.987181 9.969484 9.578524 7.220014 8.241863 5.951382 6.934886

Obtaining confidence interval for npreg as values, not as plot

I am using the well known "np" package of Hayfield & Racine for non-parametric regressions. It allows plotting confidence bands for the estimated coefficient based on bootstrap procedures. See the code below for an example.
Question: I am wondering how to obtain these confidence intervalls in numerical form? One, but not the only reason for this question is that I really don't like the presentation of the ci's. More generally speaking, I would like to use and further process the confidence band within my analysis.
library(np)
# generate random variables:
x <- 1:100 + rnorm(100)/2
y <- (1:100)^(0.25) + rnorm(100)/2
mynp <- npreg(y~x)
plot(mynp, plot.errors.method="bootstrap")`
when executing plot, the function is calling to the plot method of np package which is the function npplot
npplot exepts an argument plot.behavior which equals to plot by default which plots the results and returns NULL. you should set plot.behavior = "plot-data", and the function will plot and return the data of the object.
dat <- plot(mynp, plot.errors.method="bootstrap",plot.behavior = "plot-data")
Than the values in the line can be accesed through dat$r1$mean and the values to be added to the mean to get the upper and lower ci accesed through dat$r1$merr.
notice that not all value are plotted. only half of them (every other value and than the last).
read the 'help' on npplot for more options.
further is an example of the use of the code and the results:
library(np)
# generate random variables:
x <- 1:100 + rnorm(100)/2
y <- (1:100)^(0.25) + rnorm(100)/2
mynp <- npreg(y~x)
dat <- plot(mynp, plot.errors.method="bootstrap",plot.behavior = "plot-data")
Then recreating the results:
z <- unlist(dat$r1$eval,use.names = F)
CI.up = as.numeric(dat$r1$mean)+as.numeric(dat$r1$merr[,2])
CI.dn = as.numeric(dat$r1$mean)+as.numeric(dat$r1$merr[,1])
plot(dat$r1$mean~z, cex=1.5,xaxt='n', ylim=c(1.0,3.5),xlab='',ylab='lalala!', main='blahblahblah',col='blue',pch=16)
arrows(z,CI.dn,z,CI.up,code=3,length=0.2,angle=90,col='red')
we will get:
As you can see, theresults are the same (only I have calculated the intervals for each point and not only for half of them).
note the plot.errors.type attribute for npplot which gets "standard" and "quantiles" and is "standard" at default. When you specify "standard" dat$r1$merr will keep the standard errors and the plot will include mean+std err as intervals. Alternatively the plot will include the quantiles as the intervals and the quantiles will be saved at dat$r1$merr. which quntiles to use are specified by plot.errors.quantiles quantiles and it's only relevant if plot.errors.type = "quantiles"

Resources