I am simulating data using rnorm, but I need to set an upper and lower limit, does anyone know how to do this?
code:
rnorm(n = 10, mean = 39.74, sd = 25.09)
Upper limit needs to be 340, and the lower limit 0
I am asking this question because I am rewriting an SAS-code into an R-code. I have never used SAS.
I am trying to rewrite the following piece of code:
sim_sample(simtot=100000,seed=10004,lbound=0,ubound=340,round_y=0.01,round_m=0.01,round_sd=0.01,n=15,m=39.74,sd=25.11,mk=4)
The rtruncnorm() function will return the results you need.
library(truncnorm)
rtruncnorm(n=10, a=0, b=340, mean=39.4, sd=25.09)
You can make your own truncated normal sampler that doesn't require you to throw out observations quite simply
rtnorm <- function(n, mean, sd, a = -Inf, b = Inf){
qnorm(runif(n, pnorm(a, mean, sd), pnorm(b, mean, sd)), mean, sd)
}
Like this?
mysamp <- function(n, m, s, lwr, upr, nnorm) {
samp <- rnorm(nnorm, m, s)
samp <- samp[samp >= lwr & samp <= upr]
if (length(samp) >= n) {
return(sample(samp, n))
}
stop(simpleError("Not enough values to sample from. Try increasing nnorm."))
}
set.seed(42)
mysamp(n=10, m=39.74, s=25.09, lwr=0, upr=340, nnorm=1000)
#[1] 58.90437 38.72318 19.64453 20.24153 39.41130 12.80199 59.88558 30.88578 19.66092 32.46025
However, the result is not normal distributed and usually won't have the mean and sd you've specified (in particular if the limits are not symmetric around the specified mean).
Edit:
According to your comment it seems you want to translate this SAS function. I am not an SAS user, but this should do more or less the same:
mysamp <- function(n, m, s, lwr, upr, rounding) {
samp <- round(rnorm(n, m, s), rounding)
samp[samp < lwr] <- lwr
samp[samp > upr] <- upr
samp
}
set.seed(8)
mysamp(n=10, m=39.74, s=25.09, lwr=0, upr=340, rounding=3)
#[1] 37.618 60.826 28.111 25.920 58.207 37.033 35.467 12.434 0.000 24.857
You may then want to use replicate to run the simulations. Or if you want faster code:
sim <- matrix(mysamp(n=10*10, m=39.74, s=25.09, lwr=0, upr=340, rounding=3), 10)
means <- colMeans(sim)
sds <- apply(sim, 2, sd)
Assuming you want exactly 10 numbers and not the subset of them that is >0, <340 (and night not be a normal distribution):
aa <- rnorm(n = 10, mean = 39.74, s = 25.09)
while(any(aa<0 | aa>340)) { aa <- rnorm(n = 10, mean = 39.74, s = 25.09) }
This is the function that I wrote to achieve the same purpose. It normalizes the result from the rnorm function and then adjusts it to fit the range.
NOTE: The standard deviation and mean (if specified) get altered during the normalization process.
#' Creates a random normal distribution within the specified bounds.
#'
#' WARNING: This function does not preserve the standard deviation or mean.
#' #param n The number of values to be generated
#' #param mean The mean of the distribution
#' #param sd The standard deviation of the distribution
#' #param lower The lower limit of the distribution
#' #param upper The upper limit of the distribution
rtnorm <- function(n, mean=NA, sd=1, lower=-1, upper=1){
mean = ifelse(is.na(mean)|| mean < lower || mean > upper,
mean(c(lower, upper)), mean)
data <- rnorm(n, mean=m, sd=sd) # data
if (!is.na(lower) && !is.na(upper)){ # adjust data to specified range
drange <- range(data) # data range
irange <- range(lower, upper) # input range
data <- (data - drange[1])/(drange[2] - drange[1]) # normalize data (make it 0 to 1)
data <- (data * (irange[2] - irange[1]))+irange[1] # adjust to specified range
}
return(data)
}
There are several ways to set upper and lower limits to a normal distribution, what will cause that the result is no longer normal distributed.
Assuming a mean=0, sd=1 producing N=1e5 values with a lower boundary of LO=-1 and an upper boundary of UP=2.
N <- 1e5L
LO <- -1
UP <- 2
Move outliers to border (#Roland)
set.seed(42)
x <- pmax(LO, pmin(UP, rnorm(N)))
mean(x)
#[1] 0.07238029
median(x)
#[1] -0.002066374
sd(x)
#[1] 0.8457605
hist(x, 30)
Cut outliers of (#Dason, #Roland, truncnorm::rtruncnorm, MCMCglmm::rtnorm)
set.seed(42)
x <- qnorm(runif(N, pnorm(LO), pnorm(UP)))
mean(x)
#[1] 0.2317875
median(x)
#[1] 0.173679
sd(x)
#[1] 0.7236536
Scale (#Alex Essilfie)
set.seed(42)
x <- rnorm(N)
x <- (x-min(x))/(max(x)-min(x))*(UP-LO)+LO
mean(x)
#[1] 0.4474876
median(x)
#[1] 0.4482257
sd(x)
#[1] 0.3595199
Combination of methods. E.g. Cut and scale:
set.seed(42)
x <- qnorm(runif(N, pnorm(-3), pnorm(3)))
x <- (x-min(x))/(max(x)-min(x))*(UP-LO)+LO
mean(x)
#[1] 0.5010759
median(x)
#[1] 0.5014713
sd(x)
#[1] 0.4957751
Asymmetric combination
set.seed(42)
n <- round(N*abs(LO)/diff(range(c(LO, UP))))
x <- c(qnorm(runif(n, pnorm(-3), 0.5)), qnorm(runif(N-n, 0.5, pnorm(3))))
x <- ifelse(x < 0, x/min(x)*LO, x/max(x)*UP)
mean(x)
#[1] 0.2651627
median(x)
#[1] 0.2127903
sd(x)
#[1] 0.5078264
Related
I'd like to test the coverage probabilities for trimmed means, I am using the formula form Wilcox book for confidence intervals:
Confidence interval
The s_w is Winsorised variance and γ is the proportion coefficient, in my code it's denoted as alpha. The problem is, that the code, I have made outputs confidence intervals with 0 always in them, so that the coverage probability is 1. So, I think there is some error in the construction.
Code:
sample_var <- function(data, alpha){
n <- length(data)
data <- sort(data)
data_t <- data[(floor(n*alpha)+1):(n-floor(alpha*n))]
m <- length(data_t)
t_mean <- mean(data_t)
sigma <- (1/(1-2*alpha)^2)* ((1/n) *sum((data_t-t_mean)^2)+ alpha*(data_t[1]-t_mean)^2 +
alpha*(data_t[m]-t_mean)^2)
sigma
}
sample_var <- Vectorize(sample_var, vectorize.args = "alpha")
conf_int <- function(data,alpha){
a <- floor(alpha * n)
n <- length(data)
df <- n-2*a-1
data_t <- data[a:(n-a)]
t_mean <- mean(data_t)
t_quantile <- qt(p = alpha, df = df)
sw <- sample_var(data = data, alpha = alpha)
ul <- t_mean + t_quantile * sw / ((1-2*alpha)*sqrt(n))
ll <- t_mean - t_quantile * sw / ((1-2*alpha)*sqrt(n))
c(ll, ul)
}
Maybe someone sees the error?
EDIT:
Here I tried to construct the intervals using wilcox.test function, but I don't know whether it accurately constructs the interval for the trimmed mean. Furthermore, no matter which alpha I use, for the given data set, I get the same interval. So, I suppose that the subset argument is wrong.
set_seed(1)
data <- rnorm(100)
wilcox_test <- function(data, alpha){
n <- length(alpha)
a <- floor(alpha*n)+1
b <- n-floor(alpha)
wilcox.test(data, subset = data[a:b], conf.int = TRUE)
}
OK...with rnorm(100) and set.seed(1)
Close-ish...
set.seed(1) # note set.seed() is what you want here, I think.
data <- rnorm(100)
wilcox_test_out <- wilcox.test(data, subset = data[a:b], conf.int = .95)
summary(wilcox_test_out)
# Note the CI's are in wilcox_test_out$conf.int for further use should you need them
wilcox_test_out$conf.int
I'm trying to calculate confidence intervals for all rows in a dataframe
I've been using something like this for a normal distribution:
function(x){
mean(x)+c(-1.96,1.96)*sd(x)/sqrt(length(x))
}
Any advice how to modify the above for a t-distribution? Thanks a lot in advance
try the below. I broke it down in parts but also provided the function similar to yours.
## In parts
n <- 25
x <- sample(1:100, n, replace = TRUE)
mean <- mean(x)
sd <- sd(x)
error <- qt(0.975, df = n-1) * sd / sqrt(n)
lower <- mean - error
upper <- mean + error
# As function
example_function <- function(x){
tCI <- mean(x)+c(-( qt(0.975, df = n-1) * sd / sqrt(n)), qt(0.975, df = n-1) * sd / sqrt(n))
tCI
}
example_function(x)
# > lower
# [1] 37.23457
# > upper
# [1] 60.76543
# > example_function(x)
# [1] 37.23457 60.76543
Suppose that I want to solve a function containing two integrals like (this is an example, the actual function is uglier)
where a and b are the boundaries, c and d are known parameters and f(x) and F(x) are the density and distribution of the random variable x. In my problem f(x) and F(x) are nonparametrically found, so that I know their values only for certain specific values of x. How would you set the integral?
I did:
# Create the data
val <- runif(300, min=1, max = 10) #use the uniform distribution
CDF <- (val - 1)/(10 - 1)
pdf <- 1 / (10 - 1)
data <- data.frame(val = val, CDF = CDF, pdf = pdf)
c = 2
d = 1
# Inner integral
integrand1 <- function(x) {
i <- which.min(abs(x - data$val))
FF <- data$CDF[i]
ff <- data$pdf[i]
(1 - FF)^(c/d) * ff
}
# Vectorize the inner integral
Integrand1 <- Vectorize(integrand1)
# Outer integral
integrand2 <- function(x){
i <- which.min(abs(x - data$val))
FF <- data$CDF[i]
ff <- data$pdf[i]
(quadgk(Integrand1, x, 10) / FF) * c * ff
}
# Vectorize the outer integral
Integrand2 <- Vectorize(integrand2)
# Solve
require(pracma)
quadgk(Integrand2, 1, 10)
The integral is extremely slow. Is there a better way to solve this? Thank you.
---------EDIT---------
In my problem the pdf and CDF are computed from a vector of values v as follows:
# Create the original data
v <- runif(300, min = 1, max = 10)
require(np)
# Compute the CDF and pdf
v.CDF.bw <- npudistbw(dat = v, bandwidth.compute = TRUE, ckertype = "gaussian")
v.pdf.bw <- npudensbw(dat = v, bandwidth.compute = TRUE, ckertype = "gaussian")
# Extend v on a grid (I add this step because the v vector in my data
# is not very large. In this way I approximate the estimated pdf and CDF
# on a grid)
val <- seq(from = min(v), to = max(v), length.out = 1000)
data <- data.frame(val)
CDF <- npudist(bws = v.CDF.bw, newdata = data$val, edat = data )
pdf <- npudens(bws = v.pdf.bw, newdata = data$val, edat = data )
data$CDF <- CDF$dist
data$pdf <- pdf$dens
Have you considered using approxfun?
It takes vectors x and y and gives you a function that linearly interpolates between those. So for example, try
x <- runif(1000)+runif(1000)+2*(runif(1000)^2)
dx <- density(x)
fa <- approxfun(dx$x,dx$y)
curve(fa,0,2)
fa(0.4)
You should be able to call it using your gridded evaluations. It may be faster than what you're doing (as well as more accurate)
(edit: yes, as you say, splinefun should be fine if its fast enough for your needs)
I would like to compute the convolution of two probability distributions in R and I need some help. For the sake of simplicity, let's say I have a variable x that is normally distributed with mean = 1.0 and stdev = 0.5, and y that is log-normally distributed with mean = 1.5 and stdev = 0.75. I want to determine z = x + y. I understand that the distribution of z is not known a priori.
As an aside the real world example I am working with requires addition to two random variables that are distributed according to a number of different distributions.
Does anyone know how to add two random variables by convoluting the probability density functions of x and y?
I have tried generating n normally distributed random values (with above parameters) and adding them to n log-normally distributed random values. However, I wish to know if I can use the convolution method instead. Any help would be greatly appreciated.
EDIT
Thank you for these answers. I define a pdf, and try to do the convolution integral, but R complains on the integration step. My pdfs are Log Pearson 3 and are as follows
dlp3 <- function(x, a, b, g) {
p1 <- 1/(x*abs(b) * gamma(a))
p2 <- ((log(x)-g)/b)^(a-1)
p3 <- exp(-1* (log(x)-g) / b)
d <- p1 * p2 * p3
return(d)
}
f.m <- function(x) dlp3(x,3.2594,-0.18218,0.53441)
f.s <- function(x) dlp3(x,9.5645,-0.07676,1.184)
f.t <- function(z) integrate(function(x,z) f.s(z-x)*f.m(x),-Inf,Inf,z)$value
f.t <- Vectorize(f.t)
integrate(f.t, lower = 0, upper = 3.6)
R complains at the last step since the f.t function is bounded and my integration limits are probably not correct. Any ideas on how to solve this?
Here is one way.
f.X <- function(x) dnorm(x,1,0.5) # normal (mu=1.5, sigma=0.5)
f.Y <- function(y) dlnorm(y,1.5, 0.75) # log-normal (mu=1.5, sigma=0.75)
# convolution integral
f.Z <- function(z) integrate(function(x,z) f.Y(z-x)*f.X(x),-Inf,Inf,z)$value
f.Z <- Vectorize(f.Z) # need to vectorize the resulting fn.
set.seed(1) # for reproducible example
X <- rnorm(1000,1,0.5)
Y <- rlnorm(1000,1.5,0.75)
Z <- X + Y
# compare the methods
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
Same thing using package distr.
library(distr)
N <- Norm(mean=1, sd=0.5) # N is signature for normal dist
L <- Lnorm(meanlog=1.5,sdlog=0.75) # same for log-normal
conv <- convpow(L+N,1) # object of class AbscontDistribution
f.Z <- d(conv) # distribution function
hist(Z,freq=F,breaks=50, xlim=c(0,30))
z <- seq(0,50,0.01)
lines(z,f.Z(z),lty=2,col="red")
I was having trouble getting integrate() to work for different density parameters, so I came up with an alternative to #jlhoward's using Riemann approximation:
set.seed(1)
#densities to be convolved. could also put these in the function below
d1 <- function(x) dnorm(x,1,0.5) #
d2 <- function(y) dlnorm(y,1.5, 0.75)
#Riemann approximation of convolution
conv <- function(t, a, b, d) { #a to b needs to cover the range of densities above. d needs to be small for accurate approx.
z <- NA
x <- seq(a, b, d)
for (i in 1:length(t)){
print(i)
z[i] <- sum(d1(x)*d2(t[i]-x)*d)
}
return(z)
}
#check against sampled convolution
X <- rnorm(1000, 1, 0.5)
Y <- rlnorm(1000, 1.5, 0.75)
Z <- X + Y
t <- seq(0, 50, 0.05) #range to evaluate t, smaller increment -> smoother curve
hist(Z, breaks = 50, freq = F, xlim = c(0,30))
lines(t, conv(t, -100, 100, 0.1), type = "s", col = "red")
I want to find mean of standard normal distribution in a given interval.
For example, if I divide standard normal distribution into two ([-Inf:0] [0:Inf]) I want to get the mean of each half.
Following code does almost exactly what I want:
divide <- 2
boundaries <- qnorm(seq(0,1,length.out=divide+1))
t <- sort(rnorm(100000))
means.1 <- rep(NA,divide)
for (i in 1:divide) {
means.1[i] <- mean(t[(t>boundaries[i])&(t<boundaries[i+1])])
}
But I need a more precise (and elegant) method to calculate these numbers (means.1).
I tried the following code but it did not work (maybe because of the lack of my probability knowledge).
divide <- 2
boundaries <- qnorm(seq(0,1,length.out=divide+1))
means.2 <- rep(NA,divide)
f <- function(x) {x*dnorm(x)}
for (i in 1:divide) {
means.2[i] <- integrate(f,lower=boundaries[i],upper=boundaries[i+1])$value
}
Any ideas?
Thanks in advance.
The problem is that the integral of dnorm(x) in the interval (-Inf to 0) isn't 1, that's why you got the wrong answer. To correct you must divide the result you got by 0.5 (the integral result). Like:
func <- function(x, ...) x * dnorm(x, ...)
integrate(func, -Inf, 0, mean=0, sd=1)$value / (pnorm(0, mean=0, sd=1) - pnorm(-Inf, mean=0, sd=1))
Adapt it to differents intervals should be easy.
Thanks for answering my question.
I combined all answers as I understand:
divide <- 5
boundaries <- qnorm(seq(0,1,length.out=divide+1))
# My original thinking
t <- sort(rnorm(1e6))
means.1 <- rep(NA,divide)
for (i in 1:divide) {
means.1[i] <- mean(t[((t>boundaries[i])&(t<boundaries[i+1]))])
}
# Based on #DWin
t <- sort(rnorm(1e6))
means.2 <- tapply(t, findInterval(t, boundaries), mean)
# Based on #Rcoster
means.3 <- rep(NA,divide)
f <- function(x, ...) x * dnorm(x, ...)
for (i in 1:divide) {
means.3[i] <- integrate(f, boundaries[i], boundaries[i+1])$value / (pnorm(boundaries[i+1]) - pnorm(boundaries[i]))
}
# Based on #Kith
t <- sort(rnorm(1e6))
means.4 <- rep(NA,divide)
for (i in 1:divide) {
means.4[i] <- fitdistr(t[t > boundaries[i] & t < boundaries[i+1]], densfun="normal")$estimate[1]
}
Results
> means.1
[1] -1.4004895486 -0.5323784986 -0.0002590746 0.5313539906 1.3978177100
> means.2
[1] -1.3993590768 -0.5329465789 -0.0002875593 0.5321381745 1.3990997391
> means.3
[1] -1.399810e+00 -5.319031e-01 1.389222e-16 5.319031e-01 1.399810e+00
> means.4
[1] -1.399057073 -0.531946615 -0.000250952 0.531615180 1.400086731
I believe #Rcoster is the one that I wanted. Rest is innovative approaches compared to mine but still approximate.
Thanks.
You can use a combination of fitdistr and vector indexing.
Here's an example of how to get mean and std of just the positive values:
library("MASS")
x = rnorm(10000)
fitdistr(x[x > 0], densfun="normal")
or just the values in the interval (0,2):
fitdistr(x[x > 0 & x < 2], densfun="normal")
Let's say your cutpoints are -1, 0, 1, and 2 and you are interested in the mean of sections simulating a standard Normal.
samp <- rnorm(1e5)
(res <- tapply(samp, findInterval(samp, c( -1, 0, 1, 2)), mean) )
# 0 1 2 3 4
#-1.5164151 -0.4585519 0.4608587 1.3836470 2.3824633
Please do note that the labeling could be improved. One improvement could be:
names(res) <- paste("[", c(-Inf, -1, 0, 1, 2, Inf)[-6], " , ",
c(-Inf, -1, 0, 1, 2, Inf)[-1], ")", sep="")
> res
[-Inf , -1) [-1 , 0) [0 , 1) [1 , 2) [2 , Inf)
-1.5278185 -0.4623743 0.4621885 1.3834442 2.3835116
Using the distrEx and distr packages:
library(distrEx)
E(Truncate(Norm(mean=0, sd=1), lower=0, upper=Inf))
# [1] 0.797884
(See vignette(distr) in the distrDoc package for an excellent overview of the suite of distr and related packages.)
Or, using just base R, here's an alternative that constructs a discrete approximation of the expectation within the interval between lb and ub. The bases of the approximating rectangles are adjusted so that they all have equal areas (i.e. so that the probability of a point falling in each one of them is identical).
intervalMean <- function(lb, ub, n=1e5, ...) {
## Get x-values at n evenly-spaced quantiles between lower and upper bounds
xx <- qnorm(seq(pnorm(lb, ...), pnorm(ub, ...), length = n), ...)
## Calculate expectation
mean(xx[is.finite(xx)])
}
## Your example
intervalMean(lb=0, ub=1)
# [1] 0.4598626
## The mean of the complete normal distribution
intervalMean(-Inf, Inf)
## [1] -6.141351e-17
## Right half of standard normal distribution
intervalMean(lb=0, ub=Inf)
# [1] 0.7978606
## Right half of normal distribution with mean 0 and standard deviation 100
intervalMean(lb=0, ub=Inf, mean=0, sd=100)
# [1] 79.78606