R, rounding, ceiling and floors - r

Suppose that one has a bunch of data returned from pnorm(), such that you've got numbers between .0003ish and .9999ish.
numbers <- round(rnorm(n = 10000, mean = 100, sd = 15))
percentiles <- pnorm(numbers, mean = 100, sd = 15)*100
And then further suppose that one is interested in rounding the percentiles such that .0003 or whatevs will come out to 1 (so ceiling()), but 99.999 will come out to 99 (so floor()).
I guess what I'm looking for is round() that somehow brilliantly knows to reverse it in the extreme cases, but as far as I know, no such thing exists. Am I going to have to ugly it up with an if statement? Is there a better method of handling such a thing?

You could use round and force things into 1 or 99 at the extremities using pmin and pmax:
pmax(1, pmin(99, round(percentiles)))

Related

Generate random decimal numbers with given mean in given range in R

Hey I want to generate 100 decimal numbers in the range of 10 and 50 with the mean of 32.2.
I can use this to generate the numbers in the wanted range, but I don't get the mean:
runif(100, min=10, max=50)
Or I could use this and I dont get the range:
rnorm(100,mean=32.2,sd=10)
How can I combine those two or can I use another function?
I have tried to use this approach:
R - random distribution with predefined min, max, mean, and sd values
But I dont get the exact mean I want... (31.7 in my example try)
n <- 100
y <- rgbeta(n, mean = 32.2, var = 200, min = 10, max = 50)
Edit: Ok i have lowered the var and the mean gets near to 32.2 but I still want some values near the min and max range...
In order to get random numbers between 10 and 50 with a (true) mean of 32.2, you would need a density function that would fulfill those properties.
A uniform distribution with a min of 10 and a max of 50 (runif) will never deliver you that mean, as the true mean is 30 for that distribution.
The normal distribution has a range from - infinity to infinity, independent of the mean it has, so runif will return numbers greater than 50 and smaller than 10.
You could use a truncated normal distribution
rnormTrunc(n = 100, mean = 32.2, sd = 1, min = 10, max = 50),
if that distribution would be okay. If you need a different distibution, things will get a little more complicated.
Edit: feel free to ask if you need the math behind that, but depending on what your density function should look like it will get very complicated
This isn't perfect, but maybe its a start. I can't get the range to work out perfectly, so I just played with the "max" until I got an output I was happy with. There is probably a more solid math way to do this. The result is uniform-adjacent... at best...
rand_unif_constrained <- function(num, min, max, mean) {
vec <- runif(num, min, max)
vec / sum(vec) * mean*num
}
set.seed(35)
test <- rand_unif_constrained(100, 10, 40, 32.2) #play with max until max output is less that 50
mean(test)
#> [1] 32.2
min(test)
#> [1] 12.48274
max(test)
#> [1] 48.345
hist(test)

Why does my function only sometimes work?

I have the following function:
samp315<-function(n=30, desmean=86, distance=3.4995) {
x = seq(from = 0, to = 100, by = 0.1)
samp<-0
while (!between(mean(samp),desmean-distance,desmean+distance)) samp<-sample(x,n,replace=TRUE)
samp
}
percent <- samp315()
so pretty much I want to generate 30 numbers within 0-100 that has a mean of 86+/-3.4995, however whenever I run the last line it will load forever or when I am lucky it will genrate a list of desired results. Any idea on how i could change the function to improve its functionality?
As suggested by Parfait in the comments, you're using a randomization strategy that gives a low probability of providing the condition you're interested in. Did no other answer to this question help you out?
Some other possible strategies for you to try out.
n = 30
# Using truncated normal
library(truncnorm)
x = round(rtruncnorm(n, a = -0.0495, b = 100.0495, mean = 85, sd = 3.5*2), 1)
# Using beta
sig = 3
x = round(100*rbeta(n, (0.85)*sig, (1-0.85)*sig), 1)
The round(..., 1) is meant to align with your vector x. These methods would both have very few values away from 85. It's a trade-off you have to consider. If you want to have a mean in 85 +/- 3.5, then you can't too many values below 10, for example. So you have to lower the probability of such values being selected. Using your function, when it is completed, you'll probably find that values closer to 85 are more represented.

Creating a binary variable with probability in R

I'm trying to make a variable, Var, that takes the value 0 60% of the time, and 1 otherwise, with 50 000 observation.
For a normally distributed, I remember doing the following for a normal distribution, to define n:
Var <- rnorm(50 000, 0, 1)
Is there a way I could combine an ifelse command with the above to specify the number of n as well as the probability of Var being 0?
I would use rbinom like this:
n_ <- 50000
p_ <- 0.4 # it's probability of 1s
Var <- rbinom(n=n_, size=1, prob=p_)
By using of variables, you can change the size and/or probability just by changing of those variables. Hope that's what you are looking for.
If by 60% you mean a probability equal to 0.6 (rather than an empirical frequency), then
Var <- sample(0:1, 50000, prob = c(6, 4), replace = TRUE)
gives a desired sequence of independent Bernoulli(0.6) realizations.
I'm picking nits here, but it actually isn't completely clear exactly what you want.
Do you want to simulate a sample of 50000 from the distribution you describe?
Or, do you want 50000 replications of simulating an observation from the distribution you describe?
These are different things that, in my opinion, should be approached differently.
To simulate a sample of size 50000 from that distribution you would use:
sample(c(0,1), size = 50000, replace = TRUE)
To replicate 50000 simulations of sampling from the distribution you describe I would recommend:
replicate(50000, sample(c(0,1), size = 1, prob = c(0.6, 0.4)))
This might seem silly since these two lines of code produce exactly the same thing, in this case.
But suppose your goal was to investigate properties of samples of size 50000? Then what you would use a bunch (say, 1000) of replication of that first line of code above wrapped inside replicate:
replicate(1000, sample(c(0,1), size = 50000, prob = c(0.6, 0.4), replace = TRUE))
I hope I haven't been too pedantic about this. Having seen simulations go awry it has become my belief that one should keep separate the thing being simulated from the number of simulations you decide to do. The former is fundamental to your problem, while the latter only affects the accuracy of the simulation study and how long it takes.

Why does runif() not predict the interval maximum value?

I was responding to question posed over at Reddit AskScience and I came across something odd with respect to the functionality of runif(). I was attempting to sample a set from 1 to 52 uniformly. My first thought was to use runif():
as.integer(runif(n, min = 1, max = 52))
However, I found that the operation never produced a value of 52. For example:
length(unique(as.integer(runif(1000000, 1, 52))))
[1] 51
For my purposes, I just turned to sample() instead:
sample(52, n, replace = TRUE)
In the runif() documentation it states:
runif will not generate either of the extreme values unless max = min or max-min is small compared to min, and in particular not for the default arguments.
I'm wondering why runif() acts this way. It seems like it should be able to produce the 'extreme values' from the set if its attempting to generate samples uniformly. Is this a feature, and why?
This is indeed a feature. The C source code of runif contains the following C code:
/* This is true of all builtin generators, but protect against
user-supplied ones */
do {u = unif_rand();} while (u <= 0 || u >= 1);
return a + (b - a) * u;
this implies that unif_rand() could return 0 or 1, but runif() is engineered to skip those (unlikely) cases.
My guess would be that this is done to protect user code that would fail in the edge cases (values exactly on the boundaries of the range).
This feature was implemented by Brian Ripley on Sep 19 2006 (from the comments it seems that 0<u<1 is automatically true of the built-in uniform generator, but might not be true for user-supplied ones).
sample(1:52,size=n,replace=TRUE) is an idiomatic (although not necessarily the most efficient) way to achieve your goal.
as.integer works like trunc. It will form an integer by truncating the given value toward 0. And since values can't exceed 52 (see Ben's answer) they will always be truncated to a value between 1 and 51.
You would see different result with floor (or ceiling). Note that you have to adjust the max of runif by adding 1 (or adjust min in case of ceiling). Also note that in this case, since both min and max are above 0, you could replace floor with trunc or as.integer too.
set.seed(42)
x = floor(runif(n = 1000000, min = 1, max = 52 + 1))
plot(prop.table(table(x)), las = 2, cex.axis = 0.75)
as.integer(51.999)
51
It is because how as.integer works.
If you want to draw from a discrete distribution, then use sample. runif is not for discrete distributions.

How to get the intervals for ntile()

I was trying to figure out if there is a way to get the intervals used for when ntile() is used.
I have a sample that I want to use as a basis for getting the percentile values of a larger sample, and I was hoping to find a way to get the value of the intervals for when I use ntile().
Any enlightenment on this would be appreciated.
I really want to put this as a comment, but I stil can't comment.
How about using quantile to generate the interval, like this:
# create fake data; 100 samples randomly picked from 1 to 500
fakeData <- runif(100, 1, 500)
# create percentile values; tweak the probs to specify the quantile that you want
x <- quantile(fakeData, probs = seq(0, 1, length.out = 100))
Then you can apply that interval to the larger data set (i.e., using cut, which might give the same result to the ntile of dplyr).

Resources