Percentile rank is frequently defined by the following formula:
Percentile rank = (L/N)*100
L=Number of values in dataset lower than or equal to value of interest
N=number of data points
In R, it is common to calculate percentile rank of values in a vector by
Percentile_Rank=rank(vec)/length(vec)*100)
However, I would like to use a slightly modified definition of percentile rank, which is defined by the same formula as above but
L = Number of values in dataset strictly lower than the value of interest
This is similar to the PERCENTILERANK.EXC function in Excel.
Is there a function built into R to calculate this? Otherwise, how can I do it?
Is this what you're looking for?
y = 1:10
# traditional percentile
rank(y)/length(y) * 100
# [1] 10 20 30 40 50 60 70 80 90 100
# percentile considering those values preceding current value
vapply(y, function(x){
sum(y < x)/length(y) * 100
}, FUN.VALUE = numeric(1L))
# [1] 0 10 20 30 40 50 60 70 80 90
Related
I have this histogram:
I am wondering what i means to find the empirical relative frequency of say salaries < 85000. I understand that empirical frequency is the occurrence int eh data set.
Is this correct:
x = our_sample[our_sample$Salaries < 85000, ] # x has 292 rows
result <- 292 / length(our_sample) # our_sample has 500 samples
result
I am not sure which functions to use to find this.
I'm calculating particles diameter evolution over time and I'm trying to make the condition that when a particle diameter is less or equal to a minimal diameter the diameter is equal to the minimal fixed value.
I tried with the condition if but it is not working (code showed here bellow) So I would like to do is that from the first time the min diameter is reached what ever the other values are equal to, the min diameter value is attributed to them.
#p is my data frame and dp is diameter values
a <-p$diameter <- p$dp*((Te - p$t)/Te)^0.5
p$vol <- pi*(p$dp*1e-6)^3/6
#diam_min_ma is minimum diameter calculation
b <- diam_min_ma=(0.03*p$vol*6/pi)^(1/3)*1000000
c = if (a >= b)
{p$diameter=a}
else
{p$diameter=b}
p$diameter <- c
This is an example of expected table (DP1,....Dp7 diameter change over time and Dp min is the minimum diameter that can be reached)
DpT1 DpT2 DpT3 DpT4 DpT5 DpT6 DpT7
150 100 75 50 36 36 36 Dp min= 36µm
100 60 45 30 28 28 28 Dp min= 28µm
60 40 20 20 20 20 20 Dp min= 28µm
Finally I found the answer which was to use ifelse instead of what I did.
Which allow to do it for all table rows instead of only the first one
p$diameter <<- ifelse (a >= b,a,b)
I want to simulate the problem below in R and calculate the average probability based on 1000 simulations -
Scores on a test are normally distributed with mean 70 and std dev 10.
Estimate the probability that among 75 randomly selected students at least 22 score greater than 78
This is what I have done so far
set.seed(1)
scores = rnorm(1000,70,10)
head(scores)
hist(scores)
sm75=sample(scores,75)
length(sm75[sm75>78])/75
#[1] 0.1866667
However, this only gives me only one iteration, I want 1000 iterations and then take the average of those 1000 probabilities. I believe some kind of control structure using for loop can be implemented. Also, is there an easier way through "apply" family of functions?
At the end of the day you are testing whether at least 22 students score higher than 78, which can be compactly computed with:
sum(rnorm(75, 70, 10) > 78) >= 22
Breaking this down a bit, rnorm(75, 70, 10) returns the 75 scores, which are normally distributed with mean 70 and standard deviation 10. rnorm(75, 70, 10) > 78 is a vector of length 75 that indicates whether or not each of these scores is above 78. sum(rnorm(75, 70, 10) > 78) converts each true to a 1 and each false to a 0 and sums these values up, meaning it counts the number of the 75 scores that exceed 78. Lastly we test whether the sum is 22 or higher with the full expression above.
replicate can be used to replicate this any number of times. So to see the breakdown of 1000 simulations, you can use the following 1-liner (after setting your random seed, of course):
set.seed(144)
table(replicate(1000, sum(rnorm(75, 70, 10) > 78) >= 22))
# FALSE TRUE
# 936 64
In 64 of the replicates, at least 22 students scored above a 78, so we estimate the probability to be 6.4%.
Probability is calculated as number of favourable outcomes / the total number of outcomes. So..
> scores <- sample(rnorm(1000,70,10),75)
> probability <- length(subset(scores,scores>78))/length(scores)
> probability
[1] 0.28
However, you want to do this a 1000 times, and then take an average.
> mean(replicate(1000, {scores<-sample(rnorm(1000,70,10),75);length(subset(scores,scores>78))/length(scores)}))
[1] 0.2133333
In terms of probability distribution they use? I know that runif gives fractional numbers and sample gives whole numbers, but what I am interested in is if sample also use the 'uniform probability distribution'?
Consider the following code and output:
> set.seed(1)
> round(runif(10,1,100))
[1] 27 38 58 91 21 90 95 66 63 7
> set.seed(1)
> sample(1:100, 10, replace=TRUE)
[1] 27 38 58 91 21 90 95 67 63 7
This strongly suggests that when asked to do the same thing, the 2 functions give pretty much the same output (though interestingly it is round that gives the same output rather than floor or ceiling). The main differences are in the defaults and if you don't change those defaults then both would give something called a uniform (though sample would be considered a discrete uniform and by default without replacement).
Edit
The more correct comparison is:
> ceiling(runif(10,0,100))
[1] 27 38 58 91 21 90 95 67 63 7
instead of using round.
We can even step that up a notch:
> set.seed(1)
> tmp1 <- sample(1:100, 1000, replace=TRUE)
> set.seed(1)
> tmp2 <- ceiling(runif(1000,0,100))
> all.equal(tmp1,tmp2)
[1] TRUE
Of course if the probs argument to sample is used (with not all values equal), then it will no longer be uniform.
sample samples from a fixed set of inputs, and if a length-1 input is passed as the first argument, returns an integer output(s).
On the other hand, runif returns a sample from a real-valued range.
> sample(c(1,2,3), 1)
[1] 2
> runif(1, 1, 3)
[1] 1.448551
sample() runs faster than ceiling(runif())
This is useful to know if doing many simulations or bootstrapping.
Crude time trial script that time tests 4 equivalent scripts:
n<- 100 # sample size
m<- 10000 # simulations
system.time(sample(n, size=n*m, replace =T)) # faster than ceiling/runif
system.time(ceiling(runif(n*m, 0, n)))
system.time(ceiling(n * runif(n*m)))
system.time(floor(runif(n*m, 1, n+1)))
The proportional time advantage increases with n and m but watch you don't fill memory!
BTW Don't use round() to convert uniformly distributed continuous to uniformly distributed integer since terminal values get selected only half the time they should.
Question:
Suppose the numbers in the following random number table correspond to people arriving for work at a large factory. Let 0,1,and 2 be smokers and 3-9 be nonsmokers. After many arrivals, calculate the total relative frequency of smokers .
here is my R code to simulate the total relative frequency of smokers.
simulation<-function(k){
x<-round(runif(k)*10)
return (length(x[x<3])/k)}
> simulation(100)
[1] 0.27
> simulation(1000)
[1] 0.244
> simulation(10000)
[1] 0.2445
> simulation(100000)
[1] 0.24923
Why i can't get the result 0.3?
If all you want to do is get a discrete uniform distribution on the numbers 0, 1, ..., 9 then just use sample
sample(0:9, k, replace = TRUE)
With the code you have right now you'll actually get a probability of .05 each of getting 0 or 10 and a probability of .10 each of getting 1-9.