I was wondering how would I convert the Excel's Percentile rank exclusive function in R. I found a technique here which is like this:
true_df <- data.frame(some_column= c(24516,7174,13594,33838,40000))
percentilerank<-function(x){
rx<-rle(sort(x))
smaller<-cumsum(c(0, rx$lengths))[seq(length(rx$lengths))]
larger<-rev(cumsum(c(0, rev(rx$lengths))))[-1]
rxpr<-smaller/(smaller+larger)
rxpr[match(x, rx$values)]
}
dfr<-percentilerank(true_df$some_column)
#output which is similar to =PERCENTRANK.INC and NOT =PERCENTRANK.EXC
#[1] 0.50 0.00 0.25 0.75 1.00
But it is for =PERCENTRANK.INC equivalent in R. According to info popup in Excel, a =PERCENTRANK.INC takes (array, x-value of rank, [significance-optional]) and returns percentage rank inclusive of the first (0%) and last (100%) values in the array.
=PERCENTRANK.EXC is similar to its counterpart but it returns percentage rank exclusive of the first and last values in the array. Meaning not 0% or 100%.
Here is a small example using Excel to show difference:
When I apply the above R function it gives me the output similar to PERCENTRANK.INC($A$32:$A$36,A32) column. How can I achieve this? I'm new to R.
Using dplyr:
library(dplyr)
# inclusive
percent_rank(x)
# exclusive
percent_rank(c(-Inf, Inf, x))[-(1:2)]
I messed around with the code and got this:
true_df <- data.frame(some_column= c(24516,7174,13594,33838,40000))
percentilerank<-function(x){
rx<-rle(sort(x))
smaller<-cumsum(c(!0, rx$lengths))[seq(length(rx$lengths))]
larger<-rev(cumsum(c(0, rev(rx$lengths))))
rxpr<-smaller/(smaller+larger)
rxpr[match(x, rx$values)]
}
dfr<-percentilerank(true_df$some_column)
#output is now matches =PERCENTRANK.EXC
#[1] 0.5000000 0.1666667 0.3333333 0.6666667 0.8333333
Since the 0 and 100% are not included in the percentile. I changed the line smaller<-cumsum(c(0.... to smaller<-cumsum(c(!0.... and similarly to get rid of 100% where I took out [-1] from line larger<-...[-1]
This is how to replicate PERCENTRANK.EXC in other native Excel formulas:
= Round(Rank/(N + 1) - 0.05%, 3)
Maybe that will help someone.
The 3 corresponds to the default level of significance in PERCENTRANK.EXC. Change as needed.
Related
I'm trying to find the vector which is the most close to 0.5 but not bigger than that. And I want to print another vector on the same row.
For example, I have table named 'exp' like this
num possibility
1 0.16
2 0.43
5 0.64
4 0.12
3 0.76
.
.
.
And I'm trying to find, 'which possibility is the most closest and smaller than 0.5?'.
The answer is second row, which contains 'num==2' 'possibility==0.43'
But how can I find this with coding?
And I'm trying to calculate the '+-2' range of 'num' whose possibility is the most closest and smaller than '0.5'
The num will surely be '5' and the range will be '3~7'.
But how can I do this at once with linked codes?
And whatif I have too many exp1, exp2, exp3, exp4... to do the same work? How can I automatically do this?
I tried things.
exp[which.min(exp$possibility-0.5 <0) -1 , 1]
x < exp[which.min(exp$possibility-0.5 <0) -1 , 1]+2
& x> exp[which.min(exp$possibility-0.5 <0) -1 , 1]-2
this is my best.
but I don't know why adding '<0' in the 'which.min' function makes difference, functioning like 'ifelse'. And how to find the 'closest smaller one' without using '-1' after 'which.min' function.
Actually I more want to know what are simpler and more useful tools.
Please help..
You can try something like this. you can basically set 3 to get variations. Also you could put this in a function and use lapply to iterate over all cols.
f=data.frame(a=seq(1:10), b=runif(10))
c=0.5
z=f$b-c
z=ifelse(z>0, 99, z) # add if you dont want values above 0.5
z=abs(z)
z1=order(z)[1:3]
f$b[z1]
In you first expression (and similarly for the second one), when you do exp$possibility-0.5 <0, a boolean vector is what you get and what it is fed into which.min you are getting the min out of a bunch of one and zeros (True and False) which is not what you want.
which possibility is the most closest and smaller than 0.5?
There are many ways to achieve, one is to set those larger than 0.5 to NA, first, which is done by the ifelse, then find the max probability with which.max like you mentioned:
exp$possibility[which.max(ifelse(exp$possibility> 0.5 exp$possibility> NA)),]
And I'm trying to calculate the '+-2' range of 'num' whose possibility
is the most closest and smaller than '0.5' The num will surely be '5'
and the range will be '3~7'.
You can store the number in a variable first ...
my.num <- exp[which.max(ifelse(exp$possibility> 0.5, exp$possibility, NA)), "num"]
... and subsequently retrive it by
exp[exp$num >= (my.num -2) & exp$num <= (my.num + 2), ]
or put replace my.num with the first expression if you really want a one-liner.
I am trying to learn how to write functions in R and I have a very specific question regarding the use of table and how to treat the "levels variable".
My original problem is to write a cumulative hazard function. My function basically does this:
Example: data x= c(1,1,2,2,2,3,14,25) which has 8 observations/times
From a vector 8 observations do the following operation for F(14)= 2/8 + 3/6+ 1/3+ 1/2
for F(2)= 2/8+3/6, so on.
Basically I want the sum of: (how many observations have time i)/(how many observations have time greater than or equal to i)
So for i=2, I have two fractions: 2/(8)+ 3/(6), because there are 6 observations with time i equal to 2 or more.
Specifically I was using the function table. However, this function gives me the frequencies and treats the value associated with the frequency as a level and not as a number.
For my data I have 5 levels: 1,2,3,14,15 but when I try to do operations like:
v<-c(1,2,3,14,15)
ta<-as.data.frame(table(v))
as.numeric(ta$v)<14
[1] TRUE TRUE TRUE TRUE TRUE
However, I want the result to be TRUE TRUE TRUE FALSE FALSE. I want the variables in table() to be treated as numbers.
How can I do that?
Just for the sake to see what I am doing, my extra code is below. It works well without the censoring, but this part is key for me to advance with censoring.
cumh<-function (x,t,y=rep(1,length(x))){
le<-length(x)
#Sum comparison of terms
isum<-sum(x<=t)
#Collapse table
ta<-as.data.frame((table(x)))
ta$cum<-cumsum(ta$Freq)
ta$den<-le
for (j in 1:(nrow(ta)-1)) {
ta$den[j+1]<-le-ta$cum[j]
}
ind<-isum>=ta$cum
#correction for right censor:
ta2<-as.data.frame(table(y*x))
cumhaz<-sum(ind*ta2$Freq/ta$den)
return(cumhaz)}
Here is one method using sapply and table
x <- c(1,1,2,2,2,3,14,25)
myTab <- table(x)
myTab / sapply(seq_along(myTab), function(i) sum(tail(c(0, myTab), -i)))
x
1 2 3 14 25
0.2500000 0.5000000 0.3333333 0.5000000 1.0000000
Here, tail successively removes values from the beginning of x. The remaining values are summed together. sapply does this for values from the beginning of x to the final value. To accomplish this, I pre-pended 0 to x. The summations then divide x to return the proportions.
I will try to explain well my doubt.
I have a table with some variables X, Y, Z, for example.
Each variable has numeric values.
So, let's say I have
RDIST RDENS AGR BLF
1 146 0.000 0 0.0
2 338 0.000 0 0.0
3 931 0.000 0 3.7
I'm trying to identify outliers, so I used dotchart.
But now, I want to know, in each variable, in which observation I have the outliers.
With list(x$BLF>3) command, I get a table with TRUE or FALSE values. But what I need to know is if the outlier is in observation 2, 3, or 145.
I agree with #MrFlick. which() is the best way to go. If you want to take the next step and remove those outliers you could do
x$BLF<-x$BLF[-which(x$BLF>3)] which takes those indexes MrFlick was talking about and deletes those entries from the BLF column, using the - operator. Of course after that you store it back to the same column. Actually, REALLY don't do what I said above because if your data is stored in a dataframe R will automatically replace the removed values with the value just above it to maintain the right column length!
Probably best to replace the outliers with NA like this x$BLF[which(x$BLF>3)]<-NA. Or you could just remove the entire row from your dataset like this x<-x[-which(x$BLF>3),] The reason you have the comma now is that when your dealing with a rectangular dataframe you have to specify row, column, like this [row I want, column I want] so I just specify the row I want deleted without specifying the column.
Probably more than you wanted, but I thought it might help.
That's it! Thank you both!
For now, I just want to identify the outliers for each variable. That command solved my doubt, tottaly.
Thank you
This seems like a simple problem, but for some reason I haven't been able to find a solution.
I have a matrix of probabilities that sum to 1, and I want to know at which value I have a cumulative sum of, for example, 0.5. In other words, if I turned this matrix into a sorted vector, how far do I have to go from the highest value to get a cumulative sum of 0.5.
I transformed my matrix into a vector of values and used plot(cumsum(x)) to produce the following graph:
I can do something like
P<-ecdf(x)
P(0.00001)
to get the cumulative sum at an x value of 0.00001, but I want to go in the other direction, i.e. what is the x value at a cumulative sum of 0.5?
quantile() gives me the value at 50% of the ordered values (e.g. it would give me the value of sort(x)[4e+05] in the graph above), which is not what I'm after.
Thanks for your help with this seemingly simple question!
Cheers,
Josh
Solution:
x[max(which(cumsum(x)<=0.5))]
gives the value at the cumulative sum of 0.5 (thanks #plafort), although it seems as though there should be an easier way!
I think I get what you want; Here is my solution: where my goal is to find out the element of the matrix where the cumsum is >= 20 for example.
Even though I think that there must be a super easier way to achieve that.
set.seed(1)
data <- matrix(rnorm(9, 10), 3, 3)
data
[,1] [,2] [,3]
[1,] 9.373546 11.595281 10.48743
[2,] 10.183643 10.329508 10.73832
[3,] 9.164371 9.179532 10.57578
which(cumsum(data) >= 500)[1]
[1] NA
which(cumsum(data) >= 20)[1]
[1] 3
I've created an 8 x 1000 matrix of Exp() variables. This represents 1000 iterations (columns) of sampling 8 times from an exponential distribution. I am trying to figure out how to get the percentage of values in each column that are less than a critical value. So I end up with a vector of 1000 percentages. I've tried a couple things but still being relatively new to R I'm having some difficulty.
This is my current version of code that doesn't quite work. I've used the apply function (without the for loop) when I want the mean or variance of the samples, so I've been trying this approach but this percentage thing seems to require something different. Any thoughts?
m=1000
n=8
theta=4
crit=3
x=rexp(m*n,1/theta)
Mxs=matrix(x,nrow=n)
ltcrit=matrix(nrow=m,ncol=1)
for(i in 1:m){
lt3=apply(Mxs,2,length(which(Mxs[,i]<crit)/n))
}
ltcrit
You can use apply without any for loop and get the answer:
percentages = apply(Mxs, 2, function(column) mean(column < crit))
Note the creation of an anonymous function with function(column) mean(column < crit). You probably used apply(Mxs, 2, mean) when you wanted the means of the columns, or apply(Mxs, 2, var) when you wanted the variance of the columns, but notice that you can put any function you want into that space and it will perform it on each column.
Also note that mean(column < crit) is a good way to get the percentage of values in column less than crit.
You can use colMeans:
colMeans(Mxs < crit)
[1] 0.500 0.750 0.250 0.375 0.375 0.875 ...