I am having extreme difficulty understanding the difference between percentiles and quantile.
I have googled the two statistical measures and the statement that makes the most sense to me is:
If you know that your score is in the 90th percentile, that means you
scored better than 90% of people who took the test. Percentiles are
commonly used to report scores in tests, like the SAT, GRE and LSAT.
for example, the 70th percentile on the 2013 GRE was 156. ... The 75th
percentile is also called the third quartile.
However, even with the above statement I'm still having trouble trying to get my head around it.
Therefore looking at the following Field Values can someone please calculate the 75th perentile/quantile of the following values in the field below called Feed_source
In layman's terms:
Sort data array
Choose element in position N*0.75 (index after sorting, N is length of array)
Value of this element is 75th procentile
Concerning your example - after sorting we have[101, 101, 103, 104, 107].
Index 5*0.75=3.75 ~ 4 (round to the closest integer).
So value 104 is needed procentile.
Quantile is more general term, procentile is quantile with 0.01 resolution.
Related
I'm a newbie in statistics and I'm studying R.
I decided to do this exercise to pratice some analysis with an original dataset.
This is the issue: I want to create a datset of let's say 100 subjects and for each one of them I have a test score.
This test score has a range that goes from 0 to 70 and the mean score is 48 (and its improbable that someone scores 0).
Firstly I tried to create the set with x <- round(runif(100, min=0, max=70)) , but then I found out that were not normally distributed using plot(x).
So I searched another Rcommand and found this, but I couldn't decide the min\max:
ex1 <- round(rnorm(100, mean=48 , sd=5))
I really can't understand what I have to do!
I would like to write a function that gives me a set of data normally distributed, in a range of 0-70, with a mean of 48 and a not so big standard deviation in order to do some T-test later...
Any help?
Thanks a lot in advance guys
The normal distribution, by definition, does not have a min or max. If you go more than a few standard deviations from the mean, the probability density is very small, but not 0. You can truncate a normal distribution, chopping of the tails. Here, I use pmin and pmax to set any values below 0 to 0, and any values above 70 to 70:
ex1 <- round(rnorm(100, mean=48 , sd=5))
ex1 <- pmin(ex1, 70)
ex1 <- pmax(ex1, 0)
You can calculate the probability of an individual observation being below or above a certain point using pnorm. For your mean of 48 and SD of 5, the probability an individual observation is less than 0 is very small:
pnorm(0, mean = 48, sd = 5)
# [1] 3.997221e-22
This probability is so small that the truncation step is unnecessary in most applications. But if you started experimenting with bigger standard deviations, or mean values closer to the bounds, it could become necessary.
This method of truncation is simple, but it is a bit of a hack. If you truncated a distribution to be within 1 SD of the mean using this method, you would end up with spikes a the upper and lower bound that are even higher than the density at the mean! But it should work well enough for less extreme applications. A more robust method might be to draw more samples than you need, and keep the first n samples that fall within your bounds. If you really care to do things right, there are packages that implement truncated normal distributions.
(Because the normal distribution is symmetric, and 100 is farther from your mean than 0, the probability of observations > 100 are even smaller.)
EDIT:
I have been asked to add more detail. Originally I have a 360x180 matrix, and in it there are data of E-P values, these values stand for Evaporation (E) and Precipitation (P), and they basically indicate sources (E-P>0) and sinks(E-P<0) of moisture. In order to obtain the most important sources of moisture I have to take only the positive values, and I want to obtain the percentile 95 of these values, then plot the values which are above this threshold, since I wanted to do a reproducible example I used the peaks data:
I have done this in MATLAB but if it can be made on R it works for me as well.
I have an example 49x49 matrix like this:
a = peaks;
pcolor(a);
caxis([-10 10]);
cbh=colorbar('v');
set(cbh,'YTick',(-10:1:10))
And it shows something like this
What I want to do is to obtain the percentile 95 of only the positive values, and then plotting them.
How can I do this? and also, what would it be better: To replace all the values less than zero with 0's or Nan's??
If you have the statistics toolbox, you can use the function prctile to obtain a percentile. I don't have this toolbox, so I wrote my own version (a long time ago) based on the code for the function median. With either prctile or percentile you can do:
a = peaks;
t = percentile(a(a>0),95);
b = a > t;
subplot(1,2,1)
pcolor(a);
subplot(1,2,2)
pcolor(b);
a(a>0) is a vector with all the positive values in a. t is the 95th percentile of this vector.
I'm using the survival library. After computing the Kaplan-Meier estimator of a survival function:
km = survfit(Surv(time, flag) ~ 1)
I know how to compute percentiles:
quantile(km, probs = c(0.05,0.25,0.5,0.75,0.95))
But, how do I compute the mean survival time?
Calculate Mean Survival Time
The mean survival time will in general depend on what value is chosen for the maximum survival time. You can get the restricted mean survival time with print(km, print.rmean=TRUE). By default, this assumes that the longest survival time is equal to the longest survival time in the data. You can set this to a different value by adding an rmean argument (e.g., print(km, print.rmean=TRUE, rmean=250)).
Extract Value of Mean Survival Time and Store in an Object
In response to your comment: I initially figured one could extract the mean survival time by looking at the object returned by print(km, print.rmean=TRUE), but it turns out that print.survfit doesn't return a list object but just returns text to the console.
Instead, I looked through the code of print.survfit (you can see the code by typing getAnywhere(print.survfit) in the console) to see where the mean survival time is calculated. It turns out that a function called survmean takes care of this, but it's not an exported function, meaning R won't recognize the function when you try to run it like a "normal" function. So, to access the function, you need to run the code below (where you need to set rmean explicitly):
survival:::survmean(km, rmean=60)
You'll see that the function returns a list where the first element is a matrix with several named values, including the mean and the standard error of the mean. So, to extract, for example, the mean survival time, you would do:
survival:::survmean(km, rmean=60)[[1]]["*rmean"]
Details on How the Mean Survival Time is Calculated
The help for print.survfit provides details on the options and how the restricted mean is calculated:
?print.survfit
The mean and its variance are based on a truncated estimator. That is,
if the last observation(s) is not a death, then the survival curve
estimate does not go to zero and the mean is undefined. There are four
possible approaches to resolve this, which are selected by the rmean
option. The first is to set the upper limit to a constant,
e.g.,rmean=365. In this case the reported mean would be the expected
number of days, out of the first 365, that would be experienced by
each group. This is useful if interest focuses on a fixed period.
Other options are "none" (no estimate), "common" and "individual". The
"common" option uses the maximum time for all curves in the object as
a common upper limit for the auc calculation. For the
"individual"options the mean is computed as the area under each curve,
over the range from 0 to the maximum observed time for that curve.
Since the end point is random, values for different curves are not
comparable and the printed standard errors are an underestimate as
they do not take into account this random variation. This option is
provided mainly for backwards compatability, as this estimate was the
default (only) one in earlier releases of the code. Note that SAS (as
of version 9.3) uses the integral up to the last event time of each
individual curve; we consider this the worst of the choices and do not
provide an option for that calculation.
Using the tail formula (and since our variable is non negative) you can calculate the mean as the integral from 0 to infinity of 1-CDF, which equals the integral of the Survival function.
If we replace a parametric Survival curve with a non parametric KM estimate, the survival curve goes only until the last time point in our dataset. From there on it "assumes" that the line continues straight. So we can use the tail formula in a "restricted" manner only until some cut-off point, which we can define (default is the last time point in our dataset).
You can calculate it using the print function, or manually:
print(km, print.rmean=TRUE) # print function
sum(diff(c(0,km$time))*c(1,km$surv[1:(length(km$surv)-1)])) # manually
I add 0 in the beginning of the time vector, and 1 at the beginning of the survival vector since they're not included. I only take the survival vector up to the last point, since that is the last chunk. This basically calculates the area-under the survival curve up to the last time point in your data.
If you set up a manual cut-off point after the last point, it will simply add that area; e.g., here:
print(km, print.rmean=TRUE, rmean=4) # gives out 1.247
print(km, print.rmean=TRUE, rmean=4+2) # gives out 1.560
1.247+2*min(km$surv) # gives out 1.560
If the cut-off value is below the last, it will only calculate the area-under the KM curve up to that point.
There's no need to use the "hidden" survival:::survmean(km, rmean=60).
Use just summary(km)$table[,5:6], which gives you the RMST and its SE. The CI can be calculated using appropriate quantile of the normal distribution.
I have standardized normal values for heart rates and respiratory rates in children from a recent article. I copied them into a csv to use as a dataset in R. The data is simply different age ranges (ex 3 months to 6 months, or 1 year to 2 years) and then the heart rate for the 1st percentile, 10th, 25th, 50th, 75th, 90th, 99th percentile for that age range.
I want to compare a patients data with this reference table to tell me what percentile they are at. Since this is a perfectly normal distribution, I don't think its a very hard task to do, but its outside of my R scope and I can't seem to find any good information on how to accomplish this.
Based on what you explained i can suggest this simple function that will input the heart rate and the age range of your patient and return the %percentile based on a normal density of this specific range.
my.quantile = function(myrange, heart.rate){
table <- data.frame('range'= c("range1", "range2", "range3"),
'mean' = c(120, 90, 60),
'sd' = c(12, 15, 30))
res <- pnorm(q = heart.rate,
mean = subset(table, range==myrange)$mean,
sd = subset(table, range==myrange)$sd)
return(res*100)
}
### my.quantile("range1", 140)
### [1] 95.22096
From what you say if it is perfectly normal you just need the mean and variance of each range right? You can adapt it for the respiratory rate.
EDIT: in order to retrieve the normal distribution parameters from your quantile table, given the hypothesis that the quantiles you've got are rather precise:
i/ Your mean paramater is exactly the 50th percentile
ii/ You find the standard deviation by taking any other percentile, for instance let's assume your 90th percentile is 73 beats and the 50th is 61 beats:
(73-61)/qnorm(0.90)
### [1] 9.36365
9.36 is your standard deviation. From here it shouldn't be very hard to automate it.
Note: if your percentile data are not very precise, you may want to repeat the operation for each percentile value and take the average.
I can't find the type of problem I have and I was wondering if someone knew the type of statistics it involves. I'm not sure it's even a type that can be optimized.
I'd like to optimize three variables, or more precisely the combination of 2. The first is a likert scale average the other is the frequency of that item being rated on that likert scale, and the third is the item ID. The likert is [1,2,3,4]
So:
3.25, 200, item1. Would mean that item1 was rated 200 times and got an average of 3.25 in rating.
I have a bunch of items and I'd like to find the high value items. For instance, an item that is 4,1 would suck because while it is rated highest, it is rated only once. And a 1,1000 would also suck for the inverse reason.
Is there a way to optimize with a simple heuristic? Someone told me to look into confidence bands but I am not sure how that would work. Thanks!
Basically you want to ignore scores with fewer than x ratings, where x is a threshold that can be estimated based on the variance in your data.
I would recommend estimating the variance (standard deviation) of your data and putting a threshold on your standard error, then translating that error into the minimum number of samples required to produce that bound with 95% confidence. See: http://en.wikipedia.org/wiki/Standard_error_(statistics)
For example, if your data has standard deviation 0.5 and you want to be 95% sure your score is within 0.1 of the current estimate, then you need (0.5/0.1)^2 = 25 ratings.