Determining percentile based on reference table - r

I have standardized normal values for heart rates and respiratory rates in children from a recent article. I copied them into a csv to use as a dataset in R. The data is simply different age ranges (ex 3 months to 6 months, or 1 year to 2 years) and then the heart rate for the 1st percentile, 10th, 25th, 50th, 75th, 90th, 99th percentile for that age range.
I want to compare a patients data with this reference table to tell me what percentile they are at. Since this is a perfectly normal distribution, I don't think its a very hard task to do, but its outside of my R scope and I can't seem to find any good information on how to accomplish this.

Based on what you explained i can suggest this simple function that will input the heart rate and the age range of your patient and return the %percentile based on a normal density of this specific range.
my.quantile = function(myrange, heart.rate){
table <- data.frame('range'= c("range1", "range2", "range3"),
'mean' = c(120, 90, 60),
'sd' = c(12, 15, 30))
res <- pnorm(q = heart.rate,
mean = subset(table, range==myrange)$mean,
sd = subset(table, range==myrange)$sd)
return(res*100)
}
### my.quantile("range1", 140)
### [1] 95.22096
From what you say if it is perfectly normal you just need the mean and variance of each range right? You can adapt it for the respiratory rate.
EDIT: in order to retrieve the normal distribution parameters from your quantile table, given the hypothesis that the quantiles you've got are rather precise:
i/ Your mean paramater is exactly the 50th percentile
ii/ You find the standard deviation by taking any other percentile, for instance let's assume your 90th percentile is 73 beats and the 50th is 61 beats:
(73-61)/qnorm(0.90)
### [1] 9.36365
9.36 is your standard deviation. From here it shouldn't be very hard to automate it.
Note: if your percentile data are not very precise, you may want to repeat the operation for each percentile value and take the average.

Related

How can I create a normal distributed set of data in R?

I'm a newbie in statistics and I'm studying R.
I decided to do this exercise to pratice some analysis with an original dataset.
This is the issue: I want to create a datset of let's say 100 subjects and for each one of them I have a test score.
This test score has a range that goes from 0 to 70 and the mean score is 48 (and its improbable that someone scores 0).
Firstly I tried to create the set with x <- round(runif(100, min=0, max=70)) , but then I found out that were not normally distributed using plot(x).
So I searched another Rcommand and found this, but I couldn't decide the min\max:
ex1 <- round(rnorm(100, mean=48 , sd=5))
I really can't understand what I have to do!
I would like to write a function that gives me a set of data normally distributed, in a range of 0-70, with a mean of 48 and a not so big standard deviation in order to do some T-test later...
Any help?
Thanks a lot in advance guys
The normal distribution, by definition, does not have a min or max. If you go more than a few standard deviations from the mean, the probability density is very small, but not 0. You can truncate a normal distribution, chopping of the tails. Here, I use pmin and pmax to set any values below 0 to 0, and any values above 70 to 70:
ex1 <- round(rnorm(100, mean=48 , sd=5))
ex1 <- pmin(ex1, 70)
ex1 <- pmax(ex1, 0)
You can calculate the probability of an individual observation being below or above a certain point using pnorm. For your mean of 48 and SD of 5, the probability an individual observation is less than 0 is very small:
pnorm(0, mean = 48, sd = 5)
# [1] 3.997221e-22
This probability is so small that the truncation step is unnecessary in most applications. But if you started experimenting with bigger standard deviations, or mean values closer to the bounds, it could become necessary.
This method of truncation is simple, but it is a bit of a hack. If you truncated a distribution to be within 1 SD of the mean using this method, you would end up with spikes a the upper and lower bound that are even higher than the density at the mean! But it should work well enough for less extreme applications. A more robust method might be to draw more samples than you need, and keep the first n samples that fall within your bounds. If you really care to do things right, there are packages that implement truncated normal distributions.
(Because the normal distribution is symmetric, and 100 is farther from your mean than 0, the probability of observations > 100 are even smaller.)

The difference between percentile and quantile and how to calculate each

I am having extreme difficulty understanding the difference between percentiles and quantile.
I have googled the two statistical measures and the statement that makes the most sense to me is:
If you know that your score is in the 90th percentile, that means you
scored better than 90% of people who took the test. Percentiles are
commonly used to report scores in tests, like the SAT, GRE and LSAT.
for example, the 70th percentile on the 2013 GRE was 156. ... The 75th
percentile is also called the third quartile.
However, even with the above statement I'm still having trouble trying to get my head around it.
Therefore looking at the following Field Values can someone please calculate the 75th perentile/quantile of the following values in the field below called Feed_source
In layman's terms:
Sort data array
Choose element in position N*0.75 (index after sorting, N is length of array)
Value of this element is 75th procentile
Concerning your example - after sorting we have[101, 101, 103, 104, 107].
Index 5*0.75=3.75 ~ 4 (round to the closest integer).
So value 104 is needed procentile.
Quantile is more general term, procentile is quantile with 0.01 resolution.

Trying to coerce the data onto a Gaussian curve and the results are not as expected

this is not a question about curve fitting. Instead, what I have is a collection of 60 different sites, from which I can collect maximum, minimum and average temperatures. I need to be able to use this data to calculate the operating temperature of a photovoltaic cell; it doesn't make sense to do this, however, with the average temperatures because it includes values from after sunset. Instead, I first create a "fake" average temperature (this is our "fake average", totalityoftemperatures_fakemeans) which is the average value of the maximum and minimum temperatures. At that point, I calculate an adjusted minimum temperature by subtracting one standard deviation (assuming 6 * sd = max - min), and then finally calculate an "adjusted" mean temperature which is the average of the new minimum (fake mean - 1 * sd) and the pre-existing maximum temperature (so this is our "adjusted mean").
What really bothers me is that this re-calculated average ought to be higher than the "fake" mean; after all, it is an average value of the adjusted minimum together with the original maximum value. I might also cross-post this to the statistics stack exchange or something, but I'm pretty sure that this is a coding issue right now. Is there anyone out there who can look at the below code in R?
#The first data sets of maxima and minima are taken from empirical data
for(i in 1:nrow(totalityofsites))
{
for(j in 1:12)
{
totalityoftemperatures_fakemeans[i,j] = mean(totalityoftemperatures_maxima[i,j], totalityoftemperatures_minima[i,j])
}
}
totality_onesigmaDF = abs((1/6)*(totalityoftemperatures_maxima - totalityoftemperatures_minima))
totalityoftemperatures_adjustedminima = totalityoftemperatures_fakemeans - totality_onesigmaDF
for(i in 1:nrow(totalityofsites))
{
for(j in 1:12)
{
totalityoftemperatures_adjustedmeans[i,j] = mean(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j])
}
}
#The second calculation of the average should be higher than "fake" but that is not the case
I think your problem lies in your use of the mean function. When you do this:
mean(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j])
You are calling mean with two arguments. The function only takes one argument, a vector of numbers. If you supply it with two numbers it will ignore the second one. Look:
mean(2, 100)
#[1] 2
Whereas if you concatenate the values into a single vector, you get the right answer:
mean(c(2, 100))
#[1] 51
So you need to change
mean(totalityoftemperatures_maxima[i,j], totalityoftemperatures_minima[i,j])
to
mean(c(totalityoftemperatures_maxima[i,j], totalityoftemperatures_minima[i,j]))
and
mean(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j])
to
mean(c(totalityoftemperatures_adjustedminima[i,j], totalityoftemperatures_maxima[i,j]))

Obtaining the percentile 95 of a matrix and then plotting it

EDIT:
I have been asked to add more detail. Originally I have a 360x180 matrix, and in it there are data of E-P values, these values stand for Evaporation (E) and Precipitation (P), and they basically indicate sources (E-P>0) and sinks(E-P<0) of moisture. In order to obtain the most important sources of moisture I have to take only the positive values, and I want to obtain the percentile 95 of these values, then plot the values which are above this threshold, since I wanted to do a reproducible example I used the peaks data:
I have done this in MATLAB but if it can be made on R it works for me as well.
I have an example 49x49 matrix like this:
a = peaks;
pcolor(a);
caxis([-10 10]);
cbh=colorbar('v');
set(cbh,'YTick',(-10:1:10))
And it shows something like this
What I want to do is to obtain the percentile 95 of only the positive values, and then plotting them.
How can I do this? and also, what would it be better: To replace all the values less than zero with 0's or Nan's??
If you have the statistics toolbox, you can use the function prctile to obtain a percentile. I don't have this toolbox, so I wrote my own version (a long time ago) based on the code for the function median. With either prctile or percentile you can do:
a = peaks;
t = percentile(a(a>0),95);
b = a > t;
subplot(1,2,1)
pcolor(a);
subplot(1,2,2)
pcolor(b);
a(a>0) is a vector with all the positive values in a. t is the 95th percentile of this vector.

Wind Speed time series simulation in R

Following up from an R blog which is interesting and quite useful to simulate the time series of an unknown area using its Weibull parameters.
Although this method gives a reasonably good estimate of time series as a whole it suffers a great deal when we look for seasonal changes.
Lets see an example:
This method would give wind speeds for say below months, for a particular set of Weibull parameters as:
Jan 7.492608
Feb 7.059587
March 7.261821
Apr 7.192106
May 7.399982
Jun 7.195889
July 7.290898
Aug 7.210269
Sept 7.219063
Oct 7.307073
Nov 7.135451
Dec 7.315633
It can be seen that the variation in wind speed is not that much and in reality, the variation will change from one month to another. If I were to prioritise a certain month say July and June over months of November and December such that the Weibull remains unchanged. How would I do it?
Any lead or advice to make these change in the code listed in the link above would be of great help.
On request here is the sample code.
MeanSpeed<-7.29 ## Mean Yearly Wind Speed at the site.
Shape=2; ## Input Shape parameter.
Scale=8 ##Calculated Scale Parameter.
MaxSpeed<-17
nStates<-16
These are the inputs in the blog, the MeanSpeed is the average annual wind speed at a location that has Shape and Scale parameters as provided. The MaxSpeed is the maximum speed possible over the year.
I would like to have Maxspeed for each month say Maxspeed_Jan, Maxspeed_feb ...till Maxspeed_dec. All with different values. This should be able to reflect the seasonallity in the Wind Speed variations across the year.
Then Calculate the following in a certain way that would reflect this variation in the output time series.
nRows<-nStates;
nColumns<-nStates;
LCateg<-MaxSpeed/nStates;
WindSpeed=seq(LCateg/2,MaxSpeed-LCateg/2,by=LCateg) ## Fine the velocity vector-centered on the average value of each category.
##Determine Weibull Probability Distribution.
wpdWind<-dweibull(WindSpeed,shape=Shape, scale=Scale); # Freqency distribution.
plot(wpdWind,type = "b", ylab= "frequency", xlab = "Wind Speed") ##Plot weibull probability distribution.
norm_wpdWind<-wpdWind/sum(wpdWind); ## Convert weibull/Gaussian distribution to normal distribution.
## Correlation between states (Matrix G)
g<-function(x){2^(-abs(x))} ## decreasing correlation function between states.
G<-matrix(nrow=nRows,ncol=nColumns)
G <- row(G)-col(G)
G <- g(G)
##--------------------------------------------------------
## iterative process to calculate the matrix P (initial probability)
P0<-diag(norm_wpdWind); ## Initial value of the MATRIX P.
P1<-norm_wpdWind; ## Initial value of the VECTOR p.
## This iterative calculation must be done until a certain error is exceeded
## Now, as something tentative, I set the number of iterations
steps=1000;
P=P0;
p=P1;
for (i in 1:steps){
r<-P%*%G%*%p;
r<-as.vector(r/sum(r)); ## The above result is in matrix form. I change it to vector
p=p+0.5*(P1-r)
P=diag(p)}
## $$ ----Markov Transition Matrix --- $$ ##
N=diag(1/as.vector(p%*%G));## normalization matrix
MTM=N%*%G%*%P ## Markov Transition Matrix
MTMcum<-t(apply(MTM,1,cumsum));## From the MTM generated the accumulated
##-------------------------------------------
## Calculating the series from the MTMcum
##Insert number of data sets.
LSerie<-52560; Wind Speed every 10 minutes for a year.
RandNum1<-runif(LSerie);## Random number to choose between states
State<-InitialState<-1;## assumes that the initial state is 1 (this must be changed when concatenating days)
StatesSeries=InitialState;
## Initallise----
## The next state is selected to the one in which the random number exceeds the accumulated probability value
##The next iterative procedure chooses the next state whose random number is greater than the cumulated probability defined by the MTM
for (i in 2:LSerie) {
## i has to start on 2 !!
State=min(which(RandNum1[i]<=MTMcum[State,]));
## if (is.infinite (State)) {State = 1}; ## when the above condition is not met max -Inf
StatesSeries=c(StatesSeries,State)}
RandNum2<-runif(LSerie); ## Random number to choose between speeds within a state
SpeedSeries=WindSpeed[StatesSeries]-0.5+RandNum2*LCateg;
##where the 0.5 correction is needed since the the WindSpeed vector is centered around the mean value of each category.
print(fitdistr(SpeedSeries, 'weibull')) ##MLE fitting of SpeedSeries
The obtained result must resemble the input Scale and Shape parameters. And instead of getting uniform wind speed of each month the variation will reflect the input max wind speeds of each month.
Thank you.

Resources