I was working on statistics using R. Before i do this using R program, i have done it manually. So here is the problem.
A sample of 300 TV viewers were asked to rate the overall quality of television shows from 0 (terrible) to 100 (the best). A histogram was constructed from the results, and it was noted that
it was mound-shaped and symmetric, with a sample mean of 65 and a sample standard
deviation of 8. Approximately what proportion of ratings would be above 81?
I have answered it manually with this :
Pr(X>81)=Pr(Z>(81-65)/8)=Pr(Z>2)=0.0227
So the proportion is 0.023 or 2.3%
I have trouble with how can i do this in R ? I have tried using pnorm(p=..,mean=..,sd=..) but didnt find similar result with my manual.
Thank you so much for the answer
You identified the correct function.
The help on pnorm gives the list of arguments:
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
with the explanation for the arguments:
x, q: vector of quantiles.
mean: vector of means.
sd: vector of standard deviations.
log, log.p: logical; if TRUE, probabilities p are given as log(p).
lower.tail: logical; if TRUE (default), probabilities are P[X <= x]
otherwise, P[X > x].
Under "Value:" it says
... ‘pnorm’ gives the distribution function,
So that covers everything. If you put the correct value you want the area to the left of in for q and the correct mu and sigma values, you will get the area below it. If you want the area above, add lower.tail=FALSE.
Like so:
pnorm(81,65,8) # area to left
[1] 0.9772499
pnorm(81,65,8,lower.tail=FALSE) # area to right ... which is what you want
[1] 0.02275013
(this way is more accurate than subtracting the first thing from 1 when you get into the far upper tail)
Edit: This diagram might clarify things:
Related
I want to identify the probability of certain events occurring for a range.
Min = 600 Max = 50,000 Most frequent outcome = 600
I generated a sequence of events: numbers <- seq(600,50000,by=1)
This is where I get stuck. Not sure if using the wrong distribution or attempt at execution is going down the wrong path.
qpois(numbers, lambda = 600) produces NaNs
So the outcome desired is to be able to get an output of weighted probabilities (weighted to the mean of 600). And then be able to assess the likelihood of an outlier event about 30000 is 5% or different cuts like that by summing the probabilities for those numbers.
A bit rusty, haven't used this for a few years so any online resources to refresh is also appreciated!
Firstly, I think you're looking for ppois rather than qpois. The function qpois(p, 600) takes a vector p of probabilities. If you do qpois(0.75, 600) you will get 616, meaning that 75% of observations will be at or below 616.
ppois is the opposite of qpois. If you do ppois(616, 600) you will get (approximately) 0.75.
As for your specific distribution, it can't be a Poisson distribution. Let's see what a Poisson distribution with a mean of 600 looks like:
x <- 500:700
plot(x, dpois(x, 600), type = "h")
Getting a value of greater than even 900 has (essentially) a zero probability:
1 - ppois(900, 600)
#> [1] 0
So if your data contains values of 30,000 or 50,000 as well as 600, it's certainly not a Poisson distribution.
Without knowing more about your actual data, it's not really possible to say what distribution you have. Perhaps if you include a sample of it in your question we could be of more help.
EDIT
With the sample of numbers provided in the comments, we can have a look at the actual empirical distribution:
hist(numbers, 200)
and if we want to know the probability at any point, we can create the empirical cumulative distribution function like this:
get_probability_of <- ecdf(numbers)
This allows us to do:
number <- 1:50000
plot(number, get_probability_of(number), ylab = "probability", type = "l")
and
get_probability_of(30000)
#> [1] 0.83588
Which means that the probability of getting a number higher than 30,000 is
1 - get_probability_of(30000)
#> [1] 0.16412
However, in this case, we know how the distribution is generated, so we can calculate the exact theoretical cdf just using some simple geometry (I won't show my working here because although it is simple, it is rather lengthy, dull, and not applicable to other distributions):
cdf <- function(x) ifelse(x < 600, 0, 1 - ((49400 - (x - 600)) / 49400)^2)
and
cdf(30000)
#> [1] 0.8360898
which is very close to, but more theoretically accurate than the empirical value.
I'm working with a very large dataset with 132,019 observations of 18 variables. I've used the clusterSim package to calculate the pseudo-F statistic on clusters created using Kohonen SOMs. I'm trying to assess the various cluster sizes (e.g., 4, 6, 9 clusters) with p-values, but I'm getting weird results and I'm not statistically savvy enough to know what's going on.
I use the following code to get the pseudo-F.
library(clusterSim)
psF6 <- index.G1(yelpInfScale, cl = som.6$unit.classif)
psF6
[1] 48783.4
Then I use the following code to get the p-value. When I do lower.tail = T I get a 1 and when I do lower.tail = F I get a 0.
k6 = 6
pf(q = psF6, df1 = k6 - 1, df2 = n - k6, lower.tail = FALSE)
[1] 0
I guess I was expecting not a round number, so I'm confused about how to interpret the results. I get the exact same results regardless of which cluster size I evaluate. I read something somewhere about reversing df1 and df2 in the calculation, but that seems weird. Also, the reference text I'm using (Larose's "Data Mining and Predictive Analytics") uses this to evaluate k-means clusters, so I'm wondering if the problem is that I'm using Kohonen clusters.
I'd check your data, but its not impossible to get p value as either 0 or 1. In your case, assuming you have got your data right, it indicates that you're data is heavily skewed and the clusters you have created are ideal fit. So when you're doing lower.tail = FALSE, the p-value of zero indicates that you're sample is classified with 100% accuracy and there is no chance of an error. The lower.tail = TRUE gives 1 indicates that you clusters very close to each other. In other words, your observations are clustered well away from each other to have a 0 on two tailed test but the centre points of clusters are close enough to give a p value of 1 in one tailed test. If I were you I'd try 'K-means with splitting' variant with different distance parameter 'w' to see how the data fits. IF for some 'w' it fits with very low p values for clusters, I don't think a model as complex as SOM is really necessary.
I have to simulate a system's fail times, to do so I have to use the Weibull distribution with a "decreasing hazard rate" and a shape of "0.7-0.8". I have to generate a file with 100 results for the function that uses random numbers from 0 to 1.
So I've been searching a bit and I found this R function:
pweibull(q, shape, scale = 1, lower.tail = T, log.p = F)
There are some other (rweibull,qweibull...) but I think this is the one that I have to use, since is the cumulative distribution one, as the exercise statement says. The problem is that I am new to R and that I don't really know what parameters I have to pass to this function.
I'm guessing shape should be 0.7-0.8, and scale 1. For q parameter, should I create a random vector of 100 numbers with 0 to 1 values? If so, any tip of how to do it? Also any tip on how to export the resultant data to a file?
I'm not sure what the question is, but if you want to generate 100 values drawn from Weibull distribution with shape parameter of 0.75 use rweibull(100, 0.75).
If you want to see what the probability is that they are larger than zero, use pweibull(rweibull(100, 0.75), 0.75).
You should also be aware that there is a general no-homework rule on these sites.
The prob package numerically evaluates characteristic functions for base R distributions. For almost all distributions there are existing formulas. For a few cases, though, no closed-form solution is known. Case in point: the Weibull distribution (but see below).
For the Weibull characteristic function I essentially compute two integrals and put them together:
fr <- function(x) cos(t * x) * dweibull(x, shape, scale)
fi <- function(x) sin(t * x) * dweibull(x, shape, scale)
Rp <- integrate(fr, lower = 0, upper = Inf)$value
Ip <- integrate(fi, lower = 0, upper = Inf)$value
Rp + (0+1i) * Ip
Yes, it's clumsy, but it works surprisingly well! ...ahem, most of the time. A user reported recently that the following breaks:
cfweibull(56, shape = 0.5, scale = 1)
Error in integrate(fr, lower = 0, upper = Inf) :
the integral is probably divergent
Now, we know that the integral isn't divergent, so it must be a numerical problem. With some fiddling I could get the following to work:
fr <- function(x) cos(56 * x) * dweibull(x, 0.5, 1)
integrate(fr, lower = 0.00001, upper = Inf, subdivisions=1e7)$value
[1] 0.08024055
That's OK, but it isn't quite right, plus it takes a fair bit of fiddling which doesn't scale well. I've been investigating this for a better solution. I found a recently published "closed-form" for the characteristic function with scale > 1 (see here), but it involves Wright's generalized confluent hypergeometric function which isn't implemented in R (yet). I looked into the archives for integrate alternatives, and there's a ton of stuff out there which doesn't seem very well organized.
As part of that searching it occurred to me to translate the region of integration to a finite interval via the inverse tangent, and voila! Check it out:
cfweibull3 <- function (t, shape, scale = 1){
if (shape <= 0 || scale <= 0)
stop("shape and scale must be positive")
fr <- function(x) cos(t * tan(x)) * dweibull(tan(x), shape, scale)/(cos(x))^2
fi <- function(x) sin(t * tan(x)) * dweibull(tan(x), shape, scale)/(cos(x))^2
Rp <- integrate(fr, lower = 0, upper = pi/2, stop.on.error = FALSE)$value
Ip <- integrate(fi, lower = 0, upper = pi/2, stop.on.error = FALSE)$value
Rp + (0+1i) * Ip
}
> cfweibull3(56, shape=0.5, scale = 1)
[1] 0.08297194+0.07528834i
Questions:
Can you do better than this?
Is there something about numerical integration routines that people who are expert about such things could shed some light on what's happening here? I have a sneaking suspicion that for large t the cosine fluctuates rapidly which causes problems...?
Are there existing R routines/packages which are better suited for this type of problem, and could somebody point me to a well-placed position (on the mountain) to start the climb?
Comments:
Yes, it is bad practice to use t as a function argument.
I calculated the exact answer for shape > 1 using the published result with Maple, and the brute-force-integrate-by-the-definition-with-R kicked Maple's ass. That is, I get the same answer (up to numerical precision) in a small fraction of a second and an even smaller fraction of the price.
Edit:
I was going to write down the exact integrals I'm looking for but it seems this particular site doesn't support MathJAX so I'll give links instead. I'm looking to numerically evaluate the characteristic function of the Weibull distribution for reasonable inputs t (whatever that means). The value is a complex number but we can split it into its real and imaginary parts and that's what I was calling Rp and Ip above.
One final comment: Wikipedia has a formula listed (an infinite series) for the Weibull c.f. and that formula matches the one proved in the paper I referenced above, however, that series has only been proved to hold for shape > 1. The case 0 < shape < 1 is still an open problem; see the paper for details.
You may be interested to look at this paper, which discuss different integration methods for highly oscillating integrals -- that's what you are essentially trying to compute:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.8.6944
Also, another possible advice, is that instead of infinite limit you may want to specify a smaller one, because if you specify the precision that you want, then based on the cdf of the weibull you can easily estimate how much of the tail you can truncate. And if you have a fixed limit, then you can specify exactly (or almost) the number of subdivisions (e.g. in order to have a few(4-8) points per period).
I had the same problem than Jay - not with the Weibull distribution but with the integrate function. I found my answer to Jay's question 3 in a comment to this question:
Divergent Integral in R is solvable in Wolfram
The R package pracma contains several functions for solving integrals numerically. In the package, one finds some R functions for integrating certain mathematical functions. And there is a more general function integral. That helped in my case. Example code is given below.
To questions 2: The first answer to the linked question (above) states that not the complete error message of the C source file is printed out by R (The function may just converge too slowly). Therefore, I would agree with Jay that the fast fluctuation of the cosine may be a problem. In my case and in the example below it was the problem.
Example Code
# load Practical Numerical Math Functions package
library(pracma)
# define function
testfun <- function(r) cos(r*10^6)*exp(-r)
# Integrate it numerically with the basic 'integrate'.
out1 = integarte(testfun, 0, 100)
# "Error in integrate(testfun, 0, 100) : the integral is probably divergent"
# Integrate it numerically with 'integral' from 'pracma' package
# using 'Gauss-Kronrod' method and 10^-8 as relative tolerance. I
# did not try the other ones.
out2 = integral(testfun, 0, 100, method = 'Kronrod', reltol = 1e-8)
Two remarks
The integral function does not break as the integrate function does but it may take quite a long time to run. I do not know (and I did not try) whether the user can limit the number of iterations (?).
Even if the integral function finalises without errors I am not sure how correct the result is. Numerically integrating a function which is fast fluctuating around zero seems to be quite tricky since one does not know where exactly values on the fluctuating function are calculated (twice as much positive than negative values; positive values close to local maxima and negative values far off). I am not on expert on numeric integration but I just got to know some basic fixed-step integration methods in my numerics lectures. So maybe the adaptive methods used in integral deal with this problem in some way.
I'm attempting to answer questions 1 & 3. That being said I am not contributing any original code. I did a google search and hopefully this is helpful. Good luck!
Source:http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf (p.6)
#Script
library(ggplot2)
## sampling from a Weibull distribution with parameters shape=2.1 and scale=1.1
x.wei<-rweibull(n=200,shape=2.1,scale=1.1)
#Weibull population with known paramters shape=2 e scale=1
x.teo<-rweibull(n=200,shape=2, scale=1) ## theorical quantiles from a
#Figure
qqplot(x.teo,x.wei,main="QQ-plot distr. Weibull") ## QQ-plot
abline(0,1) ## a 45-degree reference line is plotted
Is this of any use?
http://www.sciencedirect.com/science/article/pii/S0378383907000452
Muraleedharana et al (2007) Modified Weibull distribution for maximum and significant wave height simulation and prediction, Coastal Engineering, Volume 54, Issue 8, August 2007, Pages 630–638
From the abstract: "The characteristic function of the Weibull distribution is derived."
I'm trying to generate random numbers with a multivariate skew normal distribution using the rmsn command from the sn package in R. I would like, ideally, to be able to get three columns of numbers with a specified variances and covariances, while having one column strongly skewed. But I'm struggling to achieve both goals simultaneously.
The post at skew normal distribution was related and useful (and the source of some of the code below), but hasn't completely clarified the issue for me.
I've been trying:
a <- c(5, 0, 0) # set shape parameter
s <- diag(3) # create variance-covariance matrix
w <- sqrt(1/(1-((2*(a^2)/(1 + a^2))/pi))) # determine scale parameter to get sd of 1
xi <- w*a/sqrt(1 + a^2)*sqrt(2/pi) # determine location parameter to get mean of 0
apply(rmsn(n=1000, xi=c(xi), Omega=s, alpha=a), 2, sd)
colMeans(rmsn(n=1000, xi=c(xi), Omega=s, alpha=a))
The columns means and SDs are correct for the second and third columns (which have no skew) but not the first (which does). Can anyone clarify where my code above, or my thinking, has gone wrong? I may be misunderstanding how to use rmsn, or the output. Any assistance would be appreciated.
The location is not the mean (except when there is no skew). From the documentation:
Notice that the location vector ‘xi’ does not represent the mean
vector of the distribution (which in fact may not even exist if ‘df <=
1’), and similarly ‘Omega’ is not the covariance matrix of the
distribution
And you may want to replace Omega=s with Omega=w.
And this is supposed to be a variance matrix: there should be no square root.