Alright, so I have the strangest issue here. I'm taking the mean of a dependent variable Y, when we partition a space by a particular quantile of an independent variable X.
My issue is, the quantile function in R is not returning a value within the range of my independent variable X, however the value it is returning, when printed to the screen is the correct value. What makes this stranger is it only happens with particular quantiles.
Some example code to demonstrate this weird effect:
x<-c(1.49,rep(1.59,86))
quantile(x,0.05) # returns 1.59, the correct value
# However both of these return all values as false
table(x>=quantile(x,0.05))
table(x==quantile(x,0.05))
# But if we take a quantile at 0.075 it works correctly
table(x>=quantile(x,0.075))
Any insight you guys can provide would be appreciated.
The quantile isn't exactly 1.59:
> quantile(x, 0.05)[[1]] == 1.59
[1] FALSE
> quantile(x, 0.05)[[1]] == 1.5900000000000003
[1] TRUE
quantile(..., type = 7) appears to be replacing 1.59 with 0.7000000000000001 * 1.59 + 0.3 * 1.59, which introduces a tiny error that bars the use of exact equality.
Related
My professor assigned us some homework questions regarding normal distributions. We are using R studio to calculate our values instead of the z-tables.
One question asks about something about meteors where the mean (μ) = 4.35, standard deviation (σ) = 0.59 and we are looking for the probability of x>5.
I already figured out the answer with 1-pnorm((5-4.35)/0.59) ~ 0.135.
However, I am currently having some difficulty trying to understand what pnorm calculates.
Originally, I just assumed that z scores were the only arguments needed. So I proceeded to use pnorm(z-score) for most of the normal curvature problems.
The help page for pnorm accessed through ?pnorm() indicates that the usage is:
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE).
My professor also says that I am ignoring the mean and sd by just using pnorm(z-score). I feel like it is just easier to type in one value instead of the whole set of arguments. So I experimented and found that
1-pnorm((5-4.35)/0.59) = 1-pnorm(5,4.35,0.59)
So it looks like pnorm(z-score) = pnorm (x,μ,σ).
Is there a reason that using the z-score allows to skip the mean and
standard deviation in the pnorm function?
I have also noticed that trying to add μ,σ arguments with the z-score gives the wrong answer (ex: pnorm(z-score,μ,σ).
> 1-pnorm((5-4.35)/0.59)
[1] 0.1352972
> pnorm(5,4.35,0.59)
[1] 0.8647028
> 1-pnorm(5,4.35,0.59)
[1] 0.1352972
> 1-pnorm((5-4.35)/0.59,4.35,0.59)
[1] 1
That is because a z-score is standard normally distributed, meaning it has μ = 0 and σ = 1, which, as you found out, are the default parameters for pnorm().
The z-score is just the transformation of any normally distributed value to a standard normally distributed one.
So when you output the probability of the z-score for x = 5 you indeed get the same value than asking for the probability of x > 5 in a normal distribution with μ = 4.35 and σ = 0.59.
But when you add μ = 4.35 and σ = 0.59 to your z-score inside pnorm() you get it all wrong, because you're looking for a standard normally distributed value in a different distribution.
pnorm() (to answer your first question) calculates the cumulative density function, which shows you P(X < x) (the probability that a random variable takes a value equal or less than x). That's why you do 1 - pnorm(..) to find out P(X > x).
I have used boxplot.stats$out to get outliers of a list in R. However I noticed that many times it fails to identify outliers. For example:
list = c(3,4,7,500)
boxplot.stats(list)
$`stats`
[1] 3.0 3.5 5.5 253.5 500.0
$n
[1] 4
$conf
[1] -192 203
$out
numeric(0)
quantile(list)
0% 25% 50% 75% 100%
3.00 3.75 5.50 130.25 500.00
130.25+1.5*IQR(list) = 320
As you can see the boxplot.stats() function failed to find the outlier 500, even though when I looked at the documentation they are using the Q1/Q3+/-1.5*IQR method. So 500 should've been identified as an outlier, but it clearly is not finding it and I'm not sure why?
I have tried this with a list of 5 elements instead of 4, or with an outlier that is very small instead of very large and I still get the same problem.
Notice that the third number in the "stats" portion is 253.5, not 130.25
The documentation for boxplot.stats says:
The two ‘hinges’ are versions of the first and third quartile, i.e.,
close to quantile(x, c(1,3)/4). The hinges equal the quartiles for odd
n (where n <- length(x)) and differ for even n. Whereas the quartiles
only equal observations for n %% 4 == 1 (n = 1 mod 4), the hinges do
so additionally for n %% 4 == 2 (n = 2 mod 4), and are in the middle
of two observations otherwise
In other words, for your data, it is using (500+7)/2 as the Q3 value
(and incidentally (3+4)/2 = 3.5 as Q1, not the 3.75 that you got from
quantile). Boxplot will use the boundary 253.5 + 1.5*(253.5 - 3.5) = 628.5
If you read the help page help("boxplot.stats") carefully, the return value section says the following. My emphasis.
stats
a vector of length 5, containing the extreme of the lower
whisker, the lower ‘hinge’, the median, the upper ‘hinge’ and
the extreme of the upper whisker.
Then, in the same section, again my emphasis.
out
the values of any data points which lie beyond the
extremes of the whiskers (if(do.out)).
Your data has 4 points. The extreme of the upper whisker, as returned in list member $stats, is 500.0, and this is the maximum of your data. There is no error.
Try this,
library (car)
Boxplot (Petal.Length ~ Species, id = list (n=Inf))
to identify all the outliers
Is there an implemented (!) function in R which gives you the empirical quantile for each value? I couldn't find any ...
Let's say we have x
x = c(1,3,4,2)
I want to have the quantile of each element.
[1] 0.25, 0.75, 1, 0.5
Thank you very much!
You can use the ecdf() function:
ecdf(x)(x)
[1] 0.25 0.75 1.00 0.50
ecdf(x) creates a function, and you pass the elements of x to that function. The syntax admittedly looks strange
Lets say I have a function like follows:
testFunction <- function(testInputs){
print( sum(testInputs)+1 == 2 )
return( sum(testInputs) == 1 )
}
When I test this on command line with following input: c(0.65, 0.3, 0.05), it prints and returns TRUE as expected.
However when I use c(1-0.3-0.05, 0.3, 0.05) I get TRUE printed and FALSE returned. Which makes no sense because it means sum(testInputs)+1 is 2 but sum(testInputs) is not 1.
Here is what I think: Somehow printed value is not exactly 1 but probably 0.9999999..., and its rounded up on display. But this is only a guess. How does this work exactly?
This is exactly a floating point problem, but the interesting thing about it for me is how it demonstrates that the return value of sum() produces this error, but with + you don't get it.
See the links about floating point math in the comments. Here is how to deal with it:
sum(1-0.3-0.5, 0.3, 0.05) == 1
# [1] FALSE
dplyr::near(sum(1-0.3-0.05, 0.3, 0.05), 1)
# [1] TRUE
For me, the fascinating thing is:
(1 - 0.3 - 0.05 + 0.3 + 0.05) == 1
# [1] TRUE
Because you can't predict how the various implementations of floating point arithmetic will behave, you need to correct for it. Here, instead of using ==, use dplyr::near(). This problem (floating point math is inexact, and also unpredictable), is found across languages. Different implementations within a language will result in different floating point errors.
As I discussed in this answer to another floating point question, dplyr::near(), like all.equal(), has a tolerance argument, here tol. It is set to .Machine$double.eps^0.5, by default. .Machine$double.eps is the smallest number that your machine can add to 1 and be able to distinguish it from 1. It's not exact, but it's on that order of magnitude. Taking the square root makes it a little bigger than that, and allows you to identify exactly those values that are off by an amount that would make a failed test for equality likely to be a floating point error.
NOTE: yes, near() is in dplyr, which i almost always have loaded, so I forgot it wasn't in base... you could use all.equal(), but look at the source code of near(). It's exactly what you need, and nothing you don't:
near
# function (x, y, tol = .Machine$double.eps^0.5)
# {
# abs(x - y) < tol
# }
# <environment: namespace:dplyr>
I'm having a hard time building an efficient procedure that adds and multiplies probability density functions to predict the distribution of time that it will take to complete two process steps.
Let "a" represent the probability distribution function of how long it takes to complete process "A". Zero days = 10%, one day = 40%, two days = 50%. Let "b" represent the probability distribution function of how long it takes to complete process "B". Zero days = 10%, one day = 20%, etc.
Process "B" can't be started until process "A" is complete, so "B" is dependent upon "A".
a <- c(.1, .4, .5)
b <- c(.1,.2,.3,.3,.1)
How can I calculate the probability density function of the time to complete "A" and "B"?
This is what I'd expect as the output for or the following example:
totallength <- 0 # initialize
totallength[1:(length(a) + length(b))] <- 0 # initialize
totallength[1] <- a[1]*b[1]
totallength[2] <- a[1]*b[2] + a[2]*b[1]
totallength[3] <- a[1]*b[3] + a[2]*b[2] + a[3]*b[1]
totallength[4] <- a[1]*b[4] + a[2]*b[3] + a[3]*b[2]
totallength[5] <- a[1]*b[5] + a[2]*b[4] + a[3]*b[3]
totallength[6] <- a[2]*b[5] + a[3]*b[4]
totallength[7] <- a[3]*b[5]
print(totallength)
[1] [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
sum(totallength)
[1] 1
I have an approach in visual basic that used three for loops (one for each of the steps, and one for the output) but I hope I don't have to loop in R.
Since this seems to be a pretty standard process flow question, part two of my question is whether any libraries exist to model operations flow so I'm not creating this from scratch.
The efficient way to do this sort of operation is to use a convolution:
convolve(a, rev(b), type="open")
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
This is efficient both because it's less typing than computing each value individually and also because it's implemented in an efficient way (using the Fast Fourier Transform, or FFT).
You can confirm that each of these values is correct using the formulas you posted:
(expected <- c(a[1]*b[1], a[1]*b[2] + a[2]*b[1], a[1]*b[3] + a[2]*b[2] + a[3]*b[1], a[1]*b[4] + a[2]*b[3] + a[3]*b[2], a[1]*b[5] + a[2]*b[4] + a[3]*b[3], a[2]*b[5] + a[3]*b[4], a[3]*b[5]))
# [1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
See the package:distr. Choosing the term "multiply" is unfortunate, since the situation described is not one where the contributions to probabilities is independent (where multiplication of probabilities would be the natural term to use). It's rather some sort of sequential addition, and that is exactly what the distr package provides as its interpretation of what "+" should mean when used as a symbolic manipulation of two discrete distributions.
A <- DiscreteDistribution ( setNames(0:2, c('Zero', 'one', 'two') ), a)
B <- DiscreteDistribution(setNames(0:2, c( "Zero2" ,"one2", "two2",
"three2", "four2") ), b )
?'operators-methods' # where operations on 2 DiscreteDistribution are convolution
plot(A+B)
After a bit of nosing around I see that the actual numeric values can be found here:
A.then.B <- A + B
> environment(A.the.nB#d)$dx
[1] 0.01 0.06 0.16 0.25 0.28 0.19 0.05
Seems like there should have been a method for display of the probabilities, and I'm not a regular user of this fascinating package so there well may be one. Do read the vignette and the code-demos ... which I have not yet done. Further noodling around convinces me that the right place to look is in the companion package: distrDoc where the vignette is 100+ pages long. And it shouldn't have required any effort to find it, either, since that advice is in the messages that print when the package is loaded ... except in my defense there were a couple of pages of messages, so it was more tempting to jump into coding and using the help pages.
I'm not familiar with a dedicated package that does exactly what your example describes. but let me sujust a more robust solution for this problem.
You are looking for a method to estimate the distribution of a process that might be combined by an n steps process, in your case 2 that might not be as easy to compute as your example.
The approach Iwould use is a simulation, of 10k observations drown from the underlying distributions, and then calculating the density function of the simulated results.
using your example we can do the following:
x <- runif(10000)
y <- runif(10000)
library(data.table)
z <- as.data.table(cbind(x,y))
z[x>=0 & x<0.1, a_days:=0]
z[x>=0.1 & x<0.5, a_days:=1]
z[x>=0.5 & x<=1, a_days:=2]
z[y>=0 & y <0.1, b_days:=0]
z[x>=0.1 & x<0.3, b_days:=1]
z[x>=0.3 & x<0.5, b_days:=2]
z[x>=0.5 & x<0.8, b_days:=3]
z[x>=0.8 & x<=1, b_days:=4]
z[,total_days:=a_days+b_days]
hist(z[,total_days])
this will result in a very good proxy if the density and the aproach would also work if your second process was drown from an exponential distribution. in which case you'd use rexp function to calculate b_days directly.