Suppose I have 100 marbles, and 8 of them are red. I draw 30 marbles, and I want to know what's the probability that at least five of the marbles are red. I am currently using http://stattrek.com/online-calculator/hypergeometric.aspx and I entered 100, 8, 30, and 5 for population size, number of success, sample size, and number of success in sample, respectively. So the probability I'm interested in is Cumulative Probability: $P(X \geq 5)$ which = 0.050 in this case. My question is, how do I calculate this in R?
I tried
> 1-phyper(5, 8, 92, 30, lower.tail = TRUE)
[1] 0.008503108
But this is very different from the previous answer.
phyper(5, 8, 92, 30) gives the probability of drawing five or fewer red marbles.
1 - phyper(5, 8, 92, 30) thus returns the probability of getting six or more red marbles
Since you want the probability of getting five or more (i.e. more than 4) red marbles, you should use one of the following:
1 - phyper(4, 8, 92, 30)
[1] 0.05042297
phyper(4, 8, 92, 30, lower.tail=FALSE)
[1] 0.05042297
Why use:
1 - phyper(..., lower.tail = TRUE)
?
Easier to use:
phyper(..., lower.tail = FALSE)
. Even if they are mathematically equivalent, there are numerical reasons for preferring the latter.
Does that fix your problem? I believe you are putting the correct inputs into the phyper function. Is it possible that you're looking at the wrong output in that web site you linked?
Related
I want to simulate a string of random non-negative integer values in R.
However, those values should not follow any particular probability distribution function and could be empirically distributed.
How do I go about doing it?
You will need a distribution; there is no alternative, philosophically. There's no such thing as a "random number," only numbers randomly distributed according to some distribution.
To sample from an empirical distribution stored as my_dist, you can use sample():
my_dist <- c(1, 1, 2, 3, 5, 8, 13, 21, 34, 55) # first 10 Fibonacci numbers
sample(my_dist, 100, replace = T) # draw 100 numbers from my_dist w/ replacement
Or, for some uniformly-distributed numbers between (for instance) 1 and 10, you could do:
sample(1:10, 100, replace = T)
There are, of course, specific distributions implemented as functions in base R and various packages, but I'll avoid those since you said you weren't interested in them.
Editing per Rui's good suggestion: If you want non-uniform variables, you can specify the prob parameter:
sample(1:3, 100, replace = T, prob = c(6, 3, 1))
# draws a 1 with 60% prob., a 2 with 30% prob., and a 3 with 10% prob.
I have a vector of numbers, and I would like to sample a number which is between a given position in the vector and its neighbors such that the two closest neighbors have the largest impact, and this impact is decreasing according to the distance from the reference point.
For example, lets say I have the following vector:
vec = c(15, 16, 18, 21, 24, 30, 31)
and my reference is the number 16 in position #2. I would like to sample a number which will be with a high probability between 15 and 16 or (with the same high probability) between 16 and 18. The sampled numbers can be floats. Then, with a decreasing probability to sample a number between 16 and 21, and with a yet lower probability between 16 and 24, and so on.
The position of the reference is not known in advance, it can be anywhere in the vector.
I tried playing with runif and quantiles, but I'm not sure how to design the scores of the neighbors.
Specifically, I wrote the following function but I suspect there might be a better/more efficient way of doing this:
GenerateNumbers <- function(Ind,N){
dist <- 1/abs(Ind- 1:length(N))
dist <- dist[!is.infinite(dist)]
dist <- dist/sum(dist)
sum(dist) #sanity check --> 1
V = numeric(length(N) - 1)
for (i in 1:(length(N)-1)) {
V[i] = runif(1, N[i], N[i+1])
}
sample(V,1,prob = dist)
}
where Ind is the position of the reference number (16 in this case), and N is the vector. "Dist" is a way of weighing the probabilities so that the closer neighbors have a higher impact.
Improvements upon this code would be highly appreciated!
I would go with a truncated Gaussian random sample generator, such as in the truncnorm package. On your example:
# To install it: install.package("truncnorm")
library(truncnorm)
vec <- c(15, 16, 18, 21, 24, 30, 31)
x <- rtruncnorm(n=100, a=vec[1], b=vec[7], mean=vec[2], sd=1)
The histogram of the generated sample fulfills the given prerequisites.
This excerpt from the CRAN documentation for the adagio function knapsack() functions as expected -- it solves the knapsack problem with profit vector p, weight vector w, and capacity cap, selecting the subset of elements with maximum profit subject to the constraint that the total weight of selected elements does not exceed the capacity.
library(adagio)
p <- c(15, 100, 90, 60, 40, 15, 10, 1)
w <- c( 2, 20, 20, 30, 40, 30, 60, 10)
cap <- 102
(is <- knapsack(w, p, cap))
How can I add a vector length constraint to the solution and still get an optimal answer? For example, the above exercise, but the selected subset must include exactly three elements.
One approach would be to explicitly model the problem as a mixed integer linear programming problem; the advantage of explicitly modeling it in this way is that linear constraints like "pick exactly three objects" are simple to model. Here is an example with the lpSolve package in R, where each element in the knapsack problem is represented by a binary variable in a mixed integer linear programming formulation. The requirement that we select exactly three elements is captured by the constraint requiring the decision variables to sum to exactly 3.
library(lpSolve)
p <- c(15, 100, 90, 60, 40, 15, 10, 1)
w <- c( 2, 20, 20, 30, 40, 30, 60, 10)
cap <- 102
exact.num.elt <- 3
mod <- lp(direction = "max",
objective.in = p,
const.mat = rbind(w, rep(1, length(p))),
const.dir = c("<=", "="),
const.rhs = c(cap, exact.num.elt),
all.bin = TRUE)
# Solution
which(mod$solution >= 0.999)
# [1] 2 3 4
# Profit
mod$objval
# [1] 250
While subsetting the optimal solution from the adagio:::knapsack function to the desired size is a reasonable heuristic for the case when the desired subset size is smaller than the cardinality of the optimal solution to the standard problem, there exist examples where the optimal solution to the standard knapsack problem and the optimal solution to the size-constrained knapsack problem are disjoint. For instance, consider the following problem data:
p <- c(2, 2, 2, 2, 3, 3)
w <- c(1, 1, 1, 1, 2, 2)
cap <- 4
exact.num.elt <- 2
With capacity 4 and no size constraint, the standard knapsack problem will select the four elements with profit 2 and weight 1, getting total profit 8. However, with size limit 2 the optimal solution is instead to select the two elements with profit 3 and weight 2, getting total profit 6.
I have installed the mixdist package in R to combine distributions. Specifically, I'm using the mix() function. See documentation.
Basically, I'm getting
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist = dist, :
missing value in parameter
I googled the error message, but no useful results popped up.
My first argument to mix() is a data frame called data.df. It is formatted exactly like the built-in data set pike65. I also did data.df <- as.mixdata(data.df).
My second argument has two rows. It is a data frame called datapar, formatted exactly like pikepar. My pi values are 0.5 and 0.5. My mu values are 250 and 463 (based on my data set). My sigma values are 0.5 and 1.
My call to mix() looks like:
fitdata <- mix(data.df, datapar, "norm", constr = mixconstr(consigma="CCV"), emsteps = 3, print.level = 2)
The printing shows that my pi values go from 0.5 to NaN after the first iteration, and that my gradient is becoming 0.
I would appreciate any help in sorting out this error.
Thanks,
n.i.
Using the test data you linked to
library(mixdist)
time <- seq(673,723)
counts <-c(3,12,8,12,18,24,39,48,64,88,101,132,198,253,331,
419,563,781,1134,1423,1842,2505,374,6099,9343,13009,
15097,13712,9969,6785,4742,3626,3794,4737,5494,5656,4806,
3474,2165,1290,799,431,213,137,66,57,41,35,27,27,27)
data.df <- data.frame(time=time, counts=counts)
We can see that
startparam <- mixparam(c(699,707),1 )
data.fit <- mix(data.mix, startparam, "norm")
Gives the same error. This error appears to be closely tied to the data (so the reason this data does not work could be potentially different than why yours does not work but this is the only example you offered up).
The problem with this data is that the probability between the two groups becomes indistinguishable at some point. Then that happens, the "E" step of the algorithm cannot estimate the pi variable properly. Here
pnorm(717,707,1)
# [1] 1
pnorm(717,699,1)
# [1] 1
both are exactly 1 and this seems to be causing the error. When mix takes 1 minus this value and compares the ratio to estimate group, it gets NaN values which are propagated to the estimate of proportions. When internally these NaN values are passed to nlm() to do the estimation, you get the error message
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist = dist, :
missing value in parameter
The same error message can be replicated with
f <- function(x) sum((x-1:length(x))^2)
nlm(f, c(10,10))
nlm(f, c(10,NaN)) #error
So it appears the maxdist package will not work in this scenario. You may wish to contact the package maintainer to see if they are aware of the problem. In the meantime you will will need to find another way to estimate the parameters of you mixture model.
Now, I am not an expert in mixture distributions, but I think #MrFlick's accepted answer is a little bit misleading for anyone googling the error message (although no doubt correct for the example he gave). The core problem is that in both, your linked code and your example, the sigma values are very small compared to mu values. I think that the algorithm just cannot manage to find a solution with such small starting sigma values. If you increase the sigma values, you will get a solution. Linked code as an example:
library(mixdist)
time <- seq(673,723)
counts <- c(3, 12, 8, 12, 18, 24, 39, 48, 64, 88, 101, 132, 198, 253, 331, 419, 563, 781, 1134, 1423, 1842, 2505, 374, 6099, 9343, 13009, 15097, 13712, 9969, 6785, 4742, 3626, 3794, 4737, 5494, 5656, 4806, 3474, 2165, 1290, 799, 431, 213, 137, 66, 57, 41, 35, 27, 27, 27)
data.df <- data.frame(time=time, counts=counts)
data.mix <- as.mixdata(data.df)
startparam <- mixparam(mu = c(699,707), sigma = 1)
data.fit <- mix(data.mix, startparam, "norm") ## Leads to the error message
startparam <- mixparam(mu = c(699,707), sigma = 5) # Adjust start parameters
data.fit <- mix(data.mix, startparam, "norm")
plot(data.fit)
data.fit ### Estimates somewhat reasonable mixture distributions
# Parameters:
# pi mu sigma
# 1 0.853 699.3 4.494
# 2 0.147 708.6 2.217
A bottom line: if you can increase your start parameter sigma values, mix function might find reasonable estimates for you. You do not necessarily have to try another package.
In addition, you can get this message if you have missing data in your dataset.
From example set
data(pike65)
data(pikepar)
pike65$freq[10] <- NA
fitpike1 <- mix(pike65, pikepar, "lnorm", constr = mixconstr(consigma = "CCV"), emsteps = 3)
Error in nlm(mixlike, lmixdat = mixdat, lmixpar = fitpar, ldist =
dist, : missing value in parameter
Apologies if this is a bit of a simple question, but I haven't been able to find any answer to this over the past week and it's driving me crazy.
Background Info: I have a dataset that tracks the weight of 5 individuals over 5 years. Each year, I have a distribution for the weight of individuals in the group, from which I calculate the mean and standard deviation. Data is as follows:
Year = [2002,2003,2004,2005,2006]
Weights_2002 = [12, 14, 16, 18, 20]
Weights_2003 = [14, 16, 18, 20,20]
Weights_2004 = [16, 18, 20, 22, 18]
Weights_2005 = [18, 21, 22, 22, 20]
Weights_2006 = [2, 21, 19, 20, 20]
The Question: How do I project annual distributions of weight for the group the next 10 years? Ideally, I would like the uncertainty about the mean to increase as time goes on. Likewise, I would like the uncertainty about the standard deviation to increase too. Phrased another way, I would like to project the distributions of weight going forward, accounting for both:
Natural Variance in the Data
Increasing uncertainty.
Any help would be greatly, greatly appreciated. If anyone can suggest how to do this in R, that would be even better.
Thanks guys!
Absent specific suggestions on how to use the forecasting tools in R, viz. the comments to your question, here is an alternative approach that uses Monte Carlo simulation.
First, some housekeeping: the value 2 in Weights_2006 is either a typo or an outlier. Since I can't tell which, I will assume it's an outlier and exclude it from the analysis.
Second, you say you want to project the distributions based on increasing uncertainty. But your data doesn't support that.
Year <- c(2002,2003,2004,2005,2006)
W2 <- c(12, 14, 16, 18, 20)
W3 <- c(14, 16, 18, 20,20)
W4 <- c(16, 18, 20, 22, 18)
W5 <- c(18, 21, 22, 22, 20)
W6 <- c(NA, 21, 19, 20, 20)
df <- rbind(W2,W3,W4,W5,W6)
df <- data.frame(Year,df)
library(reshape2) # for melt(...)
library(ggplot2)
data <- melt(df,id="Year", variable.name="Individual",value.name="Weight")
ggplot(data)+
geom_histogram(aes(x=Weight),binwidth=1,fill="lightgreen",colour="grey50")+
facet_grid(Year~.)
The mean weight goes up over time, but the variance decreases. A look at the individual time series shows why.
ggplot(data, aes(x=Year, y=Weight, color=Individual))+geom_line()
In general, an individual's weight increases linearly with time (about 2 units per year), until it reaches 20, when it stops increasing but fluctuates. Since your initial distribution was uniform, the individuals with lower weight saw an increase over time, driving the mean up. But the weight of heavier individuals stopped growing. So the distribution gets "bunched up" around 20, resulting in a decreasing variance. We can see this in the numbers: increasing mean, decreasing standard deviation.
smry <- function(x)c(mean=mean(x),sd=sd(x))
aggregate(Weight~Year,data,smry)
# Year Weight.mean Weight.sd
# 1 2002 16.0000000 3.1622777
# 2 2003 17.6000000 2.6076810
# 3 2004 18.8000000 2.2803509
# 4 2005 20.6000000 1.6733201
# 5 2006 20.0000000 0.8164966
We can model this behavior using a Monte Carlo simulation.
set.seed(1)
start <- runif(1000,12,20)
X <- start
result <- X
for (i in 2003:2008){
X <- X + 2
X <- ifelse(X<20,X,20) +rnorm(length(X))
result <- rbind(result,X)
}
result <- data.frame(Year=2002:2008,result)
In this model, we start with 1000 individuals whose weight forms a uniform distribution between 12 and 20, as in your data. At each time step we increase the weights by 2 units. If the result is >20 we clip it to 20. Then we add random noise distributed as N[0,1]. Now we can plot the distributions.
model <- melt(result,id="Year",variable.name="Individual",value.name="Weight")
ggplot(model,aes(x=Weight))+
geom_histogram(aes(y=..density..),fill="lightgreen",colour="grey50",bins=20)+
stat_density(geom="line",colour="blue")+
geom_vline(data=aggregate(Weight~Year,model,mean), aes(xintercept=Weight), colour="red", size=2, linetype=2)+
facet_grid(Year~.,scales="free")
The red bars show the mean weight in each year.
If you believe that the natural variation in the weight of an individual increases over time, then use N[0,sigma] as the error term in the model, with sigma increasing with Year. The problem is that there is nothing in your data to support that.