How to define Intervals for a uniform probability distribution? - math

I do not know its the right forum to ask question, however i will appreciate if someone can help me.
I have two processes and each have a distinct random Variable , let say X1 & X2, each Random Variable is from a uniform distribution with [0,1], then how random.nextdouble() can help me to identify the variation between the probabilities of these two random variables. I need this variation because i want to find the probability of minimum of the two random variables.
Can I say that its too simple and I should run the program for 100000 or more times twice and then count the minimum value from two iterations? If so, then how can I map this result with probabilities of two random variables i.e. X1 & X2?? Like what is the criteria to say that the first time I ran the program was for X1 and 2nd time for X2.

The probability of a single variable under uniform distribution to be under d is P(X<=dx) = d (assuming in range [0,1]).
Thus, the probability of it to be more then d is P(X>=d) = (1-d).
The probability of 2 random variables to be above d is P(X>=d AND Y>=d) = P(X>=d)*P(Y>=d) = (1-d)^2
Thus, the probability that one of X or Y to be under d is p = 1-(1-d)^2, and this means that the probability of the minimum to be under d is the same: p = 1 - (1-d)^2.
If you are looking for the probability density function, you can just find the derivitive of the probability:
f(x) = d/dx P(x) = d/dx 1 - (1-x)^2 =
= d/dx (1 - 1 + 2x - x^2) =
= d/dx (2x - x^2) = 2 - 2x

Related

How to simulate a dataset with a binary target in proportions determined 'a-priori'?

Can someone tell me what is the best way to simulate a dataset with a binary target?
I understand the way in which a dataset can be simulated but what I'm looking for is to determine 'a-priori' the proportion of each class. What I thought was to change the intercept to achieve it but I couldn't do it and I don't know why. I guess because the average is playing a trick on me.
set.seed(666)
x1 = rnorm(1000)
x2 = rnorm(1000)
p=0.25 # <<< I'm looking for a 25%/75%
mean_z=log(p/(1-p))
b0 = mean( mean_z - (4*x1 + 3*x2)) # = mean_z - mean( 2*x1 + 3*x2)
z = b0 + 4*x1 + 3*x2 # = mean_z - (4*x1 + 3*x2) + (4*x1 + 3*x2) = rep(mean_z,1000)
mean( b0 + 4*x1 + 3*x2 ) == mean_z # TRUE!!
pr = 1/(1+exp(-z))
y = rbinom(1000,1,pr)
mean(pr) # ~ 40% << not achieved
table(y)/1000
What I'm looking for is to simulate the typical "logistic" problem in which the binary target can be modeled as a linear combination of features.
These 'logistic' models assume that the log-odd ratio of the binary variable behaves linearly. That means:
log (p / (1-p)) = z = b0 + b1 * x1 + b2 * x2 where p = prob (y = 1)
Going back to my sample code, we could do, for example: z = 1.3 + 4 * x1 + 2 * x2 , but the probability of the class would be a result. Or instead we could choose coefficient b0 such that the probability is (statistically) similar to the one sought :
log (0.25 / 0.75) = b0 + 4 * x1 + 2 * x2
This is my approach, but there may be betters
I gather that you are considering a logistic regression model, right? If so, one way to generate a data set is to create two Gaussian bumps and say that one is class 1 and the other is class 0. Then generate 25 items from class 1 and 75 items from class 0. Then each generated item plus its label is a datum or record or whatever you want to call it.
Obviously you can choose any proportions of 1's and 0's. It is also interesting to make the problem "easy" by making the Gaussian bumps farther apart (i.e. variances smaller in comparison to difference of means) or "hard" by making the bumps overlapping (i.e. variances larger compared to difference of means).
EDIT: In order to make sample data which correspond exactly to a logistic regression model, just make the variances of the two Gaussian bumps the same. When the variances (by this I mean specifically the covariance matrix) are the same, the surfaces of equal posterior class probability are planes; when the covariances are different, the surfaces of equal probability are quadratics. This is a standard result which will appear in many textbooks. I also have some notes online about this, which I can locate if it will help.
Aside from generating the two classes separately and then merging the results into one set, you can also sample from a single distribution over x, plug x into a logistic regression model with some weights (which you choose by any means you wish), and then use the resulting output as a probability for a coin toss. This method isn't guaranteed to output proportions that correspond exactly to prior class probabilities.

Calculating probabilities of simulated random variables in R

I have the following graph:
I need to travel from A to B. I also assume that I am taking the fastest route from A to be every day.
The travel times (in hours) between the nodes are exponentially distributed. I have simulated them, with the relevant lambda values, in R as follows:
AtoX <- rexp(1000, 4)
AtoY <- rexp(1000, 2.5)
XtoY <- rexp(1000, 10)
YtoX <- rexp(1000, 10)
XtoB <- rexp(1000, 3)
YtoB <- rexp(1000, 5)
I calculated the average travel time everyday in R as follows:
AXB <- AtoX + XtoB
AYB <- AtoY + YtoB
AXYB <- AtoX + XtoY + YtoB
AYXB <- AtoY + YtoX + XtoB
TravelTimes <- pmin(AXB, AYB, AXYB, AYXB)
averageTravelTime <- mean(TravelTimes)
I'm now trying to find the following for every single day:
With which probability is each of the four possible routes from A to B taken?
What is the probability that I have to travel more than half an hour?
For (1), I understand that I need to take the cumulative distribution function (CDF) P(x <= X) for each route.
For (2), I understand that I need to take the cumulative distribution function (CDF) P(0.5 => X), where 0.5 denotes half an hour.
I have only just started learning R, and I am unsure of how to go about doing this.
Reading the documentation, it seem that I might need to do something like the following to calculate the CDF:
pexp()
1 - pexp()
How can I do this?
Let R1, R2, R3, R4 be, in some order, random variables corresponding to the total time of the four routes. Then, being sums of independent exponential random variables, each of them follows the Erlang or the Gamma distribution (see here).
To answer 1, you want to find P(min{R1, R2, R3, R4} = R_i) for i=1,2,3,4. While the minimum of independent exponential random variables is tractable (see here), as far as I know that is not the case with Erlang/Gamma distributions in general. Hence, I believe you need to answer this question numerically, using simulations.
The same applies to the second question requiring to find P(min{R1, R2, R3, R4} >= 1/2).
Hence, we have
table(apply(cbind(AXB, AYB, AXYB, AYXB), 1, which.min)) / 1000
# 1 2 3 4
# 0.312 0.348 0.264 0.076
and
mean(TravelTimes >= 0.5)
# [1] 0.145
as our estimates. By increasing 1000 to some higher number (e.g., 1e6 works fast) one could make those estimates more precise.

Constrained optimization of a vector

I have a (non-symmetric) probability matrix, and an observed vector of integer outcomes. I would like to find a vector that maximises the probability of the outcomes, given the transition matrix. Simply, I am trying to estimate a distribution of particles at sea given their ultimate distribution on land, and a matrix of probabilities of a particle released from a given point in the ocean ending up at a given point on the land.
The vector that I want to find is subject to the constraint that all components must be between 0-1, and the sum of the components must equal 1. I am trying to figure out the best optimisation approach for the problem.
My transition matrix and data set are quite large, but I have created a smaller one here:
I used a simulated known at- sea distribution of
msim<-c(.3,.2,.1,.3,.1,0) and a simulated probability matrix (t) to come up with an estimated coastal matrix (Datasim2), as follows:
t<-matrix (c(0,.1,.1,.1,.1,.2,0,.1,0,0,.3,0,0,0,0,.4,.1,.3,0,.1,0,.1,.4,0,0,0,.1,0,.1,.1),
nrow=5,ncol=6, byrow=T)
rownames(t)<-c("C1","C2","C3","C4","C5") ### locations on land
colnames(t)<-c("S1","S2","S3","S4","S5","S6") ### locations at sea
Datasim<-as.numeric (round((t %*% msim)*500))
Datasim2<-c(rep("C1",95), rep("C2",35), rep("C3",90),rep("C4",15),rep("C5",30))
M <-c(0.1,0.1,0.1,0.1,0.1,0.1) ## starting M
I started with a straightforward function as follows:
EstimateSource3<-function(M,Data,T){
EstEndProbsall<-M%*%T
TotalLkhd<-rep(NA, times=dim(Data)[1])
for (j in 1:dim(Data)[1]){
ObsEstEndLkhd<-0
ObsEstEndLkhd<-1-EstEndProbsall[1,] ## likelihood of particle NOT ending up at locations other than the location of interest
IndexC<-which(colnames(EstEndProbsall)==Data$LocationCode[j], arr.ind=T) ## likelihood of ending up at location of interest
ObsEstEndLkhd[IndexC]<-EstEndProbsall[IndexC]
#Total likelihood
TotalLkhd[j]<-sum(log(ObsEstEndLkhd))
}
SumTotalLkhd<-sum(TotalLkhd)
return(SumTotalLkhd)
}
DistributionEstimate <- optim(par = M, fn = EstimateSource3, Data = Datasim2, T=t,
control = list(fnscale = -1, trace=5, maxit=500), lower = 0, upper = 1)
To constrain the sum to 1, I tried using a few of the suggestions posted here:How to set parameters' sum to 1 in constrained optimization
e.g. adding M<-M/sum(M) or SumTotalLkhd<-SumTotalLkhd-(10*pwr) to the body of the function, but neither yielded anything like msim, and in fact, the 2nd solution came up with the error “L-BFGS-B needs finite values of 'fn'”
I thought perhaps the quadprog package might be of some help, but I don’t think I have a symmetric positive definite matrix…
Thanks in advance for your help!
What about that: Let D = distribution at land, M = at sea, T the transition matrix. You know D, T, you want to calculate M. You have
D' = M' T
hence D' T' = M' (T T')
and accordingly D'T'(T T')^(-1) = M'
Basically you solve it as when doing linear regression (seems SO does not support math notation: ' is transpose, ^(-1) is ordinary matrix inverse.)
Alternatively, D may be counts of particles, and now you can ask questions like: what is the most likely distribution of particles at sea. That needs a different approach though.
Well, I have never done such models but think along the following lines. Let M be of length 3 and D of length 2, and T is hence 3x2. We know T and we observe D_1 particles at location 1 and D_2 particles at location 2.
What is the likelihood that you observe one particle at location D_1? It is Pr(D = 1) = M_1 T_11 + M_2 T_21 + M_3 T_32. Analogously, Pr(D = 2) = M_1 T_12 + M_2 T_22 + M_3 T_32. Now you can easily write the log-likelihood of observing D_1 and D_2 particles at locations 1 and 2. The code might look like this:
loglik <- function(M) {
if(M[1] < 0 | M[1] > 1)
return(NA)
if(M[2] < 0 | M[2] > 1)
return(NA)
M3 <- 1 - M[1] - M[2]
if(M3 < 0 | M3 > 1)
return(NA)
D[1]*log(T[1,1]*M[1] + T[2,1]*M[2] + T[3,1]*M3) +
D[2]*log(T[1,2]*M[1] + T[2,2]*M[2] + T[3,2]*M3)
}
T <- matrix(c(0.1,0.2,0.3,0.9,0.8,0.7), 3, 2)
D <- c(100,200)
library(maxLik)
m <- maxLik(loglik, start=c(0.4,0.4), method="BFGS")
summary(m)
I get the answer (0, 0.2, 0.8) when I estimate it but standard errors are very large.
As I told, I have never done it so I don't know it it makes sense.

Generating power-law distributed numbers from uniform distribution – found 2 approaches: which one is correct?

I am trying to generate power-law distributed numbers ranging from 0 to 1 from a uniform distribution. I found two approaches and I am not sure which one is right and which one is wrong.
1st Source: Wolfram:
2nd Source: Physical Review (Page 2):
Where: y = uniform variate, n = distribution power, x0 and x1 = range of the distribution, x = power-law distributed variate.
The second one only gives decent results for x0 = 0 and x1 = 1, when n is between 0 and 1.
If y is a uniform random variable between 0 and 1, then 1-y also is. Thereby letting z = 1-y you can transform your formula (1) as :
x = [(x_1^{n+1}-(x_1^{n+1}-x_0^{n+1}) z]^{1/(n+1)}
which is then the same as your formula (2) except for the change n -> (-n).
So I suppose that the only difference between these two formula in the notation on how n relates to the power law decay (unfortunately the link you gave for the Wolfram alpha formula is invalid so I cannot check which notation they use).

Random Pareto distribution in R with 30% of values being <= specified amount

Let me begin by saying this is a class assignment for an intro to R course.
First, in VGAM why are there dparetoI, ParetoI, pparetoI, qparetoI & rparetoI?
Are they not the same things?
My problem:
I would like to generate 50 random numbers in a pareto distribution.
I would like the range to be 1 – 60 but I also need to have 30% of the values <= 4.
Using VGAM I have tried a variety of functions and combinations of pareto from what I could find in documentation as well as a few things online.
I experimented with fit, quantiles and forcing a sequence from examples I found but I'm new and didn't make much sense of it.
I’ve been using this:
alpha <- 1 # location
k <- 2 # shape
mySteps <- rpareto(50,alpha,k)
range(mySteps)
str(mySteps[mySteps <= 4])
After enough iterations, the range will be acceptable but entries <= 4 are never close.
So my questions are:
Am I using the right pareto function?
If not, can you point me in the right direction?
If so, do I just keep running it until the “right” data comes up?
Thanks for the guidance.
So reading the Wikipedia entry for Pareto Distribution, you can see that the CDF of the Pareto distribution is given by:
FX(x) = 1 - (xm/x)α
The CDF gives the probability that X (your random variable) < x (a given value). You want Pareto distributions where
Prob(X < 4) ≡ FX(4) = 0.3
or
0.3 = 1 - (xm/4)α
This defines a relation between xm and α
xm = 4 * (0.7)1/α
In R code:
library(VGAM)
set.seed(1)
alpha <- 1
k <- 4 * (0.7)^(1/alpha)
X <- rpareto(50,k,alpha)
quantile(X,0.3) # confirm that 30% are < 4
# 30%
# 3.891941
Plot the histogram and the distribution
hist(X, breaks=c(1:60,Inf),xlim=c(1,60))
x <- 1:60
lines(x,dpareto(x,k,alpha), col="red")
If you repeat this process for different alpha, you will get different distribution functions, but in all cases ~30% of the sample will be < 4. The reason it is only approximately 30% is that you have a finite sample size (50).

Resources